I’m king of wondering on the impact of the scaling on Log servers, I was doing some load testing last week with 4 logs server per region and the log servers were clearly saturated on CPU (not on I/O it seems), so I quadrupled the number of logs server and I still the process were showing little improvement (maybe 5 to 10% CPU reduction) while everything seemed to stay similar.
This is kind of surprising to me, maybe there is still something I don’t understand with FDB. For the record I was using a ratio of 10% writes to 90 reads and we were reading randomly in a 10M keys space.
I’m wondering if I’m not ending end up reading a lot from the log server, is this something that can happen if yes under which conditions ?
Also I’m wondering what are the current recommendation for the ratio log to server for the SSD-2 storage engine, I found a few posts mentioning 1 to 8 but I’m wondering if it’s still the case.
For what it’s worth, I ran into something similar recently; please see Log processes and CPU saturation for that discussion. We didn’t reach any firm conclusions there, but I think there are some plausible hypotheses to test.
Hi there, also noticed this behaviour and left it to be investigated later when the priority for that investigation increased.
I’m not specifying any storage engine for the logs, but using redwood for general storage engine. (we’re running v7.3)
I also noticed that with 100M keyspace, I would have only some K processes out of 9 log servers saturated.
this test was a couple of months ago, no longer have the telemetry.
Is it possible that actually the commit proxies are sending to more than the required log server and then treat a transaction as committed when at least replication factor log servers have replied ?
It’s a bit different, the CP (commit proxy) sends the mutation to all logs in the primary (and the primary satellite in case of multi-region) and waits for all logs to respond, see: FDB Read and Write Path — FoundationDB ON documentation.
Have you checked where the log processes are spending their time on? e.g. could it be that they spend most CPU time handling networking?