Scaling log server and log to storage ratio

mpatou_openai · May 6, 2025, 1:38am

I’m king of wondering on the impact of the scaling on Log servers, I was doing some load testing last week with 4 logs server per region and the log servers were clearly saturated on CPU (not on I/O it seems), so I quadrupled the number of logs server and I still the process were showing little improvement (maybe 5 to 10% CPU reduction) while everything seemed to stay similar.
This is kind of surprising to me, maybe there is still something I don’t understand with FDB. For the record I was using a ratio of 10% writes to 90 reads and we were reading randomly in a 10M keys space.

I’m wondering if I’m not ending end up reading a lot from the log server, is this something that can happen if yes under which conditions ?

Also I’m wondering what are the current recommendation for the ratio log to server for the SSD-2 storage engine, I found a few posts mentioning 1 to 8 but I’m wondering if it’s still the case.

jon · May 12, 2025, 1:23pm

For what it’s worth, I ran into something similar recently; please see Log processes and CPU saturation for that discussion. We didn’t reach any firm conclusions there, but I think there are some plausible hypotheses to test.

msf · May 14, 2025, 2:10pm

Hi there, also noticed this behaviour and left it to be investigated later when the priority for that investigation increased.
I’m not specifying any storage engine for the logs, but using redwood for general storage engine. (we’re running v7.3)
I also noticed that with 100M keyspace, I would have only some K processes out of 9 log servers saturated.
this test was a couple of months ago, no longer have the telemetry.

mpatou_openai · May 15, 2025, 4:39am

Is it possible that actually the commit proxies are sending to more than the required log server and then treat a transaction as committed when at least replication factor log servers have replied ?

johscheuer · May 15, 2025, 9:02am

It’s a bit different, the CP (commit proxy) sends the mutation to all logs in the primary (and the primary satellite in case of multi-region) and waits for all logs to respond, see: FDB Read and Write Path — FoundationDB ON documentation.

Have you checked where the log processes are spending their time on? e.g. could it be that they spend most CPU time handling networking?

mpatou_openai · May 15, 2025, 9:41pm

It’s not only that but clearly networking was a good chunk of it.

Topic		Replies	Views
What is the good ratio of the storage processes and the log processes, and the associated metrics for monitoring? Using FoundationDB performance	3	992	June 5, 2019
Log processes and CPU saturation Running FoundationDB performance	4	111	May 15, 2025
LogServer disk busy in production deployment Using FoundationDB performance	4	143	October 28, 2024
WARNING: A single process is both a transaction log and a storage server Using FoundationDB	16	1767	August 13, 2019
Distributed transaction performance test Running FoundationDB performance	3	51	July 17, 2025

Scaling log server and log to storage ratio

Related topics