Log processes and CPU saturation

We’re performing some load tests with a new FoundationDB cluster (7.3.43 with three_data_hall replication and the Redwood storage engine).

In this test, we scaled up load to the point where our log processes’ CPU consumption was very close to saturated. At that point, we doubled the number of log processes and commit proxies from 4 to 8. We were surprised to observe that the CPU load on both process classes remained the same despite unchanged load; in other words, we just went from 4 log processes at 95% CPU utilization to 8 log processes with 95% CPU utilization. We also observed a modest increase in commit latency as reported by the latency probe.

Is this expected, or have we committed a scaling faux pas? In terms of scaling log processes, what metrics should we be watching to decide that it’s time to add more log processes? Is it normal/expected for log processes to be CPU-limited?

Thanks!

I’m now reasonably convinced that CPU utilization is not a good measure for log process saturation. We just threw more load at our log processes that looked CPU-saturated and… they handled it just fine?

I guess this focuses the question a little more: what metrics should we focus on to understand when it’s time to add more log processes?

I have not seen this with our log processes (our workload is very read heavy). For storage processes, I find the disk busy metric very useful to detect if the disk is saturated. Usually it’s either that or the CPU.

I suspect that in our setup, the EBS disk driver can consume significant CPU (but could not directly prove it, might be wrong). Maybe your log process CPU is busy writing small batches of data to disk, and as the throughput increases, it writes larger batches to disk, achieving higher throughput per CPU?

Maybe your log process CPU is busy writing small batches of data to disk, and as the throughput increases, it writes larger batches to disk, achieving higher throughput per CPU?

That was my suspicion, too.

My other hypothesis is that this might have been another symptom of the “noisy neighbor” problem discussed in Significant changes in CPU load on resolver processes depending on placement in a cluster - #2 by jon. We’ll likely re-run this test with better CPU/memory isolation in the not-too-distant future, and I’ll report back if that makes a difference in this case.

If you look at status json there a is QOS section there is batch_performance_limited_by if the log server is listed then you are limited by the logs.
There is also performance_limited_by.

I found also that disk business is also not a good indicator based on my understanding if you have if you had at least 1 (not necessarily the same I/O just that over the period there was at last one) outstanding I/O in the queue in the last second you will have a busyness of 1 for that second given that nowdays disks can handle way more than that that it’s not a great measure it seems to me it’s more important to know the IOPS capacity of your disk and see if you are saturating it or not.

It’s worth noting that both storage and log server won’t try to issue more I/O if there already max_outstanding I/O (64 by default, can be changed with knob_max_outstanding).

I don’t know for sure for the log server, but for the proxies they need to talk to each other and so the more you add them the more they talk to each other and the less value you are getting out of them.