How to troubleshoot throughput performance degrade?

Thanks for the reply. We currently create a raid 0 array on all SSDs and run FoundationDB on it. In progress on changing the code to let each log/storage process run on their own disk and own physical CPU.

the 211 MB/s logical write rate seems higher than I would have initially thought to be sustainable.

I’m actually not sure how to estimate the disks requirement to sustain our traffic.

I tried something similar to this approach. In single redundancy mode, having 1 log 19 storage, log process has durable_bytes.hz around 75MB. Having 1 storage 19 log, storage process has durable_bytes.hz around 15MB. (Even though in that post it used mutations, but I don’t know how to use that number). So in our case, the ratio of log:storage processes should be 1:5.

To sustain 633MB/s physical traffic, I’ll need around 10 log processes, and 50 storage processes, each with their own disk. Which means a cluster of 60 disks for triple redundancy mode.

Does the above estimation look correct?

However, If I use dd on the single disk without any other things, the write rate could be 300+MB/s (depends on the blocksize). Does that mean I am able to run more than one log/storage process per disk?

StorageServerUpdateLag indicates that storage server is in e-brake,

Thank you. By searching on GitHub repo, it’s the only one e-brake event.