How to troubleshoot throughput performance degrade?

nfeinberg · June 18, 2019, 8:57pm

I’ve been helping Ray debug this. We’ve managed to simplify the problem considerably. We can reproduce it by writing fully random keys (no reads, no prefixes) at a high rate (saturating FDB, so about ~50-100kps) for about 6-7 minutes. Following that, FoundationDB’s log queue size spikes, storage queue size diverges (different processes having wildly different sizes), and throughput drops.

Here’s an example with 10 i3.16xlarge instances running at 3x redundancy.

Here’s 3 i3.16xlarge instances running at 1x redundancy. (A single instance had its log queue instantly maxed out, which didn’t allow us to demonstrate this odd divergence behavior).

Here’s 10 i3.16x large instances running at 1x redundancy. This is stable, suggesting that the problems we’re observing have something to do with the ratio of total machine/disk count to workload. However, I still don’t fully understand what’s happening.

My current best guess is that there’s some kind of cleanup or other secondary work being done after that 6-7 minute mark, which causes the load on the log processes to exceed its handling capacity. I’m not clear on what that would be, however.

Any feedback would be enormously appreciated.

Topic		Replies	Views
Storage queue limiting performance when initially loading data Using FoundationDB	10	2703	October 14, 2019
Storage Server CPU bottleneck - Growing data lag Using FoundationDB performance	22	2987	December 13, 2021
Scaling issues with FDB for write throughput Running FoundationDB	6	1814	September 14, 2020
The cluster is continuously Restoring replication factor, and Moing data has not decreased Running FoundationDB performance , operator	1	47	October 10, 2024
Transaction/operation throughput Using FoundationDB performance	10	1957	January 23, 2020

How to troubleshoot throughput performance degrade?

Related topics