Knobs for dealing with a cluster with fairly high p99.9 latency

mpatou_openai · August 26, 2025, 11:38pm

We have multi-region clusters using a primary and a satellite and we are seeing some fairly large and spiky value for the p99.9 on commits.
After some investigation it seems to be coming from our cloud vendor but I’m wondering if there are any knobs that we should tune for a fairly high p99.9 latency.

tclinken · September 18, 2025, 3:24pm

I would look at the histograms from ProxyCommitData.actor.h to see where the bottleneck is during the commit latency spikes. The default histogram logging interval is 5 minutes (SERVER_KNOBS->HISTOGRAM_REPORT_INTERVAL), but I’d recommend lowering this knob to ~60 seconds. Then you can find logs with Type = Histogram, and Group = CommitProxy. The relevant Op values are:

CommitBatchQueueing : High latencies indicate that batches are too large
GetCommitVersion: High latencies indicate too many small batches, saturating the Master CPU, or potentially a bad network connection
Resolution: High latencies indicate large conflict sets or potentially a CPU-saturated resolver, or bad network connection
PostResolutionQueueing: This rarely takes long, could be caused by inconsistent resolver latencies
ProcessingMutation: High latencies indicate large transactions or CPU-saturated commit proxies
TlogLogging: High latencies indicate that transaction logs are the bottleneck, further investigation can be performed on the AsycFileKAIO*Latencytraces on tlogs and Group = tLog histograms
ReplyCommit: I’ve never seen this latency be high, it should be very rare

If the issue isn’t with the tlog disks, then the best fix is likely to increase or decrease commit batch size, depending on the bottleneck. To modify commit batch size, the COMMIT_TRANSACTION_BATCH* knobs in ServerKnobs.cpp are relevant. If the issue is resolver or commit proxy CPU saturation, the fix may be to modify role counts rather than knobs.

Topic		Replies	Views
Question on parallel commit latencies Using FoundationDB	8	1610	May 31, 2019
Relax consistency guarantees Using FoundationDB	17	2276	October 30, 2019
High rate of transaction retries with error code 1009 (Request for future version) Using FoundationDB performance	39	5146	April 30, 2020
How to scale foundation db reads Using FoundationDB	20	6491	March 18, 2019
Are spikes of 500ms+ MaxRowReadLatency normal? Using FoundationDB	7	1213	July 11, 2019

Knobs for dealing with a cluster with fairly high p99.9 latency

Related topics