We have multi-region clusters using a primary and a satellite and we are seeing some fairly large and spiky value for the p99.9 on commits.
After some investigation it seems to be coming from our cloud vendor but I’m wondering if there are any knobs that we should tune for a fairly high p99.9 latency.
I would look at the histograms from ProxyCommitData.actor.h
to see where the bottleneck is during the commit latency spikes. The default histogram logging interval is 5 minutes (SERVER_KNOBS->HISTOGRAM_REPORT_INTERVAL
), but I’d recommend lowering this knob to ~60 seconds. Then you can find logs with Type = Histogram
, and Group = CommitProxy
. The relevant Op
values are:
CommitBatchQueueing
: High latencies indicate that batches are too largeGetCommitVersion
: High latencies indicate too many small batches, saturating theMaster
CPU, or potentially a bad network connectionResolution
: High latencies indicate large conflict sets or potentially a CPU-saturated resolver, or bad network connectionPostResolutionQueueing
: This rarely takes long, could be caused by inconsistent resolver latenciesProcessingMutation
: High latencies indicate large transactions or CPU-saturated commit proxiesTlogLogging
: High latencies indicate that transaction logs are the bottleneck, further investigation can be performed on theAsycFileKAIO*Latency
traces on tlogs andGroup = tLog
histogramsReplyCommit
: I’ve never seen this latency be high, it should be very rare
If the issue isn’t with the tlog disks, then the best fix is likely to increase or decrease commit batch size, depending on the bottleneck. To modify commit batch size, the COMMIT_TRANSACTION_BATCH*
knobs in ServerKnobs.cpp
are relevant. If the issue is resolver or commit proxy CPU saturation, the fix may be to modify role counts rather than knobs.