Knobs for dealing with a cluster with fairly high p99.9 latency

We have multi-region clusters using a primary and a satellite and we are seeing some fairly large and spiky value for the p99.9 on commits.
After some investigation it seems to be coming from our cloud vendor but I’m wondering if there are any knobs that we should tune for a fairly high p99.9 latency.

I would look at the histograms from ProxyCommitData.actor.h to see where the bottleneck is during the commit latency spikes. The default histogram logging interval is 5 minutes (SERVER_KNOBS->HISTOGRAM_REPORT_INTERVAL), but I’d recommend lowering this knob to ~60 seconds. Then you can find logs with Type = Histogram, and Group = CommitProxy. The relevant Op values are:

  • CommitBatchQueueing : High latencies indicate that batches are too large
  • GetCommitVersion: High latencies indicate too many small batches, saturating the Master CPU, or potentially a bad network connection
  • Resolution: High latencies indicate large conflict sets or potentially a CPU-saturated resolver, or bad network connection
  • PostResolutionQueueing: This rarely takes long, could be caused by inconsistent resolver latencies
  • ProcessingMutation: High latencies indicate large transactions or CPU-saturated commit proxies
  • TlogLogging: High latencies indicate that transaction logs are the bottleneck, further investigation can be performed on the AsycFileKAIO*Latencytraces on tlogs and Group = tLog histograms
  • ReplyCommit: I’ve never seen this latency be high, it should be very rare

If the issue isn’t with the tlog disks, then the best fix is likely to increase or decrease commit batch size, depending on the bottleneck. To modify commit batch size, the COMMIT_TRANSACTION_BATCH* knobs in ServerKnobs.cpp are relevant. If the issue is resolver or commit proxy CPU saturation, the fix may be to modify role counts rather than knobs.