Daily Pattern in WorstTLog Queue / Tuning TLog 2GB Queue size?

Hi All,

I’ve noticed a consistent pattern on our cluster in regards to the Ratekeeper WorstTLogQueue metric. Additionally we’ve found this GH issue from @alexmiller https://github.com/apple/foundationdb/issues/620. As of yet I have not been able to correlate these to specific performance issues, but as noted in the GH issue, there is potential for this spillage to impact write workload performance.

I am wondering if anyone else has encountered this pattern and what actions were taken. Is this generally an indicator that we should add more logs (per @ajbeamon’s comment) or is it best to attempt to tune with --knob_server_mem_limit if RAM is available?Cluster tuning cookbook

What is blue and what is purple? What is “Worst TLog Queue” monitoring in terms of TLogMetric attributes?

Hiya @alexmiller. Thank you for the reply. The blue and purple series are 2 different c5.xlarge instances, each running 4 fdbserver processes with the “stateless” class.

The processes on the purple instances currently have the following roles.
– snip fdbtop output –
4500 66 5 - 23 stateless proxy
4501 7 3 - 1 stateless master
4502 69 3 - 23 stateless proxy
4503 69 3 - 23 stateless proxy

The blue
– snip fdbtop output –
4500 66 19 - 23 stateless proxy
4501 66 4 - 22 stateless proxy
4502 8 3 - 2 stateless cluster_controller
4503 65 11 - 21 stateless proxy

The Worst TLog Queue metrics are from the trace files.

trace.010.050.002.210.4503.1548772635.qQYQre.26.xml: <Event Severity="10" Time="1549044301.613677" OriginalTime="1548797048.262023" Type="RkUpdate" ID="0000000000000000" TPSLimit="477.23" Reason="5" ReasonServerID="b1bb6a4e3d41df2b" ReleasedTPS="104.134" TPSBasis="104.134" StorageServers="36" Proxies="9" TLogs="5" WorstFreeSpaceStorageServer="669442251929" WorstFreeSpaceTLog="84773885905" WorstStorageServerQueue="1504647272" LimitingStorageServerQueue="69650439" WorstTLogQueue="2067779198" TotalDiskUsageBytes="1470329077864" WorstStorageServerVersionLag="0" LimitingStorageServerVersionLag="0" Machine="10.50.2.210:4503" LogGroup="default" Roles="MS" TrackLatestType="Rolled" />

Huh, I had never actually used Ratekeeper’s summaries before. Useful. I should also ask what version of FDB this is on?

WorstTLogQueue is calculating the maximum of inputBytes - durableBytes from each TLog’s TLogMetrics, which is a familiar calculation. Spilling doesn’t spill the entire queue, it just limits the queue to TLOG_SPILL_THREASHOLD bytes. So if you saw WorstTLogQueue plateau at ~2GB, then that would suggest spilling is happening. The fact that it is not, and instead declines, makes me suspect that for purple there’s, like, one mutation destined for a storage server that’s failed, and once that one mutation is spilled, the rest of the queue is rapidly trimmed. For blue this would sound more like, you have a workload that’s bursts exactly on the hour every hour, and causes queues to grow just enough that you start barely spilling.

I don’t think I’d be overly concerned about the state of this right now. If you wanted to make sure you never spill, you could either increase the number of logs, or if you have the memory available, raise --knob_server_mem_limit and --knob_tlog_spill_threshold by 1-2GB, and you should be more than fine.

1 Like

Thank you Alex that’s very helpful, the version is 6.0.16.

Oh, good, then I don’t need to go double check what changes have happened to tlogs in earlier versions.

I’ll be posting a design doc and operational guide for a new tlog spilling strategy to the forums sometime next week, which you’ll likely find relevant and interesting.

2 Likes

Something else to be aware of – ratekeeper starts trying to limit the transaction rate when the size of the log queue hits 2GB and targets a maximum queue of 2.4GB. If you have a workload that’s pushing the queue much above 2GB, it might start getting throttled.