Daily Pattern in WorstTLog Queue / Tuning TLog 2GB Queue size?

rjenkins · February 1, 2019, 4:28pm

Hi All,

I’ve noticed a consistent pattern on our cluster in regards to the Ratekeeper WorstTLogQueue metric. Additionally we’ve found this GH issue from @alexmiller https://github.com/apple/foundationdb/issues/620. As of yet I have not been able to correlate these to specific performance issues, but as noted in the GH issue, there is potential for this spillage to impact write workload performance.

I am wondering if anyone else has encountered this pattern and what actions were taken. Is this generally an indicator that we should add more logs (per @ajbeamon’s comment) or is it best to attempt to tune with --knob_server_mem_limit if RAM is available?Cluster tuning cookbook

alexmiller · February 1, 2019, 6:29pm

What is blue and what is purple? What is “Worst TLog Queue” monitoring in terms of TLogMetric attributes?

rjenkins · February 1, 2019, 6:45pm

Hiya @alexmiller. Thank you for the reply. The blue and purple series are 2 different c5.xlarge instances, each running 4 fdbserver processes with the “stateless” class.

The processes on the purple instances currently have the following roles.
– snip fdbtop output –
4500 66 5 - 23 stateless proxy
4501 7 3 - 1 stateless master
4502 69 3 - 23 stateless proxy
4503 69 3 - 23 stateless proxy

The blue
– snip fdbtop output –
4500 66 19 - 23 stateless proxy
4501 66 4 - 22 stateless proxy
4502 8 3 - 2 stateless cluster_controller
4503 65 11 - 21 stateless proxy

The Worst TLog Queue metrics are from the trace files.

trace.010.050.002.210.4503.1548772635.qQYQre.26.xml: <Event Severity="10" Time="1549044301.613677" OriginalTime="1548797048.262023" Type="RkUpdate" ID="0000000000000000" TPSLimit="477.23" Reason="5" ReasonServerID="b1bb6a4e3d41df2b" ReleasedTPS="104.134" TPSBasis="104.134" StorageServers="36" Proxies="9" TLogs="5" WorstFreeSpaceStorageServer="669442251929" WorstFreeSpaceTLog="84773885905" WorstStorageServerQueue="1504647272" LimitingStorageServerQueue="69650439" WorstTLogQueue="2067779198" TotalDiskUsageBytes="1470329077864" WorstStorageServerVersionLag="0" LimitingStorageServerVersionLag="0" Machine="10.50.2.210:4503" LogGroup="default" Roles="MS" TrackLatestType="Rolled" />

alexmiller · February 1, 2019, 8:53pm

Huh, I had never actually used Ratekeeper’s summaries before. Useful. I should also ask what version of FDB this is on?

WorstTLogQueue is calculating the maximum of inputBytes - durableBytes from each TLog’s TLogMetrics, which is a familiar calculation. Spilling doesn’t spill the entire queue, it just limits the queue to TLOG_SPILL_THREASHOLD bytes. So if you saw WorstTLogQueue plateau at ~2GB, then that would suggest spilling is happening. The fact that it is not, and instead declines, makes me suspect that for purple there’s, like, one mutation destined for a storage server that’s failed, and once that one mutation is spilled, the rest of the queue is rapidly trimmed. For blue this would sound more like, you have a workload that’s bursts exactly on the hour every hour, and causes queues to grow just enough that you start barely spilling.

I don’t think I’d be overly concerned about the state of this right now. If you wanted to make sure you never spill, you could either increase the number of logs, or if you have the memory available, raise --knob_server_mem_limit and --knob_tlog_spill_threshold by 1-2GB, and you should be more than fine.

rjenkins · February 1, 2019, 9:02pm

Thank you Alex that’s very helpful, the version is 6.0.16.

alexmiller · February 1, 2019, 9:28pm

Oh, good, then I don’t need to go double check what changes have happened to tlogs in earlier versions.

I’ll be posting a design doc and operational guide for a new tlog spilling strategy to the forums sometime next week, which you’ll likely find relevant and interesting.

ajbeamon · February 5, 2019, 9:53pm

Something else to be aware of – ratekeeper starts trying to limit the transaction rate when the size of the log queue hits 2GB and targets a maximum queue of 2.4GB. If you have a workload that’s pushing the queue much above 2GB, it might start getting throttled.

Topic		Replies	Views
How to prevent tlogs from overcommitting Using FoundationDB	20	1648	October 23, 2018
Quick question on tlog disk space for large clusters Using FoundationDB	3	602	February 18, 2020
Understanding slow log servers Using FoundationDB	0	335	September 18, 2020
WARNING: A single process is both a transaction log and a storage server Using FoundationDB	16	1782	August 13, 2019
Transaction Log Metrics Running FoundationDB	0	454	November 17, 2020

Daily Pattern in WorstTLog Queue / Tuning TLog 2GB Queue size?

Related topics