Storage Server CPU bottleneck - Growing data lag

Thank you for your reply, but I couldn’t find such knobs in 6.3 foundationdb/Knobs.cpp at release-6.3 · apple/foundationdb · GitHub

Where are they defined?

The data lag is the difference in version between what the logs have and what the storage servers have fetched from the logs. A high lag here implies that the storage servers are falling behind the most recent data and can not serve that data in a timely manner for reads.

A growing durability lag can eventually result in an increase in data lag. The queue size on the storage server grows as the durability lag does (all non-durable versions are stored in the queue), and there is a limit on the size of that queue. When it hits that limit, the storage server will stop fetching versions from the logs until the storage server makes more data durable and reduces the size of the queue. As a result, the storage server can fall behind.

It’s also possible that something else is going on, and whatever is causing your storage server to be slow in making things durable is also causing it to be slow in fetching data from the logs.

Thank your for your response.

I’ve made some tests of a write-bound workflow and some research of the fdb behaviour. There is some short description of it:

  1. If a storage server writes data more slowly than the new transactions are commited, than the durability lag grows. fdb_process_storage_durability_lag_seconds metrics shows this lag for every fdbserver.
  2. If the durability lag grows on two or more fdbservers then fdb_qos_limiting_storage_server_durability_lag_seconds metric grows as well.
  3. When fdb_qos_limiting_storage_server_durability_lag_seconds reaches some limit, than fdb activates a ratekeeper and the commit rate should be reduced.
  4. But if only one fdbserver has growing fdb_process_storage_durability_lag_seconds, then ratekeeper doesn’t limit the transaction rate and another cases may occure.
  5. When the durability lag grows, the new changed data are stored to the in-memory storage queue of this fdbserver process.
  6. The current size of the in-memory queue of each storage server can be monitored as an expression on metrics fdb_process_storage_input_bytes_total - fdb_process_storage_durable_bytes_total
  7. The maximum allowable storage server queue size in bytes is defined by the knob STORAGE_HARD_LIMIT_BYTES with the default balue 1500e6.
  8. When the storage server size reaches this limit, the StorageServerUpdateLag eventt is traced.
  9. The storage server stops fetching new mutations from the TLog until the queue size is decreased under STORAGE_HARD_LIMIT_BYTES.
  10. If more transactions are commited, but the storage server does not fetch the mutations from the Tlog, the data lag grows. The fdb_process_storage_data_lag_seconds metric shows this lag for each fdbserver storage process.
  11. There is a limit of fdb_process_storage_data_lag_seconds defined with the knob DD_SS_FAILURE_VERSIONLAG in versions (by default 1 sec = 1e6 versions). The default value is 250 seconds.
  12. When fdb_process_storage_data_lag_seconds reaches this limit, the ratekeeper marks this storage server process as Undesired. It starts moving of data from this storage process to anothers. The event SSVersionDiffLarge is traced in this case. Unfortunally, this event has information level (10) instead of warning or even error.
  13. Now neither status json nor any metrics show that the server is marked as Undesired. The only way of understanding that there is an Undesired storage server in the cluster is existance of MovingData events in traces with not zero value of the PriorityTeamContainsUndesiredServer attribute. It is hard to monitor this, and it is hard to determine which servers are undesired now.
  14. It was supposed, that after moving some data to another storage processes less mutation would appear to this storage process, so fdb_process_storage_data_lag_seconds would decrease under DD_SS_ALLOWED_VERSIONLAG knob (200 seconds by default) and the moving data would stop.
  15. Moving data requires reading this data. Unfortunally fdb reads the data not only from another replicas but from this storage process as well. It adds some read workload.
  16. Usually after adding a significant read workload to a write-bound system, the disk subsystem decreases available write throughput dramatically.
  17. This storage process starts writing data much slower than before, and in the real life the data lag does never decrease under DD_SS_ALLOWED_VERSIONLAG until almost all data is removed from this storage server.
  18. So some storage servers may be removed from the cluster under heavy write workfload, that is not good.
2 Likes