Thank you for your reply, but I couldn’t find such knobs in 6.3 foundationdb/Knobs.cpp at release-6.3 · apple/foundationdb · GitHub
Where are they defined?
Thank you for your reply, but I couldn’t find such knobs in 6.3 foundationdb/Knobs.cpp at release-6.3 · apple/foundationdb · GitHub
Where are they defined?
The data lag is the difference in version between what the logs have and what the storage servers have fetched from the logs. A high lag here implies that the storage servers are falling behind the most recent data and can not serve that data in a timely manner for reads.
A growing durability lag can eventually result in an increase in data lag. The queue size on the storage server grows as the durability lag does (all non-durable versions are stored in the queue), and there is a limit on the size of that queue. When it hits that limit, the storage server will stop fetching versions from the logs until the storage server makes more data durable and reduces the size of the queue. As a result, the storage server can fall behind.
It’s also possible that something else is going on, and whatever is causing your storage server to be slow in making things durable is also causing it to be slow in fetching data from the logs.
Thank your for your response.
I’ve made some tests of a write-bound workflow and some research of the fdb behaviour. There is some short description of it:
fdb_process_storage_durability_lag_seconds
metrics shows this lag for every fdbserver.fdb_qos_limiting_storage_server_durability_lag_seconds
metric grows as well.fdb_qos_limiting_storage_server_durability_lag_seconds
reaches some limit, than fdb activates a ratekeeper and the commit rate should be reduced.fdb_process_storage_durability_lag_seconds
, then ratekeeper doesn’t limit the transaction rate and another cases may occure.fdb_process_storage_input_bytes_total - fdb_process_storage_durable_bytes_total
STORAGE_HARD_LIMIT_BYTES
with the default balue 1500e6
.StorageServerUpdateLag
eventt is traced.STORAGE_HARD_LIMIT_BYTES
.fdb_process_storage_data_lag_seconds
metric shows this lag for each fdbserver storage process.fdb_process_storage_data_lag_seconds
defined with the knob DD_SS_FAILURE_VERSIONLAG
in versions (by default 1 sec = 1e6 versions). The default value is 250 seconds.fdb_process_storage_data_lag_seconds
reaches this limit, the ratekeeper marks this storage server process as Undesired
. It starts moving of data from this storage process to anothers. The event SSVersionDiffLarge
is traced in this case. Unfortunally, this event has information
level (10) instead of warning or even error.status json
nor any metrics show that the server is marked as Undesired
. The only way of understanding that there is an Undesired
storage server in the cluster is existance of MovingData
events in traces with not zero value of the PriorityTeamContainsUndesiredServer
attribute. It is hard to monitor this, and it is hard to determine which servers are undesired now.fdb_process_storage_data_lag_seconds
would decrease under DD_SS_ALLOWED_VERSIONLAG
knob (200 seconds by default) and the moving data would stop.DD_SS_ALLOWED_VERSIONLAG
until almost all data is removed from this storage server.