Storage queue size cluster.qos.worst_queue_bytes_storage_server contains the maximum size in bytes of a storage queue. Each storage server has mutations that have not yet been made durable, stored in its storage queue. If this value gets too large, it indicates a storage server is falling behind. A large storage queue will cause the ratekeeper to increase throttling. However, depending on the configuration, the ratekeeper can ignore the worst storage queue from one fault domain. Thus, ratekeeper uses cluster.qos.limiting_queue_bytes_storage_server to determine the throttling level
What does
depending on the configuration, the ratekeeper can ignore the worst storage queue from one fault domain.
mean? What is the configuration? Is it the knob MAX_MACHINES_FALLING_BEHIND? What is the configuration to tolerate more storage servers left behind?
Ah, “configuration” is indeed overloaded there. I believe it’s trying to mean the replication configuration.
If you ran configure single on your database, then ratekeeper can’t ignore the worst storage queue, as it’s always the only replica of some piece of data. In any other replication configuration, ratekeeper will already ignore the worst storage queues from one fault domain. There’s no additional changes needed.
If you’re running triple configuration, and one storage server falls behind in a team, then Ratekeeper won’t begin limiting. If in a different storage team, a storage server in the same zone_id starts falling behind, it too will be ignored. If in a third different storage team yet another storage server begins falling behind, but this one is in a different zone_id, then ratekeeper will begin limiting. The rule here specifically is “one zone of worst_storage_queues may be ignored” not “the worst storage server in each team may be ignored”.