We use RELOCATION_PARALLELISM_PER_SOURCE_SERVER to control data movement in the past but is that still the right way to do it? Basically, we want to reduce the urgency for the cluster to heal itself when it has one node left because it can cause logs queues to go high enough that cluster throughput suffers.
We use the following 2 knobs on clusters esp. where we are IOPS limited.
yeah, we have those but it doesn’t seem to be able to control log queues (storage queues are ok).
How large are the log queues growing? They are expected to grow during a failure up to at least 1.5 GB.
Yeah but that causes enough slowness (ratekeeper) that latencies from reads and writes are noticeable. Reducing the impact of the healing (not as aggressive to the point where there’s little headroom for the cluster to handle a spike in traffic for instance) is what we’re after.
A simulated failure of a node that basically pegged tlogs to 1.8G or above which means any additional write load could cause latencies to spike. Contrast that with just adding a new node (which results in low-priority moves) and it doesn’t have the same impact to the cluster.