Unexpected repartitioning of the database

This is an extension of this question here where we noticed high disk IO during a lot of delete operations. We first had a problem with latency of the delete txns which was fixed but we have a problem with throughput.

For context - We have a batch job that clears data older than x number of days and depending on run date it can clear upto 5 - 10% of the data in the cluster ( > 10million keys ). Our cluster size is about 140GB and post delete it comes down to about 132 GB. We run triple replication cluster, SSD-2 across 5 machines. 3 storage servers per machine. Server version 6.2.19

Now the problem sometimes when the delete is happening the DB starts moving significant amount of data. This makes almost all of the storage servers go 60 - 90% disk utilization and eventually rate keeper kicks in. During the peak contention period I see TPSLimit go down as low as “2.142…” and we notice significant delays in opening transactions.

In DataDistributor log I see the AverageShardSize goes from 33953350 to 32671480 and max “InQueue=350”. InProgress goes upto 75. Number of shards comes down from approx 4500 to 4120 during this period.

The whole redistribution takes about 2 hours and then everything becomes normal. We are not setting any knobs for min/max shard size. We are running with spring cleaning turned off ( --knob_spring_cleaning_max_vaccum_pages 0). I can get more information from logs if required.

Does explicitly setting the min (very low number) and max shard size ( about 100MB ?) alleviate this problem ? Also is there a way to slow down this moving data ? The queue depth I see in DD logs are worrying me. Please help.

Did you try to dial down the
They define how aggressively DD balance data.

Thanks, I will set these and see how it behaves.

Are you still running your workload during the period when storage servers are busy? How busy is the cluster when you are just running your workload with no data distribution? Maybe similarly interesting, how busy is the cluster if you turn off your workload and let it do data distribution only?

Data distribution is intended to not contribute excessively to the load on the cluster, but if you are already running near saturation then data distribution can certainly push you over the edge. In other words, it’s expected that you’ll run with sufficient headroom in your cluster to accommodate the activities of data distribution. If it’s the case, though, that data distribution is a substantial portion of the load on the cluster when you saturate, then that’s a sign that something may not be quite right. It could just be that DD needs to be slowed down, or maybe it could be behaving in a non-ideal way that would be better fixed by a change to the code.

What version are you running?

1 Like

I was waiting for few runs to observe the behavior before posting here. Its a very consistent behavior.

  1. At a particular time of the day we start the delete jobs. This is a burst jobs and cleans quite a lot.
  2. Our usual WriteHz is around 5 - 6K but during delete we easily go up to 15 - 20K writeops.
  3. About 10-15 min into the delete job rebalance kicks in
  4. At the same time ratekeeper kicks in - Interestingly the overall writehz calculated in status.json roughly remains the same, but our latencies goes up for sure.

Its fairly busy. When this specific job runs we are usually around 2.5x the average load.

The rebalance kicks in only we run the burst delete job. So its hard for me to simulate this bahaviour.

Unfortunately this is not the case. Its exactly the point when the cluster approaches the threshold data redistribution kicks in. In a separate env I tried setting --knob_dd_rebalance_parallelism 25 and I do see number of InProgress shards come down ( usually < 40 ). And MountainChopper and ValleyFiller report PollingInterval to be 120 which I think is the max. TPSLimit I see in ratekeeper is more generous when the parallelism is reduced.

Reading through the code “checkDelay” is the key parameter to throttle number of in-progress redistribution. I plan to bump up BG_DD_MAX_WAIT to say 240 and leave DD_REBALANCE_PARALLELISM at 25. Does it sound like a decent approach ?

My thought was to wait for this burst job to cause the rebalancing work, and then while data movement is going on suspend your workload. I don’t know how long data movement will continue in your particular circumstances, but if it’s long enough it could give you a sense for how much work the movement is doing in isolation.

My gut says that the best response to this would be to increase the size of the cluster or to reduce the rate of your burst jobs to give your cluster a little more breathing room. You could potentially also try using batch priority for your burst jobs, which would cause them to be throttled before anything else, though you should be aware that batch priority work is treated as non-critical and can be shut off for extended periods due to hardware failures, etc.

You can certainly opt to slow down data movement instead, but if you are still operating reasonably close to saturation, you may find that the cluster is a little brittle in the face of various events (e.g. a failed disk, slight increases in workload, variation in the internal work done by the cluster, etc.). You’ll also of course be slower to heal in response to the various scenarios that need data distribution to fix things.