Seeing lots of rebalancing after fleet wide restarts

We operate one of the largest FDB clusters. We happen to restart our cluster fleet wide for a kernal update. Since then, we have seen long stabilization periods. Specifically there was a long rebalancing occurring between the nodes. Has anyone who has run into this ? Any explanation of why this was happening and how to avoid this long stabilization ?. We would expect rebalancing to occur on adding or removing a node. This led to the storage queue bumping to above 1Gb. We have the following knob set changed from default. We changed custom shard size of 100MB to the default of 500MB
# Knobs
knob_max_shard_bytes = 100000000




4 Likes

When shard size changes, data distribution will reshuffle data to create shards following the new size. For example, when you increase the size from 100MB o 500MB, existing shards will be merged to larger shards. Two to-be-merged shards can sit on two different hosts; Merging them requires relocating them.

2 Likes