Add storage nodes: a big batch or N small batches?

Currently our FDB cluster has 70 storage nodes (K8s pods). We want to add 10 more to it. Our replication factor is 3, using FDB v6.2.27.

Is it better to add 10 pods at once or add in smaller batches (like 5+5, or 3+3+4)?

My main goal here is to minimize the temporary increase of disk usage during the rebalance. I don’t know how much (in terms of percentage) the rebalance would hike the disk usage. Is there a formula for a rough estimate?

In the past we had some unpleasant experience with the temp disk space spike, Last year with FDB v6.2.11 we had a trouble (Newly added storage nodes have disk usage at 98%). We upgraded to v6.2.27. It has worked better.

Last week our OS maintenance made several storage pods dropped out of the fdb cluster due to network security issue. Then I saw the disk usage in some pods go up to 90% during the rebalance. We know when disk usage reached 95%, the fdb cluster will be halted.

Therefore we’d like to be very cautious this time. Thanks.

Many data distribution related issues in 6.2.11 were fixed. As you are running 6.2.27, very likely you should not see repeat of any those issues. There were couple of safety mechanisms to avoid adding more shards to nodes with more than 80% utilization.

Having said that, in my experience it is always good to add more pods in one shot than doing in small batches. You are giving data distribution algorithm better chance to rebalance. Adding 10 nodes in one shot should work well assuming you are not super overutilized already on your old nodes (hoping you have > 30% free space).

But, if you want to even more careful and if your infrastructure supports, one best way is to add 80 new storage nodes (in one shot) and exclude old 70 nodes. That would rebalance the data best way possible. Having said that, I think in your case adding 10 in one shot, probably is good enough.

Bhaskar, thanks for sharing your thoughts. I was thinking similarly preferring a big batch. Nice to have your confirmation.
As to your second idea, we don’t have that many spare resources to implement the approach.
Thank you.