Migrating from a large cluster to another

So good to know this is not only a problem for us :wink:

We run on SSD, but the Storage engine and TLog should be irrelevant for the specific issue we ran into. It seems that the issue is that with too many processes the coordinators and the CC run into problem. More specifically, the ClusterCoordinator loses a heartbeat window and reelection happens (this then happens ever 10-30 seconds, so we stopped making progress. We ran into this problem when we grew the cluster to ~580 processes. Additionally many clients (probably in the order of 1000) were connected to the cluster as well.

Stopping load to the cluster is not an option for us (yet). Doing all maintenance online is a hard business requirement for us. Currently we solve the issue by cutting down the number of clients. We also add only one machine at a time and remove an old machine. This means the whole process takes very long and is very painful.

Thereโ€™s a knob to change polling frequency for the Cluster Controller. My understanding is, that if we increase this value the coordinators will wait longer before they react to failures but it might help with the issue. We are currently testing this on a non-production cluster.

How many clients do you typically have connected to an FDB cluster? Did you implement some kind of proxy service for this?