Migrating from a large cluster to another

markus.pilman · September 24, 2018, 5:58pm

So good to know this is not only a problem for us

We run on SSD, but the Storage engine and TLog should be irrelevant for the specific issue we ran into. It seems that the issue is that with too many processes the coordinators and the CC run into problem. More specifically, the ClusterCoordinator loses a heartbeat window and reelection happens (this then happens ever 10-30 seconds, so we stopped making progress. We ran into this problem when we grew the cluster to ~580 processes. Additionally many clients (probably in the order of 1000) were connected to the cluster as well.

Stopping load to the cluster is not an option for us (yet). Doing all maintenance online is a hard business requirement for us. Currently we solve the issue by cutting down the number of clients. We also add only one machine at a time and remove an old machine. This means the whole process takes very long and is very painful.

There’s a knob to change polling frequency for the Cluster Controller. My understanding is, that if we increase this value the coordinators will wait longer before they react to failures but it might help with the issue. We are currently testing this on a non-production cluster.

How many clients do you typically have connected to an FDB cluster? Did you implement some kind of proxy service for this?

Topic		Replies	Views
Troubles scaling up the cluster Using FoundationDB	31	3735	November 1, 2018
Storage server running out of space Using FoundationDB	16	4016	October 2, 2019
DD(data_distributor) process does not work in large clusters Using FoundationDB	1	412	May 5, 2023
Production optimizations Using FoundationDB	20	6414	August 15, 2018
Why doesn't my cluster performance scale when I double the number of machines? Using FoundationDB performance	20	3289	August 17, 2018

Migrating from a large cluster to another

Related topics