Troubles scaling up the cluster

Are these all connecting to the cluster as clients? How many thousands? Right now we recommend limiting the number of clients to somewhere around 1000 (this will likely depend on your hardware), but you could experiment to see what works in your setup. If you are running many thousands of clients, that can be taxing on the cluster controller to the point of causing system instability.

There’s another thread here where we discussed this a bit: Migrating from a large cluster to another - #4 by markus.pilman

I’ll also say this an area that we’re looking to improve. I think there are a handful of different changes that can make things better, but one in particular is to distribute the failure monitoring activity so that the single cluster controller isn’t responsible for doing this for everybody.