Migrating from a large cluster to another

Howdy - you didn’t happen to mention the storage engine you are utilizing (SSD(2) or Memory) but this is something we at wavefront happen to do quite often.

We run clusters operating at millions of writes per second and tens of thousands of reads (depending on load)

We call these operations a Fleet Replacement, as sometimes we need to increase total resources and want to start by dropping a fresh new fleet of hardware that is say 2x as powerful and running a slightly different number of machines.

Now to what you probably care about :slight_smile:

While our primary usecase right now is FDB3 (and we are migrating slowly to FDB5 re: How to prevent tlogs from overcommitting) a lot of our techniques for a safe replacement and preventing outages have applied or worked well.

I’d first ask if you have the ability to temporarily suspend workloads in your cluster? Also what kind of IO throughput do you have? and what kind of storage engine are you utilizing?

Typically when we are working with clusters that are between 250 -> 400procs we suppress all workloads and enqueue them to be processed later so that we can join a new host into the cluster without worry of the tlogs becoming overwhelmed and our CC wreaking havoc. We then slowly ramp up the enqueued workload while the cluster rebalances and shifts the load (which is much faster on the memory tier vs. ssd tier, and if you are already IO constrained on the SSD tier your workloads rebalancing will actually be quite costly).