Migrating from a large cluster to another

We operate FDB in AWS with EC2 instances and EBS for all storage. While we can scale out FDB many times by adding more EC2 instances, there are cases when we need to migrate all the data from 1 FDB cluster to another. We call this a “wiggle”. We wiggle for reasons like 1) the AWS instances have reached their EBS throughput limit and we upsize the instances, 2) the SQLite btree is too fragmented, etc. The wiggle is achieved by setting up a new cluster, starting with its processes excluded. And then including new servers’ processes and excluding old servers’ processes until all old servers’ processes are excluded.

We currently have a cluster with 288 storage procs + other roles and say 1.5K clients connecting it (say 1/2 short lived monitoring and 1/2 persistent clients). This is the largest cluster we’ve built thus far. We started the wiggle over to a new cluster by adding a similar number of worker processes as excluded. As soon as we did that the cluster went down. We went through several recoveries and cluster coordinator re-elections. And we had to shutdown the new cluster processes in order to get the old cluster to recover successfully.

We suspect that the cluster controller was overloaded, and unable to heartbeat with the coordinators in a timely matter. And the coordinators nominated a new CC. This kept happening multiple times. We also saw the CC controller being trigger happy and finding a better master and killing the master with log messages of type BetterMasterExists. And then of course the controller itself got replaced.

  1. Have you seen similar issues with large clusters?
  2. There is clearly a limit with the number of workers a CC can handle. Adjusting timeout knobs can probably make this scale higher.
  3. Wiggles from one large cluster to another need different and more cumbersome steps.

Pl. can you share your experiences with large clusters? And how do you achieve migration to another larger cluster if you do that?

We also saw the CC controller being trigger happy and finding a better master and killing the master with log messages of type BetterMasterExists.
We figured out why this was happening. We have the desired tlogs set to 10, while the we use 6 machine ids to locality placement. And we introduced new machine ids as part of the wiggle, which resulting in an attempt to recruit more tlogs.

But the cluster coordinator limits in terms of number of clients and workers in the cluster isn’t well understood.
Would be great if folks running large clusters can share their experiences.

Howdy - you didn’t happen to mention the storage engine you are utilizing (SSD(2) or Memory) but this is something we at wavefront happen to do quite often.

We run clusters operating at millions of writes per second and tens of thousands of reads (depending on load)

We call these operations a Fleet Replacement, as sometimes we need to increase total resources and want to start by dropping a fresh new fleet of hardware that is say 2x as powerful and running a slightly different number of machines.

Now to what you probably care about :slight_smile:

While our primary usecase right now is FDB3 (and we are migrating slowly to FDB5 re: How to prevent tlogs from overcommitting) a lot of our techniques for a safe replacement and preventing outages have applied or worked well.

I’d first ask if you have the ability to temporarily suspend workloads in your cluster? Also what kind of IO throughput do you have? and what kind of storage engine are you utilizing?

Typically when we are working with clusters that are between 250 -> 400procs we suppress all workloads and enqueue them to be processed later so that we can join a new host into the cluster without worry of the tlogs becoming overwhelmed and our CC wreaking havoc. We then slowly ramp up the enqueued workload while the cluster rebalances and shifts the load (which is much faster on the memory tier vs. ssd tier, and if you are already IO constrained on the SSD tier your workloads rebalancing will actually be quite costly).

So good to know this is not only a problem for us :wink:

We run on SSD, but the Storage engine and TLog should be irrelevant for the specific issue we ran into. It seems that the issue is that with too many processes the coordinators and the CC run into problem. More specifically, the ClusterCoordinator loses a heartbeat window and reelection happens (this then happens ever 10-30 seconds, so we stopped making progress. We ran into this problem when we grew the cluster to ~580 processes. Additionally many clients (probably in the order of 1000) were connected to the cluster as well.

Stopping load to the cluster is not an option for us (yet). Doing all maintenance online is a hard business requirement for us. Currently we solve the issue by cutting down the number of clients. We also add only one machine at a time and remove an old machine. This means the whole process takes very long and is very painful.

There’s a knob to change polling frequency for the Cluster Controller. My understanding is, that if we increase this value the coordinators will wait longer before they react to failures but it might help with the issue. We are currently testing this on a non-production cluster.

How many clients do you typically have connected to an FDB cluster? Did you implement some kind of proxy service for this?

This is indeed a limitation in the current system. The cluster controller maintains connections with everybody and does some work related to failure monitoring that can end up being saturating when there are a lot of server and/or client processes

Starting in 5.1, the cluster controller can be configured to run on its own process, which gives a little bit more headroom and makes things a lot more predictable (you don’t have to worry about it sharing a process with another role that’s using a lot of CPU). It doesn’t change the underlying limitation, but it does make things a bit better.

I think it’s possible to turn down how often other processes are checking in with the cluster controller, which can reduce the load put on it at the cost of increased response times to failures. The relevant knobs in the current code are SERVER_REQUEST_INTERVAL and CLIENT_REQUEST_INTERNAL (https://github.com/apple/foundationdb/blob/a10ac552a5bca9324e815b6b6d16b13f643059b9/flow/Knobs.cpp#L37). I don’t remember what it’s named in 3.0 releases, as I think it used to exist as only one knob, but hopefully it’s not too hard to find. As a caveat, I don’t think I’ve ever tried changing these knobs, so I can’t guarantee there aren’t any unexpected side-effects.

As you point out, you can also try to adjust how long a cluster controller is allowed to not respond before the coordinators elect a new one.

One other option when migrating to a new set of hardware that may or may not be feasible in your situation is to do it piecemeal. You can choose a smaller number of hosts to migrate at one time that doesn’t saturate the cluster controller, and when those finish start another batch. Going this route will probably result in extra recoveries.

Thanks for your answer. I thought we could do stuff like that but we can’t change knobs in production without testing them thoroughly first. Currently we are doing a piecemeal migration. This will take much longer.

But this will be a problem. The load on our cluster is steadily increasing as we are growing very fast. So we’ll have to find ways to improve this. We do not only add machines to migrate, we also grow the size of the cluster.

Do you have some pointers on those things:

  1. How expensive are clients to this compared to fdbserver processes (apart from the load they generate)? My understanding is that they also ping the cluster controller and they also need to communicate with the coordinators, the proxies and the storage nodes.
  2. My understanding is, that opening a database is quite expensive. The client needs to connect to all coordinators, ask for the leader and then connect to the leader to get the ClientDBInfo. We run fdbcli --exec ‘status json’ in many places to gather metrics. Do you have any experience that could help us quantify the cost of this?

The clients maintain connections with everyone they talk to, so the processes you named will have some baseline amount of communication with each client. Probably the most expensive bit outside of actual load is going to be the failure monitoring activity that they do with the cluster controller.

I’m not actually sure offhand how expensive it is to setup a new connection to the cluster. It does do the things you describe, but I couldn’t quantify how expensive that actually is. I do know that collecting status can be quite costly, so if you are doing this frequently you may experience some trouble. Particularly in larger clusters, status can cause slow tasks that starve other tasks on the process (this is worse in 3.0 because it could be sharing a process with a latency-sensitive role). If you have frequent status calls, this can be also end up utilizing a large percentage of your CPU. For what it’s worth, in 6.0 there is some work to make status calls significantly less costly.

On the other hand, the calls through fdbcli will end up closing the connections, so there aren’t any of the ongoing client maintenance costs.

Thanks for your answer!

we had issues with status json before. We are considering setting the STATUS_MIN_TIME_BETWEEN_REQUESTS knob to something like 5 seconds to mitigate against this. We also reduced the number of monitoring clients already.

This obviously will all only work in the short term. Did you ever develop ideas for long-term solutions for this scaling issue? Our production clusters are all growing very quickly and we need to find a way to scale more.

One obvious thing we can do is reduce the number of clients by using some kind of proxy-solution. But this will only help us so far - we’ll probably also need something that allows us to add more servers. We are currently running tests to figure out where the current scaling limits are (we don’t have numbers yet, but it seems to be somewhere around 1500).

what kind of TCP performance tuning have you done to improve the latency and response times of the network. sometimes improving how the underlying OS handles network calls can greatly improve the CC.

These are our FDB specific sysctl configurations we run on on ubuntu and have had great success in improving our reliability.

      - { name: net.core.somaxconn, value: 1000 }
      - { name: net.core.netdev_max_backlog, value: 5000 }
      - { name: net.core.rmem_max, value: 16777216 }
      - { name: net.core.wmem_max, value: 16777216 }
      - { name: net.ipv4.tcp_wmem, value: "4096 12582912 16777216" }
      - { name: net.ipv4.tcp_rmem, value: "4096 12582912 16777216" }
      - { name: net.ipv4.tcp_max_syn_backlog, value: 8096 }
      - { name: net.ipv4.tcp_slow_start_after_idle, value: 0 }
      - { name: net.ipv4.tcp_tw_reuse, value: 1 }
      - { name: net.core.default_qdisc, value: "fq_codel" }
      - { name: net.ipv4.tcp_mtu_probing, value: 1 }
      - { name: net.ipv4.tcp_tw_recycle, value: 1 }

for instance one of our larger SSD clusters is roughly around 16 nodes w/ 256 procs and 79TB in KV store, we can sustain around 28k writes and 33k reads and add and remove capacity without incident.

  Read rate              - 28099 Hz
  Write rate             - 33465 Hz

Our disk configuration and layout though may be drastically different than yours, and we heavily utilize AWS primitives for high speed read caches with high hit rates and LVM raids too improve write throughput.

Reading through the answers, I have a similar one. We have introduced a new servers into the cluster and I would like to move as quickly as possible. We have to distinguished clusters, where on one storage servers are running on just 375GB NVMe disks, while the other one has 3TB of NVMe disks. I wish I could tell the cluster to somehow move the data to the 3TB disks instead of being to even, as I’m planning to dismantle the small ones.

In last couple of days, I was able to remove around 50 servers, but the last 12 are going incredibly slowly, as it takes them hours to free up even 20GB of space. If I remove a storage server before there is at least 20GB on the most full, the whole cluster goes to halt and things take much longer to recover.

Anything I can do to help the cluster to understand my intent of removing the small servers? Or anything else I can do to force a mass evacuation of the data between servers?

You can use the exclude command, and point it to the servers that you’d like to remove. See Removing Machines from a Cluster.

I’m aware. But when I try to remove one server from the 12 servers fleet, I get this.

ERROR: This exclude may cause the total free space in the cluster to drop below 10%.

Should I maybe remove all of them at the same time? Or what’s the best way to do this.

Assuming you’ve added the processes with the big disks before excluding, you’re probably getting this because the space computation thinks the non-excluded small disks will be too full after the exclude. Specifying them all in the same invocation of exclude should avoid this problem.

I’ll also mention that the space check is a bit pessimistic about how much free space will remain, and if you think the check isn’t accurately characterizing what the outcome will be, you can skip it by adding the FORCE keyword (e.g. > exclude FORCE <ip> <ip> ...)

Got it. Wasn’t sure if I can exclude them all at the same time. This makes things much easier.

Yep, found out the hard way.

Thank you

Worked like a charm. Took around 12 hours, but it was much faster than doing this manually one by one.