So, we’re updating the AMI our FoundationDB cluster runs on. We want to do this as hitlessly as possible. Previously, we’ve only ever either done it in a… ‘hitful’ way, by replacing all the machines and having the new ones come up and mount the volume and associated network interface (and thus IP) of their predecessor.
What has worked for us when we’ve made changes to
storage class machines is to use the
exclude command to exclude the machineid of each of the machines we want to shut down, wait for some time until all the data has been shuffled off them and onto the new
storage processes on the new
storage class machines (at which point the excluded processes also drop the
storage role in their status), then shut down the ‘old’ machines, and delete them and their disks.
This worked fine, for
storage class machines. During the data move the cluster reported itself as ‘Healing’, but was still responding to requests. Again, this was as expected.
When we did the same for machines running processes assigned the
stateless classes, not only did their processes not drop their associated roles (and have processes on non-excluded machines pick them up), if we force a role to migrate by shutting down the
fdbmonitor (and thus also the managed
fdbserver) process on that machine, FDB is frequently choosing already-excluded processes, which had no assigned roles, as the new processes to run the next generation of that role. The cluster is listed as ‘Healthy’ throughout this migration attempt, so it either doesn’t think it needs to heal, or the healing process completes so quickly that we don’t see it.
What gives? I thought excluding a process was supposed to shuffle all data (which, because this is a very lightly loaded cluster, is basically 0 outside of the
storage processes) owned by that process across to another one via a rebalnace operation, and then drop any roles from the process and not assign new ones? If that’s not the case, what should we be doing to safely migrate these roles to machines we’re not about to shut down?
N.B. I know the
coordinator role is different and we have to manually migrate that. I’m worried about the others.