So, we’re updating the AMI our FoundationDB cluster runs on. We want to do this as hitlessly as possible. Previously, we’ve only ever either done it in a… ‘hitful’ way, by replacing all the machines and having the new ones come up and mount the volume and associated network interface (and thus IP) of their predecessor.
What has worked for us when we’ve made changes to storage
class machines is to use the exclude
command to exclude the machineid of each of the machines we want to shut down, wait for some time until all the data has been shuffled off them and onto the new storage
processes on the new storage
class machines (at which point the excluded processes also drop the storage
role in their status), then shut down the ‘old’ machines, and delete them and their disks.
This worked fine, for storage
class machines. During the data move the cluster reported itself as ‘Healing’, but was still responding to requests. Again, this was as expected.
When we did the same for machines running processes assigned the log
or stateless
classes, not only did their processes not drop their associated roles (and have processes on non-excluded machines pick them up), if we force a role to migrate by shutting down the fdbmonitor
(and thus also the managed fdbserver
) process on that machine, FDB is frequently choosing already-excluded processes, which had no assigned roles, as the new processes to run the next generation of that role. The cluster is listed as ‘Healthy’ throughout this migration attempt, so it either doesn’t think it needs to heal, or the healing process completes so quickly that we don’t see it.
What gives? I thought excluding a process was supposed to shuffle all data (which, because this is a very lightly loaded cluster, is basically 0 outside of the storage
processes) owned by that process across to another one via a rebalnace operation, and then drop any roles from the process and not assign new ones? If that’s not the case, what should we be doing to safely migrate these roles to machines we’re not about to shut down?
N.B. I know the coordinator
role is different and we have to manually migrate that. I’m worried about the others.