Excluding non-storage processes (on FDB 7.1.23)

So, we’re updating the AMI our FoundationDB cluster runs on. We want to do this as hitlessly as possible. Previously, we’ve only ever either done it in a… ‘hitful’ way, by replacing all the machines and having the new ones come up and mount the volume and associated network interface (and thus IP) of their predecessor.

What has worked for us when we’ve made changes to storage class machines is to use the exclude command to exclude the machineid of each of the machines we want to shut down, wait for some time until all the data has been shuffled off them and onto the new storage processes on the new storage class machines (at which point the excluded processes also drop the storage role in their status), then shut down the ‘old’ machines, and delete them and their disks.

This worked fine, for storage class machines. During the data move the cluster reported itself as ‘Healing’, but was still responding to requests. Again, this was as expected.

When we did the same for machines running processes assigned the log or stateless classes, not only did their processes not drop their associated roles (and have processes on non-excluded machines pick them up), if we force a role to migrate by shutting down the fdbmonitor (and thus also the managed fdbserver) process on that machine, FDB is frequently choosing already-excluded processes, which had no assigned roles, as the new processes to run the next generation of that role. The cluster is listed as ‘Healthy’ throughout this migration attempt, so it either doesn’t think it needs to heal, or the healing process completes so quickly that we don’t see it.

What gives? I thought excluding a process was supposed to shuffle all data (which, because this is a very lightly loaded cluster, is basically 0 outside of the storage processes) owned by that process across to another one via a rebalnace operation, and then drop any roles from the process and not assign new ones? If that’s not the case, what should we be doing to safely migrate these roles to machines we’re not about to shut down?

N.B. I know the coordinator role is different and we have to manually migrate that. I’m worried about the others.

What has worked for us when we’ve made changes to storage class machines is to use the exclude command to exclude the machineid of each of the machines we want to shut down, wait for some time until all the data has been shuffled off them and onto the new storage processes on the new storage class machines (at which point the excluded processes also drop the storage role in their status), then shut down the ‘old’ machines, and delete them and their disks.

If you only want to update the AMI and keep the data on the associated EBS volumes you might want to look into the Maintenance mode · apple/foundationdb Wiki · GitHub instead of doing an exclusion and a migration to a new ec2 instance. That would allow you to run for a “short” time (probably a few minutes until the new ec2 instance is up) with reduced fault tolerance until the new instance is up and running again.

When we did the same for machines running processes assigned the log or stateless classes, not only did their processes not drop their associated roles (and have processes on non-excluded machines pick them up), if we force a role to migrate by shutting down the fdbmonitor (and thus also the managed fdbserver ) process on that machine, FDB is frequently choosing already-excluded processes, which had no assigned roles, as the new processes to run the next generation of that role. The cluster is listed as ‘Healthy’ throughout this migration attempt, so it either doesn’t think it needs to heal, or the healing process completes so quickly that we don’t see it.

I haven’t seen a scenario where an excluded process picks up an active role in a new generation, if you have a minimal reproducible scenario that would be helpful. When a stateless process is excluded, that exclusion is normally instantly (since there is no state to replicate and a new recovery will be triggered), for log processes it’s almost the same, there will be a recovery and the log process will not be part of the new generation but FDB will ensure that all mutations (logs) are persisted before the log role will be dropped. If you cluster doesn’t have a high write load, that should probably take a few seconds. I would be surprised if the role is not dropped, since we do testing with the FDB Kubernetes operator for this version and a deletion actually checks if the role is dropped: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/fdbclient/admin_client.go#L472-L474, if the role still exists that could mean that either the process was not excluded or, in case of a log process, they process still has mutations that are not persisted in the storage subsystem.

What gives? I thought excluding a process was supposed to shuffle all data (which, because this is a very lightly loaded cluster, is basically 0 outside of the storage processes) owned by that process across to another one via a rebalnace operation, and then drop any roles from the process and not assign new ones? If that’s not the case, what should we be doing to safely migrate these roles to machines we’re not about to shut down?

That’s the expected behaviour, like I said if you could share a reproducible example that would be helpful, so far I haven’t seen this behaviour, which doesn’t mean it doesn’t exist.

As a side note: The exclusion command does only a very limited set of checks before doing the exclusion, so you want to ensure that you have enough processes in your cluster for your configuration, see: New safety checks on exclusions that could take a database unavailable · Issue #1292 · apple/foundationdb · GitHub.

If you only want to update the AMI and keep the data on the associated EBS volumes you might want to look into the Maintenance mode · apple/foundationdb Wiki · GitHub

Interesting :). We do, however, have a load of tooling already around the existing system, which also means we don’t need separate processes for switching out the AMI vs swapping the disks to new ones that are larger/have more IOPS/whatever. We’re happy to take the hit on time/cost for that consistency, for now, because our data volumes are going to be small enough for a few years yet that it’s not a major issue. We’re also intending to run a bank on this once we get approval, so people are going to be leery of anything which lowers fault tolerance, even for a short while. Once we’re to the point of it taking days to replicate everything, we may well rework things :wink:

if you have a minimal reproducible scenario that would be helpful

Annoyingly, this is the first time we’ve seen it. We’ll try and work up a reproduction today (we’re in the UK, so will be running on GMT).

As initial reference, we’re running in three_data_hall mode. We’re spread across the eu-west-2 region, with then intention of being able to handle an entire AZ going offline + still being able to do rolling updates if necessary by taking a single machine at a time out across the remaining 2 AZs.

As such, our ‘standard’ setup is

  • 9 ‘coordinator’ class machines (3 per AZ) which we manually configure as… the coordinators
  • 3 ‘stateless’ class machines (2 processes per machine, 1 machine per AZ)
  • 9 ‘log’ class machines (3x3 again, of which the default 4 are in use with role ‘log’ at any one time, but it allows FDB to roll to an entire new log generation within the existing collection of machines under normal circumstances)
  • 6 ‘storage’ class machines (2 per AZ).

We have a startup script which queries the tags on an instance to know what class it is, attaches and mounts a previously-unattached volume of that class in the same AZ, and use the tags on the volume to locate the associated network interface (and thus IP) which it also attaches, before finally writing the FDB config file with

  • locality-machineid as the AWS instance ID (i-0e6f3e2c231f77558)
  • locality-data-hall as the AZ (eu-west-2a)
  • locality-dcid as the region (eu-west-2)

before finally starting fdbmonitor via a systemd init script.

What we’ve done with our rolling update is double the number of running instances for each class (via a new set of ASGs for each), so 18 coordinators, 6 stateless (12 processes), 18 logs, 12 storage. We’ve not touched the coordinators yet, so the original 9 are still running as them and the new 9 should just be sat there doing nothing, but for stateless, log, and storage, we excluded locality_machineid:<AWS instance ID> for all of the old instances. They’re listed as excluded in status json. storage processes migrated their data over to the new, non-excluded, instances, and the old ones are now safe to shut down. log and stateless didn’t change.

If any of that sound like a wrong/broken setup on a first pass, just in general but especially as a potential cause of our issues, it’d be really good to know :slight_smile:

Ok, that last message sat unsent for a while when we did some debugging in our own codebase for other things, but we’ve doen some more digging.

My colleague tried excluding the nodes directly, via IP/port instead of a ‘locality’ statement, and the roles were immediately moved as you’d expect, for both log and stateless class machines. But status json definitely listed them as excluded after our previous exclude locality_machineid:XXXX statements. Is there potentially some bug in the way localities are matched up to processes? But if so, why did it work fine for our storage processes…

This also does raise concerns over our plans to define a new locality that would allow us to exclude an entire ‘generation’ (we’ve taken FDB’s nomenclature for roles/processes and extended it to our definition for the ASGs) of processes all at once, rather than having to look up the AWS instance ID (or IP/port). We know it wouldn’t be paid attention to for data replication/duplication, which is fine/good, but (we thought) it would allow us to exclude locality_classgeneration:log2 (for example). But if the base machineid has issues…

(As said colleauge) we also noticed that exclusions with locality_* do not block until complete, unlike exclusions by HOST:PORT.

As a follow up - we’ve been able to replicate this with three nodes, no class set.

If we exclude by locality_:* then we observe the behaviour where it’s marked as excluded in status json, but it doesn’t actually shift any roles. Where as if we exclude by the IP:PORT, it behaves as expected.

I’ve shoved up a gist with three status jsons if it’s any help, happy to shift this to a Issue in GitHub if it’s more useful.

My colleague tried excluding the nodes directly, via IP/port instead of a ‘locality’ statement, and the roles were immediately moved as you’d expect, for both log and stateless class machines. But status json definitely listed them as excluded after our previous exclude locality_machineid:XXXX statements. Is there potentially some bug in the way localities are matched up to processes? But if so, why did it work fine for our storage processes…

Using the locality might be the issue. Our current tests only use the IP:ports pairs to exclude machines. I will create a test case using localities to see if we observe the same behaviour. In general that sounds like a bug in the locality setup for exclusions.

Thanks for reporting!