Upgrading fdb without downtime

we are trying to deploy fdb on ec2 hosts.

the documentation on upgrading says:

you need to upgrade all of the processes at once, because the old and new processes will be unable to communicate with each other.

however, that is impossible to accomplish across multiple hosts at once. even the documentation only talks about upgrade process for a single host. Does that mean that upgrading a production fdb instance, which will definitely have multiple hosts, it is not possible to avoid downtime?

Here is a gist from my notes. Hope it helps.

I would recommend that you please consider using fdb-kubernetes-operator to manage the lifecycle of the cluster.

we are not using k8s so the k8s operator cannot be used.

regarding your gist:

i am not sure i understand it fully. does it suggest that all processes should be killed via fdbcli --exec kill and then restored? wont that result in downtime?

Maybe FDB core developers can provide a better answer.

But from what I understand, there will be a small impact to client latency (FDB uses a fat client), and inflight transactions will be retried. It is still your responsibility to ensure that your transactions are idempotent.

At a high level from what I understand, FDB has one recovery path (when individual components fail) and one upgrade path that is extensively tested using simulation testing.

Yes, this is correct.

Yes, but if done well, downtime should be done short (1-5 seconds).

Not supporting rolling upgrades has been a deliberate design choice. Basically, it reduces the testing surface drastically (there’s a few other benefits).

If you write your own operations, the burden of making sure this will be fast is on you. This is roughly how we do it:

  1. Make sure the new client version is installed on all clients (in addition to the old one, you want the application to be able to talk to both versions of FDB so it can fail over – FDBs multi-version client will take care of this for you).
  2. Install the new version of FDB on all machines (we use the versioned RPM, but you could also just copy the fdbserver binaries over)
  3. Either change the foundationdb.conf file to point to the new binary or change the symlink to the binary on all machines (depending on how it is installed). If you change the foundationdb.conf you also need to set an option that fdbmonitor doesn’t automatically trigger restarts when the file changes.
  4. Run kill; kill all on fdbcli. The fdbserver processes on all machines will kill themselves and fdbmonitor will immediately restart them.

If you do the above correctly, the downtime should be very short. Obviously, the processes won’t restart all at exactly the same time (as you pointed out, this is impossible to achieve). But that is fine: as soon as a majority of the processes run the new version, the old processes won’t be able to rejoin the cluster until they also get bounced.

You might want to include some automation here to also check the cluster health (check whether kill in fdbcli returns all processes that you expect to see, check whether the cluster reports healthy etc).

1 Like