Upgrading fdb without downtime

pdeva · November 2, 2023, 6:03pm

we are trying to deploy fdb on ec2 hosts.

you need to upgrade all of the processes at once, because the old and new processes will be unable to communicate with each other.

however, that is impossible to accomplish across multiple hosts at once. even the documentation only talks about upgrade process for a single host. Does that mean that upgrading a production fdb instance, which will definitely have multiple hosts, it is not possible to avoid downtime?

rajivr · November 3, 2023, 1:59am

Here is a gist from my notes. Hope it helps.

I would recommend that you please consider using fdb-kubernetes-operator to manage the lifecycle of the cluster.

pdeva · November 3, 2023, 5:57pm

we are not using k8s so the k8s operator cannot be used.

regarding your gist:

i am not sure i understand it fully. does it suggest that all processes should be killed via fdbcli --exec kill and then restored? wont that result in downtime?

rajivr · November 4, 2023, 1:43am

Maybe FDB core developers can provide a better answer.

But from what I understand, there will be a small impact to client latency (FDB uses a fat client), and inflight transactions will be retried. It is still your responsibility to ensure that your transactions are idempotent.

At a high level from what I understand, FDB has one recovery path (when individual components fail) and one upgrade path that is extensively tested using simulation testing.

markus.pilman · November 4, 2023, 1:41pm

Yes, this is correct.

Yes, but if done well, downtime should be done short (1-5 seconds).

Not supporting rolling upgrades has been a deliberate design choice. Basically, it reduces the testing surface drastically (there’s a few other benefits).

If you write your own operations, the burden of making sure this will be fast is on you. This is roughly how we do it:

Make sure the new client version is installed on all clients (in addition to the old one, you want the application to be able to talk to both versions of FDB so it can fail over – FDBs multi-version client will take care of this for you).
Install the new version of FDB on all machines (we use the versioned RPM, but you could also just copy the fdbserver binaries over)
Either change the foundationdb.conf file to point to the new binary or change the symlink to the binary on all machines (depending on how it is installed). If you change the foundationdb.conf you also need to set an option that fdbmonitor doesn’t automatically trigger restarts when the file changes.
Run kill; kill all on fdbcli. The fdbserver processes on all machines will kill themselves and fdbmonitor will immediately restart them.

If you do the above correctly, the downtime should be very short. Obviously, the processes won’t restart all at exactly the same time (as you pointed out, this is impossible to achieve). But that is fine: as soon as a majority of the processes run the new version, the old processes won’t be able to rejoin the cluster until they also get bounced.

You might want to include some automation here to also check the cluster health (check whether kill in fdbcli returns all processes that you expect to see, check whether the cluster reports healthy etc).

Topic		Replies	Views
Upgrading FoundationDB Using FoundationDB	10	3882	August 27, 2018
Upgrading FDB in PROD Using FoundationDB	9	1300	December 2, 2022
How can I shutdown a FDB cluster? Using FoundationDB	4	1134	September 24, 2020
Any way to gracefully stop fdbserver process temporarily without affecting traffic? Using FoundationDB	8	1571	December 9, 2019
End-to-end upgrade process including client & server Using FoundationDB	5	287	March 20, 2024

Upgrading fdb without downtime

Related topics