End-to-end upgrade process including client & server

I could do with an ELI5 (Explain it Like I’m 5 years old) explanation on upgrading a globally distributed FDB cluster,
not a large one, but a redundant one, encompassing both server side, and client side.

The specfic situation is from 7.1.29 to 7.3.33, but the general process is what interests me. I’ve read all the forum posts, docs, notes, gists, and it’s not clear to me yet.

There is a lot of discussion around server-side upgrades, which I can summarise as:

  • do lots of prior testing
  • deploy the new binaries everywhere all at once
  • bounce all the servers very quickly, especially across multiple regions
  • update your backups ASAP too
  • hope that this works just fine and there is no fallout

What am I supposed to do about clients? My understanding is that there are two “protocols” in play:

  • the API version, such that a client can request earlier version compatibility
  • the fdbcli version, the network protocol, which must always match precisely to the servers
  • the C client is apparently capable of choosing the .so dynamically to match the server, if multiple versions are installed

After reading all this I do not understand how I should approach upgrading
the overall infrastructure - servers & clients.

It seems I should:

  1. have packages for both old & new FDB versions for client & server available
  2. on all clients, deploy the new libfdb_c.so.7.3.33 alongside existing libfdb_c.so.7.1.0 so that it can use the appropriate connections as needed, and let them restart, they will continue using the older versions initially
  3. on all servers, deploy the new binaries, bounce all servers as fast as practical
  4. over time, upgrade clients to the new API version, to match the server version

Have I got this right?

Are there any tricks to simplifying this, or reducing the risk, or allowing more time during the upgrade?

Overall, this process feels very fragile, with many points where a large server and client fleet must be upgraded in step, restarting basically everything in the process, and in very short duration.

I am not sure that, operationally, I could deploy such an infrastructure, that could be rolled back if required, without significant downtime.

What I would prefer:

  • older clients (6.x and 7.x) can seamlessly connect to a newer server
  • cluster upgrades should be incrementally possible

Is that possible, somehow? Does that change for a multi-region setup?

[edited]

References & Links

Overall process for upgrade from N to N+1:

  1. Deploy new version of clients that uses N as library that’s directly linked against, and N+1 configured as an external client library.
  2. Go to all your servers and replace the binary of version N with N+1, or replace the config to point to /path/to/N+1 for fdbserver, or adjust a symlink that fdbmonitor uses. Whichever way you wish to structure staging the new binaries in place.
  3. fdbcli> kill; kill all
  4. Servers are all instructed to kill themselves. fdbmonitor restarts them as the new version. Clients all connect under the new version.

Using kill; kill all as the way to synchronize the global restart into the new version is maybe a not well documented but pervasively used idiom.

You can have as many external client library versions around as you’d like, but there is no incremental upgrade. There is no cross protocol version support, neither from client to server, nor from server to server, so everything goes from N to N+1 all in one step. The whole “upgrade one member of the quorum before upgrading the rest” sort of incremental rollout is not a supported thing. This is no different in multi-region setups: prepare the clients, stage the binaries everywhere, kill; kill all, and the whole multi-region cluster upgrades.

You are allowed to downgrade to the most recent previous patch version. You may upgrade from 7.1.0 to 7.3.33, but you may only downgrade to 7.1.59. (because I guess 7.2 doesn’t exist?)

You are never required to change the API version in your code, but you aren’t allowed to use an internal client library (the version that’s directly linked to and installed as libfdb_c.so) that’s a lower version than the API version you define. I think I’ve seen random scripts or programs that are still using API versions that predate the foundationdb acquisition against recent FDB client libraries, and they run fine.

2 Likes

thanks @alexmiller that’s really helpful.

In practice, have the server upgrades gone without issues? It’s a very different approach to what I’ve done in the past, and I’m nervous about such a big Flag Day.

It’s been a while since I’ve talked with FDB users, but I don’t have any memory of community users having complaints about an upgrade going poorly. I don’t actually recall hearing about many of them doing upgrades at all, and questions posted to the forums would sometimes reveal that some users were still on notably old versions and were suggested to upgrade. Quickly skimming over all the forum posts referencing upgrades, the only direct commentary on issues found from an upgrade were by Wavefront folk, who both have significant production use, and use parts of FoundationDB (the memory storage engine) that no one else does afaik.

Releases are moved out of pre-release status once they’ve made it through Apple,etc. production stably, which means any performance or latency issues are very likely to have been already ironed out. Upgrades tend to go well enough that downgrade support didn’t even exist until FDB 6.3 :wink:

I’m aware that “upgrade everything all at once” is not the standard SRE approved way of doing upgrades, but there isn’t really a sensible way to do rolling upgrades without quorums and a highly partitioned architecture anyway. If you get the process down of staging the new binaries and kill; kill all-ing the cluster into a new version on some pre-production/test cluster, I expect that you’ll be fine.

Thank you for your reply.

Talking about replacing fdb binaries, it is a common practice to use linux package managers (dnf or apt) for this task. Unfortunally, when I do dnf update founationdb-server foundationdb-clients the .rpm scriplet restarts the server process without any synchronisation with the whole upgrade process.

Talking about upgrading clients, there is a foundationd-clients-versioned package that allows multiple versions to be installed simultaneusly.

So the ideal upgrade plan for N to N+1 would be the following:

  1. install foundationdb-clients-versioned packages of the N+1 version to all client machines
  2. update foundationdb-{clients,server} packages on all server machines without restarting (it is not possible now without a small modification of the scriptlet)
  3. fdbcli kill; killall
  4. all clients reconnect automatically with the new library versions
  5. remove foundationdb-clients-versioned packages of the N version from all client machines

I’m not sure if you saw this documentation from the FDB Kubernetes operator: fdb-kubernetes-operator/docs/manual/upgrades.md at main · FoundationDB/fdb-kubernetes-operator · GitHub. Even though the docs are for the FDB operator most, if not all, of the steps are applicable for non-Kubernetes installations.

Just a note for the fdbcli kill; kill all call. If you call it from an interactive session this approach is fine, if not you should add a sleep step like this fdbcli kill; kill all; sleep 5; status otherwise it could happen that the fdbcli process is killed before sending the kill request to all processes, e.g.: fdb-kubernetes-operator/fdbclient/admin_client.go at main · FoundationDB/fdb-kubernetes-operator · GitHub.