New API to get the version of a running cluster without needing a matching libfdb_c library

Following the discussion in How do bindings get an appropriate copy of `fdb_c`?, it looks like it would be a great help if bindings and/or custom deployment scripts could be able to detect the version of a currently running fdb cluster, without requiring to have the appropriate version of libfdb_c installed.

The goal would be to detect the current version (let say 5.1.7 currently), then obtain the corresponding libfdb_c library, either by downloading it from a private repo or directlty from foundationdb.org, and then we could be in the expected situation of having a matching libfdb_c version installed for bindings to use.

Currently, you need an already existing compatible version of libfdb_c (so for our example, at least an 5.1.x version) to bootstrap the process.

For custom scripts that must deploy an application without previous knowledge of the running cluster version, this can be tricky.

For example, today I create a custom installation script that targets fdb 5.1.x, and I ship a version of my binding that supports at least API version 510, and in the setup I also ship libfdb_c_5.1.7.dll. Assuming I also instruct the adminsys to setup a 5.1.x cluster for the application to use, this will work fine.

But let’s say in 2 or 3 years, the fdb cluster has already been upgraded to something like v7.1.x, and all the sudden a major event wipes the application server and they must be reinstalled using the initial setup script (or maybe they move to a new datacenter, a new OS version, etc…). During all that, the fdb cluster is stil running fine and will stay at version 7.1

The custom script will reinstall libfdb_5_1_7.dll but then it will not be able to connect to the cluster, and all the error messages will probably not give me the actual cluster version. So the customer will have to get in contact, and request an updated version of the installation script. Or maybe if they know about the problem, they would need to customize the install package themselves to replace the libfdbc after install (not easy to do with MSI or custom package systems).

An ideal solution would be that all (future) versions of libfdb_c would have an additional API that would use a very simple way to get the current version of a cluster:

  • the custom script could load any version of libfdb_c it has, call this method, and get the current version.
  • if the version matches the currently loaded libfdb_c then proceed with the rest of the installation as planed.
  • if the version does not matches:
    • either abort with an explicit error message: "You need to download version x.y from foobar.org and place in this folder / edit this file / etc.."
    • download the library itself from any source, replace the library with the appropriate one, and then continue the install.

As a side bonus, if a binding would load the currently deployed libfdb_c library, and start getting errors about the version not matching the cluster, it could call the same API and display a more useful error message: "you libfdc_c library found at /path/to/somewhere is version X.x but the cluster is version Y.y that is not compatible. You need to updadate the library and restart the process."

notes:

  • I don’t think we can rely on using fdbcli because if we have a running fdbcli installed locally, then it means we already have the client binaries installed on the app server, and here we are talking about custom deployment scenarios.
  • the library would probably need to get access to a valid fdb.cluster that would probably be in a custom path, so the API would need to request the path to the cluster file.
  • I’m not sure if the cluster name is still require to be “DB”, but if it can change in the future, the API should also ask for this.

Something like this?

int fdb_get_cluster_version(char* cluster_file, char* db_name)

I’m not sure if it should return the current version in the same format as the API version, so 517, or may a more traditional version string like 5.1.7 ? It should also be able to return an error code signaling that the version of the cluster is so different that it does not even support this request anymore.

3 Likes

One thing to note is that the industry is quickly heading to deploying using immutable containers (i.e. docker, kubernetes, etc). Using an immutable container deploy model the situation you describe (where an admin moves datacenters using an initial deploy script) is no longer a concern since upgrading FDB client libraries involves publishing new containers with new client libs, rather than mutating the state of a running image. When an admin moves datacenters using containers they will likely just tell the cluster manager to drain from datacenter A into datacenter B (or redeploy all containers from A into B using the current/latest versions). In the immutable container world, downloading DLLs would be considered a “bad” thing.

All that being said, the rest of the proposal sounds great (including offering an error to the user for mismatched client/server versions) :slight_smile:

You could also deploy along side your app a folder with one version of libfdb_c for each minor version of FDB and use a mechanism like this to choose the right one from among those options. In some sense, this is what the multi-version client is already doing, but this would let you, say, load the most recent libfdb_c as the primary and one of every minor version as multi-version clients for all versions as new as the server version. This makes you ready for no-downtime upgrades of whatever cluster you are connecting to without having extraneous external clients loaded, which can introduce extra traffic (though I think on release-5.2, there is a change so that versions as new as 5.2 won’t be as expensive to load as incompatible secondary clients, so maybe that’s not such a big deal). This isn’t quite the problem you’re trying to solve (I don’t think), but it’s somewhat related.

Would this be more convenient as an fdbcli command? I guess it depends on what you mean by “script”, but it seems to me that if this is a shell script, it seems weird to load a DLL and (somehow) call a function inside of it rather than issuing a CLI command. I can see other instances where an API call is the easiest thing, though, so I’m not saying that a method in FDB C is necessarily the wrong thing. In other words, would it be better if it existed in both places?

One slightly tricky case to handle is the upgrade case. Arguably, the correct behavior is (1) if the version of the client is newer than the version of the server, than it could be that the client has been upgraded before the server and there is an ongoing upgrade, so keep looping until the cluster is upgraded (the current behavior) and (2) if the version of the client is older than the version of the server, because we don’t support downgrading, this must be a configuration issue, so you should throw an error about version incompatibilities. Then some care needs to be taken in the multi-version transaction class to only surface the (non-retriable) error all the way up if none of the loaded fdb_c versions are compatible.


Would the API call make more sense as a member on an FDBCluster (or FDBDatabase)? So, something like:

int fdb_cluster_get_version(FDBCluster* cluster);

It seems like the “cluster” is the thing that has a version, given that all databases (or, well, the one database) in a cluster will share a version, and then you would specify how it gets the cluster file and DB name* in the normal way (i.e., the same way it is done for all other operations). Presumably, this would all be hidden from the user if they did happen to have a CLI lying around, but if they didn’t, they could do something like fdb.createCluster(path_to_cluster_file).getVersion() in their binding of choice.

From an implementation point of view, it’s probably easiest to get the protocol version, given that all messages already have that meta-data inside of them, so it would just require the node attempt to talk to the cluster and see what the header is. This is also the thing that actually determines whether the client can’t talk to the server (i.e., it is protocol version mismatch that more strictly prohibits two versions from communicating), so in some sense, it’s the thing that you actually have to watch out for changing. However, it is also really supposed to be an internal number, so while having clients look at it to see if it matches expectations or not…somewhat makes sense, but it would seem weird if they started looking at it and using that to guess what the semantic version was.

I think you’d want to return the version as a string, but I could be wrong. Returning it as a single integer like the API version seems somewhat weird to me, but I guess it depends on whether you want this endpoint for human edification or for machines to make decisions on. (For the latter, you might want to put it in some kind of struct with major, minor, and patch fields.) And then you have to ask if your script needs to be able to handle different servers with different protocol versions within a prerelease build. (Until we actually release a version of fdbserver, we might update the protocol version of unreleased software at will multiple times. This let’s us experiment with the protocol freely until such time as it’s ready to release.)

But those are just my two cents. I’m also not convinced that I actually said anything after saying so much, but that’s what came to mind.

* The create cluster API still takes a DB name, and it is still the case that it will throw an error if it isn’t literally “DB”. Some of the bindings, like the Java bindings, no longer give the user the ability to specify anything else.

Maybe I should have clarified previously: We want to deploy and use FDB as part of a solution that can be installed by people with minimal to no training at all, so everything must be as automated as possible, without requiring user input, and we cannot assume that they will always perform the most appropriate action.

For the vast majority of the installs, I foresee very tiny deployments, either 1 or maybe 3 servers in the cluster. If the customer does not choose to buy upgrades or a technical support contract, he/she must still be able to reinstall/re-deploy the solution, using the binaries and documentation that they found in an old backup folder.

I know that people who write distributed databases with ACID guarantees for a living always cringe at the idea of their baby running on crappy hardware with buggy hard-drive firmware and with no-one monitoring the log files or reading the documentation :). The cloud can help solve this issue, but not everybody is there yet, and we still need to support customers that prefer - for whatever reason - to stay on-premise, or cannot use the cloud for legal reasons.

I foresee four different situations:

  1. Single instance deployment where the application and fdb are installed on the same machine, and deployed at the same time. These customers will probably have no pre-existing virtual machines or container infrastructure, and certainly no-one trained in the complexities of distributed databases.

  2. Medium deployments with multiple servers, and possibly an existing VM / container infrastructure if we are lucky. The deployment will most probably be done by technicians with minimal training with access to level 2 phone/email tech support. They will probably have no experience with distributed database (or any kind of databases even) so we can’t rely on them being fluent with all the tools or even reading log files.

  3. Large accounts with their own data centers and trained sysadmins/sub-contractors where everything is designed and tested before going to production (well,… most of the times :sweat_smile: ).

  4. Customers that are already 99.9% in the cloud that don’t need anything installed on-premise (at least not a distributed database). We handle everything, and hopefully we know what we are doing :wink:

For case 3. and 4., we will have trained people handling the installation and upgrade process, so we can impose any constraint we want and require them to think about what they are doing.

The main issue that I need to resolve is 1. and 2.. We need a way to make sure that the installation process can - if possible - handle most of the situations, or at least give use proper error messages so that we can minimize the time needed to troubleshoot. I don’t want ever again to spend hours getting logs with some weird error code, and chasing red herrings, just to realize that they used the old setup with old binaries.

This works if the customer always downloads the latest installation package that by default always redistributes all previous major versions of the fdb_c.

This does not work if for some reason, someone uses an old package that they kept from the initial install a few years ago. I’ve seen this happen so many times with our product, hence why I need a solution here :slight_smile:

The current version of our product already checks the SQL schema on startup, and notifies the admin of a problem if the database schema matches an older version, or even if it looks “from the future”. Also, most SQL drivers out there will still be able to connect to older/newer versions and allow use to query for minimum supported features (or to enable/disable some behavior depending on their availability).

I cannot do that currently with fdb if the cluster is more recent, because I won’t be able to connect to it and read any keys (including a schema version key). I cannot even tell what future version it is. Worse, the user-visible symptom will be a timeout (the error code will be in log files that nobody ever bother to read). I’m trying to prevent that :slight_smile:

I used the word “script” as a general term. I would probably have some executable that would use the .NET binding and only call the API needed to print the version string stdout, and exit. Or maybe a PowerShell script could call the binding itself. I’ll let the guys who are expert in custom deployment decide what they prefer.

Using fdbcli, currently, would require having the correct version deployed, which would not work in this case (cluster upgraded since the initial deployment). Unless of course fdbcli gets support for the get_version API and does this for us, then we could use it directly, instead of via a binding. I guess having access to this via the bindings would allow better error messages at runtime.

Sure, but wouldn’t this need to return an FdbFuture<int>? Since it has to connect to at least one coordinator, it is async and could also fail if no coordinators are valid (most probably if the fdb.cluster file is obsolete).

If futures require a running network thread, this could become a bit more difficult, since we cannot stop/restart it in the same process…

Ideally, it would be nice if this API could work in way that require minimal setup. But if this is not possible, then at least if it is properly documented we can work around that.

I’m not sure either. The only thing I see that string would give us, is support for -alpha/-beta suffixes, but I don’t think you guys use them? Also, since this is the version of the cluster itself, I think that if someone decided to upgrade the entire cluster to some pre-release version, they probably had good reasons, and the application should respect the choice.

Thinking about this more, one solution could be to address the problem with auto-updates of the application itself. We could, like you say, ship all previous versions of libfdb_c so that it can connect to current or older versions.

But the current library is at almost 6 MB, and having multiple instances of them will quickly grow to a large fraction of the current setup package. We wan’t to keep the installation package as slim as possible.

Scenario:

  • Customer downloads version N that - at the time of download - uses the most current fdb and libfdb_c versions.
  • Other the time, the application auto-updates to add new features without changing the fdb version. So far so good and reinstalling from previous version will only lose features (until the next auto-update).
  • If at some point, we need to use new fdb features, we will require upgrading the cluster itself to version N+1. This cannot be done by the application itself if the cluster is running on different servers, so action by the adminsys (or sub-contractor) is required.
  • This process will repeat itself for a few cycles, and we are now some years later at fdb version N+2.

If, at some point, the app server needs to be reinstalled for whatever reason, and the admin uses the old binary installer (and by “old” it could be the installer from last week, before yesterday’s version that upgraded the cluster), then the installer or application will not be able to talk to the upgraded cluster, and will not provide any useful error message, except maybe timeouts in the logs.

The custom will call us with a symptom “the installer is stuck”/“the application will not start”, which can be anything and the resolution of the ticket will take too much time and effort.

  • One such reason that could happen, is when an attempt to upgrade both the application and fdb cluster fails, and the customer wants to quickly revert to a previous version, but the cluster has already been upgraded to version N+1. If the previous installer can tell us which version of the library we need, we can instruct the customer to download it and install it in some well known location, while waiting for a full resolution of the initial problem.

tl;dr: I’m looking for a solution where some automatic installation process (or application at runtime) can display a message like “You are using version X of the client library, but the cluster is at version Y. Call your technical support and tell them exactly that”. This way, the issue can be solved in a matter of minutes. The installer could even download the new version automatically.

1 Like

A couple other random thoughts:

fdbcli now tells you if it’s trying to connect to any processes that are incompatible. Note that it’s possible that some but not all processes are incompatible if you have a partially upgraded cluster. It won’t tell you what version the cluster is running at (it only knows about protocol versions, not release versions), but at least it’s a more informative signal than printing nothing. We’ve tossed around some ideas for how to do this in the native library, and I think the currently favored idea is to have some side channel that we can push this kind of information through (so as not to interrupt the connection attempt). It’s probably going to be a little bit more complicated when using the client library because of the multi-version client, but it should be doable.

The API version’s value is technically not tied to the release version, though we try to make them line up. It’s probably best to use the release version in this case (e.g. 5.1.7). However, given that the processes in a cluster could be running any combination of 5.1 versions, maybe the right number to provide would just be “5.1”? Or we’d have to try to better define what constitutes a cluster’s version. Prerelease versions (at least those that are incompatible with the release) should be specifically marked as well to avoid confusion.

I agree this seems like the most natural place, though it’s a bit unfortunate that the typical use of our bindings (fdb.open) doesn’t have the user dealing with Cluster objects at all.

Since the goal of the API is to detect which version of the libfdb_c to use, then at least it should be a value that points to the minimum version that fits the bill. So for 5.1.7, it should probably return 510 because (I think?) libfdc_c v5.1.0 should be able to connect to processes that run v5.1.7 ?

If the installation process gets 510, it could have its own logic to decide the range of compatible version (5.1.0 <= v <= 5.1.7) and either use the lowest version, OR use the highest version available (not sure which is better, I’d say the higher version with all the bugfix, but it could introduce new bugs…).

That’s true, though the bindings could expose a static helper method that creates a cluster instance, call the API and then destroys this handle immediately after (a bit wasteful but it is expected to be called once during installation or startup of the application).

And internally, if the bindings need to resolve this version at runtime - when transactions start failing because the cluster was just upgraded - they could use the existing cluster handle that they should have somewhere (at least in my case, the database object references the cluster object which wraps a valid ClusterHandle).

The status json can give the runtime version of the nodes in the cluster. Do you think that getting that info would be possible for an older client ?

If the client needs to connect to a storage (or master?) nodes to get this info, then it would need to support the new protocol anyway. The other option would be to have a dedicated request type that is protocol-version neutral, just to get this information (part of what status json returns) and that would not change in future versions. Not sure if this is feasible or even a good idea…

I suspect the easiest thing to do here would be to report out the protocol versions that are already being exchanged and then have some external mapping between release versions and protocol versions. I haven’t thought about it much, but if were feasible to update this connection packet with a release version number as well, then that could be reported instead (I’m not sure yet what the implications of this are, though). The result of this is that you would know the (protocol and maybe release) versions of the processes you were trying to connect to, which in an incompatible scenario are going to be the coordinators only.

Providing a version agnostic way for reading status sounds like a much thornier feature, as it would require sufficient testing of this feature across versions (which our simulator doesn’t currently support), and you’d have to deal with the possibility that the status schema can change over versions.

Ignoring prerelease versions, 5.1.x are all compatible, so the best thing for the client to do is probably to use whatever the newest version of 5.1 is that they can get their hands on. In general, we enforce by policy that patch releases within the same major/minor version are compatible, and that any major or minor version change is protocol incompatible. For that reason, I think the major and minor version (e.g. 5.1) are the most relevant details to know, in addition to whether it’s a prerelease.

The problem with using the API version (510) is that there aren’t any guarantees about it’s relation to the server version. It’s possible we could formalize something (which would probably also require setting the max minor version to 9 and maybe also not changing the API version on patch releases), but I haven’t given any thought to whether that’s a good idea.