Doubts regarding fdb.cluster file

Hi, I am having a bit of confusion understanding the semantics of a “legal” fdb.cluster file:

Reading the documents here, I assumed that in order for a client to connect to an fdb cluster it should have an identical fdb.cluster file as that being used by the servers in the cluster. However, this does not seem correct. I performed a toy experiment, and here are the findings:

  1. Created an FDB cluster with two processes on a single node, and assigned one of the processes to be coordinators (say 127.0.0.1:4500)
  2. Copied the fdb.cluster file from /etc/foundationdb/fdb.cluster to a temp location and named it fdb_client.cluster.
  3. Started fdbcli, on the server node, without any special cluster file parameters (so that it uses the file at default location: /etc/foundationdb/fdb.cluster; then changed the coordinator to 127.0.0.1:4501. Then exited the fdbcli.
  4. Observed that the ID and the coordinator process id of the cluster changes in the updated /etc/foundationdb/fdb.cluster (due to changing the coordinator).
  5. Started the fdbcli using -C fdb_client.cluster that was earlier copied to the temp location. fdbcli was able to join the cluster successfully (note that this file still had old coordinator ip:port and old ID prior to this step. After the fdbcli joined the cluster, it updated the fdb_client.cluster (located in the temp location) to match the contents of /etc/foundationdb/fdb.cluster.

I also tried repeating the above steps (fresh start, with the same initial conditions), but this time, prior to step (5), I manually edited the description in the fdb_client.cluster to something random. Now, in step (5), fdbcli was not able to connect to the cluster.

So, I have doubts about what constitutes a valid fdb.cluster file, for a client to join an existing cluster? From the above experiments, I could observe that (a) if the fdb.cluster file that the client is using is pointing to “some” alive process in the cluster (not necessarily a coordinator), it is able to join the cluster, and then it updates its cluster file. And (b), even if the ID in the client’s cluster file does not match that of fdb cluster’s, it is able to join it, as long as the description matches.

Could someone please clarify what is the minimum requirement for a client’s fdb.cluster file to be considered legal in order to join a running cluster?

–thanks

Just so I don’t have to keep going back forth between here and our docs, here’s the cluster file format:

description:ID@ip1:port1,ip2:port2,ip3:port3

To connect to an FDB cluster, you must have a matching description. In some sense, the description should be used to track the progress of a single cluster through time as machines are added and removed.*

Every time the coordinators are changed, the ID should be updated as the ID uniquely identifies the coordinator set. However, if you try to connect to the cluster using an older ID (after it has been changed), the server will give you the newer ID and then the client updates its cluster file. This can be done by serving the updated file from one of the old coordinators even if they are no longer in the cluster.

The reason for this is to allow for changing the coordinators midstream without downtime. When the coordinator change happens, the update is propagated to any connected client and they update their file. Any dormant client will also pick up the change when they wake back up. The problem case happens when all of the coordinators are changed and removed from the cluster. (This might happen if, say, the cluster is moved to an entirely different set of hardware.) In that case, any client that doesn’t connect to the database between the cluster file being changed and the old coordinators being removed from the cluster will be forever more unable to connect to the cluster unless they can get the updated file from someone else.

But it’s a little more stringent than just letting the client connect regardless of the ID. For example, if you take your cluster file copy (i.e., fdb_client.cluster) and just randomly change the ID, I believe you’ll find that you can’t connect. Likewise, if you randomly change one of the coordinators in the file, you shouldn’t be able to connect even if that process is in the cluster.

So, I believe the minimal requirements are:

  1. The description must exactly match the description in the servers’ cluster files.
  2. The ID must match either the current or a previous ID used by the cluster (assuming at least one coordinator from when that ID was the current ID is still in the cluster).
  3. The coordinator set should match the coordinator set associated with the ID.

Or something along those lines.


* But if you change all of the machines, is it really the same cluster?

Thank you @alloc! This explains why I was able to connect using a stale fdb.cluster file.

And this was precisely the test-case I was checking for - how do dormant clients get to know about cluster co-ordinates if all the previous co-ordinator (roles) have moved to new processes.

Yeah, unfortunately, I think the only way to handle that case is to have some other way of vending the cluster file. You probably want to know what the cluster file is for your cluster anyway so you can give it to new clients when they start up, so it might actually be the case that you just need to redeploy those dormant clients with the new file (retieved from whatever system is remembering the file).