How would I recover from this failed cluster move?

I attempted to move a development FDB cluster to a new set of machines, and broke the cluster, resulting in data loss. I’d like to understand where I went wrong, and whether it’s possible to recover any of the data.

Here were the steps I took:

  1. Initial “green” cluster: 3 machines, 2 processes each, replication configuration ‘double’
  2. Spin up an identical “blue” cluster, and cluster all 3 nodes together. Blue cluster is a fresh FDB install.
  3. Join the “blue” cluster to the “green” cluster by copying the cluster file and restarting FDB. Now in one “super” cluster, 6 machines, 12 processes total.
  4. Exclude the three “green” machines with FDB CLI command exclude FORCE (this feels like my mistake, but regular exclude was saying that it couldn’t calculate the size of the database)
  5. After all 3 excludes succeed, shut down FoundationDB on the green cluster.

At this point, the new cluster is in a bad state. I eventually wiped it, and set up a fresh cluster, but I still have the 3 excluded green nodes around. Is there anything I can do to get them back clustered, included, and have the data remain intact?

You have the correct idea of how to move a cluster to a new set of machines. However, there are some protections in place within the database that would have prevented the blue cluster processes from joining the green cluster. Initially implemented in Data loss protection v3 by sfc-gh-ljoswiak · Pull Request #8560 · apple/foundationdb · GitHub, data loss protection is a feature that prevents processes from one FDB cluster from joining another FDB cluster.

When a stateful FDB process comes online (like a tlog or storage server), it attempts to register with the cluster controller. The cluster controller has state about what the stateful processes are (this is stored in the coordinated state on the coordinators). If a tlog from the blue cluster comes online and registers with the green cluster, the tlog will not be recognized because it is from a different cluster, and it will wipe all its on disk data. This is a problem, because it leads to potential data loss scenarios in production if cluster files are accidentally updated to point to the wrong cluster. The above data loss protection PR prevents this issue by assigning a “cluster ID” UID to each stateful process upon its first registration with the cluster controller. This value is stored in a file called clusterId in the data directory for each stateful process. To correctly perform machine moves, you should stop the blue cluster, delete the clusterId file from each processes datadir, and then modify the cluster file to point to the green cluster. Alternatively, you can copy the cluster file from the green cluster to the blue machines before running them for the first time.

If you still have the logs around, you can look for WorkerBelongsToExistingCluster to verify data loss protection is the cause of the issue. This gets logged on the cluster controller, so all these logs should come from one machine in the green cluster. You can also look for ZombieProcess traces, which should have been output on each of the blue cluster processes, indicating they are refusing to join the new cluster and are going into a zombie state.

To attempt to recover your data, I would set up your original green cluster by itself, and then include the excluded processes. The result of your original exclude FORCE would have been to exclude all the processes in the cluster. I’m not sure what exactly happens in this case, but you may be able to recover the data by simply including the processes back.

I’m not sure if data loss protection triggered here - I can’t find any WorkerBelongsToExistingCluster nor ZombieProcess traces. Do you know the expected severity for those log messages?

Also, I’m not sure how to start the green cluster on its own. I attempt to change the cluster file to point to the old (excluded, green) coordinators, but every time it starts, it gets rewritten to point to the new (blue) coordinators, which have since been destroyed/recreated with new IPs (and are therefore unreachable.)

I may have misinterpreted this

  • Join the “blue” cluster to the “green” cluster by copying the cluster file and restarting FDB. Now in one “super” cluster, 6 machines, 12 processes total.

Did you change the cluster files on the green cluster processes to point to the blue cluster, or did you change them on the blue cluster to point to the green cluster? Based on the cluster file containing blue cluster IPs, it sounds like you changed the cluster files on the green cluster.

In either case, I’m a little surprised you don’t see either of those traces - they should both log as SevError. If you didn’t run configure new on the blue cluster, then I don’t believe a clusterId file gets created, so that could be it. But then the coordinator IPs in the cluster file wouldn’t have been changed unless you updated the coordinators using the fdbcli coordinators command.

Also, please add the output of fdbserver --version.

I changed the cluster files on the blue cluster to point to green. After that, I did exclude FORCE on each of the green machine IPs, and then I ran coordinators auto (sorry, forgot to mention this) which explains why the IPs were changed.

My FDB server version is:

FoundationDB 7.2 (v7.2.0)
source version 5eae3be195ee5c1302878459d3d1d34282b1ee60
protocol fdb00b072000000

If they were SevError, I definitely didn’t see those log messages. I checked both our logging pipeline and the files on disk on the green cluster. min_trace_severity was set to 20, but that shouldn’t have prevented SevError messages, if I understand that knob correctly.

This source version does not have data loss protection. The good news is that, since you modified the cluster files on the blue cluster, it is unlikely that the stateful processes on the green cluster wiped their data. However, I am not sure it will be possible to restore their data given the current state of the database.

I attempt to change the cluster file to point to the old (excluded, green) coordinators, but every time it starts, it gets rewritten to point to the new (blue) coordinators, which have since been destroyed/recreated with new IPs

I believe this process exists because old coordinators store the value of the new coordinators as a way to update clients which have not received the newest set of coordinators. The cluster string is stored internally in the database in the system key \xff/coordinators. So just updating the cluster file won’t necessarily work. Unfortunately, since it sounds like you deleted the data on the latest set of coordinators (on the blue cluster), the database has no way to bootstrap. It’s probably theoretically possible to “fix” the old coordinators (on the green cluster) in some way such that you can start the cluster, but I’m not aware of any tooling we have to perform such an action.

Is there any documentation on the format of the data stored in each process’ data directory? If I could manually modify the system key you mention within the old coordinators, that might help me “fix” the old coordinators, but that would require some knowledge of the internal data format.

Also, perhaps it belongs on a separate thread, but is there any documentation describing why my use of exclude [IP] didn’t work, leading me to use the erroneous exclude FORCE [IP]? (As described in the first post in this thread.)

I was talking to @alexmiller about this, and he mentioned a recent change that can directly dump key-value pairs from a storage server SQLite file Add role 'kvfiledump' to dump key-values from storage file by ppggff · Pull Request #6302 · apple/foundationdb · GitHub. You can run this with fdbserver --role kvfiledump --kvfile <path_to_sqlite_file>.

If I could manually modify the system key you mention within the old coordinators, that might help me “fix” the old coordinators

Thinking about this a little more, I don’t think fixing the old coordinators would help in this case. The state of the new coordinators is likely still needed to correctly bring the cluster back up. The difficulty in this case is that it sounds like your storage servers still have all the data, but since the coordinator files are missing, the cluster can’t correctly boot. Hopefully the kvdump tool above will help.

Also, perhaps it belongs on a separate thread, but is there any documentation describing why my use of exclude [IP] didn’t work, leading me to use the erroneous exclude FORCE [IP]?

I’m not aware of any documentation on the failure modes of exclude. The check occurs in checkExclusion (https://github.com/apple/foundationdb/blob/16b5a22cef0bce4f7f40e7d8bc9b4279a3df1b56/fdbclient/SpecialKeySpace.actor.cpp#L952), and presumably you saw one of the errors in that function when you originally attempted the exclude.

1 Like

This is helpful, thanks!

After a little experimentation, it looks like the data was deleted from the green storage servers when I ran the exclude, as the dumps are empty now. Presumably it would’ve been present on the blue cluster we migrated to (which has since been deleted.)

I was able to practice the kvfiledump command on our new cluster though, and it works as expected! For future readers, the server must be stopped, otherwise you’ll get an error like:

Dump start: , end: \xff\xff, debug: true
Fatal Error: Broken promise