How to recover fdb database from attempt of excluding the single sattellite node?

Hello!

I have 2 region 3dc configuration. Each region has 3 nodes in the primary dc and one node in a sattellite dc. “satellite_redundancy_mode” is set to “one_satellite_single”,

Now I want to remove all sattellite dcs from the configuration.

Excluding the single sattelite dc node from the 2-region with exclude command in fdbcli completed without any issues. But excluding the sattellite dc node from the first region hangs.

Afer this hang the database became unavailable.

db> status details
Using cluster file `fdb-primary.cluster’.
Recruiting new transaction servers.
Need at least 2 log servers across unique zones, 1 commit proxies, 1 GRV
proxies and 1 resolvers.

Have 8 non-excluded processes on 8 machines across 8 zones.
Unable to locate the data distributor worker.
Unable to locate the ratekeeper worker.

Seems excluding the last sattelite node with one_satellite_single policy was a bad idea.

Is there any way to recover data from this database and restore the cluster to operating mode?

Seems the error message isn’t very good for this case.

Is there any way to recover data from this database and restore the cluster to operating mode?

As long as you still have enough of your original processes, I would expect it should be possible to startup some new satellite processes and recover the database. I don’t think it’s necessary that the satellite logs be the same or have their original data if you’ve lost it.

With the specified policy, it is required that your commits all go to a satellite before they can be considered complete. In order to remove the satellite, you’ll first need to change the policy to no longer have this requirement, and once done then you can remove the processes.

This may be another case where this would be useful: https://github.com/apple/foundationdb/issues/1292

As long as you still have enough of your original processes, I would expect it should be possible to startup some new satellite processes and recover the database. I don’t think it’s necessary that the satellite logs be the same or have their original data if you’ve lost it.

All original satellite processes are started and keep their data. But they do not help because they have been excluded by exclude command.

I’ve created another node in the satellite datacenter and I’ve started a new process. fdbcli shows it as 9 processes instead of 8, but fdb doesn’t use the new process for log destination.

Are you able to re-include them?

How to reinclude it?

In fdbcli:

> include <IP:port>

Or if you excluded the whole IP,

> include <IP>

include command hangs

fdb> include 192.168.56.92
WARNING: Long delay (Ctrl-C to interrupt)
The database is unavailable; type `status’ for more information.

Ok, I wasn’t sure if that would happen or not. You could also potentially try reconfiguring the cluster to remove the satellite, which is probably more likely to work.

Excellent!
Including the new machine does not not work. But including the old machine works!

fdb> include 192.168.56.91
WARNING: Long delay (Ctrl-C to interrupt)
The database is unavailable; type `status’ for more information.
fdb> status details

… A good status …

So the right way of recovering fdb after excluding the last satellite machine is reincluding the same machine again.