How to recover fdb database from attempt of excluding the single sattellite node?

osamarin · October 9, 2020, 10:51am

Hello!

I have 2 region 3dc configuration. Each region has 3 nodes in the primary dc and one node in a sattellite dc. “satellite_redundancy_mode” is set to “one_satellite_single”,

Now I want to remove all sattellite dcs from the configuration.

Excluding the single sattelite dc node from the 2-region with exclude command in fdbcli completed without any issues. But excluding the sattellite dc node from the first region hangs.

Afer this hang the database became unavailable.

db> status details
Using cluster file `fdb-primary.cluster’.
Recruiting new transaction servers.
Need at least 2 log servers across unique zones, 1 commit proxies, 1 GRV
proxies and 1 resolvers.

Have 8 non-excluded processes on 8 machines across 8 zones.
Unable to locate the data distributor worker.
Unable to locate the ratekeeper worker.

Seems excluding the last sattelite node with one_satellite_single policy was a bad idea.

Is there any way to recover data from this database and restore the cluster to operating mode?

ajbeamon · October 9, 2020, 3:10pm

Seems the error message isn’t very good for this case.

Is there any way to recover data from this database and restore the cluster to operating mode?

As long as you still have enough of your original processes, I would expect it should be possible to startup some new satellite processes and recover the database. I don’t think it’s necessary that the satellite logs be the same or have their original data if you’ve lost it.

With the specified policy, it is required that your commits all go to a satellite before they can be considered complete. In order to remove the satellite, you’ll first need to change the policy to no longer have this requirement, and once done then you can remove the processes.

john_brownlee · October 9, 2020, 3:25pm

This may be another case where this would be useful: https://github.com/apple/foundationdb/issues/1292

osamarin · October 9, 2020, 4:50pm

As long as you still have enough of your original processes, I would expect it should be possible to startup some new satellite processes and recover the database. I don’t think it’s necessary that the satellite logs be the same or have their original data if you’ve lost it.

All original satellite processes are started and keep their data. But they do not help because they have been excluded by exclude command.

I’ve created another node in the satellite datacenter and I’ve started a new process. fdbcli shows it as 9 processes instead of 8, but fdb doesn’t use the new process for log destination.

ajbeamon · October 9, 2020, 4:51pm

Are you able to re-include them?

osamarin · October 9, 2020, 4:52pm

How to reinclude it?

ajbeamon · October 9, 2020, 4:54pm

In fdbcli:

> include <IP:port>

Or if you excluded the whole IP,

> include <IP>

osamarin · October 9, 2020, 4:55pm

include command hangs

fdb> include 192.168.56.92
WARNING: Long delay (Ctrl-C to interrupt)
The database is unavailable; type `status’ for more information.

ajbeamon · October 9, 2020, 4:57pm

Ok, I wasn’t sure if that would happen or not. You could also potentially try reconfiguring the cluster to remove the satellite, which is probably more likely to work.

osamarin · October 9, 2020, 5:00pm

Excellent!
Including the new machine does not not work. But including the old machine works!

fdb> include 192.168.56.91
WARNING: Long delay (Ctrl-C to interrupt)
The database is unavailable; type `status’ for more information.
fdb> status details

… A good status …

osamarin · October 9, 2020, 5:02pm

So the right way of recovering fdb after excluding the last satellite machine is reincluding the same machine again.

Topic		Replies	Views
3DC2regions--Simulating Primary Datacenter Failure Using FoundationDB bindings	5	493	April 17, 2025
Recovery from lost all transaction node Using FoundationDB	2	507	January 16, 2022
Unexpected cluster state - Unable to read database configuration Using FoundationDB	1	1467	December 14, 2022
Triple ssd fdb cluster on 3 node, one node poweroff, but the fdb cluster is unavailable! Using FoundationDB	2	692	July 7, 2020
Two region setup: fdb doesn't switch back to recovered primary idc Using FoundationDB	4	495	January 25, 2022

How to recover fdb database from attempt of excluding the single sattellite node?

Related topics