How to save partial data in case of lost all tlog nodes and some of storage nodes

ppggff · January 15, 2022, 11:38am

We had a 40 node cluster that expanded to 60 nodes. These additional 20 nodes
include 8 tlog (it’s all tlogs) nodes and 12 storage nodes.

During the expansion rebalance process(20% data transferred), the new 20 nodes were accidentally
lost during the rebalance process.
(processes were killed and the data directory was deleted).

Then restarting the cluster reports the “Locking coordination state.” and hangs.

In this case:

is there any way to start the cluster? Data lost is acceptable.
is there any way to dump the data from the originial remaining 40 nodes?
For example, read data from sqlite directly, and we can fix the <key, value> data and load into an new FDB cluster.

mengxu · January 16, 2022, 6:47am

Which FDB version are you using?
What configuration did you use? (triple ssd?)
Did you configure HA (aka fearless) for this cluster?
How much data loss can you accept? (if you configure a new db, all data will get lost, but cluster will be back. I guess that’s not what you want.)

If this is a test cluster, you can try fdbcli force_recovery_with_data_loss. It will drop tLog data and try to get cluster back. Its implementation is at:

github.com

apple/foundationdb/blob/de39293b8d6033c23702e0cd920eb6b09a16eb79/fdbcli/fdbcli.actor.cpp#L2053-L2058

    
      
          				if (tokencmp(tokens[0], "force_recovery_with_data_loss")) {
          					bool _result = wait(makeInterruptable(forceRecoveryWithDataLossCommandActor(db, tokens)));
          					if (!_result)
          						is_error = true;
          					continue;
          				}

ppggff · January 16, 2022, 7:15am

@mengxu Thanks a lot, this is the same problem in Recovery from lost all transaction node
we will try it.

ppggff · January 27, 2022, 2:48pm

I create a PR to dump data from storage file (sqlite) directly. I hope it will help to recover as much data as possible.

Topic		Replies	Views
Recovery from lost all transaction node Using FoundationDB	2	506	January 16, 2022
Storage servers 95% full - how to recover Using FoundationDB	8	1532	May 1, 2024
How would I recover from this failed cluster move? Running FoundationDB	11	546	October 16, 2024
Impact of losing two nodes in 7.1.x cluster with RF=2 Using FoundationDB	9	354	October 18, 2023
Data Distribution Stopped - How to Restart? Using FoundationDB	13	1838	November 12, 2019

How to save partial data in case of lost all tlog nodes and some of storage nodes

Related topics