How to save partial data in case of lost all tlog nodes and some of storage nodes

We had a 40 node cluster that expanded to 60 nodes. These additional 20 nodes
include 8 tlog (it’s all tlogs) nodes and 12 storage nodes.

During the expansion rebalance process(20% data transferred), the new 20 nodes were accidentally
lost during the rebalance process.
(processes were killed and the data directory was deleted).

Then restarting the cluster reports the “Locking coordination state.” and hangs.

In this case:

  1. is there any way to start the cluster? Data lost is acceptable.
  2. is there any way to dump the data from the originial remaining 40 nodes?
    For example, read data from sqlite directly, and we can fix the <key, value> data and load into an new FDB cluster.

Which FDB version are you using?
What configuration did you use? (triple ssd?)
Did you configure HA (aka fearless) for this cluster?
How much data loss can you accept? (if you configure a new db, all data will get lost, but cluster will be back. I guess that’s not what you want.)

If this is a test cluster, you can try fdbcli force_recovery_with_data_loss. It will drop tLog data and try to get cluster back. Its implementation is at:

@mengxu Thanks a lot, this is the same problem in Recovery from lost all transaction node
we will try it.

I create a PR to dump data from storage file (sqlite) directly. I hope it will help to recover as much data as possible.

1 Like