We had a 40 node cluster that expanded to 60 nodes. These additional 20 nodes
include 8 tlog (it’s all tlogs) nodes and 12 storage nodes.
During the expansion rebalance process(20% data transferred), the new 20 nodes were accidentally
lost during the rebalance process.
(processes were killed and the data directory was deleted).
Then restarting the cluster reports the “Locking coordination state.” and hangs.
In this case:
is there any way to start the cluster? Data lost is acceptable.
is there any way to dump the data from the originial remaining 40 nodes?
For example, read data from sqlite directly, and we can fix the <key, value> data and load into an new FDB cluster.
Which FDB version are you using?
What configuration did you use? (triple ssd?)
Did you configure HA (aka fearless) for this cluster?
How much data loss can you accept? (if you configure a new db, all data will get lost, but cluster will be back. I guess that’s not what you want.)
If this is a test cluster, you can try fdbcli force_recovery_with_data_loss. It will drop tLog data and try to get cluster back. Its implementation is at: