Recovery/Reviving a Storage Full Cluster

Hi there,

I’m doing some firefighting drills of “full cluster” scenarios using FDB 7.3.57 in a staging environment. Small test cluster:

  • physical 3 nodes, 3 nvme drives per node for storage,
  • fdb-logs and OS on separate drive
  • set for double-replication,
  • redwood engine

Scenario 1:

  1. fill up cluster completely
  2. writes and reads hang (clients hang)
  3. initiate “rescue drill”

We’ve used knob_min_available_space_ratio=0.001 to get the fdbcli to come back alive.. but..
We forgot to turn of the write workload and it continued to fill up the cluster…and..
Due to misconfiguration we filled up node-A OS drive and that put the node offline, unresponsive. [

We now can’t find a way of making the cluster respond to any command.
We have different types of data and we know we have a few options to try:

  1. reduce redundancy to single
  2. clearrange on data we’re willing to drop (and restore from backup)
  3. We don’t want to focus on solving node-A, because this accidental loss is a realistic event: we could loose a node to hardware failure.
  4. the goal here is to increase FoundationDB knowledge and know how to operate it also under unlikely/bad scenarios.

However, we can’t get fdbcli to work at all..

# fdbcli
Using cluster file `/etc/foundationdb/fdb.cluster'.

The database is unavailable; type `status' for more information.

Welcome to the fdbcli. For help, type `help'.
db> coordinators
Cluster description: b6NBkaFl
Cluster coordinators (3): 10.100.196.187:4500,10.100.196.188:4500,10.100.36.123:4500
Type `help coordinators' to learn how to change this information.
fdb> status details

WARNING: Long delay (Ctrl-C to interrupt)

Using cluster file `/etc/foundationdb/fdb.cluster'.

Could not communicate with all of the coordination servers.
  The database will remain operational as long as we
  can connect to a quorum of servers, however the fault
  tolerance of the system is reduced as long as the
  servers remain disconnected.

  10.100.36.123:4500  (unreachable)
  10.100.196.188:4500  (reachable)
  10.100.196.187:4500  (reachable)

Timed out fetching cluster status.

Configuration:
  Redundancy mode        - unknown
  Storage engine         - unknown
  Log engine             - unknown
  Encryption at-rest     - disabled
  Coordinators           - unknown
  Usable Regions         - unknown

Cluster:
  FoundationDB processes - unknown
  Zones                  - unknown
  Machines               -
  Machines               - unknown

Data:
  Replication health     - unknown
  Moving data            - unknown
  Sum of key-value sizes - unknown
  Disk space used        - unknown

Operating space:
  Unable to retrieve operating space status

Workload:
  Read rate              - unknown
  Write rate             - unknown
  Transactions started   - unknown
  Transactions committed - unknown
  Conflict rate          - unknown

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:

Client time: 05/14/25 16:19:41 

Anyone with some tips on how to get fdbcli to operate on a “limping” cluster needing operator rescue?
I’ve seen references to PRIORITY_SYSTEM_IMMEDIATE on the documentation, which seems to hint that I should be able to start fdbcli using this priority for all transactions, but I can’t figure out how.. I’ve tried this:


fdb> option on PRIORITY_SYSTEM_IMMEDIATE
Option enabled for all transactions
fdb> begin
Transaction started
fdb> status details

WARNING: Long delay (Ctrl-C to interrupt)

Using cluster file `/etc/foundationdb/fdb.cluster'.

Could not communicate with all of the coordination servers.
  The database will remain operational as long as we
  can connect to a quorum of servers, however the fault
  tolerance of the system is reduced as long as the
  servers remain disconnected.

  10.100.36.123:4500  (unreachable)
  10.100.196.188:4500  (reachable)
  10.100.196.187:4500  (reachable)

Timed out fetching cluster status.

Configuration:
  Redundancy mode        - unknown
  Storage engine         - unknown
  Log engine             - unknown
  Encryption at-rest     - disabled
 ... 

similar to the avoe output.. clearly didn’t work running the status details under PRIORITY_SYSTEM_IMMEDIATE.

Thanks in advance, all help is welcome

Have you checked the log of fdbcli where it hangs? That might provide some additional hints. The PRIORITY_SYSTEM_IMMEDIATE is already used by most fdbcli calls.

1 Like