Hi there,
I’m doing some firefighting drills of “full cluster” scenarios using FDB 7.3.57 in a staging environment. Small test cluster:
- physical 3 nodes, 3 nvme drives per node for storage,
- fdb-logs and OS on separate drive
- set for double-replication,
- redwood engine
Scenario 1:
- fill up cluster completely
- writes and reads hang (clients hang)
- initiate “rescue drill”
We’ve used knob_min_available_space_ratio=0.001
to get the fdbcli to come back alive.. but..
We forgot to turn of the write workload and it continued to fill up the cluster…and..
Due to misconfiguration we filled up node-A OS drive and that put the node offline, unresponsive. [
We now can’t find a way of making the cluster respond to any command.
We have different types of data and we know we have a few options to try:
- reduce redundancy to
single
clearrange
on data we’re willing to drop (and restore from backup)- We don’t want to focus on solving node-A, because this accidental loss is a realistic event: we could loose a node to hardware failure.
- the goal here is to increase FoundationDB knowledge and know how to operate it also under unlikely/bad scenarios.
However, we can’t get fdbcli
to work at all..
# fdbcli
Using cluster file `/etc/foundationdb/fdb.cluster'.
The database is unavailable; type `status' for more information.
Welcome to the fdbcli. For help, type `help'.
db> coordinators
Cluster description: b6NBkaFl
Cluster coordinators (3): 10.100.196.187:4500,10.100.196.188:4500,10.100.36.123:4500
Type `help coordinators' to learn how to change this information.
fdb> status details
WARNING: Long delay (Ctrl-C to interrupt)
Using cluster file `/etc/foundationdb/fdb.cluster'.
Could not communicate with all of the coordination servers.
The database will remain operational as long as we
can connect to a quorum of servers, however the fault
tolerance of the system is reduced as long as the
servers remain disconnected.
10.100.36.123:4500 (unreachable)
10.100.196.188:4500 (reachable)
10.100.196.187:4500 (reachable)
Timed out fetching cluster status.
Configuration:
Redundancy mode - unknown
Storage engine - unknown
Log engine - unknown
Encryption at-rest - disabled
Coordinators - unknown
Usable Regions - unknown
Cluster:
FoundationDB processes - unknown
Zones - unknown
Machines -
Machines - unknown
Data:
Replication health - unknown
Moving data - unknown
Sum of key-value sizes - unknown
Disk space used - unknown
Operating space:
Unable to retrieve operating space status
Workload:
Read rate - unknown
Write rate - unknown
Transactions started - unknown
Transactions committed - unknown
Conflict rate - unknown
Backup and DR:
Running backups - 0
Running DRs - 0
Process performance details:
Client time: 05/14/25 16:19:41
Anyone with some tips on how to get fdbcli
to operate on a “limping” cluster needing operator rescue?
I’ve seen references to PRIORITY_SYSTEM_IMMEDIATE
on the documentation, which seems to hint that I should be able to start fdbcli
using this priority for all transactions, but I can’t figure out how.. I’ve tried this:
fdb> option on PRIORITY_SYSTEM_IMMEDIATE
Option enabled for all transactions
fdb> begin
Transaction started
fdb> status details
WARNING: Long delay (Ctrl-C to interrupt)
Using cluster file `/etc/foundationdb/fdb.cluster'.
Could not communicate with all of the coordination servers.
The database will remain operational as long as we
can connect to a quorum of servers, however the fault
tolerance of the system is reduced as long as the
servers remain disconnected.
10.100.36.123:4500 (unreachable)
10.100.196.188:4500 (reachable)
10.100.196.187:4500 (reachable)
Timed out fetching cluster status.
Configuration:
Redundancy mode - unknown
Storage engine - unknown
Log engine - unknown
Encryption at-rest - disabled
...
similar to the avoe output.. clearly didn’t work running the status details
under PRIORITY_SYSTEM_IMMEDIATE.
Thanks in advance, all help is welcome