Here’s what happens when I use fallocate to fill up a disk on one of the storage servers:
WARNING: Long delay (Ctrl-C to interrupt)
Using cluster file `/etc/foundationdb/fdb.cluster'.
Unable to start default priority transaction after 5 seconds.
Unable to start batch priority transaction after 5 seconds.
Unable to retrieve all status information.
Configuration:
Redundancy mode - triple
Storage engine - ssd-2
Coordinators - 5
Desired Proxies - 16
Desired Logs - 3
Cluster:
FoundationDB processes - 44
Machines - 10
Memory availability - 5.8 GB per process on machine with least available
Retransmissions rate - 1 Hz
Fault Tolerance - 2 machines
Server time - 11/06/18 00:11:28
Data:
Replication health - Healthy (Rebalancing)
Moving data - 0.102 GB
Sum of key-value sizes - 316.707 GB
Disk space used - 1.844 TB
Operating space:
Storage server - 0.0 GB free on most full server
Log server - 574.4 GB free on most full server
Workload:
Read rate - 227 Hz
Write rate - 7 Hz
Transactions started - 2 Hz
Transactions committed - 1 Hz
Conflict rate - 0 Hz
Performance limited by process: Storage server running out of space (approaching 5% limit).
Most limiting process: 10.50.139.81:4502
Backup and DR:
Running backups - 0
Running DRs - 0
As soon as I delete the file, everything goes back to normal:
Using cluster file `/etc/foundationdb/fdb.cluster'.
Configuration:
Redundancy mode - triple
Storage engine - ssd-2
Coordinators - 5
Desired Proxies - 16
Desired Logs - 3
Cluster:
FoundationDB processes - 44
Machines - 10
Memory availability - 5.8 GB per process on machine with least available
Retransmissions rate - 92 Hz
Fault Tolerance - 2 machines
Server time - 11/06/18 00:12:16
Data:
Replication health - Healthy (Rebalancing)
Moving data - 0.051 GB
Sum of key-value sizes - 316.717 GB
Disk space used - 1.844 TB
Operating space:
Storage server - 522.2 GB free on most full server
Log server - 574.4 GB free on most full server
Workload:
Read rate - 96497 Hz
Write rate - 330 Hz
Transactions started - 29585 Hz
Transactions committed - 46 Hz
Conflict rate - 1 Hz
Backup and DR:
Running backups - 0
Running DRs - 0
Reading the docs, I would assume FoundationDB would be able to withstand such a failure. Are we doing something wrong or am I interpreting the docs incorrectly?