Full disk on one machine results in 99% performance degradation

Here’s what happens when I use fallocate to fill up a disk on one of the storage servers:

WARNING: Long delay (Ctrl-C to interrupt)

Using cluster file `/etc/foundationdb/fdb.cluster'.

Unable to start default priority transaction after 5 seconds.

Unable to start batch priority transaction after 5 seconds.

Unable to retrieve all status information.

Configuration:
  Redundancy mode        - triple
  Storage engine         - ssd-2
  Coordinators           - 5
  Desired Proxies        - 16
  Desired Logs           - 3

Cluster:
  FoundationDB processes - 44
  Machines               - 10
  Memory availability    - 5.8 GB per process on machine with least available
  Retransmissions rate   - 1 Hz
  Fault Tolerance        - 2 machines
  Server time            - 11/06/18 00:11:28

Data:
  Replication health     - Healthy (Rebalancing)
  Moving data            - 0.102 GB
  Sum of key-value sizes - 316.707 GB
  Disk space used        - 1.844 TB

Operating space:
  Storage server         - 0.0 GB free on most full server
  Log server             - 574.4 GB free on most full server

Workload:
  Read rate              - 227 Hz
  Write rate             - 7 Hz
  Transactions started   - 2 Hz
  Transactions committed - 1 Hz
  Conflict rate          - 0 Hz
  Performance limited by process: Storage server running out of space (approaching 5% limit).
  Most limiting process: 10.50.139.81:4502

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

As soon as I delete the file, everything goes back to normal:

Using cluster file `/etc/foundationdb/fdb.cluster'.

Configuration:
  Redundancy mode        - triple
  Storage engine         - ssd-2
  Coordinators           - 5
  Desired Proxies        - 16
  Desired Logs           - 3

Cluster:
  FoundationDB processes - 44
  Machines               - 10
  Memory availability    - 5.8 GB per process on machine with least available
  Retransmissions rate   - 92 Hz
  Fault Tolerance        - 2 machines
  Server time            - 11/06/18 00:12:16

Data:
  Replication health     - Healthy (Rebalancing)
  Moving data            - 0.051 GB
  Sum of key-value sizes - 316.717 GB
  Disk space used        - 1.844 TB

Operating space:
  Storage server         - 522.2 GB free on most full server
  Log server             - 574.4 GB free on most full server

Workload:
  Read rate              - 96497 Hz
  Write rate             - 330 Hz
  Transactions started   - 29585 Hz
  Transactions committed - 46 Hz
  Conflict rate          - 1 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Reading the docs, I would assume FoundationDB would be able to withstand such a failure. Are we doing something wrong or am I interpreting the docs incorrectly?

This is the intended behavior of the cluster. As processes run out of space, the cluster gradually stops handing out read versions, which significantly limits the starting of transactions. This mechanism (known as ratekeeper) is the same as what’s used to slow down clients when they are saturating the cluster. In this case, we are attempting to protect a relatively small amount of space on the disks (the larger of 5% or 100MB), which makes it much easier to recover than if the disk is completely full.

In the case that only one process is running out of space, the cluster still reacts the same way. This is in part because our data distribution process tries to keep roughly the same amount of data on every storage server. You could imagine data distribution trying to be more sophisticated by assigning data based on how much space is available, and perhaps one day it may do that, but in that scenario you’d also probably want to be more careful about making sure the load is appropriately leveled across the processes.

It’s also possible that the cluster could attempt to proactively remove a process that has run out of disk space if the rest of the cluster has ample space available. I think there’s some desire to add a general feature like this (i.e. giving the cluster a budget to remove processes that are behaving poorly for some reason), and this may be a good candidate for one of those reasons.

Thanks for the detailed answer! I’d put my vote in for ejecting the processes that are out of space. Until then, it looks like we’ll have to setup a disk space monitor that can terminate the processes if they get too close.

Is the difference here that you’re sharing disks with other processes, and have no way to carve off a portion of that disk for only FDB?

You might also wish to look at using the exclude command instead of kill. See removing machines from a cluster.

@alexmiller These are i3.xlarge EC2 instances, so they definitely have plenty of IOPS to keep FDB happy. The case that caused me to run the test to repro is that backups filled a disk in production and knocked out the cluster. We could probably partition out the backup storage from the primary storage or something to make this less likely.

Ah, yes, I would recommend doing so if you can. Even to a different volume on the same host would be good, as fdbserver basically assumes that it has the disk to itself.