Full disk on one machine results in 99% performance degradation

This is the intended behavior of the cluster. As processes run out of space, the cluster gradually stops handing out read versions, which significantly limits the starting of transactions. This mechanism (known as ratekeeper) is the same as what’s used to slow down clients when they are saturating the cluster. In this case, we are attempting to protect a relatively small amount of space on the disks (the larger of 5% or 100MB), which makes it much easier to recover than if the disk is completely full.

In the case that only one process is running out of space, the cluster still reacts the same way. This is in part because our data distribution process tries to keep roughly the same amount of data on every storage server. You could imagine data distribution trying to be more sophisticated by assigning data based on how much space is available, and perhaps one day it may do that, but in that scenario you’d also probably want to be more careful about making sure the load is appropriately leveled across the processes.

It’s also possible that the cluster could attempt to proactively remove a process that has run out of disk space if the rest of the cluster has ample space available. I think there’s some desire to add a general feature like this (i.e. giving the cluster a budget to remove processes that are behaving poorly for some reason), and this may be a good candidate for one of those reasons.