Are there any 'likely problems' resulting in a storage role exceeding its 8GB RAM and being killed that folks can point me to?

I’m trying to avoid asking a question that’s an XY problem here.

Our engineers managed to take down our dev/testing FDB (7.3, redwood) cluster over the weekend.

They wanted a large volume of data of the sort that would normally be produced by our application over a number of months, to test how well some new components of the system scale beyond small datasets. So they wrote a job to generate a high volume of data as fast as possible. This meant creating a reasonably large number of initial resources, pushing requests that referenced those initial resources through the applications at a much higher rate/sec than we see in prod to generate additional data, and running various (normally) nightly tasks etc more frequently too. The result was our applications doing a much higher volume of reads (mostly) and writes to the cluster, to the point that they hit the iops limits on the volumes we have deployed for each storage process.

At this point, we started to see our storage processes crashing and being restarted by fdbmonitor, but failing to stay up stably. The output in the logs was ERROR: Out of memory. This seems to have come from FDB itself exceeding the 8GB memory allocation it has by default, rather than the underlying instances running out of RAM and being killed by the kernel. We didn’t see any noticeable change/problem in our metrics for the storage process data or durability lags, or the log queue length.

We got it to stabilise by shutting down the data-generation jobs, and editing (increasing) the memory and cache-memory config options for the storage processes, but my understanding is that having to do this is generally a bit of a code smell that we’re doing something wrong/unexpected with how we interact with FDB.

I’m trying to understand more about why the storage nodes would have grown beyond their normal RAM allocation, and then OOMed.

My current working theory:

  • Our application is using JVM virtual threads instead of a thread pool to run a lot of the system (IIRC the FDB JNI interaction pins virtual threads, so it actually spawns a new ‘real’ thread each time specifically for the FDB tx, via .[read|run]Async returning a CompleteableFuture).
  • Several of these data-generation jobs do a lot of parallelism to try and complete ASAP.
  • So they don’t have a fixed pool of ‘worker threads’ that will be slowed down if individual transactions slow down. The only limit on the count of in-progress requests is the memory limit on the job.
  • As such, if the FDB ratekeeper slows down tx creation to try and force the application to back off, the result will be more txes in flight simultaneously (potentially taking longer to start/return), rather than a fixed max count of in-flight txes being artificially slowed to give the DB more ‘breathing room’.
  • It doesn’t look like FDB will ever say “I can’t handle any more txes right now, go away”, it will just try and slow down tx creation to exert backpressure.

The challenge to this theory from within my team was that the ratekeeper slowing tx startup as a backpressure mechanism would surely show as increased memory (tracking more ‘starting’ txes) on the commit proxies or similar, not an issue on the storage nodes. I also wondered if there was some upper limit to how much the ratekeeper would artificially slow tx start.

So, does my theory sound sensible? Am I completely off-base (quite possible)? If so, is there something else I should be looking at to help understand what caused this so we can try to avoid anything similar in prod?

Ratekeeper will throttle all the way to 0 normal (non-“system_priority_immediate”) transactions under the right conditions, such as more than 2 Storage Queues growing beyond something like 1.8GB.

Based on

We didn’t see any noticeable change/problem in our metrics for the storage process data or durability lags, or the log queue length.

it sounds like that did not happen here because the cluster was still working at some non-zero transaction rate despite all the StorageServer restarts caused by OOMs. This is not unexpected because Redwood is very fast at recovering from disk and returning to normal write throughput. The return-to-performance time will be especially short if the writes have high key locality.

As for the default memory configuration, it’s probably the case that little attention has been paid to the defaults by large scale FDB users for several years and the StorageServer memory usage outside of the storage engine’s page cache has grown. FDB does not currently limit its memory usage as a reaction to its current usage vs its budget, so it is essentially up to the user to set the cache-memory and memory limits.

  • The cache-memory option sets the page cache size for the Redwood and ssd-2 storage engines. They will reliably stick to this limit aside from an occasional tiny overage when too many page eviction attempts encounter temporarily pinned pages.
  • The memory option sets the total memory usage (specifically RSS, not virtual memory) for the process. For a Storage class process, this setting must be large enough to accommodate the sum of:
    • cache-memory
    • Storage engine memory usage aside from its page cache (such as temporary memory used by reads or pending writes)
    • StorageServer memory other than the storage engine memory listed above.
      StorageServer memory will vary based on its current user workload, shard movement activity, and logical data size.

I don’t think there is any documentation describing how to arrive at what memory should be relative to cache-memory, and in fact in the fleets I’ve been involved with we have updated these settings occasionally based on memory usage observations. As a general rule, I think setting memory to (1.5 * cache_memory + 4GB) would be a stable configuration.

1 Like