I’m trying to avoid asking a question that’s an XY problem here.
Our engineers managed to take down our dev/testing FDB (7.3, redwood) cluster over the weekend.
They wanted a large volume of data of the sort that would normally be produced by our application over a number of months, to test how well some new components of the system scale beyond small datasets. So they wrote a job to generate a high volume of data as fast as possible. This meant creating a reasonably large number of initial resources, pushing requests that referenced those initial resources through the applications at a much higher rate/sec than we see in prod to generate additional data, and running various (normally) nightly tasks etc more frequently too. The result was our applications doing a much higher volume of reads (mostly) and writes to the cluster, to the point that they hit the iops limits on the volumes we have deployed for each storage process.
At this point, we started to see our storage processes crashing and being restarted by fdbmonitor
, but failing to stay up stably. The output in the logs was ERROR: Out of memory
. This seems to have come from FDB itself exceeding the 8GB memory allocation it has by default, rather than the underlying instances running out of RAM and being killed by the kernel. We didn’t see any noticeable change/problem in our metrics for the storage process data or durability lags, or the log queue length.
We got it to stabilise by shutting down the data-generation jobs, and editing (increasing) the memory
and cache-memory
config options for the storage processes, but my understanding is that having to do this is generally a bit of a code smell that we’re doing something wrong/unexpected with how we interact with FDB.
I’m trying to understand more about why the storage nodes would have grown beyond their normal RAM allocation, and then OOMed.
My current working theory:
- Our application is using JVM virtual threads instead of a thread pool to run a lot of the system (IIRC the FDB JNI interaction pins virtual threads, so it actually spawns a new ‘real’ thread each time specifically for the FDB tx, via
.[read|run]Async
returning aCompleteableFuture
). - Several of these data-generation jobs do a lot of parallelism to try and complete ASAP.
- So they don’t have a fixed pool of ‘worker threads’ that will be slowed down if individual transactions slow down. The only limit on the count of in-progress requests is the memory limit on the job.
- As such, if the FDB ratekeeper slows down tx creation to try and force the application to back off, the result will be more txes in flight simultaneously (potentially taking longer to start/return), rather than a fixed max count of in-flight txes being artificially slowed to give the DB more ‘breathing room’.
- It doesn’t look like FDB will ever say “I can’t handle any more txes right now, go away”, it will just try and slow down tx creation to exert backpressure.
The challenge to this theory from within my team was that the ratekeeper slowing tx startup as a backpressure mechanism would surely show as increased memory (tracking more ‘starting’ txes) on the commit proxies or similar, not an issue on the storage nodes. I also wondered if there was some upper limit to how much the ratekeeper would artificially slow tx start.
So, does my theory sound sensible? Am I completely off-base (quite possible)? If so, is there something else I should be looking at to help understand what caused this so we can try to avoid anything similar in prod?