We’ve got a single container running on K8s that is reading messages from Kafka and pushing then into FoundationDB running on EC2 instances (not via the K8s operator, as that currently results in limitations we’re not OK with).
Our cluster is not huge, but we’re running in
three_data_hall mode with 9 coordinators, 6 ‘stateless’ class nodes, 9 available log class nodes (4 in use, 5 standby), and 6 storage processes across 3 nodes. Each storage and log class node has 8GB RAM, and gp3 disks (one disk per process on the storage nodes) configured for a max of 6000 iops and 150MB/sec throughput.
With all of that, our single container is able to completely kill the FDB cluster. It reads a batch of messages from Kafka, then fans out into parallel threads to write them to FDB. This works fine for a short amount of time (15 minutes or so), but we see the durability lag, data lag, log queue size etc all gradually increase. The log queue size in particular seems to max out at 1.5GB. It’ll run like that for a short while, then the cluster just becomes unavailable. The app gets no response,
fdbcli on a server reports that “the configuration could not be read”, etc. The only way to bring it back seems to be to reboot the log and storage nodes. When they come back up the cluster goes into recovery/repartitioning, and eventually ends up back in a healthy state. It doesn’t seem to have lost any of the writes that the app actually saw it accept.
The thing is, I thought this was the purpose of ratekeeper? I thought when it detected queues/lag growing like this it was supposed to rate limit incoming requests to prevent the cluster being overloaded. Why is that not happening?