How does Ratekeeper actually _work_, and can I tune it?

We’ve got a single container running on K8s that is reading messages from Kafka and pushing then into FoundationDB running on EC2 instances (not via the K8s operator, as that currently results in limitations we’re not OK with).

Our cluster is not huge, but we’re running in three_data_hall mode with 9 coordinators, 6 ‘stateless’ class nodes, 9 available log class nodes (4 in use, 5 standby), and 6 storage processes across 3 nodes. Each storage and log class node has 8GB RAM, and gp3 disks (one disk per process on the storage nodes) configured for a max of 6000 iops and 150MB/sec throughput.

With all of that, our single container is able to completely kill the FDB cluster. It reads a batch of messages from Kafka, then fans out into parallel threads to write them to FDB. This works fine for a short amount of time (15 minutes or so), but we see the durability lag, data lag, log queue size etc all gradually increase. The log queue size in particular seems to max out at 1.5GB. It’ll run like that for a short while, then the cluster just becomes unavailable. The app gets no response, fdbcli on a server reports that “the configuration could not be read”, etc. The only way to bring it back seems to be to reboot the log and storage nodes. When they come back up the cluster goes into recovery/repartitioning, and eventually ends up back in a healthy state. It doesn’t seem to have lost any of the writes that the app actually saw it accept.

The thing is, I thought this was the purpose of ratekeeper? I thought when it detected queues/lag growing like this it was supposed to rate limit incoming requests to prevent the cluster being overloaded. Why is that not happening?

1 Like

Are you doing blind writes (i.e. your transactions only have sets) without reading anything? If you don’t read, it won’t even use ratekeeper.

I didn’t realise ratekeeper only applied to gets. Why is that? It rate limits by slowing down the transaction commit doesn’t it, so surely that’s just as valid with sets as with gets?

In answer to the question though, we are doing gets yes. We’re checking for the existence of the key before we try to write it. This is really just a belt-and-braces thing, as all the messages are keyed on UUID so should be unique except in the cases where we’ve purposefully duplicated them from one topic to another, and in that case they’re identical so it doesn’t matter if we write the same message multiple times. And that’s even before taking into account FDB’s own transaction conflict detection.

Ratekeeper is applied to GetReadVersionRequest’s, which happen once per transaction (even if it’s a blind write). The only way to avoid ratekeeper is to call set_read_version on your transaction. To me it sounds like ratekeeper isn’t quite doing what it’s intended to do here

I stand corrected though I’m quite sure we had to insert gets at Wavefront in order to stop our blind writes from knocking over FDB. Maybe there is something else at play here. (Edit: now that I think about it we may have been setting set_read_version)

Do you know what the bottleneck is? Are you hitting the maximum IIOPs on your disks?

We’re not setting set_read_version. I’ve done some more digging and testing, and I think I’m getting somewhere:

We’re deploying our cluster across multiple EC2 instances in AWS, with EBS volumes as the data store. Apparently, anything below an 8xlarge(!) across, the c, m, and t (which doesn’t go that high, so it’s just all of them) series (and possibly others, but those are the ones I looked at) have a 30 minute ‘burst limit’ on interaction with their volumes. They can only actually achieve the max data rates the volume supports (if it’s scaled high) and the instance type/size claims for a max of 30 mins in a 24 hour period, for iops and bytes separately, after which they’re rate limited to a much lower baseline.

So problem one was that we were exceeding that baseline, draining our EBS I/O ‘balance’, and then getting heavily rate limited. We weren’t exceeding the max I/O on the disks while we were in the burst period, but once rate limited we were. When that happened, we saw the log instances were flatlining at 1.5GB log queue length (but not dying), and then eventually the storage instances were disappearing from the cluster. At no point did I see any message indicating ratekeeper had kicked in.

Problem two was that I had allocated storage processes to instances at a capacity of 2 processes per 8GB node. So, slightly under the 4GB per process that is recommended, once you account for some for the OS etc. I believe there was some issue occurring with memory exhaustion on the instances, not just with the FDB processes, as when the storage instances dropped out of the cluster I was unable to log into them via any standard AWS methods, and I had to terminate them and allow them to be restarted by the ASGs we’ve created. Once I did that, and the new instances mounted the existing volumes, the cluster recovered (assuming I had stopped our application that was sending all the data).

So I’ve just done another test. Scaled the storage instances from 8GB to 16GB, and turned on our import application again. Waited a bit. I’m now seeing our EBS volumes being rate limited, but the storage processes haven’t dropped out the cluster. The log queue length has dropped down to ~1GB and stabilised there, and the durability lag is ~5mins. Obviously none of that is ideal, but the important thing is the cluster hasn’t completely fallen over, so when the load stops it should recover without manual intervention. I’m also seeing a message in fdbcli:

Performance limited by process: Storage server performance (storage queue).
  Most limiting process: 10.1.246.97:4500

which I think means ratekeeper is now doing its thing correctly?

So if my assumptions are correct, I just need to adjust our cluster topology a bit. Rather than trying to fit 2 storage processes on a single 8GB instance, I should be OK fitting 3 on a 16GB instance, and then deploying fewer instances. That’ll also mean I probably want to reduce the volume size that each process gets, to keep the same amount of overall storage. I can do that, no worries.

This is a very abnormal application that we’re using for testing anyway. We wouldn’t normally expect the cluster to be loaded so highly and for so long, but if it does happen, due to a DoS or similar, we want to be sure it won’t completely collapse :slight_smile:

1 Like

There’s still a meaningful difference between a get and a GRV-only, because inserting a get means that the read version has to be readable on a storage server before the client will submit the write. It essentially makes clients slow down writes as storage gets behind in versions, even if ratekeeper isn’t doing that job as well as it needs to be.

IIRC the workloads for FDB’s write throughput benchmarking had a read in each write transaction because it was a more realistic representation of effective write throughput for the cluster.