Backpressure against data pipelines

We’re using FDB with the FDB record layer, though this question isn’t specific to the record layer.

We recently ran a large data job that made a lot of FDB calls. Our monitoring saw that the storage_server_durability_lag was going up. In the past, when this has happened for a sustained period of time, it’s caused the cluster to become unhealthy. This time, luckily we were able to catch it and stop the job.

In the future, we’d like to have whatever is doing transactions against FDB be able to auto-detect this kind of situation, and slow down the data load, so it doesn’t put too much load on FoundationDB. It seems right now, the client is just going as fast as it can, without regard for server health. We’d like to be able to limit the client’s interactions with the server to a healthy level. Ideally, not fail, but just slow things down to a manageable level.

Is there a recommended way to do this?

There are two ways in which clients can adjust transaction rate in face of cluster saturation:

  1. FDB Ratekeeper keeps track of different buffers and lags in the cluster and influences proxies to increase the latency in handing out ReadVersion as part of starting new transactions. This results in automatic throttling of transactions and thereby reducing cluster load.

  2. Clients can periodically (every minute or so) query the cluster status (\xff\xff/status/json) and then throttle their activity workload if the status indicates possibility of cluster saturation.

You mentioned that in past a sustained high transaction rate caused cluster to become unhealthy - what was the issue observed? Was it just that new transaction took significantly longer to start (due to proxy throttling), or was it some other issue?

@gaurav We are running FDB version 6.1.8 with RF=3 and the main impact that we are seeing, when the cluster is running a heavy workload, is that the storage & transaction log queues continue to grow in size and the clients don’t seem to be throttling. The storage durability lag increases to the order of minutes and the database becomes unhealthy.

My understanding is that the ratekeeper should kick in and the clients should be throttling before the database becomes unhealthy but that never seems to happen.

As far as I know we run our transactions with default priority & are not seeing any impact to default transaction latencies.

Yeah… Ratekeeper is supposed to be kicking in exactly in that circumstance…

  • Could you hunt down and post some RkUpdate trace events from when your cluster is unhealthy and it should be throttled?
  • Are you doing any sort of read version caching in your clients that would cause you to potentially bypass ratekeeper?
  • Do you know if this is a read-heavy workload or a write heavy workload?

Hi @alexmiller,

  1. I dug through the available trace logs from when our database was unhealthy but couldn’t find any RkUpdate events during that period. The only RkUpdate trace events that I could find are from trace logs when I first brought up the cluster. We will try to simulate the condition again and I’ll try to collect any events.

  2. Afaik, we aren’t doing anything special client-side for read version caching.

  3. We have a read-then-write workload but the read-to-write ratio is 4:1.

One thing that I learned here: What do you monitor? is this: When the queues reach a certain threshold (currently 900 MB on storage, 1.6GB on logs), ratekeeper will slowly start to throttle transactions. If the queue continues to grow, the throttling becomes more aggressive, attempting to achieve certain limits (1GB on storage, 2GB on logs). It may be worth noting, though, that ratekeeper will let 1 fault domain’s worth of storage servers fall arbitrarily far behind without throttling, in which case the storage server will eventually stop trying to fetch new data when it reaches what we call the e-brake (1.5GB).

For us when the database was unhealthy, one of our storage processes had peak queue size at 1.08GB but the rest of storage processes were <560MB and all tx log processes had a queue size at ~515MB. Could it be that because of these conditions, while suboptimal but not critical, the RK throttling never kicked in?

You should be getting an RkUpdate trace event every 5 seconds. You can grep -P 'Roles=.*?RK.*?' trace.xml to scrape to make sure you’re collecting traces from that process at all. If not, then that seems like a suspicious gap. Either you’re somehow missing the process’s logs, or your FDB cluster is completely unavailable during that time and I’m going to want to peek at all trace events that match Type="MasterRecover.*" to figure out where in recovery you’re stuck.

Yes, it sounds like Ratekeeper wouldn’t be limiting due to storage or log queues. However, you mentioned that storage servers were minutes behind, and Ratekeeper will also limit based on how far your storage servers are behind in terms of versions (and thus time). So if you do have more than one storage server per team that’s minutes behind in applying mutations, then that should also cause the cluster to be heavily throttled. I’m still struggling to figure out how your cluster would be unhealthy in a way that Ratekeeper wouldn’t catch.

If you can share a small window of some TLogMetrics and StorageMetrics from the unhealthy time along with the RkUpdate logs, that’d be great.

@alexmiller I increased the log file sizes for our cluster as it was at the default sizes and I think the original logs got rotated. We were able to recreate the situation and I have RkUpdate trace logs & observed tx log & storage metrics here: https://gist.github.com/rclerk-branch/a77edcef6256598b8bfb2823a70e5224. I’ve also attached observed metrics for TLog & storage queues.

While monitoring live, it seems that Ratekeeper does kick in when FDB is unhealthy and the storage durability lag is climbing, but it seems to be more relaxed and doesn’t try to be very aggressive. It seems to be happy with the suboptimal perf and doesn’t think it’s critical.

Looking at all of this… I’m not… actually sure what is unhealthy? Your RkUpdate says that your storage servers are keeping up in fetching data from the transaction logs, so your clients should still be able to do reads just fine. Which appears to be true, as your read and write latencies don’t look impacted. I’ll agree that that you have storage servers with a large queue of data to write to disk, but they’re not yet at the point where we’ll limit based on that, and the only thing that should be impacting is storage queue size and transaction log queue size, (due to buffering the data in memory on the storage, and not popping it from TLogs, respectively).

So, I’ll agree that this cluster doesn’t look like it’s the happiest, but I also don’t see any signs of it being bad in any way that should impact your workload? Could you expand on what sort of unhealthyness that you’re seeing?

Thanks for the insight. This is helpful to understand. Given that this is my first time operating FDB, I wasn’t certain as to whether this is healthy behavior or a concerning pattern.

Since fdbmonitor reports the database status being unhealthy, there was a concern whether this pattern of suboptimal perf is concerning over long periods of time. Also, given that we weren’t seeing Ratekeeper in action very often, likely because we are under the ops threshold when it really kicks in, it caused confusion as to whether we need to implement some application logic to do so.

After conducting more experiments last week, we did see Ratekeeper kick in actively as we continued to push the write workload. The only thing we noticed, as we left the heavy write workload running for a longer period of time, is that Ratekeeper keeps the storage & tlog queues at manageable levels but doesn’t aggressively try to bring them down, which might be expected behavior but not something that I was aware of beforehand.

We have a better understanding of the Ratekeeper thresholds and patterns now. Appreciate your patience and guidance!