What do you monitor?

These queues exist on the storage and log processes, respectively, and they more or less represent work that’s outstanding in the system. The system has limits on the queue sizes enforced by ratekeeper, which is a process that artificially introduces latency at transaction start (i.e. get read version) to limit the transaction rate and protect the cluster from being overrun.

I’ll try to give you an idea of how the queues work before I describe why you might want to keep an eye on them. When data is committed, it eventually makes its way to the transaction logs to be made durable. This data is also stored in memory on the log processes at this time (to support quick fetching by the storage servers), so we consider it part of the queue and track how much memory it’s using.

Meanwhile, the storage servers are asynchronously fetching the most recently committed data from the logs. When a storage server gets a set of mutations, it also holds it in memory and these mutations get tracked in the storage server’s queue. It should be noted that the data structures holding this memory on the logs and storage servers are different, which means that the size of mutations in the queue will differ between the logs and the storage servers.

Without going into too much detail, as part of our MVCC implementation, storage servers are able to serve reads up to 5+ seconds old. The way this works is that the oldest available version is stored on disk and subsequent versions are held in a multi-version memory structure. This means that each mutation will reside in memory for 5 seconds, and during this time it is considered part of the storage queue. At the point when the data is made durable, it is removed from the memory data structure and consequently no longer tracked in the storage queue.

Once the storage server has made mutations durable, it notifies the logs that it’s done so. When all storage servers responsible for a set of mutations have reported this, the logs can safely discard that data and remove it from their queue.

What you expect to see in normal circumstances is that the log and storage queues will hold about 5 seconds worth of mutations (actually, it’s a little more in practice, something like 7-8s). If something isn’t able to keep up for some reason, then you may see the log and/or storage queues start to grow beyond that. One such reason you may see this is if a storage server fails. In that scenario, the storage server is unable to fetch data from the logs, and the logs end up holding onto that data until it is replicated elsewhere or the storage server comes back and gets the data.

In this scenario, you could imagine that this takes a while and the queue on the log could grow quite large. To protect against this, we have a process on the log call spilling. When the spilling conditions are met (currently, this is when the log queue >= 1.5GB), the data is written to a separate file on disk and removed from the memory structures. When this occurs, we also consider it to be removed from the queue.

Another case where queues might grow is if a process is being saturated. If a process reach 100% CPU, say, it may be unable to keep up with the incoming work and its queues may grow. When the queues reach a certain threshold (currently 900 MB on storage, 1.6GB on logs), ratekeeper will slowly start to throttle transactions. If the queue continues to grow, the throttling becomes more aggressive, attempting to achieve certain limits (1GB on storage, 2GB on logs). It may be worth noting, though, that ratekeeper will let 1 fault domain’s worth of storage servers fall arbitrarily far behind without throttling, in which case the storage server will eventually stop trying to fetch new data when it reaches what we call the e-brake (1.5GB).

The reason you may want to monitor the queues is that it can provide you early warning that your workload is going to saturate the cluster. You may also be able to see that you are approaching the queue limits, even if they aren’t currently growing.

However, alerting based on the queues should be done with care. There are different reasons that queues can grow large, and you may not want to treat all of them the same. As illustrated above, a large queue could mean:

  1. The cluster isn’t keeping up with the workload
  2. The cluster is keeping up with the workload, but 5 seconds worth of mutations represents a sizable portion of the queue
  3. The cluster has had a failure somewhere and the queues are large while it heals

I tend to think of situation #1 as more serious than the other two (i.e. I may want to know immediately if #1 is happening, eventually if #2 is happening, and possibly not at all if #3 is happening).

Based on the current behavior of the system, I probably wouldn’t recommend aggressive alerting on the log queue size at anything less than the spill threshold (you could alert at 1.6GB, perhaps) because otherwise you will get alerted any time #3 happens. Alerting on storage queue sizes is less prone to that and probably more reasonable (I currently do it), though it’s not immune.

To keep an eye on #2, I currently don’t use queue sizes directly. Instead, I look at something called cluster.processes.<id>.roles.input_bytes.hz for the storage and log processes in status json, which measures the rate that data is coming into the queue (durable_bytes, by the way, measures data leaving the queue. You can get a queue size for a process by computing input_bytes.counter - durable_bytes.counter). From that, I can estimate how big the queues should be based on that input rate.

For #1, in addition to watching the storage queue, you can also keep an eye on storage lag (in 5.2 it’s at cluster.processes.<id>.role.data_version_lag for storage roles, in 6.0 you can use cluster.processes.<id>.role.data_lag.seconds). This measures how far behind a storage server is from the logs, and if this gets and stays much larger than 5 seconds (or 5,000,000 versions for 5.2), then that could also indicate something’s going on. In practice, I actually do something a little more complicated than this, tracking the worst lag getting very large (hundreds of seconds) and the [replication_factor]th worst lag, subtracting failures, being around 15s or so (i.e. the 3rd worst for triple replication, but 2nd worst if a process is failed).

7 Likes