Determine cluster availability for processing new transactions

Hello there!
We’re working on a method to determine the cluster’s sanity based on its ability to process new transactions. To ascertain this, we were relying on database availability and quorum fields in status json. However, we discovered an exception to this rule, where the database is still available but can not process new transactions due to the storage server reaching the 5% limit. We understand that there might me more such exceptions that need to be accounted for.
Hence, it’d be helpful to know if there’s a better method to determine if my db is capable of processing new transactions, albeit at a reduced performance.

Thanks!

The client.database_status.available field will report that the cluster is available if the client can connect to the cluster to get status, and if as part of that status request the cluster controller was able to start an immediate priority transaction, read a key, and commit the transaction. There are some timeouts set on different parts of this operation, and if those are hit the database will be regarded as unavailable for this status field.

There are various cases where a client of the cluster may experience unavailability but not be able to detect it through this status field. For example:

  1. Ratekeeper throttling that delays GRV requests significantly. There are a variety of reasons this throttling could happen, including the example you listed where a process runs out of space.
  2. Some keys could become unavailable for reads. I mentioned that status checks whether a read works, but it doesn’t read from all shards in the cluster. If some set of data has gone missing or its storage servers are unable to handle reads in a timely manner, then any transactions relying on that data will not work.
  3. A problem limited to a subset of the proxies could prevent some clients from starting and/or committing transactions while not preventing all transactions.

Monitoring #1 is perhaps most easily accomplished by performing latency probes into the cluster to check the timeliness of default-priority GRVs. Unlike the immediate-priority transactions used to populate the availability field in status, these are subjected to ratekeeper, and you could use them to determine if it’s possible to start transactions quickly enough. Status includes a latency probe at cluster.latency_probe.transaction_start_seconds that you could use, but it should be noted that it caps out at 5 seconds. If you are happy treating the database as unavailable if this value crosses a threshold at or below that, then you could use it. Otherwise, you would probably need to run your own probe. Also, be aware that this value could be missing in status, which should be treated as if the latency is larger than 5 seconds.

For #2, this is a bit trickier. We can detect when the cluster knows that all replicas of some piece of data are missing (cluster.data.min_replicas_remaining). To determine if data is present but unreadable (e.g. because storage servers are too busy or have fallen behind), we would need to use a few other metrics. We can probably tell if a storage server is too busy by looking at the read latency statistics for it (cluster.processes[<ID>].roles[N].read_latency_statistics for storage roles). We can tell if it is behind by checking whether cluster.processes[<ID>].roles[N].data_lag.seconds is large (>5s or so). In both of these cases the problem needs to affect all replicas of some subset of data, so you’d probably only want to report an availability issue if the number of storage servers with problems was at least as large as your replication factor (3 in triple).

For #3, we can probably cover a good number of the cases here by checking proxy request latencies (cluster.processes[<ID>].roles[N].grv_latency_statistics and commit_latency_statistics for proxy roles).

I realize this is quite a bit to monitor in order to observe the available state of the cluster, and that’s not ideal. This is something that I’m currently thinking about and hoping to improve going forward. I’m not sure it’s necessarily a good idea to just roll up all of this data into a binary available field, as that limits the ability for others to define availability the way they want (e.g. is a 5 second downtime unavailable, or 30, etc? Is one busy team of storage servers a complete unavailability?). However, it seems doable that we could provide a few basic statistics about the cluster’s availability that could be more easily reasoned about than what I described above.

3 Likes