Hi @alexmiller,
-
I dug through the available trace logs from when our database was unhealthy but couldn’t find any RkUpdate events during that period. The only RkUpdate trace events that I could find are from trace logs when I first brought up the cluster. We will try to simulate the condition again and I’ll try to collect any events.
-
Afaik, we aren’t doing anything special client-side for read version caching.
-
We have a read-then-write workload but the read-to-write ratio is 4:1.
One thing that I learned here: What do you monitor? is this: When the queues reach a certain threshold (currently 900 MB on storage, 1.6GB on logs), ratekeeper will slowly start to throttle transactions. If the queue continues to grow, the throttling becomes more aggressive, attempting to achieve certain limits (1GB on storage, 2GB on logs). It may be worth noting, though, that ratekeeper will let 1 fault domain’s worth of storage servers fall arbitrarily far behind without throttling, in which case the storage server will eventually stop trying to fetch new data when it reaches what we call the e-brake (1.5GB).
For us when the database was unhealthy, one of our storage processes had peak queue size at 1.08GB but the rest of storage processes were <560MB and all tx log processes had a queue size at ~515MB. Could it be that because of these conditions, while suboptimal but not critical, the RK throttling never kicked in?