Health / Readiness checks

Daniel-B-Smith · May 8, 2020, 4:24pm

We’re running FDB on k8s without the operator for a few reasons. We’re not sure what we should be using for liveness and readiness probes. We know enough not to use status json but don’t know what would be better. I tried to look through the operator code and couldn’t find anything. Do you have any recommendations?

markus.pilman · May 9, 2020, 12:22am

Can you elaborate? We use status json for this - you just shouldn’t call into it too often as it is quite expensive. So what we do is that we run this on every machine that runs a coordinator.

PierreZ · May 9, 2020, 11:11am

Not sure about the readiness checks (I used the operator), but I wrote a small go app that is converting status json into prometheus metrics, it may help you as the health is exposed:

# TYPE fdb_database_status gauge
fdb_database_status{state="available"} 1
fdb_database_status{state="healthy"} 1
fdb_database_status{state="quorum_reachable"} 1

Daniel-B-Smith · May 11, 2020, 2:08pm

By not using status json, I meant not using it on every node. Ideally, we would use something that could be run locally on the node. I’m unsure of how difficult it would be to have a small number of pods using status json and updating the pod readiness of all of the cluster.

markus.pilman · May 11, 2020, 2:16pm

Yes, running it on every machine won’t scale. But there are alternatives:

You could write the status json back into fdb with a timestamp of last update. Drawback is that you won’t be able to fetch anything if fdb is down (though it’s unclear whether you need to…). Doing a get request every 5 seconds or so from each machine should work even for large clusters. If not you can write multiple copies.
You could write the result of status json to a third service (like S3). This has the drawback that you need to rely on another service

ajbeamon · May 11, 2020, 3:04pm

One thing we do for liveness checks on the cluster is to repeatedly run a transaction to test that we can get a read version, read the database, and commit. For GRV and commit, this is usually quite informative. For reads it’s a little less so given that some keys could be unreadable while others aren’t, and you presumably wouldn’t want to test reads from everywhere.

john_brownlee · May 11, 2020, 5:16pm

We don’t currently use Kubernetes readiness or liveness checks for the FDB processes, largely because we haven’t found a need for them. I think the most important use for them is when you have multiple processes under a single service, so that the service can figure out which processes to route traffic to, but FDB doesn’t have that kind of use case.

While you can use status JSON to get at this information, you’ll need to to be careful to separate out the cluster-level health from the process-level health. You can get process-level error information which could be a helpful thing to expose, but it will depend on what you’re trying to do in response to the readiness and liveness checks. You could also try a basic TCP probe, but I don’t know if that would serve your needs.

Daniel-B-Smith · May 18, 2020, 4:23pm

Readiness checks are also valuable for things like pod disruption budgets. AFAIU, the controller uses the readiness state of pods to determine whether or not it is safe to kill / move pods in the statefulset. They are also useful in ensuring that a rolling restart of the statefulset doesn’t proceed faster than the pods can re-join the cluster.

Topic		Replies	Views
Is “status json” the right way to check whether FDB cluster is in a single-datacenter mode and how scalable it is as a query? Running FoundationDB bindings , performance	0	322	October 27, 2021
Questions on status json Running FoundationDB	6	1104	August 21, 2021
I made a tool for browsing `status json` Using FoundationDB	1	483	November 3, 2022
How to do status check using the Java API Using FoundationDB bindings	5	1977	June 30, 2018
Monitor FDB in production Using FoundationDB	6	1088	June 1, 2022

Health / Readiness checks

Related topics