Health / Readiness checks

We’re running FDB on k8s without the operator for a few reasons. We’re not sure what we should be using for liveness and readiness probes. We know enough not to use status json but don’t know what would be better. I tried to look through the operator code and couldn’t find anything. Do you have any recommendations?

Can you elaborate? We use status json for this - you just shouldn’t call into it too often as it is quite expensive. So what we do is that we run this on every machine that runs a coordinator.

Not sure about the readiness checks (I used the operator), but I wrote a small go app that is converting status json into prometheus metrics, it may help you as the health is exposed:

# TYPE fdb_database_status gauge
fdb_database_status{state="available"} 1
fdb_database_status{state="healthy"} 1
fdb_database_status{state="quorum_reachable"} 1

By not using status json, I meant not using it on every node. Ideally, we would use something that could be run locally on the node. I’m unsure of how difficult it would be to have a small number of pods using status json and updating the pod readiness of all of the cluster.

Yes, running it on every machine won’t scale. But there are alternatives:

  1. You could write the status json back into fdb with a timestamp of last update. Drawback is that you won’t be able to fetch anything if fdb is down (though it’s unclear whether you need to…). Doing a get request every 5 seconds or so from each machine should work even for large clusters. If not you can write multiple copies.
  2. You could write the result of status json to a third service (like S3). This has the drawback that you need to rely on another service
1 Like

One thing we do for liveness checks on the cluster is to repeatedly run a transaction to test that we can get a read version, read the database, and commit. For GRV and commit, this is usually quite informative. For reads it’s a little less so given that some keys could be unreadable while others aren’t, and you presumably wouldn’t want to test reads from everywhere.

We don’t currently use Kubernetes readiness or liveness checks for the FDB processes, largely because we haven’t found a need for them. I think the most important use for them is when you have multiple processes under a single service, so that the service can figure out which processes to route traffic to, but FDB doesn’t have that kind of use case.

While you can use status JSON to get at this information, you’ll need to to be careful to separate out the cluster-level health from the process-level health. You can get process-level error information which could be a helpful thing to expose, but it will depend on what you’re trying to do in response to the readiness and liveness checks. You could also try a basic TCP probe, but I don’t know if that would serve your needs.

Readiness checks are also valuable for things like pod disruption budgets. AFAIU, the controller uses the readiness state of pods to determine whether or not it is safe to kill / move pods in the statefulset. They are also useful in ensuring that a rolling restart of the statefulset doesn’t proceed faster than the pods can re-join the cluster.