Database reporting not healthy despite having all the processes running

mpatou_openai · December 14, 2024, 8:52pm

We used to alert on .client.database_status.available but we discovered during one incident that this was still reporting as available despite having some shards unavailable and since then we move to .client.database_status.healthy for a better measurement of the health of the database. But this metric seems to be over sensitive for instance today we have an alert and it seems because a few hours ago the datamover had an error

Last logged error: DataDistribution: internal_error at Sat Dec 14 19:08:55 2024

Why is the client reporting as not healthy if there was some error that seems to have been corrected by DD ?

johscheuer · December 20, 2024, 2:56pm

I wouldn’t recommend alerting based on the cluster health as the health metrics is very sensible: Monitored Metrics — FoundationDB ON documentation and even a client with an out-dated connection string can cause the cluster to be “unhealthy”. For the replica case, I would recommend looking into the team_tracker metrics in the machine-readable status: Machine-Readable Status — FoundationDB ON documentation.

In your case you probably have to restart the process, as the error will be present until the process is restarted. The error is persistent to allow better debugging and to report the failure state until the process was intentionally restarted.

mpatou_openai · February 15, 2025, 7:57pm

We use to alert on the database not being available but we found that we could have range unavailable and still have this metric being reported.
I will have a look at the team_tracker.

Topic		Replies	Views
How to check if data is fully replicated? Using FoundationDB	2	1011	July 12, 2019
Foundation db process degrade alert for a cluster Running FoundationDB	3	61	June 25, 2025
UNHEALTHY: No replicas remain of some data Using FoundationDB	4	444	June 14, 2021
Determine cluster availability for processing new transactions Using FoundationDB	1	557	April 19, 2021
Identifying shards associated with replica unavailability; shard selection when dropping redundancy levels Running FoundationDB	15	2123	December 11, 2021

Database reporting not healthy despite having all the processes running

Related topics