Database reporting not healthy despite having all the processes running

We used to alert on .client.database_status.available but we discovered during one incident that this was still reporting as available despite having some shards unavailable and since then we move to .client.database_status.healthy for a better measurement of the health of the database. But this metric seems to be over sensitive for instance today we have an alert and it seems because a few hours ago the datamover had an error

Last logged error: DataDistribution: internal_error at Sat Dec 14 19:08:55 2024

Why is the client reporting as not healthy if there was some error that seems to have been corrected by DD ?

I wouldn’t recommend alerting based on the cluster health as the health metrics is very sensible: Monitored Metrics — FoundationDB ON documentation and even a client with an out-dated connection string can cause the cluster to be “unhealthy”. For the replica case, I would recommend looking into the team_tracker metrics in the machine-readable status: Machine-Readable Status — FoundationDB ON documentation.

In your case you probably have to restart the process, as the error will be present until the process is restarted. The error is persistent to allow better debugging and to report the failure state until the process was intentionally restarted.