Today we had an issue in a three_data_hall
cluster which was unbalanced across AZs due to a misconfiguration.
There were 5 TLog-eligible processes in AZ X, 2 in AZ Y, and 1 in AZ Z.
A TLog in AZ Y died, leaving the cluster unable to recover. I added another TLog in zone Y and it recovered.
We have monitoring on the max_zone_failures_without_losing_availability
metric emitted by status json
, but it was never lower than 2 the entire time.
Is there a way FDB can monitor this for me, or do I need to do this myself?