Today we had an issue in a
three_data_hall cluster which was unbalanced across AZs due to a misconfiguration.
There were 5 TLog-eligible processes in AZ X, 2 in AZ Y, and 1 in AZ Z.
A TLog in AZ Y died, leaving the cluster unable to recover. I added another TLog in zone Y and it recovered.
We have monitoring on the
max_zone_failures_without_losing_availability metric emitted by
status json, but it was never lower than 2 the entire time.
Is there a way FDB can monitor this for me, or do I need to do this myself?