Metric equivalent to "max zone failures without losing availability" for tlogs?

ryanworl · April 2, 2021, 10:02pm

Today we had an issue in a three_data_hall cluster which was unbalanced across AZs due to a misconfiguration.

There were 5 TLog-eligible processes in AZ X, 2 in AZ Y, and 1 in AZ Z.

A TLog in AZ Y died, leaving the cluster unable to recover. I added another TLog in zone Y and it recovered.

We have monitoring on the max_zone_failures_without_losing_availability metric emitted by status json, but it was never lower than 2 the entire time.

Is there a way FDB can monitor this for me, or do I need to do this myself?

Topic		Replies	Views
Fault Tolerance - 0 zones after setting locality_zoneid Using FoundationDB	8	761	May 19, 2021
Max Tolerable Zone Failures for Availability and Data Using FoundationDB	1	601	June 24, 2020
Foundationdb cluster became unavailable after shutting down 1 az Using FoundationDB	0	244	August 23, 2023
What should a data hall failure look like in three_data_hall mode (AWS)? Running FoundationDB	0	285	September 13, 2023
"Locking coordination state" after losing a AZ on three_data_hall Using FoundationDB	1	236	February 28, 2024

Metric equivalent to "max zone failures without losing availability" for tlogs?

Related topics