We have one FDB cluster with the following status:
Cluster: FoundationDB processes - 331 Zones - 19 Machines - 211 Memory availability - 135.2 GB per process on machine with least available Retransmissions rate - 2 Hz Fault Tolerance - 0 zones (2 without data loss) Server time - 06/23/20 01:18:48
I am little confused by the “Fault Tolerance” line and like to get some advice on how to read the values and why such value combination occur.
Is the following reading correct?
- The first number (“0 zones”) is for “cluster fault tolerance max zone failures without losing availability”
- The second number in parens (“2 without data loss”) is for “cluster fault tolerance max zone failures without losing data”
Please explain losing availability vs. losing data. Under what circumstances would such combination occur? What does it mean from db operations point of view?
Why is there such a big difference in fault tolerance between losing availability and losing data (0 zones vs 2 zones)?
For the availability case, my hypothesis is that it is related the ratio of the number of machines vs. the number of zones in the cluster. (In our deployment, “zone” is mapped to “rack”.) When the ratio is smaller, i.e., fewer machines per zone, the allowed zone failures would be higher. Is this the case?
Is there a formula for the ratio (rule of thumb) that can give us some assurance to have at least 1 zone failure tolerance (for availability and/or data)?
For comparison, we have another cluster with the following status:
Cluster: FoundationDB processes - 141 Zones - 23 Machines - 94 Memory availability - 140.0 GB per process on machine with least available Retransmissions rate - 144 Hz Fault Tolerance - 2 zones Server time - 06/23/20 19:10:05
I assume the single value of “2 zones” above applies to both availability and data.
In this second cluster, we have fewer machines (94) than the first cluster (211), and more zones (23) than the first one (19), thus the ratio is smaller (94/23 == 4) than the first cluster (211/19 == 11).