Max Tolerable Zone Failures for Availability and Data

We have one FDB cluster with the following status:

Cluster:
  FoundationDB processes - 331
  Zones                  - 19
  Machines               - 211
  Memory availability    - 135.2 GB per process on machine with least available
  Retransmissions rate   - 2 Hz
  Fault Tolerance        - 0 zones (2 without data loss)
  Server time            - 06/23/20 01:18:48

I am little confused by the “Fault Tolerance” line and like to get some advice on how to read the values and why such value combination occur.

Is the following reading correct?

  • The first number (“0 zones”) is for “cluster fault tolerance max zone failures without losing availability
  • The second number in parens (“2 without data loss”) is for “cluster fault tolerance max zone failures without losing data

Please explain losing availability vs. losing data. Under what circumstances would such combination occur? What does it mean from db operations point of view?

Why is there such a big difference in fault tolerance between losing availability and losing data (0 zones vs 2 zones)?

For the availability case, my hypothesis is that it is related the ratio of the number of machines vs. the number of zones in the cluster. (In our deployment, “zone” is mapped to “rack”.) When the ratio is smaller, i.e., fewer machines per zone, the allowed zone failures would be higher. Is this the case?

Is there a formula for the ratio (rule of thumb) that can give us some assurance to have at least 1 zone failure tolerance (for availability and/or data)?

For comparison, we have another cluster with the following status:

Cluster:
  FoundationDB processes - 141
  Zones                  - 23
  Machines               - 94
  Memory availability    - 140.0 GB per process on machine with least available
  Retransmissions rate   - 144 Hz
  Fault Tolerance        - 2 zones
  Server time            - 06/23/20 19:10:05

I assume the single value of “2 zones” above applies to both availability and data.

In this second cluster, we have fewer machines (94) than the first cluster (211), and more zones (23) than the first one (19), thus the ratio is smaller (94/23 == 4) than the first cluster (211/19 == 11).

Thanks.

1 Like

let’s ask the question: what is the minimum zones to lose to cause data loss?
It is the tLog_replica number of tlogs OR a majority of coordinators OR storage server replica number of SSes.

So the maximum number of failed zones we can tolerate without losing data is the minimum number above minus one.

About availability, in a multi-region configuration, we need to replicate data to satellite for every commit.
For example, satellite’s tLog replica factor is 2 and it has only 2 zones. If any zone in the satellite fails, FDB won’t be able to recruit new tLogs to satisfy the tLog replication factor. Then commit won’t succeed. Let’s define the number of extra tLog zones in satellite (i.e., extraEligibleTLogZones) as the number of available zones in the satellite minus the tLog replication factor.

The number of failed zones we can tolerate is the minimum number of extraEligibleTLogZones + 1 in satellites.

The related code is at https://github.com/apple/foundationdb/blob/e10704fd76138c6519a4bf743d7698b43f9eb9df/fdbserver/Status.actor.cpp#L1987-L1990

(BTW, this is a really good question.)