Fault Tolerance changes from "2 machines" to "0 machines (2 without data loss)"

(binzhang) #1

we have one 3 data center ( two usable regions) config FDB cluster.

when usable_regions=1, we see “Fault Tolerance - 2 machines”;
then we change usable_regions=2, we start to see “Fault Tolerance - 0 machines (2 without data loss)”

What’s possible reason that fault tolerance decrease to 0 machine?

fdb> status details

Using cluster file `/var/lib/foundationdb/fdb.cluster’.

Redundancy mode - triple
Storage engine - ssd-2
Coordinators - 5
Desired Proxies - 4

FoundationDB processes - 48
Machines - 36
Memory availability - 163.2 GB per process on machine with least available
Retransmissions rate - 2 Hz
Fault Tolerance - 0 machines (2 without data loss)
Server time - 05/30/19 00:32:27

Replication health - Healthy (Rebalancing)
Moving data - 0.015 GB
Sum of key-value sizes - 1.405 GB
Disk space used - 11.000 GB

(Alex Miller) #2

(Sorry, I had half a reply typed out to you, meant to double check with Evan, and then forgot.)

Region-ifyign a cluster is supposed to be a three part process:

  1. usable_regions=1 regions=[...]
  2. usable_regions=2 regions=[...] where regions has the datacenter that doesn’t have a fully copy of the database currently set to a priority of -1.
  3. usable_regions=2 regions=[...] where regions now has a >=0 priority for both datacenters.

What you’re seeing would make sense to me if you went straight from Step 1 to Step 3, as the remote side would be down to a fault tolerance of 0, but the primary still has copies of the data.

It’s also possible that you only have three zoneid’s in the remote DC, so if you lose one machine in the remote DC, you wouldn’t be able to recover to it?

…or it’s possible that there’s a bug in status.

Either way, more details on the exact steps you took and the exact layout and configuration of your cluster would be helpful. :slight_smile: