Could anyone recommend an ideal setup (nodes and processes) for achieving Fault Tolerance = 3?
I tried setting up nodes across 6 regions, each region having 2 nodes. Coordinators = auto and Double mode gave me Fault Tolerance = 1. In same setup when mode is changed to Triple, the Fault Tolerance became 2. We increased one more node (totally 7), still the Fault Tolerance is 2.
This table shows minimum number of machines for Fault Tolerance as 4. As we have 7 machines, I expected the Fault Tolerance value to become 3.
Is there way to achieve Fault Tolerance > 2 in Triple mode? Please advice.
There was a thread that discussed some misinterpretable aspects of that table:
Which also caused some edits to its wording.
Fault Tolerance
will always be, at best, Total Number of Copies - 1
. So Fault Tolerance =2 is the best that you will get with triple
replication.
Also see Optimal configuration for more than 3 DCs and a linked thread therein for a longer discussion on FDB on more than two regions.
Note that FDB is not a quorum-based system. The “a majority of nodes is required” logic only applies to coordinators, which do form a quorum-based system. Otherwise, the number of processes involved in holding a shard of data is explicitly set as part of the configuration, as the table attempted to describe.
I think a 2-region triple
configuration will give you the highest effective fault tolerance right now. If all copies are lost in one region, data distribution in multi-region understands how to copy it back across. You’d be able to theoretically lose 5 machines and FDB would still be able to recover, as long as you run 11 coordinators. I recall the status message wording being a bit weird with multi-region currently, so I think it will still show 2
, but will eventually tell you but (N) without data loss
if you start killing enough things.
I’ve typically seen people plan for 1 planned and 1 unplanned outage, so triple replication and a fault tolerance of 2 has generally been fine for most folk. Any reason you’re extra concerned about copies of your data suddenly going missing or corrupt?
Also just in case it isn’t clear, fault tolerance here refers to the number of faults you can safely tolerate without losing data or availability. So in triple
mode, you can safely tolerate two failures because there are 3 total replicas of each key in the database (temporarily ignoring other aspects of fault tolerance, such as coordinator count). Three failures could potentially cause you to lose all replicas of some of your data, so you don’t have a fault tolerance of 3.