ex3ndr
(Steve Korshakov)
May 18, 2021, 7:17pm
1
For some reason just after i added to each machine locality_zoneid it started to show fault tolerance 0 zones.
My setup:
triple SSD
5 coordinators each in separate zone
3 machines with 2 processes each in 3 zones all have correct zone id set.
What I am missing why I can’t lose any zone?
ex3ndr
(Steve Korshakov)
May 18, 2021, 8:14pm
2
I switched everything to locality_data_hall
and got “Fault tolerance: 0 machines
”, switched to three_data_hall
still same. What could be wrong?
ex3ndr
(Steve Korshakov)
May 18, 2021, 8:21pm
3
ajbeamon
(A.J. Beamon)
May 18, 2021, 8:53pm
4
A triple replicated cluster requires three transaction logs in different zones in order to be available. If you have only three zones in your cluster, then the loss of any one zone would mean that your cluster is unavailable. As a result, your fault tolerance is zero.
If you add two additional zones, then your fault tolerance should be the maximum value of two. You could also configure into a double replicated cluster and have the maximum fault tolerance there of one.
See Configuration — FoundationDB 6.3 for more information.
ex3ndr
(Steve Korshakov)
May 18, 2021, 9:32pm
5
Thank you! I will try to convert one storage process to log and check if it would work.
But! In docs you linked there are no mention that triple replication requires logs in three zones to make it work. I mean, why fdbcli
not mentioning that this is about avaliability, not data integrity? Why with triple replication losing a single log server causes data loss then? Is this a true?
UPD: Yep, i added new log process and it still zero fault tolerance
UPD2: This is my fdbtop output:
ip port cpu% mem% iops net class roles
-------------- ------ ---- ---- ------ ----- ----------- ------------------------------------
10.138.0.100 4500 14 35 2799 55 storage storage
4501 3 34 2795 2 storage storage
4502 11 33 2798 63 storage storage
4503 16 34 2793 88 storage storage
-------------- ------ ---- ---- ------ ----- ----------- ------------------------------------
10.138.0.101 4500 4 34 3841 8 storage storage
4501 12 34 3842 52 storage storage
4502 9 34 3841 59 storage storage
4503 8 34 3841 5 storage storage
-------------- ------ ---- ---- ------ ----- ----------- ------------------------------------
10.138.0.119 4500 8 35 1623 1 storage storage
4501 22 4 1613 8 log log
-------------- ------ ---- ---- ------ ----- ----------- ------------------------------------
10.138.0.120 4500 22 35 6700 67 storage storage
4501 16 33 6877 92 storage storage
-------------- ------ ---- ---- ------ ----- ----------- ------------------------------------
10.138.0.121 4500 20 35 3932 70 storage storage
4501 2 14 3933 0 storage storage
-------------- ------ ---- ---- ------ ----- ----------- ------------------------------------
10.138.0.14 4500 15 34 3651 85 storage storage
4501 4 34 3651 3 storage storage
4502 5 34 3657 9 storage storage
4503 12 34 3648 47 storage storage
-------------- ------ ---- ---- ------ ----- ----------- ------------------------------------
10.138.0.48 4500 16 35 1317 54 storage storage
4501 17 4 1317 7 log log
-------------- ------ ---- ---- ------ ----- ----------- ------------------------------------
10.138.0.50 4500 24 41 5463 117 storage storage
4501 18 6 5503 7 log log
-------------- ------ ---- ---- ------ ----- ----------- ------------------------------------
10.138.0.88 4500 43 35 8461 238 storage storage
4501 39 35 8352 72 log log,storage
-------------- ------ ---- ---- ------ ----- ----------- ------------------------------------
10.138.0.89 4500 13 34 1305 3 storage storage
4501 3 25 1305 0 storage storage
-------------- ------ ---- ---- ------ ----- ----------- ------------------------------------
10.138.0.90 4500 7 34 1032 2 storage storage
4501 2 9 1038 1 storage storage
-------------- ------ ---- ---- ------ ----- ----------- ------------------------------------
10.138.0.91 4500 25 34 5706 96 storage storage
4501 3 34 5737 1 storage storage
-------------- ------ ---- ---- ------ ----- ----------- ------------------------------------
10.138.0.94 4500 4 5 - 5 stateless cluster_controller
4501 1 4 - 0 stateless resolver
4502 1 4 - 1 stateless resolver
4503 4 4 - 2 stateless data_distributor,master,ratekeeper
-------------- ------ ---- ---- ------ ----- ----------- ------------------------------------
10.138.0.99 4500 33 6 - 23 stateless proxy
4501 31 4 - 22 stateless proxy
4502 32 5 - 22 stateless proxy
4503 1 4 - 0 stateless resolver
ajbeamon
(A.J. Beamon)
May 18, 2021, 10:55pm
6
Can you send a copy of your status json output with the 4th log (and preferably running in triple
redundancy again)?
ex3ndr
(Steve Korshakov)
May 18, 2021, 11:38pm
7
Meanwhile i upgraded cluster to 6.3 and it surfaced some stats about logs and they looks ok too.
FDB 6.3.12, triple replication, 4 logs: { "client" : { "cluster_file" : { "path" : "/etc/foun - Pastebin.com
I wonder if this maybe has something to do with the coordinators not being a part of the cluster, and thus maybe the fault tolerance calculation is missing the locality information?
It looks like the code assumes that it can fetch the correct zone information about coordinators from the list of workers, but if the coordinator isn’t a worker, then it would get a default created empty string as the zone, and then it’d look like all coordinators are in the same (empty) zone?
ajbeamon
(A.J. Beamon)
May 19, 2021, 4:26pm
9
That seems very plausible, good catch. I’ve created an issue to address this problem: Coordinator fault tolerance calculation depends on coordinators being part of the cluster · Issue #4833 · apple/foundationdb · GitHub .
I’m not sure that there’s a great way to work around this problem except to have your coordinators be part of the cluster for now.