6.0.18 Reporting incorrect Fault Tolerance via fdbcli with triple redundancy mode?

(Ray Jenkins) #1

Hi All,

I noticed today our production cluster on 6.0.18 is reporting via fdbcli status

Fault Tolerance - 0 machines (1 without data loss)

Unsure why, as it is provisioned for triple and has sufficient capacity to handle 2 machine failures, I ran an experiment in staging and I’m seeing the same results. Our cluster is deployed in AWS and uses 3 distinct ASGs. We assign a process class to the nodes per ASG, either transaction, stateless, or storage, (per previous configuration guidelines discussed here, primarily from @alexmiller’s posts). Cluster tuning cookbook / Questions about process classes, recruitment behavior and cluster configuration

Here is a rough timeline of my steps in order to reproduce.

  1. Boot 3 node cluster (1 txn log nodes, 1 storage node, 1 stateless node), sdd engine, single redundancy mode. fdbcli status reports Fault Tolerance of 0 Machines, as expected.
  2. Scale cluster up to 9 nodes (3 txn, 3 storage, 3 stateless), configure double, re-coordinate with coordinators auto. After cluster heals to achieve the double replica config, fdbcli status reports Fault Tolerance - 1 machine. At this point I increased the number of logs to 6 and proxies to 9 with configure logs=6 and configure proxies=9 as I had extra processes running on these new boxes that weren’t yet recruited. Fault Tolerance showed 1 machine as expected. Great, everything is working fine at this point!
  3. Scale cluster up to 15 nodes (5 txn, 5 storage, 5 stateless), configure triple, re-coordinate with coordinators auto. fdbcli coordinators reports 5 coordinators as expected. After cluster heals to support triple replica fdbcli status shows. Fault Tolerance - 0 machines (1 without data loss).

At this point I reduced the number of txn logs and proxies to 3 just to see if that made a difference but we’re still showing Fault Tolerance - 0 machines (1 without data loss).

Is it possible this is a known issues in the fdbcli status output? Here’s some additional configuration info.

  Redundancy mode        - triple
  Storage engine         - ssd-2
  Coordinators           - 5
  Desired Proxies        - 3
  Desired Logs           - 3

  FoundationDB processes - 50
  Machines               - 15
  Memory availability    - 1.8 GB per process on machine with least available
                           >>>>> (WARNING: 4.0 GB recommended) <<<<<
  Fault Tolerance        - 0 machines (1 without data loss)
  Server time            - 04/04/19 18:52:01

  Replication health     - Healthy (Rebalancing)
  Moving data            - 0.000 GB
  Sum of key-value sizes - 56.471 GB
  Disk space used        - 306.350 GB

Operating space:
  Storage server         - 753.7 GB free on most full server
  Log server             - 420.2 GB free on most full server

  Read rate              - 31 Hz
  Write rate             - 96 Hz
  Transactions started   - 21 Hz
  Transactions committed - 17 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Client time: 04/04/19 18:52:01

fdbtop output
— snip —

ssh "fdbcli --exec 'status json'" | fdbtop
ip               port    cpu%  mem%  iops    net  class          roles               
---------------  ------  ----  ----  ------  ---  -------------  --------------------        4500    0     2     -       0    stateless                         
                  4501    0     2     -       0    stateless                         
                  4502    1     2     -       0    stateless      proxy              
                  4503    0     2     -       0    stateless                         
---------------  ------  ----  ----  ------  ---  -------------  --------------------    4500    0     2     -       0    stateless                         
                  4501    3     3     -       1    stateless      cluster_controller 
                  4502    1     2     -       0    stateless                         
                  4503    1     2     -       0    stateless      resolver           
---------------  ------  ----  ----  ------  ---  -------------  --------------------    4500    1     3     -       0    transaction                       
                  4501    2     5     27      1    transaction    log                
---------------  ------  ----  ----  ------  ---  -------------  --------------------     4500    0     2     -       0    stateless                         
                  4501    2     3     -       1    stateless      master             
                  4502    0     2     -       0    stateless                         
                  4503    0     2     -       0    stateless                         
---------------  ------  ----  ----  ------  ---  -------------  --------------------     4500    2     33    892     0    storage        storage            
                  4501    3     33    888     0    storage        storage            
                  4502    2     32    853     0    storage        storage            
                  4503    2     33    853     0    storage        storage            
---------------  ------  ----  ----  ------  ---  -------------  --------------------     4500    2     30    239     0    storage        storage            
                  4501    2     32    231     0    storage        storage            
                  4502    1     31    231     0    storage        storage            
                  4503    2     31    239     0    storage        storage            
---------------  ------  ----  ----  ------  ---  -------------  --------------------     4500    3     4     27      1    transaction    log                
                  4501    1     4     -       0    transaction                       
---------------  ------  ----  ----  ------  ---  -------------  --------------------     4500    1     3     -       0    stateless                         
                  4501    0     2     -       0    stateless                         
                  4502    1     3     -       0    stateless                         
                  4503    1     3     -       0    stateless                         
---------------  ------  ----  ----  ------  ---  -------------  --------------------      4500    3     34    1191    0    storage        storage            
                  4501    3     33    1160    0    storage        storage            
                  4502    3     33    1201    0    storage        storage            
                  4503    3     34    1160    0    storage        storage            
---------------  ------  ----  ----  ------  ---  -------------  --------------------      4500    2     3     -       0    transaction                       
                  4501    1     3     -       0    transaction                       
---------------  ------  ----  ----  ------  ---  -------------  --------------------     4500    2     31    413     0    storage        storage            
                  4501    1     31    413     0    storage        storage            
                  4502    1     31    364     0    storage        storage            
                  4503    2     32    413     0    storage        storage            
---------------  ------  ----  ----  ------  ---  -------------  --------------------     4500    2     3     -       0    transaction                       
                  4501    1     4     -       0    transaction                       
---------------  ------  ----  ----  ------  ---  -------------  --------------------      4500    2     32    1013    0    storage        storage            
                  4501    3     32    1024    0    storage        storage            
                  4502    2     33    1024    0    storage        storage            
                  4503    2     33    1024    0    storage        storage            
---------------  ------  ----  ----  ------  ---  -------------  --------------------      4500    2     4     -       0    transaction                       
                  4501    2     4     27      0    transaction    log                
---------------  ------  ----  ----  ------  ---  -------------  --------------------      4500    1     3     -       0    stateless                         
                  4501    1     3     -       0    stateless      proxy              
                  4502    1     3     -       0    stateless                         
                  4503    1     4     -       0    stateless      proxy              


(A.J. Beamon) #2

Any chance you could send over the output of status json as well?

(Ray Jenkins) #4

@ajbeamon I believe it’s too large to post on the forum, the form will reject. Is there a particular part you want me to snip, or perhaps attach in an email?

(A.J. Beamon) #5

It looks like this is happening because there are only 3 Zone IDs being used in the cluster. The Zone ID represents different fault domains in the cluster, which we use for distributing data that needs to be sufficiently failure independent. For example, the different replicas of each piece of data must live in separate fault domains. The fault tolerance calculation is actually counting how many fault domains you can lose, and I believe the reason you would lose availability after 1 fault domain failure is that the cluster wouldn’t be able to recruit logs in the required three different fault domains.

That status describes your fault tolerance in terms of machines is a relic of the fact that in the past, our only fault domain was at the machine level (although you could trick it by manually assigning machine IDs). I think we need to clean up the language around this area a bit. Additionally, I think we should log additional information about zone counts in the status output, which could help to identify these issues more easily. I’ve recently filed a GitHub issue to that effect.

As for how to resolve your current issue, I think you’ll need to choose a different unit to serve as your fault domain, such as machines or racks, etc. Or you could add more of your current zones if that made sense.

If you need to maintain certain properties with respect to how the cluster distributes data between your current zones such that you can’t easily use a finer grained zone, it may be possible to make this work with a multi-datacenter mode. There have been a variety of changes in this area recently, and I’m admittedly not an expert on the state-of-the-art here, so maybe @alexmiller or @Evan could provide more guidance on that.

(A.J. Beamon) #6

I also created https://github.com/apple/foundationdb/issues/1421 to update the language we use when reporting fault tolerance.

(Ray Jenkins) #7

@ajbeamon Thank you for the explanation! When I built the tooling for the cluster I added locality_zoneid to the build process and selected the AWS availability zone to use for my locality_zoneid. I did this primarily to ensure when set to triple my data would be properly distributed across storage replicas in 3 distinct AZs. This is to avoid the risk of all replicas landing in a single AZ. It is quite common to deploy across 3 AZs and distribute your replicas across them for safety, we do this with Kafka and Cassandra, but I did not consider the txn nodes. Now that I think about it, that makes sense.

It is a bit more nuanced though than what you described, is it not? i.e. I believe the reason you would lose availability after 1 fault domain failure is that the cluster wouldn’t be able to recruit logs in the required three different fault domains.

Let’s say we have 5 txn nodes in the following 3 fault domains. [a,b,c,a,b]. Now as long as we write to at least 3 distinct fault domains [a,b,c] we can sustain a loss of up to two fault domains and not lose any data, correct?

As far as fulfilling the triple replication contract after failure, if we lose c : [a,b,a,b] remains, two b instances : [a,c,a] remains or two a instances : [b,c,b] remains, then we can no longer sustain the triple replica contract.

However if we lose a single a or b, or even both an a and b we are left with [ a,b,c ] and that still fulfills triple replication, correct?

(Note: Edit to s/failure domains/fault domains/)

(A.J. Beamon) #8

I believe this is where the multi-datacenter modes may be able to help, as they can ensure your data is replicated to multiple AZs.

I think this is mostly correct, as you’ll should still have a copy of 1 replica of data on the storage servers and logs. The reason data fault tolerance is reported as 1, however, is because of the coordinators. I don’t remember all of the ways that you can run into trouble here, but for example it’s possible some failure scenarios could have resulted in a network partition that allowed commits on the other AZs before they disappeared. There may also be potential issues with the state stored on the coordinators if you don’t have a quorum of them available, but I’m not sure of the details there.

This is true, though our reporting for fault tolerance isn’t so granular at the moment – it’s currently only able to tell you how many fault domains you can lose safely. It’s possible as we get into more elaborate configurations that we would want a more expressive way to describe what the fault tolerance is.

(Ray Jenkins) #9

I’d be interesting in hearing more details about your ideas here @ajbeamon. I was originally under the impression that specifying --datacenter_id was for multi-regional clusters as described here https://apple.github.io/foundationdb/configuration.html#configuring-regions.

Are you perhaps referring to the data-center-aware mode? https://apple.github.io/foundationdb/configuration.html#datacenter-aware-mode and in particular setting configure three_datacenter?

It looks like I’d also need to specify --datacenter_id to use that per the docs.

When using the datacenter-aware mode, all fdbserver processes should be passed a valid datacenter identifier on the command line.

fdb> configure three_datacenter
ERROR: Not enough processes exist to support the specified configuration
Type `configure FORCE *’ to configure without this check

One thing I am not clear on is how does that differs and interacts with locality_zoneid. The only obvious difference from the docs are triple mode require 3 active fault domains to continue to be available for writes.

triple mode
(best for 5+ machines)

FoundationDB replicates data to three machines, and at least three available machines are required to make progress

While three_datacenter (somewhat self contradicting) says it attempts to replicate across two datacenters and stays up if only two are available. It then goes on to say data is triple replicated.

FoundationDB attempts to replicate data across two datacenters and will stay up with only two available. Data is triple replicated. For maximum availability, you should use five coordination servers: two in two of the datacenters and one in the third datacenter.

So is that essentially saying it’s triple replicated, but will remain available for writes as long as two datacenters are still available?

(Evan Tschannen) #10

I actually think the configuration the best matches what you are trying to do is the three_data_hall configuration. The AZs will be specified as the “data_hall” locality attribute.

This will require one storage replica in each AZ, similar to what you have now. However, it only requires transaction logs in only two of the three AZs, so that way even if one AZ is down the cluster will still be able to recruit a new generation of transaction logs. Note that this configuration have four total replicas of the transaction log data, so you many need to configuration slightly more processes as transaction logs to keep the same performance as before.

I just looked and realized we do not have a corresponding three_data_hall_fallback configuration, I will add it as a patch for 6.1. You will need this to discard the data is a failed AZ if the failure lasts long enough that the TLogs become low on space.

(Ray Jenkins) #11

Thanks @Evan and @ajbeamon . I ran an experiment this morning and I’m curious if this might be a viable work around for us to increase availability while still getting triple replica distribution across the storage nodes. I removed the locality_zoneid from the txn nodes, restarted fdbserver on those machines and now the fdbcli is showing.

Fault Tolerance - 2 machines

I’ve left locality_zoneid on the stateless and storage nodes.

Of course, I’m now giving up the ability to ensure these writes are getting replicated across AZs or fault domains on the transaction log nodes. Without the locality_zoneid it’s just reverting back to machine_id and any write could for example, land on 3 txn nodes all in a single AZ, correct?

As you mention in https://www.youtube.com/watch?v=EMwhsGsxfPU around the nine minute mark, the txn nodes are distributed WALs. Under normal operation you’re fsync’ing and then replicating out to the storage nodes from page cache very quickly, and then cleaning up.

Assuming I’d be willing to take the risk of non fault domain/AZ-aware transaction log replication and the potential consequence of data loss in the event of losing an AZ, i.e. some set of transactions were replicated across 3 transaction logs nodes in a single AZ, do you foresee any potential issues with this configuration?

My biggest concern is that I’d be introducing some configuration that has never been tested on FoundationDB, i.e. some portion of machines with and without locality_zoneid, and there are some hidden gotchas.

However, if you believe this is a viable configuration option, this may be a trade-off we’d consider. Our data is being fed to FoundationDB from Kafka and we already have triple replication there. Further our workload is idempotent, so in the event of such a failure we could restore FoundationDB and simply replay the data from Kafka.

(Alex Miller) #12

The policy for three_data_hall mode is:

  • For a team of storage servers, please recruit 1 storage server per datahall
  • For transaction logs, please ensure commits are on at least 2 different zoneids in each of 2 different datahalls (minimum 4 logs total)

So --locality_data_hall is taking the place of --locality_zoneid in ensuring distribution across AZs. If you had the concept of a rack of machines within your AZ, then it’d be appropriate to set zoneids according to the rack to help prevent the loss of both transaction logs in an AZ at once.

To be clear, with --locality_data_hall set to the AZ a process is in, you’re still getting replication across AZ’s, and the ability to survive the loss of one, but not two, AZs.

(Ray Jenkins) #13

Hiya @alexmiller thank you for the reply.

Yes I understand the path here Evan and you are describing with the locality_data_hall replacing the use of locality_zoneid on the machines. My reply is actually in regards to a different configuration entirely, that is, using triple and simply removing locality_zoneid on the transaction log nodes. I’m curious if that configuration I describe is a viable one, given the trade-offs, and as a potential interim solution until Evan includes the patch he references in 6.1.

In terms of moving to the three_data_hall approach that’s definitely on the table for us to consider, it seems like that might be the best approach long term.

Do you have any additional thoughts on the configuration I described above?

(Alex Miller) #14

Ah, my apologies. I indeed cannot read, and maybe I shouldn’t respond to forum requests before the morning caffeine…

This is entirely correct.

The path of machineid becoming zoneid is well tested; I think it’s what simulation actually does. I wouldn’t be concerned about running in this way, as most people that know to set --locality_zoneid would also be running in this way.

If you’re fine potentially having to wipe and recreate your cluster if an AZ goes down, then I agree that this is a configuration you could consider.

(Ryan Worl) #15

It may be worth investigating whether the partition mode of EC2 placement groups would let you manufacture more fault domains per AZ.

(Evan Tschannen) #16

Even without a three_data_hall_fallback configuration, you could still configure double to recover if a AZ failed for a long period of time. It is just going allow putting both replicas in one AZ until you are able to configure back to three_data_hall.

This might be acceptable as a short term option since the only reason you would be doing this configuration change is because one AZ has failed for more than a few hours.