How to check if data is fully replicated?

I’m checking if there are other things valuable in the fdbcli --exec 'status json' output

cluster.data.state.name: I think this is “Replication health” in status output. Does “healthy” also mean data is fully replicated?

cluster.data.team_trackers.state.name: What does this mean? It has the same enums of values as cluster.data.state.name. What does “team” mean in FoundationDB?

cluster.data.full_replication: I think this means data is fully replicated? If so, is there any difference between this and cluster.data.state.name == 'healthy'?

cluster.recovery_state.name: Does “fully_recovered” mean cluster is ready to accept transactions, but it’s irrelevant if data in the cluster is healthy or not?

Assume we want to replace a failed machine programmatically, after we exclude a machine (via fdbcli), what field should we monitor to make sure it’s safe to terminate that instance?

Also, is there other fields to check to make sure cluster is in a “healthy” state?

If the replication health shows healthy, then any data has the configured number of replicas.
The code related to this is defined in https://github.com/apple/foundationdb/blob/master/fdbserver/Status.actor.cpp#L1237

A team is a collection of k storage servers (where k is the replication factor). We only build a limited number of teams to host data, so that when k servers or machines fail at the same time, the possibility of data unavailable event is super low.

(Sorry, your questions require reading code to answer with confidence what the fields mean, which I haven’t had time to do. I had typed out and not posted a reply to the spirit of your question before, which I’ll post now, and return to the rest of your question when I do have time.)

I think this is the crux of your question, so I’ll start with this. You should terminate the instance when exclude tells you that it is safe to do so. Exclude synchronously waits for the data distribution off of the failed machine to complete, and returns once it does.

Using the example from Removing machines from a cluster

fdb> exclude 1.2.3.4 1.2.3.5 1.2.3.6
Waiting for state to be removed from all excluded servers.  This may take a while.
[a long wait]
It is now safe to remove these machines or processes from the cluster.

And there’s also the note that if you ^C out of an exclude, re-issuing the same exclude will allow you to resume the wait.

I’d recommend using this over the other proposed methods of determining replication health, because

  1. It’s more specific to the particular machine, and won’t include other data distribution activities (splits, merges, etc.)
  2. It’s what our automation uses.

“Team” is a grouping of Replication Factor number of hosts. Shards of data are assigned to a team. Part of data distribution builds and maintains these teams. I’m sure @mengxu would be happy to answer any questions you have about team building :wink:

EDIT: turns out he already did

1 Like