How to check if data is fully replicated?

rayjcwu · July 10, 2019, 9:06pm

I’m checking if there are other things valuable in the fdbcli --exec 'status json' output

cluster.data.state.name: I think this is “Replication health” in status output. Does “healthy” also mean data is fully replicated?

cluster.data.team_trackers.state.name: What does this mean? It has the same enums of values as cluster.data.state.name. What does “team” mean in FoundationDB?

cluster.data.full_replication: I think this means data is fully replicated? If so, is there any difference between this and cluster.data.state.name == 'healthy'?

cluster.recovery_state.name: Does “fully_recovered” mean cluster is ready to accept transactions, but it’s irrelevant if data in the cluster is healthy or not?

Assume we want to replace a failed machine programmatically, after we exclude a machine (via fdbcli), what field should we monitor to make sure it’s safe to terminate that instance?

Also, is there other fields to check to make sure cluster is in a “healthy” state?

mengxu · July 11, 2019, 10:03pm

If the replication health shows healthy, then any data has the configured number of replicas.
The code related to this is defined in https://github.com/apple/foundationdb/blob/master/fdbserver/Status.actor.cpp#L1237

A team is a collection of k storage servers (where k is the replication factor). We only build a limited number of teams to host data, so that when k servers or machines fail at the same time, the possibility of data unavailable event is super low.

alexmiller · July 12, 2019, 9:41pm

(Sorry, your questions require reading code to answer with confidence what the fields mean, which I haven’t had time to do. I had typed out and not posted a reply to the spirit of your question before, which I’ll post now, and return to the rest of your question when I do have time.)

I think this is the crux of your question, so I’ll start with this. You should terminate the instance when exclude tells you that it is safe to do so. Exclude synchronously waits for the data distribution off of the failed machine to complete, and returns once it does.

Using the example from Removing machines from a cluster

fdb> exclude 1.2.3.4 1.2.3.5 1.2.3.6
Waiting for state to be removed from all excluded servers.  This may take a while.
[a long wait]
It is now safe to remove these machines or processes from the cluster.

And there’s also the note that if you ^C out of an exclude, re-issuing the same exclude will allow you to resume the wait.

I’d recommend using this over the other proposed methods of determining replication health, because

It’s more specific to the particular machine, and won’t include other data distribution activities (splits, merges, etc.)
It’s what our automation uses.

“Team” is a grouping of Replication Factor number of hosts. Shards of data are assigned to a team. Part of data distribution builds and maintains these teams. I’m sure @mengxu would be happy to answer any questions you have about team building

EDIT: turns out he already did

Topic		Replies	Views
What exactly is happening when Replication health is in state "(Re)initializing automatic data distribution"? Running FoundationDB	0	34	October 16, 2024
UNHEALTHY: No replicas remain of some data Using FoundationDB	4	444	June 14, 2021
Temporary hardware failure on singly replicated cluster Using FoundationDB	2	465	August 28, 2020
Database reporting not healthy despite having all the processes running Running FoundationDB operator	2	69	February 15, 2025
Help me understand this status output Using FoundationDB	12	3572	June 15, 2021

How to check if data is fully replicated?

Related topics