UNHEALTHY: No replicas remain of some data

DenisBY · June 11, 2021, 11:29am

After yesterday’s outage of AWS two of five nodes have been replaced and now we have “Replication health - UNHEALTHY: No replicas remain of some data”.

Full output:

# fdbcli --exec 'status details'
Using cluster file `/etc/foundationdb/fdb.cluster'.

Configuration:
  Redundancy mode        - double
  Storage engine         - ssd-2
  Coordinators           - 5
  Usable Regions         - 1

Cluster:
  FoundationDB processes - 10
  Zones                  - 5
  Machines               - 5
  Memory availability    - 7.7 GB per process on machine with least available
  Retransmissions rate   - 2 Hz
  Fault Tolerance        - 0 machines
  Server time            - 06/11/21 11:25:10

Data:
  Replication health     - UNHEALTHY: No replicas remain of some data
  Moving data            - unknown
  Sum of key-value sizes - unknown
  Disk space used        - 97.738 GB

Operating space:
  Storage server         - 61.7 GB free on most full server
  Log server             - 61.7 GB free on most full server

Workload:
  Read rate              - 27 Hz
  Write rate             - 1 Hz
  Transactions started   - 14 Hz
  Transactions committed - 1 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:
  10.165.196.94:4500     (  1% cpu;  1% machine; 0.001 Gbps;  1% disk IO; 0.5 GB / 7.8 GB RAM  )
  10.165.196.94:4501     (  1% cpu;  1% machine; 0.001 Gbps;  1% disk IO; 0.6 GB / 7.8 GB RAM  )
  10.165.196.116:4500    (  1% cpu;  1% machine; 0.001 Gbps;  1% disk IO; 0.6 GB / 7.7 GB RAM  )
  10.165.196.116:4501    (  1% cpu;  1% machine; 0.001 Gbps;  1% disk IO; 0.4 GB / 7.7 GB RAM  )
  10.165.196.148:4500    (  1% cpu;  1% machine; 0.001 Gbps;  1% disk IO; 0.3 GB / 7.8 GB RAM  )
  10.165.196.148:4501    (  1% cpu;  1% machine; 0.001 Gbps;  1% disk IO; 0.4 GB / 7.8 GB RAM  )
  10.165.196.176:4500    (  1% cpu;  1% machine; 0.001 Gbps;  1% disk IO; 0.6 GB / 7.7 GB RAM  )
  10.165.196.176:4501    (  1% cpu;  1% machine; 0.001 Gbps;  1% disk IO; 0.4 GB / 7.7 GB RAM  )
  10.165.196.234:4500    (  0% cpu;  1% machine; 0.001 Gbps;  1% disk IO; 0.3 GB / 7.7 GB RAM  )
  10.165.196.234:4501    (  1% cpu;  1% machine; 0.001 Gbps;  1% disk IO; 0.4 GB / 7.7 GB RAM  )

Coordination servers:
  10.165.196.94:4500  (reachable)
  10.165.196.116:4500  (reachable)
  10.165.196.148:4500  (reachable)
  10.165.196.176:4500  (reachable)
  10.165.196.234:4500  (reachable)

Client time: 06/11/21 11:25:08

If I stop one node ‘Replication health’ changes to ‘(Re)initializing automatic data distribution’, then ‘Healthy’ and then after some time to ‘UNHEALTHY: No replicas remain of some data’

# fdbserver --version
FoundationDB 6.2 (v6.2.28)
source version 569ab46bf638cd0bfc86f192b724c9217e090760
protocol fdb00b062010001

Is there a way to fix it?

EugeniuZ · June 11, 2021, 2:36pm

We also noticed that upon trying to retrieve certain keys the client (python) tries to connect to ips that are not anymore part of the cluster. Since we have specified the double redundancy mode is there a way to force replicating the remaining copy of the data on the new servers (double means original + 2 copies which I assume we lost).

alexmiller · June 12, 2021, 9:05am

Unfortunately, double replication means there are only two replicas of your data (there’s no “original”). Thus, after losing 2/5 hosts, it is reasonable to expect that some shards would be unavailable or lost. You will either need to bring one of the failed machines back online or restore from backup.

The behavior of the clients trying to connect to non-existent IPs is a side-effect of having shards of data with no live storage server associated with them.

If losing 2 hosts is an expected event, then I would suggest running with three copies of all data (triple), or looking into higher availability configurations (three_data_hall or multi-region).

EugeniuZ · June 13, 2021, 5:31pm

Thanks for replying @alexmiller.

Assuming we set the replication mode to triple in the future and lose 2 servers (with 2 of the shard replicas) to make the database cluster available we should:

Set redundancy mode to single
Add 2 new servers
Set redundancy mode to triple and wait for the repartition process to finish creating the replicas

Is that sequence of steps correct?

alexmiller · June 14, 2021, 11:59pm

You only need to add 2 new servers, so steps (1) and (3) are not necessary (nor would I recommend doing them). If 1 of the 3 replicas remain for a piece of data, FDB will automatically re-replicate the 1 copy into 3 with no human involvement.

Topic		Replies	Views
Temporary hardware failure on singly replicated cluster Using FoundationDB	2	465	August 28, 2020
Debugging abnormally high IO load Using FoundationDB	5	1406	October 31, 2019
FDB out of memory Running FoundationDB	5	952	July 22, 2022
Mac version of FoundationDB seems unhealthy Development	4	1800	August 9, 2018
The cluster is continuously Restoring replication factor, and Moing data has not decreased Running FoundationDB performance , operator	1	49	October 10, 2024

UNHEALTHY: No replicas remain of some data

Related topics