UNHEALTHY: No replicas remain of some data

After yesterday’s outage of AWS two of five nodes have been replaced and now we have “Replication health - UNHEALTHY: No replicas remain of some data”.

Full output:

# fdbcli --exec 'status details'
Using cluster file `/etc/foundationdb/fdb.cluster'.

Configuration:
  Redundancy mode        - double
  Storage engine         - ssd-2
  Coordinators           - 5
  Usable Regions         - 1

Cluster:
  FoundationDB processes - 10
  Zones                  - 5
  Machines               - 5
  Memory availability    - 7.7 GB per process on machine with least available
  Retransmissions rate   - 2 Hz
  Fault Tolerance        - 0 machines
  Server time            - 06/11/21 11:25:10

Data:
  Replication health     - UNHEALTHY: No replicas remain of some data
  Moving data            - unknown
  Sum of key-value sizes - unknown
  Disk space used        - 97.738 GB

Operating space:
  Storage server         - 61.7 GB free on most full server
  Log server             - 61.7 GB free on most full server

Workload:
  Read rate              - 27 Hz
  Write rate             - 1 Hz
  Transactions started   - 14 Hz
  Transactions committed - 1 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:
  10.165.196.94:4500     (  1% cpu;  1% machine; 0.001 Gbps;  1% disk IO; 0.5 GB / 7.8 GB RAM  )
  10.165.196.94:4501     (  1% cpu;  1% machine; 0.001 Gbps;  1% disk IO; 0.6 GB / 7.8 GB RAM  )
  10.165.196.116:4500    (  1% cpu;  1% machine; 0.001 Gbps;  1% disk IO; 0.6 GB / 7.7 GB RAM  )
  10.165.196.116:4501    (  1% cpu;  1% machine; 0.001 Gbps;  1% disk IO; 0.4 GB / 7.7 GB RAM  )
  10.165.196.148:4500    (  1% cpu;  1% machine; 0.001 Gbps;  1% disk IO; 0.3 GB / 7.8 GB RAM  )
  10.165.196.148:4501    (  1% cpu;  1% machine; 0.001 Gbps;  1% disk IO; 0.4 GB / 7.8 GB RAM  )
  10.165.196.176:4500    (  1% cpu;  1% machine; 0.001 Gbps;  1% disk IO; 0.6 GB / 7.7 GB RAM  )
  10.165.196.176:4501    (  1% cpu;  1% machine; 0.001 Gbps;  1% disk IO; 0.4 GB / 7.7 GB RAM  )
  10.165.196.234:4500    (  0% cpu;  1% machine; 0.001 Gbps;  1% disk IO; 0.3 GB / 7.7 GB RAM  )
  10.165.196.234:4501    (  1% cpu;  1% machine; 0.001 Gbps;  1% disk IO; 0.4 GB / 7.7 GB RAM  )

Coordination servers:
  10.165.196.94:4500  (reachable)
  10.165.196.116:4500  (reachable)
  10.165.196.148:4500  (reachable)
  10.165.196.176:4500  (reachable)
  10.165.196.234:4500  (reachable)

Client time: 06/11/21 11:25:08

If I stop one node ‘Replication health’ changes to ‘(Re)initializing automatic data distribution’, then ‘Healthy’ and then after some time to ‘UNHEALTHY: No replicas remain of some data’

# fdbserver --version
FoundationDB 6.2 (v6.2.28)
source version 569ab46bf638cd0bfc86f192b724c9217e090760
protocol fdb00b062010001

Is there a way to fix it?

1 Like

We also noticed that upon trying to retrieve certain keys the client (python) tries to connect to ips that are not anymore part of the cluster. Since we have specified the double redundancy mode is there a way to force replicating the remaining copy of the data on the new servers (double means original + 2 copies which I assume we lost).

Unfortunately, double replication means there are only two replicas of your data (there’s no “original”). Thus, after losing 2/5 hosts, it is reasonable to expect that some shards would be unavailable or lost. You will either need to bring one of the failed machines back online or restore from backup.

The behavior of the clients trying to connect to non-existent IPs is a side-effect of having shards of data with no live storage server associated with them.

If losing 2 hosts is an expected event, then I would suggest running with three copies of all data (triple), or looking into higher availability configurations (three_data_hall or multi-region).

2 Likes

Thanks for replying @alexmiller.

Assuming we set the replication mode to triple in the future and lose 2 servers (with 2 of the shard replicas) to make the database cluster available we should:

  • Set redundancy mode to single
  • Add 2 new servers
  • Set redundancy mode to triple and wait for the repartition process to finish creating the replicas

Is that sequence of steps correct?

You only need to add 2 new servers, so steps (1) and (3) are not necessary (nor would I recommend doing them). If 1 of the 3 replicas remain for a piece of data, FDB will automatically re-replicate the 1 copy into 3 with no human involvement.