Temporary hardware failure on singly replicated cluster

We were using a cluster that consisted of 13 machines and 91 processes ( 7 processes on each machine ) and had a total of 19.4 TB of data on it. One of our machines was impacted by a hardware failure. This immediately put the replication health of our cluster in an (understandably) unhealthy state with the message UNHEALTHY: No replicas remain of some data. The machine could not be recovered for a week and in that duration the replication health of the cluster remained unhealthy and the size of Moving Data was high (~4000GB).

After a week, when we recovered the machine and started fdbmonitor on it then the replication health of the cluster came back to healthy and the size of Moving Data came back down to a few hundred GBs but the cluster started to lag a lot. One of the processes (on the newly returned machine) was maxing out at 99.9% CPU and the cluster was displaying messages saying that it was unable to reach some of the processes. We waited for ~12 hours for the cluster to repair itself but it was unable to recover. We also tried excluding all the processes of the impacted machine which increased the Moving Data to ~2000GB initially and eventually (after ~12 hours) came down to 396GB but did not decrease further due to which the exclusion of these processes could not be completed.

We don’t care about this 396GB of data as we can easily re-migrate it but if we try killing the fdbmonitor of the impacted machine the replication health of the cluster comes back to unhealthy.

Is there any way of communicating to the cluster that the lost data can be ignored and recovering the cluster? Also, any pointers on dealing with a situation like this in the future?

What configuration are you running in (specifically, what replication mode)? Losing 1 machine shouldn’t result in losing all replicas unless you only have 1 or you are already in a degraded state.

What role is this process? You can tell by looking at status output or for the Roles attribute on trace events logged by this process.

What version are you running?

Have you tried restarting all processes in the cluster? There have been cases where data distribution has gotten stuck that restarting has helped.

Hmm, I’m not sure. You could always clear the data from the database, which is the usual way to tell the cluster you don’t need it, but I don’t know if this would allow for it to recover.

If something is wrong specifically with this host, then usually just removing it would be ok and the cluster would heal from other replicas. If you only have one replica, this wouldn’t be an option.

If instead something is wrong elsewhere in the cluster (e.g. data distribution), then as mentioned above restarting it could be helpful. If you are running an older version, there have been various fixes to data distribution that may also help it behave better.

Thanks for your suggestions. We had ultimately decided to clear out the cluster completely and migrate our data to a fresh cluster, but your suggestions will be very helpful if we get stuck in the same situation again. FYI, we were running in single replication mode with FDB version 6.2.19. I am not sure what the role of the process was (don’t have access to the logs anymore).

We hadn’t tried this, will give it a shot if this happens again.

By clear the data do you mean using clearrange? We couldn’t use clearrange as we didn’t know which shards were mapped to that process. Is there a way to view the shard assignment? And if you meant clearing the data directories manually, wouldn’t the cluster continue to complain that it can’t find any replicas of some data?