We were using a cluster that consisted of 13 machines and 91 processes ( 7 processes on each machine ) and had a total of 19.4 TB of data on it. One of our machines was impacted by a hardware failure. This immediately put the replication health of our cluster in an (understandably) unhealthy state with the message
UNHEALTHY: No replicas remain of some data. The machine could not be recovered for a week and in that duration the replication health of the cluster remained unhealthy and the size of Moving Data was high (~4000GB).
After a week, when we recovered the machine and started fdbmonitor on it then the replication health of the cluster came back to healthy and the size of Moving Data came back down to a few hundred GBs but the cluster started to lag a lot. One of the processes (on the newly returned machine) was maxing out at 99.9% CPU and the cluster was displaying messages saying that it was unable to reach some of the processes. We waited for ~12 hours for the cluster to repair itself but it was unable to recover. We also tried excluding all the processes of the impacted machine which increased the Moving Data to ~2000GB initially and eventually (after ~12 hours) came down to 396GB but did not decrease further due to which the exclusion of these processes could not be completed.
We don’t care about this 396GB of data as we can easily re-migrate it but if we try killing the fdbmonitor of the impacted machine the replication health of the cluster comes back to unhealthy.
Is there any way of communicating to the cluster that the lost data can be ignored and recovering the cluster? Also, any pointers on dealing with a situation like this in the future?