I have an FDB cluster (version 6.2.19) with double replication factor consisting of 13 machines where each machine has 7 processes (total 98 processes). On each machine I am using 4 mount points (of different sizes). My processes are assigned to the different mount points in the following way:
I was migrating data to the cluster and had migrated 20.7TB of data when I noticed I wasn’t able to write to (or read from) the cluster anymore and running
status displayed the following messages:
Unable to start default priority transaction after 5 seconds
Unable to start batch priority transaction after 5 seconds
Unable to retrieve all status information
Performance limited by process: Storage server running out of space (approaching 5% limit)
I took a look at the disk utilisation of my machines (these machines are used only as part of the cluster) and saw the following distribution on 8 (out of 13) of my machines that had only storage processes:
To make the cluster usable again I tried adding another machine to the cluster and also adding another process to Mount Point 4 hoping that the cluster would rebalance itself but the moving data stayed at 0.0GB. I also tried
excluding the limiting processes but that didn’t trigger the data movement either. It seemed as though the cluster was “stuck” as even commands like setclass weren’t reflecting any changes. Then I tried killing the data distributor process as mentioned here but that changed the values of “Sum of key-value sizes”, “Moving Data” and “Replication Health” to unknown and the cluster remained stuck.
My issue is similar to this but I can’t use the solutions mentioned here as I don’t have enough resources to add so many additional machine and my cluster is already double replication mode so I can’t bring down the replication factor without risking data loss.
This is a critical issue for me and any suggestions on how to recover the cluster would be appreciated. Also, I have a few questions regarding how data is distributed on storage servers:
This post mentioned that the “data distribution process tries to keep roughly the same amount of data on every storage server”. Does this mean that the same amount of data is kept on each server or the same percentage (out of total available) of data is kept on each server?
The post linked above also mentioned that performance degradation when any storage server reaches 95% of its capacity is intended as that makes it easier to recover. Is it guaranteed that the cluster will always be recoverable in such scenarios?