I always show up with some fun issue. This time is no different.
After restoring the cluster, everything worked fine up until two of our 14 servers went down at the same time. Normally, this would be not issue, however thanks to triple replication, the cluster got saturated and is now running out of space.
Because these are NVME servers, they ran into an issue that they can’t recover (GCP thing).
I thought, no problem, will just add more servers and things will rebalance. However, that’s not what happened. Now everything seems to be crippled to almost a halt. I added 10 more servers so they added almost twice more new storage, but the cluster is still reporting that one of the server is running out of space.
I found this discussion that explains why but not the solution. So what are my options here? How do I tell the cluster that it now has plenty of space and it should start rushing the rebalancing?
I tried excluding the process, but it has no impact.
Configuration: Redundancy mode - triple Storage engine - ssd-2 Coordinators - 3 Exclusions - 1 (type `exclude' for details) Desired Proxies - 7 Desired Resolvers - 1 Desired Logs - 7 Cluster: FoundationDB processes - 72 (less 4 excluded; 0 with errors) Machines - 18 (less 1 excluded) Memory availability - 7.9 GB per process on machine with least available Fault Tolerance - 0 machines Server time - 09/27/19 19:42:27 Data: Replication health - HEALING: Only two replicas remain of some data Moving data - 372.732 GB Sum of key-value sizes - 3.649 TB Disk space used - 3.322 TB Operating space: Storage server - 0.0 GB free on most full server Log server - 339.4 GB free on most full server Workload: Read rate - 2 Hz Write rate - 0 Hz Transactions started - 1 Hz Transactions committed - 0 Hz Conflict rate - 0 Hz Performance limited by process: Storage server running out of space (approaching 5% limit). Most limiting process: 10.128.1.131:4502 Backup and DR: Running backups - 0 Running DRs - 0 Client time: 09/27/19 19:42:17
In addition, one of the server that died was a coordinator and when I try to change the coordinators it just does nothing.
Is there a way to help the cluster to understand that it has now plenty of space to start doing the proper rebalancing?