We’re running FDB cluster for couple of months now and things are working out well. However mid last week it started running out of space. I decided to increase the size to almost double (from 40 to 70 servers; each 4 vCPU, 15GB RAM, 1x375GB NVME SSD). I only added storage classes however and a day after the log servers (we had 5 servers, 2 per server) started running out. So I added 3 more servers with 2 log classes per server (2 are stateless) and left the cluster to do whatever it needs to do.
But the cluster is now running into some troubles. I can’t get any data from the
status as it always returns something like this
fdb> status details Using cluster file `fdb.cluster'. Unable to communicate with the cluster controller at 10.128.9.184:4500 to get status. Configuration: Redundancy mode - unknown Storage engine - unknown Coordinators - unknown Cluster: FoundationDB processes - unknown Machines - Machines - unknown Data: Replication health - unknown Moving data - unknown Sum of key-value sizes - unknown Disk space used - unknown Operating space: Unable to retrieve operating space status Workload: Read rate - unknown Write rate - unknown Transactions started - unknown Transactions committed - unknown Conflict rate - unknown Backup and DR: Running backups - 0 Running DRs - 0 Process performance details: Coordination servers: 10.128.0.125:4500 (reachable) 10.128.0.204:4500 (reachable) 10.128.0.240:4500 (reachable) 10.128.1.36:4500 (reachable) 10.128.1.243:4500 (reachable) 10.128.3.106:4500 (reachable) 10.128.4.114:4500 (reachable) Client time: 10/07/18 20:07:39
There are no errors in the processes. I did restart the cluster (just to make sure) but it has no impact. The cluster itself works most of the time, but every 10 minutes or so things stop to halt and then start working again after couple of minutes.
Any suggestion what to do? I did pause any writes into the cluster to give it time to recover, but need to get it working as soon as possible. Any help is very appreciated.