Troubles scaling up the cluster

Hi guys,

We’re running FDB cluster for couple of months now and things are working out well. However mid last week it started running out of space. I decided to increase the size to almost double (from 40 to 70 servers; each 4 vCPU, 15GB RAM, 1x375GB NVME SSD). I only added storage classes however and a day after the log servers (we had 5 servers, 2 per server) started running out. So I added 3 more servers with 2 log classes per server (2 are stateless) and left the cluster to do whatever it needs to do.

But the cluster is now running into some troubles. I can’t get any data from the status as it always returns something like this

fdb> status details

Using cluster file `fdb.cluster'.

Unable to communicate with the cluster controller at 10.128.9.184:4500 to get
status.

Configuration:
  Redundancy mode        - unknown
  Storage engine         - unknown
  Coordinators           - unknown

Cluster:
  FoundationDB processes - unknown
  Machines               -
  Machines               - unknown

Data:
  Replication health     - unknown
  Moving data            - unknown
  Sum of key-value sizes - unknown
  Disk space used        - unknown

Operating space:
  Unable to retrieve operating space status

Workload:
  Read rate              - unknown
  Write rate             - unknown
  Transactions started   - unknown
  Transactions committed - unknown
  Conflict rate          - unknown

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:

Coordination servers:
  10.128.0.125:4500  (reachable)
  10.128.0.204:4500  (reachable)
  10.128.0.240:4500  (reachable)
  10.128.1.36:4500  (reachable)
  10.128.1.243:4500  (reachable)
  10.128.3.106:4500  (reachable)
  10.128.4.114:4500  (reachable)

Client time: 10/07/18 20:07:39

There are no errors in the processes. I did restart the cluster (just to make sure) but it has no impact. The cluster itself works most of the time, but every 10 minutes or so things stop to halt and then start working again after couple of minutes.

Any suggestion what to do? I did pause any writes into the cluster to give it time to recover, but need to get it working as soon as possible. Any help is very appreciated.

Hi Thomas, are you running with a large number of fdbserver instances + client instances? Could it be that Cluster Coordinator is bottlenecked on number of monitoring all connected entities?

Please see some recent threads https://forums.foundationdb.org/t/cluster-controller-cpu-utilization-pegged/740 and https://forums.foundationdb.org/t/migrating-from-a-large-cluster-to-another/721 if they give you any clues…

4 per server, so 280 total. There are no clients connecting at the moment.

I read through the discussions, but not sure what the solution is. Should I start slowly removing servers?

This looks a lot like the cluster controller is overloaded. One way of figuring out is to go to the machine that runs the cluster controller and grep for messages of type LongDelayOnClusterController in the traces of the process.

That being said, 280 servers should not be an issue. The easiest way to run into this issue is if you have many instances of processes that run status json (or status or status details as internally these three do the same thing). We found experimentally, that running a low number of these commands simultaneously can kill the ClusterController (which will result in a recovery - if you are very unlucky, this will happen all the time). Low number here refers to 10s of simultaneous requests.

Do you have some monitoring that runs status json and runs on every machine? 70 of those would certainly be problematic.

That’s exactly it. We run a telegraf on each server that does just that. Idiotic, but it was easy to scale. Anyway, I turned them off. Is there anything I can do to help the cluster controller to recover? 10 minutes in and no change. Currently there are no clients connecting to the cluster, no processes requesting any status updates.

Unfortunately, even when I removed the status requests, the cluster controller is still experiencing problems for some reason

n@:~$ sudo grep -rnw -e 'LongDelayOnClusterController' /var/log/foundationdb/ | wc -l
463
n@:~$ sudo grep -rnw -e 'LongDelayOnClusterController' /var/log/foundationdb/ | wc -l
490

The amount of errors is still growing. Could this be caused by the cluster itself? Any idea what could I do to help it?

This is super weird. We run clusters that are much larger than yours and we made benchmarks with clusters that have more than 1,000 fdbserver processes. I have to admit that I don’t have many ideas how to help you here.

I appreciate your pointers though. At least I know what caused it. Just not sure how to get out if.

Trying to see what happens if I turn off the whole cluster and start turning on server after server. Maybe something will happen there.

Does restarting every process help (e.g. kill; kill all in fdbcli, assuming you are using something like fdbmonitor that will automatically restart the processes)?

Haven’t tried through fbdcli but did try through ansible and didn’t help.

Anyway, I did stop all processes and then started adding slowly and thing did get back (at least I get data now). However I think I broke the cluster as now it says that it doesn’t have access to some of the data

Data:
  Replication health     - UNHEALTHY: No replicas remain of some data
  Moving data            - 1905.884 GB
  Sum of key-value sizes - 2.190 TB
  Disk space used        - 8.871 TB

I don’t think it will be able to recover from this unfortunately. Fortunately we do have backup of the raw data, but it will take me a week to restore it. High price for running status json

Running status is certainly expensive and can saturate the cluster controller if done a lot (although this is made a lot better in 6.0), but I don’t think it would lead to any cluster damage.

The missing data report here is possibly misleading. If you’ve added back all of your storage servers and there was no missing data before the problem, then my guess is that the data is all present. However, while adding back the storage servers slowly, there would have been a period where some data was missing, and you would get this message. Do to a reporting quirk, this status can persist even when all data does show up if there is a lot of data movement queued up. Can you let it run for a bit until data movement finishes and see if it becomes happy? Alternatively, I think there are some logs we can check to see if things are in an ok state, but I’ll have to wait until I’m able to look up the events to provide details.

Yes, will do. One of the storage servers is quite stuck at 5% limit and there is still 2TB of data to be moved. Will give it the night and see if it came back.

Even after 12 hours, the cluster is not recovering It’s quite stuck at the same level without any material change.

Data:
  Replication health     - UNHEALTHY: No replicas remain of some data
  Moving data            - 1781.906 GB
  Sum of key-value sizes - 2.190 TB
  Disk space used        - 9.264 TB

I did bring back all servers and all of them should have the original data, but it seems something broke along the way. Any ideas what else can I try before I have to purge it all and redo it again?

So one thing you can try (this is a long shot but it is cheap to try) is to kill the master process. fdbmonitor will automatically restart it and it will force fdb to do a recovery.

We ran into a problem before where the master seemed to “forget” about some partitions. We never managed to find a way to reproduce the problem but killing the master usually fixed it. But we usually only “lost” one copy of the data, not all three of them. But it is still worth a try.

Funny enough, that’s exactly what I did (out of desperation) and the whole cluster recovered after that. Quite surprising to be honest.

But at least I now know, never to run multiple status at the same time and how to deal with outages of this kind.

Really appreciate all the help you guys gave me.

Glad to hear you could fix this problem. This is definitely a bug.

@ajbeamon We had this happen to us as well (master randomly losing track of some partitions or storage servers). We never managed to reproduce this in a test environment. If you guys have any idea how to reproduce this or what this bug could be would be great! Maybe create a bug report for this? I am honestly a bit out of ideas of how to reproduce and fix this. But as we are not the only one seeing this behavior, this seems definitely a bug somewhere (also it seems to exist for quite a while).

Maybe it could be reproduced the way I done it? That I introduced only a few server at a time to the cluster and then the cluster was just not able to recover from it? The second all processes got restarted, things recovered within an hour.

Good team work figuring out what could have happened and finding a way to recover it :clap:
I went to change my telegraf collection also to only run telegraf from a single monitoring node. So thank you!

Also @ThomasJ I wanted to check if you had just run a ‘kill’ of the FDB process running the master role or if you had done this differently, in case it happens to anyone else. Was the master process re-created on the same node automatically?
Cheers!

Just that. Rann kill; kill all and the system came back up after an hour or so. It seems everything went smoothly afterwards.

One thing I’ve noticed, because there is 70 servers now running, first time I call status it always fails with unknowns e.g.

fdb> status

Using cluster file `fdb.cluster'.

Unable to communicate with the cluster controller at 10.128.0.125:4503 to get
status.

Configuration:
  Redundancy mode        - unknown
  Storage engine         - unknown
  Coordinators           - unknown

Cluster:
  FoundationDB processes - unknown
  Machines               -
  Machines               - unknown

Data:
  Replication health     - unknown
  Moving data            - unknown
  Sum of key-value sizes - unknown
  Disk space used        - unknown

Operating space:
  Unable to retrieve operating space status

Workload:
  Read rate              - unknown
  Write rate             - unknown
  Transactions started   - unknown
  Transactions committed - unknown
  Conflict rate          - unknown

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Client time: 10/10/18 08:36:36

Maybe there is some other way how to collect statistics across the nodes? I know @ajbeamon mentioned that this may be reworked in 6.0, but not sure if it addresses this exact issue.

The one issue I know of here is the one I mentioned where status is pessimistic about how many replicas remain when there is a lot of queued data movement. Data movements are given priorities based on why those movements are happening, with higher priorities given to movements restoring the replication factor. The fewer replicas of the data that remain, the higher the priority is. I believe we are currently determining what to report in status based on the highest priority active data movement task. It seems that when a lot of data movement is queued, these high priority tasks are sticking around for a while.

I’m not sure if this happens only when the data movement becomes unnecessary or if they would also be slow to get executed if they were still needed. I’m not that familiar with how priorities are dealt with in data distribution and how data movement gets canceled, but I do intend to address this issue before too long (and I can raise a GitHub issue for it - Status replication health is pessimistic when replication factor is restored and a lot of data movement is queued · Issue #833 · apple/foundationdb · GitHub).

In this case, the replication factor should return to normal when data movement completes, or possibly before. It should also be the case that the data is still accessible in the cluster. I think this is what happened in Thomas’s case, though possibly due to space issues the data movement was unable to complete. Bouncing the cluster (or just the master even) would have restarted data distribution and cleared out the high priority movement, but I’m not sure why it helped with the space issue.

Does the above seem like it may be what you’ve encountered? If not and you have any other details that could point me in the right direction, please let me know.

EDIT:

Given that it took a while for the cluster to come back after bouncing the master, it may have been the combination of a few factors. The data movement to re-replicate the data would have been cancelled, removing pressure on disk space and opening the way for lower priority rebalancing work to happen. Additionally, the gradual process of reclaiming cleared data (which also occurs as a result of data movement) could be helping as well.