Troubles scaling up the cluster

This happens the first time you run status every time you open fdbcli? Is it also telling you that it 's having trouble communicating with the cluster controller? Is there any way you can check what the CPU usage looks like on the cluster controller to see if it may still be pretty busy? This can be done either from the ProcessMetrics event in the trace logs of the process (compute CPUSeconds/Elapsed) or by the standard tools (e.g. top) on the machine itself.

Turns out the reporting issue I described has been addressed in 6.0: https://github.com/apple/foundationdb/pull/687.

I think there may be some other issues with data movement highlighted here – for example that it got to a point where it was stuck but bouncing the cluster unstuck it. I may try to reproduce this scenario to understand the details of why it couldn’t make progress and think about how we might be able to get it out of that condition automatically.

Hard to guess for me, but I’ve added 30 servers with 12TB of extra capacity two days prior that. I know the cluster wasn’t rebalanced yet, but it should not be running out of space. The log servers were, that’s why I repurposed two more, but they seemed to be clearing up.

Your explanation does make a sense to me, even I really don’t know that much about the internals of FDB.

This happens anytime I ran status after couple of minutes not doing it (lets say overnight). The first time I run it returns the above, the second time it returns the data. When I check the CPU the load average is way bellow 4 (on 4 vCPU servers) and the utilization is not over 50% on any of the cores, both reported directly from the server and FDB (status details) itself.

Very appreciate it.

Is this happening in new fdbcli sessions, or does it also happen in an already running session that’s been left idle for a while? The message about being unable to connect to the cluster controller occurs if it takes a long time to connect to the cluster controller or if you don’t hear anything from the cluster controller for a while (I think 4-5 seconds in both cases, depending on your version).

One possible cause is the cluster controller being busy as I mentioned, but it sounds like that may not be the case here. Just to confirm, do you know if the 4 vCPUs are backed by 4 physical cores, or is it possible you are getting 2 physical cores with hyperthreading? In the latter case, we’d need to make sure that the machine utilization is below 2 rather than 4.

Another possibility for the error is that it’s taking a really long time to establish a connection to the cluster controller. It could also be that there is really high latency in communication after you’ve established a connection, but if it’s true that it’s happening only at the start of an fdbcli session, then I’d be more inclined to think it has something to do with the initial connection.

If you can get trace logs from fdbcli when it does this (you can pass it the --log parameter to turn on logging), then it might give some clues. I’d be happy to look at the logs as well if you wanted to provide them.

The latter. I actually don’t usually open new sessions.

The latter again. We’re running in a cloud, so it’s a vCPU not a CPU.

The load average is around 1 across the whole cluster.

Good to know. I will do that. I just tried a new session couple of times and it works smoothly. My guess is that the stale session has harder time to reconnect to all of the servers. I added --log to the cmd and will upload results once this happens again

I’m still running into the same issue as described in the first message in this thread. I’m not sure why, as there is no status request being run anymore.

We have however thousands of servers connecting from our infrastructure and they maybe clog the cluster? Anyway, the only remedy I found is to run kill; kill all in the fdbcli. After that, the cluster recovers within couple of minutes and everything is back to normal.

Could this be related to the size and setup of the cluster? Currently we have:

  • 4 core servers, with 16GB ram and 375GB Local SSD disk
  • 62 storage servers (2 processes storage, 2 stateless)
  • 7 log servers (2 processes log, 2 stateless)
  • 7 coordinators
  • 20 proxies
  • 20 resolvers

I started shrinking the cluster and increase the HDD size from 375GB to 3TB (still on NVMe disks). So the new setup will probably be:

  • 20 storage servers (2 processes storage, 2 stateless), 4cpu, 16GB ram, 3TB local sdd
  • 3 log servers (2 processes log, 2 stateless)
  • 4 coordinators
  • 5 proxies
  • 5 resolvers

Maybe this will help? However we need to grow the system and 60TB is going to be good only for short period time. We’re still mostly testing it, but eventually we would like to put all our data in, which is a little bit bellow 1PB, or around 3PB with some room to grow plus triple redundancy. That would bring us to around 1k servers. However as we’re having issues to make 70 work, I’m wondering what could we do better here.

AJ has a far better magical debugging sense than I, but I’m going to imagine that even he would do well with a bit more information. A pastebin of either status json when your cluster appears to be wedged would likely be insightful.

On an unrelated tangent, this is probably pretty overkill. I’d imagine you’d see an improvement in your latencies with little change to throughput if you instead ran with 7 proxies and 1 resolver. I think we’ve only ever seen workloads that are very write heavy with many small key writes benefit from >1 resolvers, and most of the profiling I’ve done seemed to suggest that the marginal benefit of more proxies past the number of logs seemed to taper off pretty quickly.

This is very helpful. We do, in fact, have many writes with small keys. Should I lower to the amount of log servers or log processes (we have 2 per servers, so 14 instead of 7).

I don’t think that’s possible. If you look at my first post, the status returns almost nothing. I did run status json last time and it’s almost empty. Will post whatever it returns next time this occurs again.

Are these all connecting to the cluster as clients? How many thousands? Right now we recommend limiting the number of clients to somewhere around 1000 (this will likely depend on your hardware), but you could experiment to see what works in your setup. If you are running many thousands of clients, that can be taxing on the cluster controller to the point of causing system instability.

There’s another thread here where we discussed this a bit: Migrating from a large cluster to another - #4 by markus.pilman

I’ll also say this an area that we’re looking to improve. I think there are a handful of different changes that can make things better, but one in particular is to distribute the failure monitoring activity so that the single cluster controller isn’t responsible for doing this for everybody.

Correct, everything is a client. Well, There around 3k servers running around 60 threads that connect to FDB. None of them hammer it with too much traffic, but they do tend to make a transaction a second or so.

How would you suggest to deal with this? Postgresql has some pooling services, would this makes sense too? I was hoping to skip that and be able to connect directly as it’s simplifies things radically.

This is 3k client processes each running 60 threads? The significant metric for the cluster controller is the number of client processes, as it does some work for each connected client process. Importantly, the cluster controller isn’t involved in the read or write paths, so your load shouldn’t be responsible for saturating it.

If it’s feasible, the first thing I would try is to reduce the number of your clients and see if that helps. You could try 500 and then maybe add some back if things look good. Assuming things get better, then we need to figure out a way to limit the number of clients connected to the database in your system.

One option is to simply run fewer of them. You should be able to push a cluster to its limits with a smaller number of clients, but of course if other factors require you to have 3k clients, then this won’t be an option.

Another option is to introduce a proxy application between your clients and the cluster. Your clients would interact with the proxies, and the proxies would actually connect to the cluster.

Correct.

Unfortunately, that’s the case. We can limit the amount of threads but not the amount of servers.

I was afraid that this is going to be the solution. Makes things incrementally more difficult.

I will do some more tests and report back what I learned.