Cluster Controller CPU Utilization Pegged

Hey -

I’m currently POC’ing a FDB cluster in AWS using three-datacenter replication across 3 availability zones. We have 4 nodes in each data center. Each node is running 4 FDB processes, with two classes set to storage, 1 log, and 1 stateless.

I noticed once I started to increase my write throughput that the cluster controller was the only process in “details” that was pegged at about 95-99% CPU utilization. I ran an exclude on the process, and allowed FDB to migrate it’s role over to another process class of stateless, but that process also exhibited the same behavior.

That being said, I let my continuous streaming job run overnight, and this morning I noticed I am no longer able to make a connection to the database due to the cluster controller being unavailable. Although, from the looks of it, clients that had connections seem to retain them.

I restarted the FDB process on the controller and confirmed that the controller that it was complaining about did in fact change (so at least I know things are working). Although, when I opened a terminal to the node, I can see the process via top is still pegged at 99% CPU.

Finally, I killed my streaming job, and immediately I was able to connect to the cluster.

So I guess my question is, how does one scale the controller process out, so that we don’t have 1 controller getting completely hammered by writes.

How many clients are connected to the database? The cluster controller isn’t involved in the write path, so it shouldn’t be the bottleneck for the write throughput of your database. However, it does maintain a connection to every process in the cluster and every client. Part of its responsibilities are to handle failure monitoring, and the costs for this do add up when you have a lot of clients and/or processes in the cluster. Your number of server process seems ok, so that’s why I’m curious about the clients.

At the moment there is no way to scale out the responsibilities of the cluster controller, although it may be possible to reduce the connection costs with knobs. See Migrating from a large cluster to another for more discussion of this.

what other roles were assigned to your cluster controller? It is very likely that (depending on your workload) you may need more processes to better distribute and isolate the process. Also depending on the hardware and cpu architecture that you may just need a bit more horsepower under the hood.

We have found it’s best to try and isolate the CC to a single process on its own so that it is not also trying to perform proxy/resolver work along side the CC work.

FDB is single threaded to the individual proc, so it would essentially be trying to possibly handle heavy write throughput and manage the CC.

@ajbeamon - your comment sent me down the right path. There was a path in my code I totally overlooked that was opening a connection for every aggregate I was writing. I honestly can’t believe it stayed online for as long as it did :laughing:

I’m about to run the job again with my fix and I am optimistic it will scream.

Cheers!

That’s another good argument for implementing connection pooling under the hood in bindings, to prevent this (easy) mistake.

Question: if the binding was doing connection pooling under the hood, this would have prevented the perf hit, but this would have also hidden the issue in the code (recreating a new connection on each usage). What do you think is better: doing magic under the hood to prevent “bad” code from misbehaving? Or letting it “blow up” with bad perf during testing in order to reveal the issue, and force to redesign the code into something better?

There might be an opportunity to have both. I’m a fan of APIs being explicit where possible, and so, maybe a connection pool API could be added, but you would have to explicitly create and use it.

but to answer your question, I think it’s preferable that everything blew up and surfaced the bug in my code. Otherwise, the bug would have exposed itself in suboptimal performance, which can be a nightmare to track down, especially if you don’t know what is considered “suboptimal” without a bug-free baseline to base it off.

Maybe settable explicit max number of connections per process per cluster initially set to 1?