Cluster Controller CPU Utilization Pegged

ricky.saltzer · September 30, 2018, 4:25pm

Hey -

I’m currently POC’ing a FDB cluster in AWS using three-datacenter replication across 3 availability zones. We have 4 nodes in each data center. Each node is running 4 FDB processes, with two classes set to storage, 1 log, and 1 stateless.

I noticed once I started to increase my write throughput that the cluster controller was the only process in “details” that was pegged at about 95-99% CPU utilization. I ran an exclude on the process, and allowed FDB to migrate it’s role over to another process class of stateless, but that process also exhibited the same behavior.

That being said, I let my continuous streaming job run overnight, and this morning I noticed I am no longer able to make a connection to the database due to the cluster controller being unavailable. Although, from the looks of it, clients that had connections seem to retain them.

I restarted the FDB process on the controller and confirmed that the controller that it was complaining about did in fact change (so at least I know things are working). Although, when I opened a terminal to the node, I can see the process via top is still pegged at 99% CPU.

Finally, I killed my streaming job, and immediately I was able to connect to the cluster.

So I guess my question is, how does one scale the controller process out, so that we don’t have 1 controller getting completely hammered by writes.

ajbeamon · October 1, 2018, 3:22pm

How many clients are connected to the database? The cluster controller isn’t involved in the write path, so it shouldn’t be the bottleneck for the write throughput of your database. However, it does maintain a connection to every process in the cluster and every client. Part of its responsibilities are to handle failure monitoring, and the costs for this do add up when you have a lot of clients and/or processes in the cluster. Your number of server process seems ok, so that’s why I’m curious about the clients.

At the moment there is no way to scale out the responsibilities of the cluster controller, although it may be possible to reduce the connection costs with knobs. See Migrating from a large cluster to another for more discussion of this.

killertypo · October 1, 2018, 4:40pm

what other roles were assigned to your cluster controller? It is very likely that (depending on your workload) you may need more processes to better distribute and isolate the process. Also depending on the hardware and cpu architecture that you may just need a bit more horsepower under the hood.

We have found it’s best to try and isolate the CC to a single process on its own so that it is not also trying to perform proxy/resolver work along side the CC work.

FDB is single threaded to the individual proc, so it would essentially be trying to possibly handle heavy write throughput and manage the CC.

ricky.saltzer · October 1, 2018, 5:17pm

@ajbeamon - your comment sent me down the right path. There was a path in my code I totally overlooked that was opening a connection for every aggregate I was writing. I honestly can’t believe it stayed online for as long as it did

I’m about to run the job again with my fix and I am optimistic it will scream.

Cheers!

KrzysFR · October 1, 2018, 7:37pm

That’s another good argument for implementing connection pooling under the hood in bindings, to prevent this (easy) mistake.

Question: if the binding was doing connection pooling under the hood, this would have prevented the perf hit, but this would have also hidden the issue in the code (recreating a new connection on each usage). What do you think is better: doing magic under the hood to prevent “bad” code from misbehaving? Or letting it “blow up” with bad perf during testing in order to reveal the issue, and force to redesign the code into something better?

ricky.saltzer · October 1, 2018, 8:22pm

There might be an opportunity to have both. I’m a fan of APIs being explicit where possible, and so, maybe a connection pool API could be added, but you would have to explicitly create and use it.

but to answer your question, I think it’s preferable that everything blew up and surfaced the bug in my code. Otherwise, the bug would have exposed itself in suboptimal performance, which can be a nightmare to track down, especially if you don’t know what is considered “suboptimal” without a bug-free baseline to base it off.

spullara · October 1, 2018, 9:10pm

Maybe settable explicit max number of connections per process per cluster initially set to 1?

Topic		Replies	Views
Troubles scaling up the cluster Using FoundationDB	31	3735	November 1, 2018
FoundationdDB Cluster Performance Issues Using FoundationDB performance , operator	11	1305	October 21, 2020
CPU limited storage processes Using FoundationDB performance	9	1542	May 18, 2021
Migrating from a large cluster to another Using FoundationDB	14	2409	November 6, 2018
Cluster tuning cookbook Using FoundationDB	26	8838	February 1, 2019

Cluster Controller CPU Utilization Pegged

Related topics