I’m currently POC’ing a FDB cluster in AWS using three-datacenter replication across 3 availability zones. We have 4 nodes in each data center. Each node is running 4 FDB processes, with two classes set to storage, 1 log, and 1 stateless.
I noticed once I started to increase my write throughput that the cluster controller was the only process in “details” that was pegged at about 95-99% CPU utilization. I ran an exclude on the process, and allowed FDB to migrate it’s role over to another process class of stateless, but that process also exhibited the same behavior.
That being said, I let my continuous streaming job run overnight, and this morning I noticed I am no longer able to make a connection to the database due to the cluster controller being unavailable. Although, from the looks of it, clients that had connections seem to retain them.
I restarted the FDB process on the controller and confirmed that the controller that it was complaining about did in fact change (so at least I know things are working). Although, when I opened a terminal to the node, I can see the process via top is still pegged at 99% CPU.
Finally, I killed my streaming job, and immediately I was able to connect to the cluster.
So I guess my question is, how does one scale the controller process out, so that we don’t have 1 controller getting completely hammered by writes.