How FDB react to too many connected client

Hi, We need to support large number of connected client in our workload. We want to test the max number of connected client FDB can support and how FDB react when it exceeds its limit. We use the metric “cluster.clients.count” as states in foundationdb/monitored-metrics.rst at main · apple/foundationdb · GitHub to compare with the request client count. From the alert suggestion, it seems the fdb should be able to support more than 1000 connected client, however, that’s not what we observed in our test.

Our cluster setup:

  • FDB 7.1.15
  • Multi-region deployment
    • 9 coordinators, 3 per datacenter
    • 21 VM for primarydatacenter , 3 for satellite datacenter and 21 VM for remote datacenter
    • In primary datacenter, 3 vm for logs 3 vm for statless, and 15 vm allocated for storage

From the experiments, the max number of connected client to the cluster that without seeing any errors is 640, where we have 320 clients do read-modify-update to primary and 320 clients do read-only to remote datacenter.
Then we started seeing FDB process failure when we increase the clients to the cluster.

  • Connected client count at 720:
    Trigger FDB transaction system recovery
1.  The cluster has some unreachable processes. (stateless role being killed by "Fatal Error: Network connection failed")
   Unable to retrieve all status information.
2. Performing recovery transaction.
3. (Re)initializing automatic data distribution
4. Healthy
  • Connected client count > 900:
    FDB cluster become unavailable and unable to recovery from the transaction system recovery.

If we reduce the connected client process but increase the thread per process, then the cluster works fine. Ex. The cluster can support 500 client X 4 thread per client but not 1000 process X 1 thread per client.

Is this expected behavior of FDB when receiving too many open client connections?
Is there any documentation that we can refer to to understand the internal logic of how FDB handle the too many connected client use case?

I would love to understand this better and any help will be appreciated. Thanks!

2 Likes

You need to raise the file descriptor limit of the FDB processes. Please see All Coordinators Crashed At Same Time for previous discussion of this.

Thanks @alexmiller . The information in this thread: All Coordinators Crashed At Same Time(Foundationdb 6.2 - fdbserver going out of memory) is very useful.
We now can support 2000 connected clients stably after increased the open file limit to LimitNOFILE=262144 at process level.
However, if we keep increase more clients, we started to seeing the similar behavior as we tested earlier.

  1. #client hit > 2000: 1-2 proxies failed with Fatal Error: Network connection failed
  2. #client hit > 3000: More stateless processes crashed and some VMs dis-joined from the cluster. Some tests seeing the coordinators crashed too.

We also observed that the number of ls -l /proc/<process-pid>/fd/ | wc -l keep growing but do not see it exceeded the max limit we set.

@alexmiller Wha’t the maximal connected clients that you ever tested for a large scale cluster?
What would be limit or ideal number of the connected clients for Foundationdb on production?

“Network connection failed” doesn’t map to any reason that I’m aware of as to why a proxy would crash. Can you search your trace logs for the Severity=“40” reason why the processes crashed? (And maybe look a little bit above to see if there’s related information there.)

I’m definitely aware of multiple clusters running stably with over 3k clients connected, so there’s likely something still wrong with your setup/environment/test.

My shot in the dark prediction is that (number of fdbserver processes) * 8GB > (amount of memory in the VM).