We need to support very large number of concurrent clients. We cannot use proxy because that will increase latency.
I believe the limiting factor is cluster controller which is a single process.
Is it possible to separate out client and server controller? Then there can be more than one client controller? The client controller can be chosen in round robin or data center aware way.
These are just my initial thoughts. I am sure, I would have ignored lots of scenarios that needs to be though of.
Note that each of the 100K clients may also connect to any given storage server; it’s not just the cluster controller that will be under stress in this scenario. The cluster controller does little for clients, basically just forward them (rare) changes to configuration information, so it should be easy to have multiple places to go for that. But without adding an extra hop to the read path, with huge numbers of clients you are either going to have huge numbers of open connections on storage servers, or lots of connections being opened and closed. It’s possible that the extra hop is in fact the highest performance solution for your use case.
Also note that to find the cluster controller clients have to talk to the coordinators…
But without adding an extra hop to the read path, with huge numbers of clients you are either going to have huge numbers of open connections on storage servers, or lots of connections being opened and closed.
Is there an connection open to storage servers even when these is no read/no transaction ongoing?
How does the client gets to know about changes in data partitions? (i.e. which storage server holds which key ranges).
I’m curious about your use case: What do you call exactly a ‘client’, if you have at least 100K of them? And do they really need to talk directly to the database cluster?
If your definition of client is a unique process running somewhere, would it be possible to have these 100K clients talk to some API backend, with a lot less servers, and then have these server connect to the database? With database connection pooling, this could dramatically reduce the load on the cluster, and even though it is a “proxy”, maybe the latency would be decreased overall (due to batching, reduced memory overhead, …)