GRV throughput saturates at 400K per second

We are running a cluster across 5 Baremetal machines each with 40 cores * 760GB Ram * 4TB SSD.

FDB version - 6.2.19
Storage Servers = 20 - 25
T-Logs = 3
Proxies = 3
Stateless = 10
Replication = 3

We run a netty based layer that acts as a service layer for client requests. At the moment we are running between 20 - 40 netty processes but can scale more if required. We are testing in actual user load before full blown production usecase. Actual number of individual clients that connect to the netty process could be in the range of 40 - 50K ( but we want to be able to scale upto 200K in the future). Each client has its own transaction boundaries and it cant share txns with other clients.

With the above requirement we soon figured out the cluster saturates at around 400K rps. We were expecting saturation to come from storage processes and we could scale the cluster but the actual limitation comes from getting the read versions. This is somewhat understandable with the limitation on proxies but the bigger surprise it there is no way it could be scaled further. Adding more proxies in fact makes the problem worse and i could see latency creeping up but no benefit in actual throughput.
Following this thread we tested with setting the MaxBatchSize in the client size, it actually helped to some extend. Single JVM throughput improved between 5 - 15% but after that its of no use.
If we reuse transactions across clients a single jvm throughput jumps by 100%, this shows the bottleneck is not in network thread or storage servers.

Single JVM performance in prod setup (simulated with a test client to put maximum concurrency)-
Throughput means number of req to read 1KB sec in a new txn.
Concurrency - absolute concurrency in the jvm simulated by a test client (ie., not just open sockets but the actual workload )

Concurrency Throughput Latency
2 - 3.1K - 0.6ms
4 - 5.5K - 0.7ms
8 - 9.5K - 0.9ms
16 - 15K - 1.1ms
32 - 20K - 1.6ms
64 - 25K - 2.5ms ( already saturated )

Across 40 JVM
Concurrency Throughput Latency
64 - 52K - 1.2ms
128 - 85K - 1.4ms
256 - 133K - 1.9ms
512 - 208K - 2.5ms
1000 - 285K - 3.5ms
1500 - 340K - 4.5ms
2000 - 370K - 5.5ms
2500 - 400K - 6.2ms
3000 - 380K - 7.1ms ( throughput is down )

** Same load in 20JVM was marginally better, 40JVM was of no use in this benchmark.

This shows the server side batching isnt really working, This thread highlights the issue.

I understand that reusing txns are useful and we have already experimented that, but as you can see with 50K clients and average latency expectation is order 1 - 3ms we are going to hit this limit pretty soon. Any more ideas on how to improve the GRV throughput will be very useful.

1 Like