Ah yes, I forgot you mentioned trying the memory storage engine, which would eliminate disk issues.
I asked about “client processes”, not “clients.” By “process” I am referring to the operating system concept. You mentioned having up to 100 clients in a single process, which suggested maybe you are only using one process to run all of your clients, but I did not want to assume this.
The reason I asked about process count is that each unique process will have one instance of the
fdb_c library, which has a single fdb network thread through which all communication is routed, so this can be a bottleneck. I am not familiar with what ops/s or KV byte/s rates at which this becomes an issue, hopefully someone else can chime in about that.
Backup does large many parallel
getRange() requests of sequential data in ranges chosen to distribute well across the cluster. That is different from what you are doing.
My understanding of your code is that each instance of
parallel.ts will get a single read version, then create 100 transactions with that read version (so no round trip to the cluster which is good) and each transaction will launch 1000 parallel reads of the same 1000 keys and then wait for them, and then you wait for all 100 transactions to finish before printing the time it took. Basically, you are bursting a bunch of requests to a single storage team in your cluster (3 specific processes in triple replication) and waiting for all of them to return.
If you are running a single instance of
parallel.ts as your benchmark, and when you say “100 clients” you are referring to the 100 parallel transactions in this code, then this is not a great benchmark for measuring sustained throughput. The amount of work outstanding at once will go from 0 to up to 100,000 very quickly (“up to” because likely some will complete by the time all of them are launched) and then fall rapidly until the last one completes. Your test is waiting for the final completion before it counts anything as completed.
To measure throughput, it is better to target some fixed number of requests outstanding, launch that number of requests, and when any of them complete then launch another request to replace it. Then you can measure the number of complete/launch events per second, which gives you the throughput (ops/second) benchmark at that level of parallelism. Then run the test again at higher levels of parallelism to find the saturation point, which is the point at which the throughput does not increase.
EDIT: Once you have found this saturation limit for a single test process, then you can try running more than one test process, and if that gets you a higher aggregate throughput (and I expect it would) then the bottleneck was the single client process (with its single fdb library instance) and not the cluster.
Also, although this is not relevant to your issue I wanted to point out that your quoted Mac disk speeds are for linear read and write, uncached random access workloads are much slower.