How to achieve (1000 tx 1000 reads each)/sec on single client

We are struggling to find a way and if it is even possible to do so.
We want to be able to achieve 1000 parallel transactions each doing 1k reads with 500 bytes values each second.

What we tried:

  1. Have multiple clients (1-100) for a single process
  2. Reusing Read Version
  3. Increase transactions, but reduce reads
  4. Using memory engine

All is on development machine (MacBook Pro) on MacOS, but we are getting about the same performance in our cluster.

We checked if our application is bottleneck and if FDB have high CPU/IOPS, but nothing is true - application is mostly idle waiting for responses and FDB is eating 5% at most.

So, our results (number of clients x number of transactions x number of reads):

  1. 1 x 100k x 1 ~ 3 sec (same with cached read version)
  2. 1 x 1000 x 1000 - aborted
  3. 1 x 1000 x 200 ~ 3 sec
  4. 1 x 1000 x 400 ~ 6 sec (linear growth!)
  5. 10 х 1000 х 100 ~ 125sec (why they are not timeouted?)
  6. 100 х 10 х 1000 ~ 59sec (no timeout again)
  7. 100 х 100 х 1000 + reuse read version ~ 174sec

fdbcli shows usually ~30khz of reads at most.

This is all are far from official documentation:

Are we missing something?

This is not enough information to know for certain what is going on, but here is one possibility:

If your data set does not fit in cache, and your reads are random, then each 500 byte value you are trying to read exists in a different, uncached page of your storage engine files. This would mean that a read rate of 30k key-value pairs per second is reading 30,000 different random disk pages per second. Even though your values are only 500 bytes, those disk pages are 4k each, and so your disk is reading about 120 MBytes/s of random data. That’s probably about the limit of what your disk can do, which is why you are seeing low CPU usage - you are disk bound.

Using a real cluster, with multiple storage engine processes to load balance across (requires replication of double or triple), you have more disks you can max out IOPs on, so you should be able to go faster. If you are not, then you probably need to increase the parallelism of your read requests.

You did not say how many client processes you are using, but you are trying to read 500 MBytes of key-value data per second so I can say with certainty that one process is too few. You need multiple processes, and possibly multiple machines depending on network connectivity.

1 Like

I actually put a very simple benchmark:

And as i mentioned we have tried memory store just to roll out disk issues. This is all available data and everything is in cache.

I did said how many clients - first number in my tests is the number of clients used.

But thank you, your math is useful, but reviewing my setup i can’t see anything that causes saturation neither disks nor network. For example, Mac’s disk is 1300MB read / 600MB write. Network is just a localhost.

It sounds like it is just some kind of throttling or natural latency somewhere.

Also during backups i can see 200k/s reads in our cluster (we are using single vm for backups), but our app never was able to reach that speed having ~50 processes, not even close.

Ah yes, I forgot you mentioned trying the memory storage engine, which would eliminate disk issues.

I asked about “client processes”, not “clients.” By “process” I am referring to the operating system concept. You mentioned having up to 100 clients in a single process, which suggested maybe you are only using one process to run all of your clients, but I did not want to assume this.

The reason I asked about process count is that each unique process will have one instance of the fdb_c library, which has a single fdb network thread through which all communication is routed, so this can be a bottleneck. I am not familiar with what ops/s or KV byte/s rates at which this becomes an issue, hopefully someone else can chime in about that.

Backup does large many parallel getRange() requests of sequential data in ranges chosen to distribute well across the cluster. That is different from what you are doing.

My understanding of your code is that each instance of parallel.ts will get a single read version, then create 100 transactions with that read version (so no round trip to the cluster which is good) and each transaction will launch 1000 parallel reads of the same 1000 keys and then wait for them, and then you wait for all 100 transactions to finish before printing the time it took. Basically, you are bursting a bunch of requests to a single storage team in your cluster (3 specific processes in triple replication) and waiting for all of them to return.

If you are running a single instance of parallel.ts as your benchmark, and when you say “100 clients” you are referring to the 100 parallel transactions in this code, then this is not a great benchmark for measuring sustained throughput. The amount of work outstanding at once will go from 0 to up to 100,000 very quickly (“up to” because likely some will complete by the time all of them are launched) and then fall rapidly until the last one completes. Your test is waiting for the final completion before it counts anything as completed.

To measure throughput, it is better to target some fixed number of requests outstanding, launch that number of requests, and when any of them complete then launch another request to replace it. Then you can measure the number of complete/launch events per second, which gives you the throughput (ops/second) benchmark at that level of parallelism. Then run the test again at higher levels of parallelism to find the saturation point, which is the point at which the throughput does not increase.

EDIT: Once you have found this saturation limit for a single test process, then you can try running more than one test process, and if that gets you a higher aggregate throughput (and I expect it would) then the bottleneck was the single client process (with its single fdb library instance) and not the cluster.

Also, although this is not relevant to your issue I wanted to point out that your quoted Mac disk speeds are for linear read and write, uncached random access workloads are much slower.

1 Like

Ah my bad i left only one client in the sources, but in my benchmarks i iterated between different number of clients (first for loop) from 1 to 100. This is just my latest iteration. It have different instances - is this still the one network thread? So actual connection pooling is not possible?

Unfortunately, we need exactly this performance: 1k of txs in parallel with 1000 of random reads withing one instance of our app and we need them ASAP - this direct request processing from customer apps and not workers that could be delayed. This benchmark is quite what we want. Beside i am not sure how we can actually budget transaction to make a good balancing/queuing, i bet FDB knows much better how to distribute workload than our application code.

Right now we have a limit for 200 parallel transactions per process and this is very slow, this just means that our app can’t handle even 200 users without degradation.

Our nodes are barely loaded, launching x10 processes looks like very suboptimal since NodeJS (just like any other language) have some fixed overhead and if we barely use them, this became a major contributor to resource consumption.

For example if we want 1m of concurrent users requires whooping 5000 processes. I bet we just doing something wrong or misconfigured something. This just means that to achieve performance we need x100 of client processes comparing to FDB cluster size, this couldn’t be right.

Assuming when you say “instance” you mean once process launched (for example, node <some_program> then each instance would be using a different instance of the FDB library and therefore a network thread. This is what you would want to happen, combining requests from different client app instances on the same box would not be an improvement.

An FDB network thread only opens at most one TCP connection to any cluster process it communicates with, and all requests to that node share that one connection. The FDB client talks to several roles in the cluster, but mainly the proxies (for getting read versions, discovering what storage server to talk to for a key range, and submitting transactions) and the storage servers.

If your values are 500 bytes each, and this is the amount of requests you need to perform in one second, then you are trying to read 500 MB/s with 1 million requests per second from a single instance of the FDB client library. While your cluster can go this fast, a single instance of the FDB client library cannot.

It also seems likely that your network connection could not go this fast. Although 500 MBytes/s sounds slower than 10 Gigabit ethernet (which is 1.1GBytes / second) this traffic pattern of 1 million small requests per second to many different processes is very different from, say, streaming a file at 500 MByte/s.

Supposing though that your network connection was able to handle this, and your client machine was powerful enough, there may be some future in which this speed from a single client application is possible. There is a plan to eventually (no idea when…) have an RPC layer in front of FoundationDB.

This could help your use case because then a single process application could perform its reads with many threads and many connections to many RPC service instances in the FDB cluster. This would remove the bottleneck of all requests being routed through the FDB client library instance’s single network thread.

There is a discussion of the RPC layer concept here:

1 Like

Your ideas a valid, but not reflecting out tests. We narrowed down our test: reading the very same keys (1000 of them in tx), keys are the only one in the database (~1mb of kv), memory storage. Added three more FDB stateless processes (running on macbookpro still, 6-8 cores). We also built a simple golang benchmark that equals to NodeJS one to rule out JS part: it turns out that performance is ~10% less in nodejs, but it is more reliable and we found one minor improvement that makes everything equals to golang performance.

This is clearly some kind of “anti-pattern” benchmark, but we are basically trying to achieve one of the reference benchmarks:

Start 100 parallel clients, each doing:
    Loop for 1 minute:
        Start a transaction
        Perform a random 1000-key range read

Result: 3,600,000 keys/second

The main difference is that we want to get this in a single “instance” of our app.

As our research shows that network thread is indeed shared across all clients in the same process, but we actually tried to run everything in parallel and we got ~20x performance degradation.

We literally just started 5 parallel instances of our app each with one client and we got 32 sec per instance instead of 1.5sec.

Again, reusing read versions yield same results.

We regularly do much more than this against a memory cluster but it seems like you’re hitting the same SS process constantly? You really only have <replication_factor> number of single-threaded FDB processes answering (you could try going triple if you have the option). The trace logs on the server will show you what sort of throughput you’re getting (we can regularly get triple digit megabytes/s out of a single FDB process, AWS i3).

On the client side, you are correct that the network thread is single threaded per process so you’d want to run more of your clients but again it seems to me that the dataset is just too small to begin with (there’s work to have read caches but that’s not in any released version).

1 Like

BTW, in many non-critical (write-then-read) use cases, we re-use the same transaction (obv same read version) and recycle them every second. That way you also get some local caching too (if you’re really reading the same keys, even ranges, all the time).

Java but you get the idea:

That’s a really important difference, and another really important difference is that the range read mentioned in your example is 1 request out with a very large response returned. You are trying to do 1000 individual requests out with 1000 individual responses returned. This is very different, and has orders of magnitude more overhead (because 1000x more requests) in the client for cross-thread synchronization, network serialization and send, and receive overhead to match up the responses to requests (they are not 1:1 in order or even necessarily from the same TCP connection if replication is greater than single).

But what is your throughput from a single client process (and so a single FDB library instance)?

20Gbps fabric, we are seeing > 100 megabytes/s on a single JVM (highest priority to network thread). AWS m5/r5. Most of these are getRanges() with bounded thread pools for call-backs.

Do you know how many requests per second? The goal @ex3ndr is trying to hit here is 1 million get() (not get range) requests per second, each reading a 500 byte value, from a single client, and I am fairly certain this is several times faster than what can be achieved with point small point reads on a single client library instance.

we re-use the same transaction

We are doing something like this, but just copying read version. In our tests performance is similar for both variants (reuse rw or tx).

This is very different, and has orders of magnitude more overhead (because 1000x more requests)

One of the selling points is intelligent batching, but it seems that it is not implemented for reads? If every key is on the same SS then workload is about the same as range reads. Overhead is only on network that could be avoided or batched. This is just very surprising that it is not true.

We are seeing per second scan rates from a single machine going beyond 30k requests/s so it’s definitely less a million gets a second. For a data set that’s a few megabytes of KV, you’d almost be better to read all of them in a single scan and just serve from an in-memory cache =p

1 Like

No, and no. I’m sorry but you are completely ignoring constant factor overhead of just having a request, on the client side and on the server side but in this case especially on the client. Once the per-request overhead is paid, the more work that request produces the faster your throughput will be. One million small requests per second is not at all the same as 1000 request per second where each request produces 1000x the amount of output as each of the million small ones.

I’m going to try explaining this with a non-FDB example in hopes it will help. The concept of per-request overhead and the idea of maximizing per-request output to reduce overhead and maximize throughput is not unique to FDB.

Imagine you have a file sitting in Amazon S3. That file is 500 Mbytes in size. You have a client running in AWS, very close to S3 in terms of network latency and throughput, and you want to use that client to read that file. You open an HTTP connection to the S3 service and then do one of these two things:

  • A) Issue a single Read request to the file data with (offset 0, length 500 million)
  • B) Issue 1 million Read requests in a row, on that same HTTP connection, each requesting the next consecutive 500 bytes of the file. In other words, (offset 0, length 500) then (offset 500, length 500), then (offset 1000, length 500) and so on.

Does it make sense to you that these will not run at the same speed? B will be far, far slower than A. The B approach is incurring per-request overhead a million times. This overhead is significant compared to the work of returning 500 sequential bytes from the file once the request is parsed and the location in the file is reached.

Thank you!

@ex3ndr This sounds about the same as the 30khz you initially observed with a single process running many concurrent transactions.

As for why you could not get more with more client processes, one possibility is your are over-committing the CPU cores on your single machine (you should not have fewer CPU cores than you have fdbserver processes + client processes).

Hey! I am not trying to nullify constant cost of a request! I just trying to figure out a way to achieve performance and this thread shows than performance is limited for reading very same 1k keys:

  1. By replication factor, maximum is 3
  2. By number of data centers, maximum is 3 too?
  3. Limited by a client process performance.

Since a single process can yield ~30khz then we are getting 30 * 9 = 270khz total maximum read rate for same amount of keys meaning there are no way to read it faster? And enen then we should guarantee that workload is even between DC that we couldn’t do actually, so it is more like 90khz.

Is this a limiting factor of FDB - maximum practical speed of reading same keys will be ~ 90khz? That’s quite limiting, i bet there are a better option, is it possible to have read replicas?

This limitation is very unfortunate since i often promote FDB as replacement for Cassandra, but Cassandra at least has a way to scale such reads and it seems there are no way to do so in FDB or i am wrong and there is a way?

If you can scale the number of clients and you are capped by a single fdb server process, you can always duplicate the keys and just randomly read from one replica (and increase writes).