How does FDB execute multiple getRange requests in parallel?

(Kun Ren) #1

We found that FoundationDB is a very good product and plan to use it. I have questions regarding how FoundationDB execute multiple getRange() requests.

  1. For example, if application sends 1000 getRange requests to FDB client and each getRange call is wrapped inside a CompletableFuture for
    asynchronous invocation. Then how does FDB client handles these 1000 getRange requests? Let’s say FDB client might put these 1000 requests into a
    request queue, after that how these requests are executed in parallel? I guess there are multiple threads are working on these pending requests,
    I have two guesses and not sure which one is correct:
    (1) Let’s just assume there is only 1 thread, Does this thread just get a getRange request from request queue, and immediately send to storage server, then get
    the next request from requests queue and send to storage server? And asynchronously receive response from storage server. If we have multiple threads, then multiple threads are doing the same thing, right? If it is true, then there may have many getRange requests running in parallel which is good.
    (2) I checked the getRange code in NativeAPI.actor.cpp, and looks like it does the sequential query from multiple storage servers if getRange needs to read data
    from these servers because it wants to return the values in order, does this mean that each thread will get a getRange request from requests queue and then block on waiting for response from the storage server?
    Then this thread will work on next one after finish this getRange request. If this is the case, suppose there are N working threads, then there are at most N getRange requests are running in parallel, right?

Can you please let me know FDB client is running like (1) or (2)? Basically how does FDB client execute getRange requests in parallel and it would be great if you can show me what is the correct running flow when FDB client receives a lot getRange requests.

  1. Also does the connection channel between FDB client and storage server using something like bidirectional streaming? Or it is running like just send getKeyValues requests to storage server in parallel?

  2. Do you think if client side batch multiple getKeyValues requests and send to storage server would be more efficient? I think even client has MultiGetRanges function, the latency should be same with executing getRange query in parallel since MultiGetRanges still needs to wait for all responses , just want to confirm.

  3. Our current situation is that inside one read-only transaction, our application may send a lot of getRange requests to FDB client, say more than 2500, then there will be 2500 CompletableFutures, Do you think it would be a problem? Or if FDB already did well when running these getRange requests, then it would be no problem.

Thanks a lot and appreciate your reply.

1 Like
(A.J. Beamon) #2
  1. There is a single thread on the client onto which all operations are serialized. This thread can issue requests to the cluster and then begin processing other requests while the request is outstanding, it does not need to block on the response. I’ll note that there is a little bit of work on the client thread associated with each request, and issuing a sufficiently large number of requests can saturate the thread and increase your apparent latencies.

  2. The client maintains connections to each process in the cluster it connects to, but each request and response are sent in whole rather than in some streaming way when looked at from the perspective of the native client. However, the range read interface provides a “streaming mode” that controls how big of batches each range request gets back, so a single range read issued from the bindings may actually correspond with several serial requests to the cluster.

  3. Probably in some sense (e.g. bytes sent over the network, number of function calls) it could be made more efficient to request multiple ranges in one request, though I’m not sure the win would be very significant. In terms of the latency, you may see worse latency in terms of time to first result with a MultiGetRange type of request if it’s required to wait for all results before returning.

  4. As mentioned above, there is a limit to how much you can do on a single client thread. I’m not aware of any issues with having 2500 requests in flight simultaneously, but depending on the rate at which you are issuing these batches of requests you may exceed the total rate at which the thread can process requests (in which case the best solution is to spread the load over more clients).

(Meng Xu) #3

Out of curiosity, what is the typical maximum number of requests before rate-keeper kicks in to slow down the requests?

(Maybe a better question is what metrics does rate-keeper uses for traffic control.)

(A.J. Beamon) #4

Ratekeeper manages load on the cluster side, and when it chooses to clamp down depends on the size of the cluster and the workload.

The limit I was referring to was how much work the client thread can do, and it is not controlled by ratekeeper. I don’t really have any real numbers to give you, but a rough guess might be that you can do some tens of thousands of get range requests per second on a client.

1 Like
(Kun Ren) #5

Thanks for your reply.

  1. Just to confirm that FDB is able to execute multiple getRange requests in pipeline mode and running in parallel, right?

  2. Based on the response, let’s use the following simpler scenarios to get more clarification on end-to-end latency.

Assume we have one client, and one server. And we have 100 getRange (.) calls, {G0, G1, … G99}. Each call, Gi, i={0, 1, …99}, falls into a different range. But all ranges fall into the same server. These 100 getRange(.) calls are originally from the 100 Java getRange(.) calls (each one is a CompleteableFuture) from the same Java client via the Java-Binding library.

(1) Following the response, in today’s FDB implementation, the client thread can keep sending G0, G1 ,… G99, in sequence, without blocking, by continuously pulling them from the request queue deposited with the 100 Java Calls. At the same time, the client thread can receive the response from G0, G1, …, thus the receiving may overlap with some of the call requests’ sending.
(2) Assume that there is a multiGetRange implementation, which allows the client thread to pack a single call with 100 of getRange() requests inside, and send this single call to the server. The client will receive the return response with all of the 100 getRange() call results inside.
(3) In our application in Java, we care only the time duration that starts with sending G0, and ends with receiving responses of all getRange requests.
(4) Each call, either a single getRange(.) call or the hypothesized multiGetRange call, will have to go over the network (with cost, Cost 1), arrive at the Server-side FDB core processing layer (with cost, Cost 2), and then get dispatched to the function that implements getRange() or multiGetRange(with cost, Cost 3), and then finally the actual processing of getRange() or multiGetRange (with cost, Cost 4), by following the conventional RPC call paradigm. The return path may incur the similar costs as well.
(5) In both implementation, our understanding is that Cost 1, Cost 2 and Cost 4 are more or less the same. Thus the only difference is Cost 3, 100 getRange(.) calls likely is going to traverse the dispatcher’s call stack (of the dispatcher) multiple times, whereas one single multiGetRange(.) call only traverses the dispatcher’s call stack once.

In summary, the further question is: is that true that the end-to-end latency difference, between the 100 getRange(.) calls, and the single multiGetRange(.) call, is due to Cost 4, the call stack responsible for the dispatcher in the RPC paradigm?

Thanks.

(Markus Pilman) #6

I think most of these statements are true, though I would like to point out that (as always for these kind of things) there are many additional variables:

  1. Each getRange request will be pushed to the storage engine. This will happen concurrently (in an asynchronous manner, not in parallel). This would be true (although it might be optimized there) even if FoundationDB would provide a multiGetRange operation.
  2. If you are using the memory engine, all requests will be answered in a serial manner - this is due to the lack of parallelism (assuming all your requests go to the same storage process). But in practice you will have at least replication, so even if all requests hit the same partition, they would likely go to two or three storage processes and therefore you would get some parallelism.
  3. If you are using the ssd storage, each read will issue a number of requests to the local disk but only up to 64 read requests will run concurrently. So your 100 requests will have at least twice the latency.
  4. It will depend on your disk and your OS (+ OS config) how much disk parallelism you will get. It also depends on your cache-size and cache hit rate (not OS cache as FDB uses Kernel AIO on Linux).

As you can see, most of these caveats would also apply to a multiGetRange requests. So at the end of the day your question is very difficult to answer. But the short answer is: several parallel getRange requests will typically be faster if they are executed in parallel - but there are limits.