Are spikes of 500ms+ MaxRowReadLatency normal?

Hi all. I’m new to FoundationDB, and have been trying to get a feel for what it is capable of.

I’ve set up a test environment which is doing about 1k reads/0.5k writes/s, running in ‘double ssd’ mode across 3 bare metal machines. Each machine is running 3 storage processes, and 1 log process, and each process has its own SSD. There’s a separate proxy process, and a separate stateless process on each machine.

I have enabled the trace file in the little client I’ve written using the Rust bindings. Latency98 seems to hover around the 20ms mark, but MaxRowReadLatency seems to jump around - sometimes <20ms, sometimes 500ms, occasionally exceeding a second. Over the course of around an hour, I see bucket counts like this:

<10ms: 544
<25ms: 280
<100ms: 0
<400ms: 8
<800ms: 60
800ms+: 2

I’m just wondering if that sort of maximum latency is normal, or whether it might be indicative of a problem somewhere?

Even your lower latency figures are way too high. I’d expect a p99 for a single row read to be <2ms on bare metal.

Be mindful that FDB runs a single network thread per process. If you’re using multiple threads in your rust client, and running only one process, the latency that you’re observing is probably that of saturating the network thread and not your FDB cluster. top should suffice to check for this.

Otherwise, I’m struggling to think of what could have been easily misconfigured that would result in a high MaxRowReadLatency. It’s thus probably time to start posting pastebins of your fdbcli> status json and client+server tracelogs to see if we can magic out what’s gone wrong.

Thanks Alex, that’s good to know. I suspect the issue may lie in me having used .wait() to block until a key value is returned, so I’ll start by rewriting that code and/or doing another test in Go to see how things go.

The latency figure that you’re looking at comes from within the FDB client library itself. Using .wait() to block on the get(key) future is the expected thing to do, and would occur in your bindings wrapping the client library, and thus wouldn’t affect the client library’s perceived latency. This is why I’m suspicious that you’ve either accidentally fallen into an FDB benchmarking anti-pattern in your client code (of saturating the networking thread), or that there’s something horrible happening within your FDB cluster (like a recovery happening every 5s).

The .wait() calls in rust block rust’s event loop, so I thought that might explain it, but the performance is the same without them unfortunately.

The only other thing I can think of is that these servers are running in a shared data center - all 3 in the same building, but they are connected over a wireguard link running on top of the data center’s normal network, rather than using a separate private network. They have a <1ms ping to each other, and retransmits are only occasionally seen. Running the client app on one of the fdb servers or a separate server does not seem to make a difference.

The client process is not cpu bound, and the keys and values are not large. If I monitor the status, I don’t see any recoveries happening. Here’s a snippet from the client trace files and the status output; please let me know if anything else might be useful.

client trace & status
https://pastebin.com/9Rt7v2tZ

status.json
https://pastebin.com/kj5XCNPM

In the server logs, the N2_ReadErrors aren’t related - I have a separate script that is running once every 30 seconds, and the latency doesn’t change when that script is disabled.

https://github.com/eastern-grey/fdblogs/blob/master/server-logs.tgz

As an experiment I switched to ‘ssd single’ and excluded the 2nd and 3rd machines so that only a single machine is serving queries, and the max row read latency is spiking a lot less, hovering around 1-10ms when running the client on the fdb server, with only one 200ms and one 500ms outlier in a sample of 500 metrics.

Perhaps it is network related? Here’s a quick log from iperf across two of the machines.

https://pastebin.com/N5Uc9PUx

Ok, it looks like the network was to blame for the max row read latency. I’d naively assumed that the lost packets would be retried at around 2*RTT, but it looks like Linux waits at least 200ms before retrying by default. Setting rto_min on the interface to a smaller value seems to improve the worst case latency, at the cost of some unnecessary retransmits.

Regarding Latency98 in the trace file, it seems to only measure transactions that modify the database, so rather than the p98 of a single row read, I’m guessing the 20ms was the p98 my test code was taking to complete a write transaction? It was doing approximately 2 gets and 3-4 (double) sets, so I’m guessing the per op latency would work out to be about the 2ms you mentioned.

Edit: I’ve spun this out into a new post: Proxies and packet loss/retransmits

Your server logs look largely fine. The only hints that I see are what you’ve already determined: your network seems to be flaky. Every time that there’s a network reconnection, your storage servers receive a small burst of read traffic, suggesting that whatever was reconnecting was building up read requests to send once the network connection was re-established. This would seem to be in line with the high tail read latency you’re seeing.

Writes in transactions are buffered locally, so this is just sending one commit message to the system and getting a response is 20ms. That commit involves a lot of server<->server communication, so my guess is that in rare cases the same network flake that’s affecting your client also affects the server communication, and they too end up with multi-ms stalls in sending packets between each other, and that shows up in your total latency.

It takes 2s of not hearing a heartbeat for us to consider a node as failed, and all of your network drops are less than 1s, so there’s no signs of recoveries happening of FDB thinking a node failed, there’s only high tail latency.