High P99.9 Latencies (±70millis) on range reads (<1KiB) with ±1000 reads per second

We have a deployment w/ 3 servers, using double replication.
Each server has 10GBit networking, 256GB of ram, 10TB of nvme disk, 48cores (96 threads).
We run with 3 coordinators (one per node), total of 48 fdbserver processes per node and all processes share the same nvme drive. We use ssd-redwood-1 for storage engine.

The servers are all “idle” looking at cpu and io-load, network load is also small (100Mbit/sec utilization).

The load is coming from a single process, also on the same 10GBit network, doing sequential reads of <1KiB. the avg read time is ±900microseconds, but the P99.9 is around 70millis, P99.99 is hovering around 350millis.

bpftrace shows me that the load is not io constrained and is mostly being served from fdbserver processes cache.

Any tips on how to make fdb latencies more consistent, at this small load, having such high P99.9 latencies is definitely not expected.

Here’s the output of status details

Configuration:
  Redundancy mode        - double
  Storage engine         - ssd-redwood-1
  Log engine             - ssd-2
  Encryption at-rest     - disabled
  Coordinators           - 3
  Desired Commit Proxies - 3
  Desired GRV Proxies    - 1
  Desired Resolvers      - 1
  Desired Logs           - 8
  Usable Regions         - 1

Cluster:
  FoundationDB processes - 144
  Zones                  - 3
  Machines               - 3
  Memory availability    - 3.3 GB per process on machine with least available
                           >>>>> (WARNING: 4.0 GB recommended) <<<<<
  Retransmissions rate   - 6 Hz
  Fault Tolerance        - 1 machines
  Server time            - 02/15/25 16:52:35

Data:
  Replication health     - Healthy
  Moving data            - 0.000 GB
  Sum of key-value sizes - 7.519 TB
  Disk space used        - 17.076 TB

Operating space:
  Storage server         - 8479.5 GB free on most full server
  Log server             - 8471.4 GB free on most full server

Workload:
  Read rate              - 1271 Hz
  Write rate             - 4552 Hz
  Transactions started   - 25 Hz
  Transactions committed - 11 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:
  10.100.36.123:4500     (  1% cpu;  1% machine; 0.209 Gbps; 37% disk IO; 3.1 GB / 3.4 GB RAM  )
( ...144 entries that all look very similar....)
  10.100.196.188:4542    (  1% cpu;  1% machine; 0.068 Gbps; 20% disk IO; 2.9 GB / 3.3 GB RAM  )
  10.100.196.188:4543    (  1% cpu;  1% machine; 0.068 Gbps; 20% disk IO; 2.9 GB / 3.3 GB RAM  )
  10.100.196.188:4544    (  1% cpu;  1% machine; 0.068 Gbps; 21% disk IO; 3.1 GB / 3.3 GB RAM  )
  10.100.196.188:4545    (  1% cpu;  1% machine; 0.068 Gbps; 21% disk IO; 3.1 GB / 3.3 GB RAM  )
  10.100.196.188:4546    (  1% cpu;  1% machine; 0.068 Gbps; 21% disk IO; 3.0 GB / 3.3 GB RAM  )
  10.100.196.188:4547    (  1% cpu;  1% machine; 0.068 Gbps; 21% disk IO; 3.2 GB / 3.3 GB RAM  )

Coordination servers:
  10.100.36.123:4500  (reachable)
  10.100.196.187:4500  (reachable)
  10.100.196.188:4500  (reachable)

Client time: 02/15/25 16:52:35

WARNING: A single process is both a transaction log and a storage server.
  For best performance use dedicated disks for the transaction logs by setting process classes.

We don’t limit the RAM per process, the servers are dedicated to foundationdb, we are assigning 1cpu to each process and expecting it to use 4GiB per process. fdbcli status details shows utilization evenly spread across the 144 fdbserver processes.

Our client code in C++ is using 1 fdbclient thread, the thread business isn’t high (±16%)
Any tips are appreciated, thank you.
We can share more details if necessary.

here’s a plot of client side latency metrics: