We have a deployment w/ 3 servers, using double replication.
Each server has 10GBit networking, 256GB of ram, 10TB of nvme disk, 48cores (96 threads).
We run with 3 coordinators (one per node), total of 48 fdbserver processes per node and all processes share the same nvme drive. We use ssd-redwood-1 for storage engine.
The servers are all “idle” looking at cpu and io-load, network load is also small (100Mbit/sec utilization).
The load is coming from a single process, also on the same 10GBit network, doing sequential reads of <1KiB. the avg read time is ±900microseconds, but the P99.9 is around 70millis, P99.99 is hovering around 350millis.
bpftrace shows me that the load is not io constrained and is mostly being served from fdbserver processes cache.
Any tips on how to make fdb latencies more consistent, at this small load, having such high P99.9 latencies is definitely not expected.
Here’s the output of status details
Configuration:
Redundancy mode - double
Storage engine - ssd-redwood-1
Log engine - ssd-2
Encryption at-rest - disabled
Coordinators - 3
Desired Commit Proxies - 3
Desired GRV Proxies - 1
Desired Resolvers - 1
Desired Logs - 8
Usable Regions - 1
Cluster:
FoundationDB processes - 144
Zones - 3
Machines - 3
Memory availability - 3.3 GB per process on machine with least available
>>>>> (WARNING: 4.0 GB recommended) <<<<<
Retransmissions rate - 6 Hz
Fault Tolerance - 1 machines
Server time - 02/15/25 16:52:35
Data:
Replication health - Healthy
Moving data - 0.000 GB
Sum of key-value sizes - 7.519 TB
Disk space used - 17.076 TB
Operating space:
Storage server - 8479.5 GB free on most full server
Log server - 8471.4 GB free on most full server
Workload:
Read rate - 1271 Hz
Write rate - 4552 Hz
Transactions started - 25 Hz
Transactions committed - 11 Hz
Conflict rate - 0 Hz
Backup and DR:
Running backups - 0
Running DRs - 0
Process performance details:
10.100.36.123:4500 ( 1% cpu; 1% machine; 0.209 Gbps; 37% disk IO; 3.1 GB / 3.4 GB RAM )
( ...144 entries that all look very similar....)
10.100.196.188:4542 ( 1% cpu; 1% machine; 0.068 Gbps; 20% disk IO; 2.9 GB / 3.3 GB RAM )
10.100.196.188:4543 ( 1% cpu; 1% machine; 0.068 Gbps; 20% disk IO; 2.9 GB / 3.3 GB RAM )
10.100.196.188:4544 ( 1% cpu; 1% machine; 0.068 Gbps; 21% disk IO; 3.1 GB / 3.3 GB RAM )
10.100.196.188:4545 ( 1% cpu; 1% machine; 0.068 Gbps; 21% disk IO; 3.1 GB / 3.3 GB RAM )
10.100.196.188:4546 ( 1% cpu; 1% machine; 0.068 Gbps; 21% disk IO; 3.0 GB / 3.3 GB RAM )
10.100.196.188:4547 ( 1% cpu; 1% machine; 0.068 Gbps; 21% disk IO; 3.2 GB / 3.3 GB RAM )
Coordination servers:
10.100.36.123:4500 (reachable)
10.100.196.187:4500 (reachable)
10.100.196.188:4500 (reachable)
Client time: 02/15/25 16:52:35
WARNING: A single process is both a transaction log and a storage server.
For best performance use dedicated disks for the transaction logs by setting process classes.
We don’t limit the RAM per process, the servers are dedicated to foundationdb, we are assigning 1cpu to each process and expecting it to use 4GiB per process. fdbcli status details
shows utilization evenly spread across the 144 fdbserver processes.
Our client code in C++ is using 1 fdbclient thread, the thread business isn’t high (±16%)
Any tips are appreciated, thank you.
We can share more details if necessary.
here’s a plot of client side latency metrics: