Adding some more info.
Added 2 more storage nodes, now we have a 16 node FDB cluster.
3 i3.xl (Tx nodes) + 13 i3.4xl (SP nodes)
FoundationDB 6.2 (v6.2.15)
Redundancy mode - double
Storage engine - ssd-2
Coordinators - 6
Desired Resolvers - 6
Desired Logs - 6
FoundationDB processes - 220
Zones - 16
Machines - 16
Memory availability - 7.3 GB per process on machine with least available
Retransmissions rate - 3 Hz
Fault Tolerance - 1 machine
Server time - 09/03/20 04:43:55
Replication health - Healthy (Repartitioning.)
Moving data - 0.549 GB
Sum of key-value sizes - 574.755 GB
Disk space used - 7.380 TB
Storage server - 1182.3 GB free on most full server
Log server - 824.9 GB free on most full server
Read rate - 43579 Hz
Write rate - 95600 Hz
Transactions started - 277 Hz
Transactions committed - 260 Hz
Conflict rate - 3 Hz
Performance limited by process: Storage server performance (storage queue).
Each i3.4xl has 12 storage processes running (6/disk), rest 4 are stateless process (3 nodes have 1 coordinators).
All 3 i3.xl nodes run 2 Tx process each (They also have 1 coordinator + 1 stateless process running).
Even with 40-50K RPS + 50-80K WPS, we see cluster going into unhealthy state. As per the FDB performance bottleneck reason, it shows - ‘storage_server_write_queue_size’
Disk busy is under 60% (fdb_cluster_processes_disk_busy). Disk reads: 3K-6K, Disk writes : 10K-30K.
We do see storage process queue going up to 1GB randomly on different SPs. Does not look it is always hitting same SP or same host.
(fdb_cluster_processes_roles_storage_input_bytes_counter - fdb_cluster_processes_roles_storage_durable_bytes_counter)
See SP CPU hitting 100%.
Each write transaction has ~200 records, total Tx size is limited below 1MB.
We also see fair bit of Tx conflicts.
Initially cluster size was 6 node (3Tx+3SP), we don’t see linear performance as we added 13 more SP nodes. Trying to see where the current bottleneck is. Any help is much appreciated.