We are running a foundationdb cluster with the following configuration -
Configuration: Redundancy mode - triple Storage engine - ssd-2 Coordinators - 8 Cluster: FoundationDB processes - 37 Zones - 8 Machines - 8 Memory availability - 7.2 GB per process on machine with least available Retransmissions rate - 1 Hz Fault Tolerance - 2 machines
The process class configuration is -
4 storage + 1 proxy on 3 nodes
4 storage + 1 stateless on 2 nodes
1 log + 3 stateless on 3 nodes
Every machine has a single ssd.
We observe periods with spikes in disk I/O (75-80% on some processes) where the transaction processing latency spikes up quite a bit, high disk I/O doesn’t always lead to high latency but there are sustained periods (~2hr) where processing of transactions is very slow (and disk I/O is high). Very often this coincides with a decrease in the total disk space used (or happens around this time) but there doesn’t seem to be a correlation with the number of clear key transactions with the latency spikes. Sometimes a large number of these types
of transactions are processed with minimal latency. Also it is not the case that the write rate is low during these periods.
We want to figure out if there’s something that we are missing that might be happening in the cluster leading to these periods of sustained high latency.