We’re currently conducting load tests on the cluster, and initially (when the cluster is new), the tests run smoothly without any observable performance issues. However, after some time, the cluster intermittently shows signs of being unhealthy, specifically citing either Storage Server Write Queue Size or Storage Server Write Bandwidth MVCC. Could someone help provide insights into potential factors that might be causing this behavior?
Test started ~12:30 PM, we start seeing issues ~3:30 PM.
Some Observations:
Memory Utilization Trend: Memory utilization steadily increased until 3:30 pm, after which it stayed flat and we started seeing issues. What are potential causes for this trend?
Disk Busy Levels:Disk Busy surged from ~40% to 100% on all storage nodes around 3:30 pm, with a less drastic increase on non-storage nodes (~40% → ~60%).
Disk Reads:A sudden spike in disk reads occurred at 3:30 pm, following a period of no disk reads despite the load test running since 12:30 pm. There were no changes in data sent/ read from the load test clients.
FoundationDB 7.1 (v7.1.43)
source version 0301045e333ac12c4d1184a4621ce6809d8211d4
protocol fdb00b071010000
Cluster Details
Cluster:
FoundationDB processes - 76
Zones - 10
Machines - 10
Memory availability - 7.4 GB per process on machine with least available
Retransmissions rate - 8 Hz
Fault Tolerance - 1 machines
Server time - 01/09/24 18:44:11
Data:
Replication health - Healthy (Repartitioning)
Moving data - 2.734 GB
Sum of key-value sizes - 680.918 GB
Disk space used - 1.592 TB
Operating space:
Storage server - 6426.4 GB free on most full server
Log server - 825.2 GB free on most full server
Workload:
Read rate - 163265 Hz
Write rate - 62972 Hz
Transactions started - 75801 Hz
Transactions committed - 41387 Hz
Conflict rate - 0 Hz
Performance limited by process: Storage server performance (storage queue).
Memory config for storage process: cache_memory 6GiB memory 12GiB
One theory I have - I suspect that the current behavior is as anticipated. As long as disk reads are not initiated, data remains accessible in the storage server memory. The FDB doesn’t resort to reading data from the disk until all available memory is exhausted. Once the memory is fully utilized, any client attempting to read a key will trigger a disk read. Given the substantial volume of disk operations we are performing, it appears we are approaching the upper limits of the storage servers. Please correct if this doesn’t make sense.
You can check theory using test workload. Something like this:
test.txt
testTitle=RandomReadWriteTest
testName=ReadWrite
; 10 minute test
testDuration=600.0
allowedLatency=0.025
transactionsPerSecond=1000000
writesPerTransactionA=0
readsPerTransactionA=5
writesPerTransactionB=1
readsPerTransactionB=1
; Fraction of transactions that will be of type B
alpha=0.18
; produced 70Gb
nodeCount=20000000
valueBytes=3800
; average 3600
minValueBytes=3400
discardEdgeMeasurements=false
warmingDelay=20.0
timeout=300000.0
databasePingDelay=300000.0
Run:
fdbserver -r multitest -f test.txt
You can increase nodeCount. You should receive key-value size more than sum of cache size. After
that you can reproduce the behavior in the first message.
@ajames , to validate your theory, you can check a few metrics:
storage cache hit rate: this can be found in status json, i.e., storage_hit_rate. Usually we expect the value to be about 0.99. We use --cache_memory=6144MiB --class=storage for storage servers, so less memory for cache can be a reason for lower hit ratio.
chart CPU busyness of storage servers by priorities: Splunk query is like Type=NetworkMetrics Machine="10.19.12.12:4702" | timechart max(PriorityBusy*)
what environment you are using? bare metal or AWS? if cloud, how much IOPS provisioned? maybe not enough IOPS caused poor performance.
Thank you for the insights. Currently, our storage_hit_rate stands at 0.999, however, the cluster is not under stress. I’ll monitor the situation and investigate further when issue arise.
Regarding the allocation of cache memory, with 12GB dedicated to the storage process, I’m curious if there’s a recommended percentage for allocating to cache memory. My assumption is that cache_memory is a subset of the total memory allocated (12GB in our case). Could you confirm this?
Our infrastructure utilizes i3en machines with instance store, and so far, there’s no apparent bottleneck observed.
Additionally, I’d appreciate more clarity on the concept of “CPU busyness of storage servers by priorities.” How does this differ from the CPU usage for the storage process that we typically monitor in the status details?
Yes. cache_memory is the subset of the total memory allocated.
CPU busyness of storage servers by priorities is the detailed CPU usage of the SS (e.g., RPC, disk read, disk write, etc), while status details report the aggregate value. The priorities are defined here.