Cluster Performance Issue (7.1.43)

We’re currently conducting load tests on the cluster, and initially (when the cluster is new), the tests run smoothly without any observable performance issues. However, after some time, the cluster intermittently shows signs of being unhealthy, specifically citing either Storage Server Write Queue Size or Storage Server Write Bandwidth MVCC. Could someone help provide insights into potential factors that might be causing this behavior?

Test started ~12:30 PM, we start seeing issues ~3:30 PM.

Some Observations:

  1. Memory Utilization Trend: Memory utilization steadily increased until 3:30 pm, after which it stayed flat and we started seeing issues. What are potential causes for this trend?

  1. Disk Busy Levels: Disk Busy surged from ~40% to 100% on all storage nodes around 3:30 pm, with a less drastic increase on non-storage nodes (~40% → ~60%).

  1. Disk Reads: A sudden spike in disk reads occurred at 3:30 pm, following a period of no disk reads despite the load test running since 12:30 pm. There were no changes in data sent/ read from the load test clients.

  1. Disk Writes: No notable changes in disk writes were observed.

  1. CPU Usage: CPU usage showed no significant changes before and after 3:30 pm.
  2. RPS and WPS: Total RPS and WPS reported by FDB remained consistent before and after 3:30 pm.

FDB Version

FoundationDB 7.1 (v7.1.43)
source version 0301045e333ac12c4d1184a4621ce6809d8211d4
protocol fdb00b071010000

Cluster Details

Cluster:
  FoundationDB processes - 76
  Zones                  - 10
  Machines               - 10
  Memory availability    - 7.4 GB per process on machine with least available
  Retransmissions rate   - 8 Hz
  Fault Tolerance        - 1 machines
  Server time            - 01/09/24 18:44:11

Data:
  Replication health     - Healthy (Repartitioning)
  Moving data            - 2.734 GB
  Sum of key-value sizes - 680.918 GB
  Disk space used        - 1.592 TB

Operating space:
  Storage server         - 6426.4 GB free on most full server
  Log server             - 825.2 GB free on most full server

Workload:
  Read rate              - 163265 Hz
  Write rate             - 62972 Hz
  Transactions started   - 75801 Hz
  Transactions committed - 41387 Hz
  Conflict rate          - 0 Hz
  Performance limited by process: Storage server performance (storage queue).

Memory config for storage process:
cache_memory 6GiB
memory 12GiB

Any insights on what is happening/ pointers to check is much appreciated. cc: @amehta

Any help here is much appreciated.

One theory I have - I suspect that the current behavior is as anticipated. As long as disk reads are not initiated, data remains accessible in the storage server memory. The FDB doesn’t resort to reading data from the disk until all available memory is exhausted. Once the memory is fully utilized, any client attempting to read a key will trigger a disk read. Given the substantial volume of disk operations we are performing, it appears we are approaching the upper limits of the storage servers. Please correct if this doesn’t make sense.

You can check theory using test workload. Something like this:
test.txt

    testTitle=RandomReadWriteTest
    testName=ReadWrite
    ; 10 minute test
    testDuration=600.0
    allowedLatency=0.025
    transactionsPerSecond=1000000
    writesPerTransactionA=0
    readsPerTransactionA=5
    writesPerTransactionB=1
    readsPerTransactionB=1
    ; Fraction of transactions that will be of type B
    alpha=0.18
    ; produced 70Gb
    nodeCount=20000000
    valueBytes=3800
    ; average 3600
    minValueBytes=3400
    discardEdgeMeasurements=false
    warmingDelay=20.0
    timeout=300000.0
    databasePingDelay=300000.0

Run:

fdbserver -r multitest -f test.txt

You can increase nodeCount. You should receive key-value size more than sum of cache size. After
that you can reproduce the behavior in the first message.

1 Like

@ajames , to validate your theory, you can check a few metrics:

  • storage cache hit rate: this can be found in status json, i.e., storage_hit_rate. Usually we expect the value to be about 0.99. We use --cache_memory=6144MiB --class=storage for storage servers, so less memory for cache can be a reason for lower hit ratio.
  • chart CPU busyness of storage servers by priorities: Splunk query is like Type=NetworkMetrics Machine="10.19.12.12:4702" | timechart max(PriorityBusy*)
  • what environment you are using? bare metal or AWS? if cloud, how much IOPS provisioned? maybe not enough IOPS caused poor performance.

Thank you for the insights. Currently, our storage_hit_rate stands at 0.999, however, the cluster is not under stress. I’ll monitor the situation and investigate further when issue arise.

Regarding the allocation of cache memory, with 12GB dedicated to the storage process, I’m curious if there’s a recommended percentage for allocating to cache memory. My assumption is that cache_memory is a subset of the total memory allocated (12GB in our case). Could you confirm this?

Our infrastructure utilizes i3en machines with instance store, and so far, there’s no apparent bottleneck observed.

Additionally, I’d appreciate more clarity on the concept of “CPU busyness of storage servers by priorities.” How does this differ from the CPU usage for the storage process that we typically monitor in the status details?

Yes. cache_memory is the subset of the total memory allocated.

CPU busyness of storage servers by priorities is the detailed CPU usage of the SS (e.g., RPC, disk read, disk write, etc), while status details report the aggregate value. The priorities are defined here.

1 Like