Understanding about testing class and performance tuning in Foundationdb

Hi all,

I create FDB cluster with kubernetes operator on AWS with K8s cluster spec:
8 c5a.4xlarge instances (16 cores, 33Gb RAM for each) - Total 128 cores and 264 Gb RAM.
Here is my k8s FDB cluster

apiVersion: apps.foundationdb.org/v1beta2
kind: FoundationDBCluster
metadata:
  labels:
    cluster-group: foundationdb-cluster
  name: foundationdb-cluster
spec:
  version: 7.1.25
  faultDomain:
    key: foundationdb.org/none
  processCounts:
    cluster_controller: 1
    stateless: 8
    log: 4
    storage: 8
    test: 10
  databaseConfiguration:
    redundancy_mode: "double"
    commit_proxies: 4
    grv_proxies: 2
  processes:
    general:
      customParameters:
      - "knob_disable_posix_kernel_aio=1"
      volumeClaimTemplate:
        spec:
          resources:
            requests:
              storage: "100G"
      podTemplate:
        spec:
          containers:
            - name: foundationdb
              resources:
                requests:
                  cpu: 1
                  memory: 1Gi
                limits:
                  cpu: 4
                  memory: 8Gi

First question: does cluster hardware specs enough resource for FDB cluster with these config archive best performance?

Then I access to one test pod and create test file RandomReadWrite.txt:

; This file is used by the Atlas Load Generator
testTitle=RandomReadWriteTest
    testName=ReadWrite
    testDuration=60.0
    transactionsPerSecond=10000
    writesPerTransactionA=1
    readsPerTransactionA=100
    writesPerTransactionB=10
    readsPerTransactionB=10
    ; Fraction of transactions that will be of type B
    alpha=0.1
    nodeCount=20000000
    valueBytes=1000
    ; average 600
    minValueBytes=200
    discardEdgeMeasurements=false
    warmingDelay=20.0
    timeout=300000.0
    databasePingDelay=300000.0

My second question: please explain detail every parameter in test file and test scenario, I use it from foundationDB repo. I don’t see any document explain for them.

And run the test (num_testers are 10 - 10 test pods create in yaml spec):
fdbserver -r multitest -f RandomReadWrite.txt --num_testers 10

I got output for 10 test instances, all of output are quite equivalent, so I post one output here:

Set perpetual_storage_wiggle=0 ...
Set perpetual_storage_wiggle=0 Done.
Run test:RandomReadWriteTest start
setting up test (RandomReadWriteTest)...
running test (RandomReadWriteTest)...
RandomReadWriteTest complete
checking test (RandomReadWriteTest)...
fetching metrics (RandomReadWriteTest)...
Metric (0, 0): Measured Duration, 60.000000, 60
Metric (0, 1): Transactions/sec, 181.016667, 181
Metric (0, 2): Operations/sec, 18282.683333, 1.83e+04
Metric (0, 3): A Transactions, 10861.000000, 10861
Metric (0, 4): B Transactions, 0.000000, 0
Metric (0, 5): Retries, 131.000000, 131
Metric (0, 6): Mean load time (seconds), 958.151461, 958
Metric (0, 7): Read rows, 1086100.000000, 1.09e+06
Metric (0, 8): Write rows, 10861.000000, 1.09e+04
Metric (0, 9): Mean Latency (ms), 1361.722780, 1.36e+03
Metric (0, 10): Median Latency (ms, averaged), 1350.329638, 1.35e+03
Metric (0, 11): 90% Latency (ms, averaged), 1473.537207, 1.47e+03
Metric (0, 12): 98% Latency (ms, averaged), 1581.157446, 1.58e+03
Metric (0, 13): Max Latency (ms, averaged), 4137.708664, 4.14e+03
Metric (0, 14): Mean Row Read Latency (ms), 1166.177251, 1.17e+03
Metric (0, 15): Median Row Read Latency (ms, averaged), 1165.833235, 1.17e+03
Metric (0, 16): Max Row Read Latency (ms, averaged), 1622.269392, 1.62e+03
Metric (0, 17): Mean Total Read Latency (ms), 1341.413475, 1.34e+03
Metric (0, 18): Median Total Read Latency (ms, averaged), 1345.561028, 1.35e+03
Metric (0, 19): Max Total Latency (ms, averaged), 1622.269392, 1.62e+03
Metric (0, 20): Mean GRV Latency (ms), 1.212514, 1.21
Metric (0, 21): Median GRV Latency (ms, averaged), 1.117945, 1.12
Metric (0, 22): Max GRV Latency (ms, averaged), 9.887934, 9.89
Metric (0, 23): Mean Commit Latency (ms), 3.062925, 3.06
Metric (0, 24): Median Commit Latency (ms, averaged), 2.967358, 2.97
Metric (0, 25): Max Commit Latency (ms, averaged), 11.859655, 11.9
Metric (0, 26): Read rows/sec, 18101.666667, 1.81e+04
Metric (0, 27): Write rows/sec, 181.016667, 181
Metric (0, 28): Bytes read/sec, 11150626.666667, 1.12e+07
Metric (0, 29): Bytes written/sec, 111506.266667, 1.12e+05

My third question: with these output, can you explain it?
As my understanding (may be it was wrong), my cluster (8 strong instances) can only handle maximum
Metric (0, 1): Transactions/sec, 181.016667, 181
And what is different between
Metric (0, 1): Transactions/sec, 181.016667, 181
Metric (0, 2): Operations/sec, 18282.683333, 1.83e+04

I’m really apprecicate for your help/advice.

First question: does cluster hardware specs enough resource for FDB cluster with these config archive best performance?

That depends on the workload pattern you are expecting, there is no general answer to that.

My second question: please explain detail every parameter in test file and test scenario, I use it from foundationDB repo. I don’t see any document explain for them.

There should be some limited documentation in the code itself: https://github.com/apple/foundationdb/blob/main/fdbserver/workloads/ReadWrite.actor.cpp#L366.

My third question: with these output, can you explain it?
As my understanding (may be it was wrong), my cluster (8 strong instances) can only handle maximum
Metric (0, 1): Transactions/sec, 181.016667, 181

For the specific test configuration yes, you might want to run the test longer than 60 seconds to get a better signal.

And what is different between
Metric (0, 1): Transactions/sec, 181.016667, 181
Metric (0, 2): Operations/sec, 18282.683333, 1.83e+04

The different between transactions and operations is that a transaction can contain one or more operations. In the configuration above you specified:

writesPerTransactionA=1
readsPerTransactionA=100
writesPerTransactionB=10
readsPerTransactionB=10

So a transaction of type A will do 101 operations and a transaction of type B will do 20 operations and based on alpha=0.1: 90% of the transactions will be of type A.

Btw.: Is there any reason you set knob_disable_posix_kernel_aio or is that just a copy paste from the example in the operator repo? If so you want to remove this knob, we only set it because it’s required if you run FDB inside Docker for Mac (or something similar that uses a VM under the hood).

My cluster purpose is use for 90% read and 10% write.
I know this question is quite general but I don’t know how many process/role config for that cluster.
Example that cluster should have 8 or 16 storage process, 4 or 8 log process, 2 or 4 commit proxy …
How many core, RAM for each pod (each pod is run only one process)?
How many process for cluster?

Why test longer can get better signal?
I assume maximum number of transactions that can handle by cluster per second is fixed number (+/- small deviation)

And what is different when I run --num_testers is 1 vs 10?
10 mean run 10 test parallel? => so trans/s or operations/s is multiple by 10?
In my case
10 x Metric (0, 1): Transactions/sec, 181.016667, 181 ??
10 x Metric (0, 2): Operations/sec, 18282.683333, 1.83e+04 ??

I removed it