Hi all,
I create FDB cluster with kubernetes operator on AWS with K8s cluster spec:
8 c5a.4xlarge instances (16 cores, 33Gb RAM for each) - Total 128 cores and 264 Gb RAM.
Here is my k8s FDB cluster
apiVersion: apps.foundationdb.org/v1beta2
kind: FoundationDBCluster
metadata:
labels:
cluster-group: foundationdb-cluster
name: foundationdb-cluster
spec:
version: 7.1.25
faultDomain:
key: foundationdb.org/none
processCounts:
cluster_controller: 1
stateless: 8
log: 4
storage: 8
test: 10
databaseConfiguration:
redundancy_mode: "double"
commit_proxies: 4
grv_proxies: 2
processes:
general:
customParameters:
- "knob_disable_posix_kernel_aio=1"
volumeClaimTemplate:
spec:
resources:
requests:
storage: "100G"
podTemplate:
spec:
containers:
- name: foundationdb
resources:
requests:
cpu: 1
memory: 1Gi
limits:
cpu: 4
memory: 8Gi
First question: does cluster hardware specs enough resource for FDB cluster with these config archive best performance?
Then I access to one test pod and create test file RandomReadWrite.txt:
; This file is used by the Atlas Load Generator
testTitle=RandomReadWriteTest
testName=ReadWrite
testDuration=60.0
transactionsPerSecond=10000
writesPerTransactionA=1
readsPerTransactionA=100
writesPerTransactionB=10
readsPerTransactionB=10
; Fraction of transactions that will be of type B
alpha=0.1
nodeCount=20000000
valueBytes=1000
; average 600
minValueBytes=200
discardEdgeMeasurements=false
warmingDelay=20.0
timeout=300000.0
databasePingDelay=300000.0
My second question: please explain detail every parameter in test file and test scenario, I use it from foundationDB repo. I don’t see any document explain for them.
And run the test (num_testers
are 10 - 10 test pods create in yaml spec):
fdbserver -r multitest -f RandomReadWrite.txt --num_testers 10
I got output for 10 test instances, all of output are quite equivalent, so I post one output here:
Set perpetual_storage_wiggle=0 ...
Set perpetual_storage_wiggle=0 Done.
Run test:RandomReadWriteTest start
setting up test (RandomReadWriteTest)...
running test (RandomReadWriteTest)...
RandomReadWriteTest complete
checking test (RandomReadWriteTest)...
fetching metrics (RandomReadWriteTest)...
Metric (0, 0): Measured Duration, 60.000000, 60
Metric (0, 1): Transactions/sec, 181.016667, 181
Metric (0, 2): Operations/sec, 18282.683333, 1.83e+04
Metric (0, 3): A Transactions, 10861.000000, 10861
Metric (0, 4): B Transactions, 0.000000, 0
Metric (0, 5): Retries, 131.000000, 131
Metric (0, 6): Mean load time (seconds), 958.151461, 958
Metric (0, 7): Read rows, 1086100.000000, 1.09e+06
Metric (0, 8): Write rows, 10861.000000, 1.09e+04
Metric (0, 9): Mean Latency (ms), 1361.722780, 1.36e+03
Metric (0, 10): Median Latency (ms, averaged), 1350.329638, 1.35e+03
Metric (0, 11): 90% Latency (ms, averaged), 1473.537207, 1.47e+03
Metric (0, 12): 98% Latency (ms, averaged), 1581.157446, 1.58e+03
Metric (0, 13): Max Latency (ms, averaged), 4137.708664, 4.14e+03
Metric (0, 14): Mean Row Read Latency (ms), 1166.177251, 1.17e+03
Metric (0, 15): Median Row Read Latency (ms, averaged), 1165.833235, 1.17e+03
Metric (0, 16): Max Row Read Latency (ms, averaged), 1622.269392, 1.62e+03
Metric (0, 17): Mean Total Read Latency (ms), 1341.413475, 1.34e+03
Metric (0, 18): Median Total Read Latency (ms, averaged), 1345.561028, 1.35e+03
Metric (0, 19): Max Total Latency (ms, averaged), 1622.269392, 1.62e+03
Metric (0, 20): Mean GRV Latency (ms), 1.212514, 1.21
Metric (0, 21): Median GRV Latency (ms, averaged), 1.117945, 1.12
Metric (0, 22): Max GRV Latency (ms, averaged), 9.887934, 9.89
Metric (0, 23): Mean Commit Latency (ms), 3.062925, 3.06
Metric (0, 24): Median Commit Latency (ms, averaged), 2.967358, 2.97
Metric (0, 25): Max Commit Latency (ms, averaged), 11.859655, 11.9
Metric (0, 26): Read rows/sec, 18101.666667, 1.81e+04
Metric (0, 27): Write rows/sec, 181.016667, 181
Metric (0, 28): Bytes read/sec, 11150626.666667, 1.12e+07
Metric (0, 29): Bytes written/sec, 111506.266667, 1.12e+05
My third question: with these output, can you explain it?
As my understanding (may be it was wrong), my cluster (8 strong instances) can only handle maximum
Metric (0, 1): Transactions/sec, 181.016667, 181
And what is different between
Metric (0, 1): Transactions/sec, 181.016667, 181
Metric (0, 2): Operations/sec, 18282.683333, 1.83e+04
I’m really apprecicate for your help/advice.