We are running a foundationDB benchmark to better understand its performance and scaling limits. We are currently running a read-only workload where each requests is a point read, doing the following fdb calls: fdb_database_create_transaction(), fdb_transaction_get() and fdb_transaction_commit(). However, as we increase the number of requests, the latencies for getting a read version (GRV) keep increasing to unacceptably high levels and we can’t seem to figure out how to add hardware to address this and scale further.
We were using 320 clients, and as we grew the number of requests from 100k to 1.4 million, we observed read latencies severely degrade, e.g. median degraded from 2ms to 25ms. By looking at, fdb xml trace files, it seems we were entirely limited by GRV and not by reading from storage servers. This is how client TransactionMetrics look like at 1.42 million qps with 8 proxies and 320 clients:
<Event Severity="10" Time="1549044358.928279" Type="TransactionMetrics" ID="0000000000000000" ReadVersions="284189625" LogicalUncachedReads="284189625" PhysicalReadRequests="284189584" CommittedMutations="0" CommittedMutationBytes="0" CommitStarted="0" CommitCompleted="0" TooOld="0" FutureVersions="0" NotCommitted="0" MaybeCommitted="0" ResourceConstrained="0" TransactionBatcherReadVersionRequestsSent="14668992" TransactionBatcherMaxBatchSizeReached="13170029" MeanLatency="0" MedianLatency="0" Latency90="0" Latency98="0" MaxLatency="0" MeanRowReadLatency="0.00274302" MedianRowReadLatency="0.00192499" MaxRowReadLatency="0.0401535" MeanGRVLatency="0.0400555" MedianGRVLatency="0.0242767" MaxGRVLatency="0.193665" MeanCommitLatency="0" MedianCommitLatency="0" MaxCommitLatency="0" MeanMutationsPerCommit="0" MedianMutationsPerCommit="0" MaxMutationsPerCommit="0" MeanBytesPerCommit="0" MedianBytesPerCommit="0" MaxBytesPerCommit="0" GRVBatchTimeMean="0.00431965" GRVBatchTimeMedian="0.005" GRVBatchTimeMax="0.005" GRVBatchReplyLatenciesMean="0.0384088" GRVBatchReplyLatenciesMedian="0.0205247" GRVBatchReplyLatenciesMax="0.193665" Machine="10.58.120.56:3258" LogGroup="default" />
Note that we have added more metrics to the fdbclient better understand the client batching behavior. GRVBatchTime is the target batchTime of the client, which seems to max out at 5ms, while GRVBatchReply is the time to get a response from the proxies, which median was 20ms and mean was 38ms. MeanGRVLatency is about 40ms (roughly the sum of GRVBatchTimeMean / 2 + GRVBatchReplyLatenciesMean as expected). We tried varying number of proxies but this does not seem to have changed the GRV latencies.
We reduced the number of clients from 320 to 160 and latencies did went back to normal at 1.4 million total rps. <Event Severity="10" Time="1549060475.756485" Type="TransactionMetrics" ID="0000000000000000" ReadVersions="3919839" LogicalUncachedReads="3919839" PhysicalReadRequests="3919725" CommittedMutations="0" CommittedMutationBytes="0" CommitStarted="0" CommitCompleted="0" TooOld="0" FutureVersions="0" NotCommitted="0" MaybeCommitted="0" ResourceConstrained="0" TransactionBatcherReadVersionRequestsSent="216126" TransactionBatcherMaxBatchSizeReached="121876" MeanLatency="0" MedianLatency="0" Latency90="0" Latency98="0" MaxLatency="0" MeanRowReadLatency="0.00101261" MedianRowReadLatency="0.000856161" MaxRowReadLatency="0.00865197" MeanGRVLatency="0.00417926" MedianGRVLatency="0.00385022" MaxGRVLatency="0.0224671" MeanCommitLatency="0" MedianCommitLatency="0" MaxCommitLatency="0" MeanMutationsPerCommit="0" MedianMutationsPerCommit="0" MaxMutationsPerCommit="0" MeanBytesPerCommit="0" MedianBytesPerCommit="0" MaxBytesPerCommit="0" GRVBatchTimeMean="0.00187918" GRVBatchTimeMedian="0.00172091" GRVBatchTimeMax="0.005" GRVBatchReplyLatenciesMean="0.00383643" GRVBatchReplyLatenciesMedian="0.00326872" GRVBatchReplyLatenciesMax="0.0223789" Machine="10.58.121.24:25861" LogGroup="default" />
However, as we scaled the rps per client further, we did hit another inflection point at 1.8 million qps. Here is a fdb client TransactionMetrics sample:
<Event Severity="10" Time="1549063033.139217" Type="TransactionMetrics" ID="0000000000000000" ReadVersions="8814201" LogicalUncachedReads="8814201" PhysicalReadRequests="8813119" CommittedMutations="0" CommittedMutationBytes="0" CommitStarted="0" CommitCompleted="0" TooOld="0" FutureVersions="0" NotCommitted="0" MaybeCommitted="0" ResourceConstrained="0" TransactionBatcherReadVersionRequestsSent="450788" TransactionBatcherMaxBatchSizeReached="399161" MeanLatency="0" MedianLatency="0" Latency90="0" Latency98="0" MaxLatency="0" MeanRowReadLatency="0.00156971" MedianRowReadLatency="0.00153208" MaxRowReadLatency="0.0099442" MeanGRVLatency="0.0407064" MedianGRVLatency="0.0392284" MaxGRVLatency="0.122648" MeanCommitLatency="0" MedianCommitLatency="0" MaxCommitLatency="0" MeanMutationsPerCommit="0" MedianMutationsPerCommit="0" MaxMutationsPerCommit="0" MeanBytesPerCommit="0" MedianBytesPerCommit="0" MaxBytesPerCommit="0" GRVBatchTimeMean="0.00440744" GRVBatchTimeMedian="0.005" GRVBatchTimeMax="0.005" GRVBatchReplyLatenciesMean="0.0411822" GRVBatchReplyLatenciesMedian="0.0399389" GRVBatchReplyLatenciesMax="0.123384" Machine="10.58.120.56:2754" LogGroup="default" />
Batch times are maxing out 5ms (which seem to be a configurable knob), but MeanGRVLatency is over 40ms. We are reaching the max batch size (configured to 20 in fdbclient/Knobs.cpp) that is forcing the client to send requests to the proxy more often. Again, we tried to vary the number of proxies to 2, 4, 8, 12, 16, 20 but it did not seem to have helped (which somewhat makes sense given that every single proxy is involved in computing the live read version). Rate limiter does not seem to be kicking in. Given that read latencies seem low and storage server CPU and IO seem underutilized, we don’t believe we are hitting any limits there. As far as we understand, number of tlogs and resolvers should not matter for computing live read version and we are running read-only workload on a cluster that has not done writes in few days.
Couple of Questions:
- What is the upper bound of foundationdb clients that can be used and why?
- Any idea how we can scale the number of reads further by reconfiguring the cluster and throwing hardware at the problem without changing any knobs? Or is the upper bound of transactions independent of number of stateless servers?
- Are there any downsides of increasing MAX_BATCH_SIZE in fdbclient/Knobs.cpp to something larger than 20? Any reason why this value was chosen?
Setup:
160 clients
2 - 20 proxies, 2 resolvers, 16 tlogs - 12GB RAM each. Intel® Xeon® CPU E5-2620 v4 @ 2.10GHz.
416 storage servers total - 16 RAM each. Intel® Xeon® CPU E5-2630 v4 @ 2.20GHz.
All servers in the cluster have round trip time in the range of 0.1-1ms.
We use development version of FDB at f418a85390734373b74f022ae6a61bed23eb92ee.
Why we believe rate keeper is not kicking in:
“performance_limited_by” : {
“description” : “The database is not being saturated by the workload.”,
“name” : “workload”,
“reason_id” : 2
},