How to scale foundation db reads

We are running a foundationDB benchmark to better understand its performance and scaling limits. We are currently running a read-only workload where each requests is a point read, doing the following fdb calls: fdb_database_create_transaction(), fdb_transaction_get() and fdb_transaction_commit(). However, as we increase the number of requests, the latencies for getting a read version (GRV) keep increasing to unacceptably high levels and we can’t seem to figure out how to add hardware to address this and scale further.

We were using 320 clients, and as we grew the number of requests from 100k to 1.4 million, we observed read latencies severely degrade, e.g. median degraded from 2ms to 25ms. By looking at, fdb xml trace files, it seems we were entirely limited by GRV and not by reading from storage servers. This is how client TransactionMetrics look like at 1.42 million qps with 8 proxies and 320 clients:
<Event Severity="10" Time="1549044358.928279" Type="TransactionMetrics" ID="0000000000000000" ReadVersions="284189625" LogicalUncachedReads="284189625" PhysicalReadRequests="284189584" CommittedMutations="0" CommittedMutationBytes="0" CommitStarted="0" CommitCompleted="0" TooOld="0" FutureVersions="0" NotCommitted="0" MaybeCommitted="0" ResourceConstrained="0" TransactionBatcherReadVersionRequestsSent="14668992" TransactionBatcherMaxBatchSizeReached="13170029" MeanLatency="0" MedianLatency="0" Latency90="0" Latency98="0" MaxLatency="0" MeanRowReadLatency="0.00274302" MedianRowReadLatency="0.00192499" MaxRowReadLatency="0.0401535" MeanGRVLatency="0.0400555" MedianGRVLatency="0.0242767" MaxGRVLatency="0.193665" MeanCommitLatency="0" MedianCommitLatency="0" MaxCommitLatency="0" MeanMutationsPerCommit="0" MedianMutationsPerCommit="0" MaxMutationsPerCommit="0" MeanBytesPerCommit="0" MedianBytesPerCommit="0" MaxBytesPerCommit="0" GRVBatchTimeMean="0.00431965" GRVBatchTimeMedian="0.005" GRVBatchTimeMax="0.005" GRVBatchReplyLatenciesMean="0.0384088" GRVBatchReplyLatenciesMedian="0.0205247" GRVBatchReplyLatenciesMax="0.193665" Machine="10.58.120.56:3258" LogGroup="default" />

Note that we have added more metrics to the fdbclient better understand the client batching behavior. GRVBatchTime is the target batchTime of the client, which seems to max out at 5ms, while GRVBatchReply is the time to get a response from the proxies, which median was 20ms and mean was 38ms. MeanGRVLatency is about 40ms (roughly the sum of GRVBatchTimeMean / 2 + GRVBatchReplyLatenciesMean as expected). We tried varying number of proxies but this does not seem to have changed the GRV latencies.

We reduced the number of clients from 320 to 160 and latencies did went back to normal at 1.4 million total rps. <Event Severity="10" Time="1549060475.756485" Type="TransactionMetrics" ID="0000000000000000" ReadVersions="3919839" LogicalUncachedReads="3919839" PhysicalReadRequests="3919725" CommittedMutations="0" CommittedMutationBytes="0" CommitStarted="0" CommitCompleted="0" TooOld="0" FutureVersions="0" NotCommitted="0" MaybeCommitted="0" ResourceConstrained="0" TransactionBatcherReadVersionRequestsSent="216126" TransactionBatcherMaxBatchSizeReached="121876" MeanLatency="0" MedianLatency="0" Latency90="0" Latency98="0" MaxLatency="0" MeanRowReadLatency="0.00101261" MedianRowReadLatency="0.000856161" MaxRowReadLatency="0.00865197" MeanGRVLatency="0.00417926" MedianGRVLatency="0.00385022" MaxGRVLatency="0.0224671" MeanCommitLatency="0" MedianCommitLatency="0" MaxCommitLatency="0" MeanMutationsPerCommit="0" MedianMutationsPerCommit="0" MaxMutationsPerCommit="0" MeanBytesPerCommit="0" MedianBytesPerCommit="0" MaxBytesPerCommit="0" GRVBatchTimeMean="0.00187918" GRVBatchTimeMedian="0.00172091" GRVBatchTimeMax="0.005" GRVBatchReplyLatenciesMean="0.00383643" GRVBatchReplyLatenciesMedian="0.00326872" GRVBatchReplyLatenciesMax="0.0223789" Machine="10.58.121.24:25861" LogGroup="default" />

However, as we scaled the rps per client further, we did hit another inflection point at 1.8 million qps. Here is a fdb client TransactionMetrics sample:
<Event Severity="10" Time="1549063033.139217" Type="TransactionMetrics" ID="0000000000000000" ReadVersions="8814201" LogicalUncachedReads="8814201" PhysicalReadRequests="8813119" CommittedMutations="0" CommittedMutationBytes="0" CommitStarted="0" CommitCompleted="0" TooOld="0" FutureVersions="0" NotCommitted="0" MaybeCommitted="0" ResourceConstrained="0" TransactionBatcherReadVersionRequestsSent="450788" TransactionBatcherMaxBatchSizeReached="399161" MeanLatency="0" MedianLatency="0" Latency90="0" Latency98="0" MaxLatency="0" MeanRowReadLatency="0.00156971" MedianRowReadLatency="0.00153208" MaxRowReadLatency="0.0099442" MeanGRVLatency="0.0407064" MedianGRVLatency="0.0392284" MaxGRVLatency="0.122648" MeanCommitLatency="0" MedianCommitLatency="0" MaxCommitLatency="0" MeanMutationsPerCommit="0" MedianMutationsPerCommit="0" MaxMutationsPerCommit="0" MeanBytesPerCommit="0" MedianBytesPerCommit="0" MaxBytesPerCommit="0" GRVBatchTimeMean="0.00440744" GRVBatchTimeMedian="0.005" GRVBatchTimeMax="0.005" GRVBatchReplyLatenciesMean="0.0411822" GRVBatchReplyLatenciesMedian="0.0399389" GRVBatchReplyLatenciesMax="0.123384" Machine="10.58.120.56:2754" LogGroup="default" />

Batch times are maxing out 5ms (which seem to be a configurable knob), but MeanGRVLatency is over 40ms. We are reaching the max batch size (configured to 20 in fdbclient/Knobs.cpp) that is forcing the client to send requests to the proxy more often. Again, we tried to vary the number of proxies to 2, 4, 8, 12, 16, 20 but it did not seem to have helped (which somewhat makes sense given that every single proxy is involved in computing the live read version). Rate limiter does not seem to be kicking in. Given that read latencies seem low and storage server CPU and IO seem underutilized, we don’t believe we are hitting any limits there. As far as we understand, number of tlogs and resolvers should not matter for computing live read version and we are running read-only workload on a cluster that has not done writes in few days.

Couple of Questions:

  1. What is the upper bound of foundationdb clients that can be used and why?
  2. Any idea how we can scale the number of reads further by reconfiguring the cluster and throwing hardware at the problem without changing any knobs? Or is the upper bound of transactions independent of number of stateless servers?
  3. Are there any downsides of increasing MAX_BATCH_SIZE in fdbclient/Knobs.cpp to something larger than 20? Any reason why this value was chosen?

Setup:
160 clients
2 - 20 proxies, 2 resolvers, 16 tlogs - 12GB RAM each. Intel® Xeon® CPU E5-2620 v4 @ 2.10GHz.
416 storage servers total - 16 RAM each. Intel® Xeon® CPU E5-2630 v4 @ 2.20GHz.
All servers in the cluster have round trip time in the range of 0.1-1ms.

We use development version of FDB at f418a85390734373b74f022ae6a61bed23eb92ee.

Why we believe rate keeper is not kicking in:
“performance_limited_by” : {
“description” : “The database is not being saturated by the workload.”,
“name” : “workload”,
“reason_id” : 2
},

1 Like

Do you absolutely need each individual point read to get a new read version from the cluster? In other words, do you require that each point read be able to see the latest committed version as of the start of the point read’s transaction start?

If the answer is no, then getting more read ops/sec is simple. Each client can save a recent read version obtained from transaction.getReadVersion() and reuse it in new transactions for some short time period (150ms?) by calling transaction.setVersion(). This avoids the GRV latency entirely for the transactions that use the cached version.

Obviously you could also batch multiple reads into the same transactions, but I did not suggest this because I assume (in a real application) that some of these transactions are going to decide to do writes based on what they’ve read so it would be messy to have unrelated reads in the same transaction.

This would usually be my first suggestion when trying to measure how many reads a cluster can do and the latency of reads – namely to do multiple reads per transaction. If it really is the case that you need one read per transaction, then you could consider other options.

Caching read versions works too, but then you are complicating matters a bit with having to store read versions. If you are doing any writes, then you also must be ok with reading stale data, potentially increasing the rate of conflicts, and maybe also getting a suboptimal ratekeeper experience.

I’m assuming that what you’re doing here is intentionally trying to determine what’s the maximum number of linearizable point reads that you can do. Unlike many other databases, the first point read in a transaction is significantly more expensive than the second due to the implicit GRV call, and when there isn’t a second call, it’s just a benchmark of GRVs as you’ve discovered. In that case, you would need to try to get as much GRV throughput as possible, but admittedly, FDB isn’t great at this. If this workload is an accurate model of your production workload, then this is normally when I recommend anything that does partitioned quorums, e.g. Cassandra, as a better fit.

If what you’re trying to do is benchmark “my transactions do a lot of point reads, how does the throughput on that look?”, then just doing many point reads per transaction (as AJ recommends), or re-using read versions (as Steve recommends) are great directions to go.

EDIT: It’s possibly worth noting that on read-only transactions, fdb_transaction_commit is a no-op.

Upper bound of clients? Or upper bound of clients that are sitting in a tight loop issuing GetReadVersions requests?

I think the overall upper bound on clients is somewhere around 10k ±5k right now, at which point regardless of your configuration, one of {coordinators, cluster controller} will fall over due to trying to heartbeat so many connections. (There is actually a PR open right now that should improve this.)

For the latter case, I don’t think we have a performance test that does that, which you might have discovered.

If I wanted to benchmarket GRV throughput, I’d…

  • Find a machine with a fat NIC and a lot of cores
  • Run all my proxies on one machine to remove the network latency between them.
  • Set CAUSAL_READ_RISKY on all my transactions.

Each proxy needs to talk to all other proxies to calculate the read version (and a few TLogs if you don’t let us break causal consistency under faults via CAUSAL_READ_RISKY), so you’d want to minimize their latency. This ‘everyone talks to everyone’ behavior does mean GRV throughput doesn’t scale up linearly with proxies. You’re very unlikely to max out the bandwidth on one machine with just GRV requests, and you’d have to start from 1 and increment until you find your optimal number to handle as many requests as possible but not cause too much extra communication.

This would not be an advisable setup to run in production though. It’s the exact opposite of how you’d try to set up a cluster for a write bandwidth test. Every commit you do would need to flow into and out of that one machine. Commits, having actual data, will stress your network bandwidth much sooner. We generally don’t recommend using CAUSAL_READ_RISKY by default, and not using it would require sending all requests across the network to all your transaction logs, likely defeating some of the benefit of gathering all your proxies on one machine.

I can’t think of any. Increasing that sounds like a solid improvement to your benchmark.

Adding knob modifications to the benchmarketing above, just crank up all the batching done on both the client and the proxies.

Thanks all for the replies! Yes, we are doing this intentionally and we are trying to understand the scaling limitations of a single foundationdb cluster. As discussed there is an inherited maximum linearizable point reads you can do regardless of size of the cluster.

We could do some transaction reuse on top of what fdbclient already does but our thinking was that fdbclient/proxy should have robust logic for tuning the batching size already so we digged a bit deeper to try to understand how it works.

We tried increasing fdbclient/Knobs.cpp MAX_BATCH_SIZE from 20 to 100 to see if this will have any noticeable effect. It did help to only to limited extent and GRV latencies do collapse again at about 2 million rps. We tried uncommenting the code here (and modify it to log only once every 5 seconds) thttps://github.com/apple/foundationdb/blob/master/fdbserver/MasterProxyServer.actor.cpp#L1133 and it does seem from proxy perspective there is no batching happening. It is almost always doing a new live read version computation for single request from the client (queue size is always 0 or 1). Proxy believes its processing time is 0.5ms while the client observes GRV batch reply times of tens of ms. Here are is a sample from the proxy.
<Event Severity="10" Time="1549476776.412145" Type="GRVBatch" ID="6c6259b8ac3827df" Elapsed="0.00059104" NTransactionToStart="10020" TransactionRate="1.60714e+10" TransactionQueueSize="0" NumTransactionsStarted="220" NumSystemTransactionsStarted="0" NumNonSystemTransactionsStarted="220" TransactionBudget="20" LastLeftToStart="9900" TransactionCount="3971886172" TxnStartIn="206636" TxnStartOut="206745" TxtStartBatch="1521.97" GRVBatchTime="0.000596639" TransactionQueueEmptyCount="70921627" TransactionQueueNonEmptyCount="336" Machine="10.50.46.151:5000" LogGroup="default" Roles="MP" />

At the same time clients observe very high GRV batch reply latencies.
<Event Severity="10" Time="1549476963.723564" Type="TransactionMetrics" ID="0000000000000000" ReadVersions="5721711" LogicalUncachedReads="5721711" PhysicalReadRequests="5721587" CommittedMutations="0" CommittedMutationBytes="0" CommitStarted="0" CommitCompleted="0" TooOld="0" FutureVersions="0" NotCommitted="0" MaybeCommitted="0" ResourceConstrained="0" TransactionBatcherReadVersionRequestsSent="67743" TransactionBatcherMaxBatchSizeReached="40495" MeanLatency="0" MedianLatency="0" Latency90="0" Latency98="0" MaxLatency="0" MeanRowReadLatency="0.00385984" MedianRowReadLatency="0.00353456" MaxRowReadLatency="0.0199761" MeanGRVLatency="0.0749949" MedianGRVLatency="0.0569313" MaxGRVLatency="0.222431" MeanCommitLatency="0" MedianCommitLatency="0" MaxCommitLatency="0" MeanMutationsPerCommit="0" MedianMutationsPerCommit="0" MaxMutationsPerCommit="0" MeanBytesPerCommit="0" MedianBytesPerCommit="0" MaxBytesPerCommit="0" GRVBatchTimeMean="0.00454283" GRVBatchTimeMedian="0.005" GRVBatchTimeMax="0.005" GRVBatchReplyLatenciesMean="0.0696152" GRVBatchReplyLatenciesMedian="0.0476301" GRVBatchReplyLatenciesMax="0.222431" Machine="10.58.252.48:30800" LogGroup="default" />

Any idea where the discrepancies might come from? Can this be somewhere in FLOW or network layer where responses are getting queued or delayed?

Here are some other proxy metrics for reference:
<Event Severity="10" Time="1549476736.334286" Type="NetworkMetrics" ID="0000000000000000" CantSleep="0" WontSleep="6" Yields="6" YieldCalls="18" YieldCallsTrue="0" SlowTaskSignals="0" YieldBigStack="0" RunLoopIterations="107439" TimersExecuted="189003" TasksExecuted="555124" ASIOEventsProcessed="316113" ReadCalls="176174" WriteCalls="179995" ReadProbes="176174" WriteProbes="0" PacketsRead="182697" PacketsGenerated="182693" WouldBlock="0" SlowTask1M="2" SlowTask2M="2" SlowTask4M="1" S2Pri1="2.50657e-07" S2Pri2050="2.50657e-07" S2Pri3050="2.50657e-07" S2Pri4050="2.43155e-07" S2Pri4950="2.43155e-07" S2Pri5050="2.42905e-07" S2Pri7050="4.6725e-08" S2Pri8050="4.6725e-08" Machine="10.50.46.151:5000" LogGroup="default" Roles="MP" /> <Event Severity="10" Time="1549476736.334286" Type="ProcessMetrics" ID="0000000000000000" Elapsed="5.00004" CPUSeconds="3.044" MainThreadCPUSeconds="3.044" UptimeSeconds="63983.4" Memory="10384719872" ResidentMemory="10271088640" UnusedAllocatedMemory="7503478784" MbpsSent="12.6714" MbpsReceived="10.9247" DiskTotalBytes="982914220032" DiskFreeBytes="903126876160" DiskQueueDepth="0" DiskIdleSeconds="4.42804" DiskReads="0" DiskWrites="255" DiskReadsCount="41200" DiskWritesCount="7585834" DiskWriteSectors="4480" DiskReadSectors="0" FileWrites="0" FileReads="0" CacheReadBytes="0" CacheFinds="0" CacheWritesBlocked="0" CacheReadsBlocked="0" CachePageReadsMerged="0" CacheWrites="0" CacheReads="0" ZoneID="b1b2d80aa10a68ae469f3494237a48df" MachineID="b1b2d80aa10a68ae469f3494237a48df" AIOSubmitCount="0" AIOCollectCount="0" AIOSubmitLag="0" AIODiskStall="0" CurrentConnections="160" ConnectionsEstablished="0" ConnectionsClosed="0" ConnectionErrors="0" Machine="10.50.46.151:5000" LogGroup="default" Roles="MP" TrackLatestType="Original" /> <Event Severity="10" Time="1549476736.334286" Type="MachineMetrics" ID="0000000000000000" Elapsed="5.00004" MbpsSent="68.6815" MbpsReceived="68.2728" OutSegs="453063" RetransSegs="1" CPUSeconds="0.356362" TotalMemory="67425738752" CommittedMemory="46388178944" AvailableMemory="21037559808" ZoneID="b1b2d80aa10a68ae469f3494237a48df" MachineID="b1b2d80aa10a68ae469f3494237a48df" Machine="10.50.46.151:5000" LogGroup="default" Roles="MP" TrackLatestType="Original" /> <Event Severity="10" Time="1549476739.893415" Type="ProxyMetrics" ID="6c6259b8ac3827df" Elapsed="5.00006" TxnStartIn="207412 246.742 3964283538" TxnStartOut="207452 334.648 3964283300" TxnStartBatch="1524.98 2.11975 88408118" TxnSystemPriorityStartIn="0 0 5" TxnSystemPriorityStartOut="0 0 5" TxnBatchPriorityStartIn="0 0 0" TxnBatchPriorityStartOut="0 0 0" TxnDefaultPriorityStartIn="207412 246.742 3964283533" TxnDefaultPriorityStartOut="207452 334.648 3964283295" TxnCommitIn="0 0 36347506" TxnCommitVersionAssigned="0 0 36347506" TxnCommitResolving="0 0 36347506" TxnCommitResolved="0 0 36347506" TxnCommitOut="0 0 36347506" TxnCommitOutSuccess="0 0 35961944" TxnConflicts="0 0 385562" CommitBatchIn="1.39998 2.06359 26336963" CommitBatchOut="1.39998 2.06536 26336963" MutationBytes="0 0 6612540859" Mutations="0 0 48747200" ConflictRanges="0 0 129504684" LastAssignedCommitVersion="1203110550331" Version="1203110550331" CommittedVersion="1203110550331" CommitBatchesMemBytesCount="0" Machine="10.50.46.151:5000" LogGroup="default" Roles="MP" TrackLatestType="Original" /> <Event Severity="10" Time="1549476741.334293" Type="MachineLoadDetail" ID="0000000000000000" User="21762913" Nice="502508" System="20518466" Idle="1526672512" IOWait="1512340" IRQ="0" SoftIRQ="10121770" Steal="0" Guest="0" Machine="10.50.46.151:5000" LogGroup="default" Roles="MP" />

Yeah, that sounds… not right. I’ve typed out three half-responses to you, but each time I check my theory against your TraceEvent, it doesn’t check out. So I don’t have an immediate answer for you.

I’ll ask around today in case someone else is more familiar with proxy code, but it’s probably about time for us to write a GRV performance test…

1 Like

Stupid questions, but are any part of your clients or servers running on Windows? I’ve had random latency in the order of 10-15ms on the Windows implementation of fdb before…

It is all linux. Ubuntu-16.04.

Ok then that’s not linked to that. These random spikes went away when I tested on linux.

What do the ProcessMetrics on the client look like? In particular, is it possible that then network thread is being saturated? If so, then that could be a source of latency.

Something else you could look for would be SlowTasks on any of the involved processes (client, proxies, logs).

Yeah, digging into this further, some bottleneck on the client side would make sense since GRV latencies client side do seem to regress as we increase the number of reads although the total number of GRV requests from client to proxy does not change. We are looking at respective metrics and will update here.

Here are the process metrics from the client at health state (500k qps total, 3.1k per client)
<Event Severity="10" Time="1549495001.367166" Type="NetworkMetrics" ID="0000000000000000" CantSleep="2930" WontSleep="1" Yields="2" YieldCalls="0" YieldCallsTrue="0" SlowTaskSignals="0" YieldBigStack="0" RunLoopIterations="76599" TimersExecuted="40494" TasksExecuted="157065" ASIOEventsProcessed="125925" ReadCalls="20387" WriteCalls="20385" ReadProbes="20387" WriteProbes="0" PacketsRead="20424" PacketsGenerated="20420" WouldBlock="0" SlowTask8M="1" S2Pri1="2.62143e-06" S2Pri2050="2.62143e-06" S2Pri3050="2.62143ffe-06" S2Pri4050="2.62143e-06" S2Pri4950="2.62143e-06" S2Pri5050="2.60932e-06" S2Pri7050="1.48976e-06" S2Pri8050="4.00047e-09" S2Pri10050="1.0661e-09" Machine="10.58.121.24:12310" LogGroup="default" /> <Event Severity="10" Time="1549495001.367166" Type="ProcessMetrics" ID="0000000000000000" Elapsed="5.00001" CPUSeconds="2.87021" MainThreadCPUSeconds="2.14769" UptimeSeconds="860.003" Memory="416571392" ResidentMemory="59912192" UnusedAllocatedMemory="524288" MbpsSent="2.2351" MbpsReceived="3.24368" DiskTotalBytes="0" DiskFreeBytes="0" DiskQueueDepth="0" DiskIdleSeconds="0" DiskReads="0" DiskWrites="0" DiskReadsCount="0" DiskWritesCount="0" DiskWriteSectors="0" DiskReadSectors="0" FileWrites="0" FileReads="0" CacheReadBytes="0" CacheFinds="0" CacheWritesBlocked="0" CacheReadsBlocked="0" CachePageReadsMerged="0" CacheWrites="0" CacheReads="0" ZoneID="[not set]" MachineID="[not set]" AIOSubmitCount="0" AIOCollectCount="0" AIOSubmitLag="0" AIODiskStall="0" CurrentConnections="425" ConnectionsEstablished="0" ConnectionsClosed="0" ConnectionErrors="0" Machine="10.58.121.24:12310" LogGroup="default" TrackLatestType="Original" /> <Event Severity="10" Time="1549495001.367166" Type="MachineMetrics" ID="0000000000000000" Elapsed="5.00001" MbpsSent="101.448" MbpsReceived="92.5736" OutSegs="620455" RetransSegs="0" CPUSeconds="1.53884" TotalMemory="67479838720" CommittedMemory="6997049344" AvailableMemory="60482789376" ZoneID="[not set]" MachineID="[not set]" Machine="10.58.121.24:12310" LogGroup="default" TrackLatestType="Original" /> <Event Severity="10" Time="1549495001.376438" Type="TransactionMetrics" ID="0000000000000000" ReadVersions="2694260" LogicalUncachedReads="2687277" PhysicalReadRequests="2687277" CommittedMutations="0" CommittedMutationBytes="0" CommitStarted="0" CommitCompleted="0" TooOld="0" FutureVersions="0" NotCommitted="0" MaybeCommitted="0" ResourceConstrained="0" TransactionBatcherReadVersionRequestsSent="23125" TransactionBatcherMaxBatchSizeReached="0" MeanLatency="0" MedianLatency="0" Latency90="0" Latency98="0" MaxLatency="0" MeanRowReadLatency="0.000286058" MedianRowReadLatency="0.000235319" MaxRowReadLatency="0.00270963" MeanGRVLatency="0.00261455" MedianGRVLatency="0.00257373" MaxGRVLatency="0.00393844" MeanCommitLatency="0" MedianCommitLatency="0" MaxCommitLatency="0" MeanMutationsPerCommit="0" MedianMutationsPerCommit="0" MaxMutationsPerCommit="0" MeanBytesPerCommit="0" MedianBytesPerCommit="0" MaxBytesPerCommit="0" GRVBatchTimeMean="0.000972856" GRVBatchTimeMedian="0.000977913" GRVBatchTimeMax="0.0010788" GRVBatchReplyLatenciesMean="0.00193746" GRVBatchReplyLatenciesMedian="0.00183296" GRVBatchReplyLatenciesMax="0.00301933" Machine="10.58.121.24:12310" LogGroup="default" /> <Event Severity="10" Time="1549495006.367175" Type="MachineLoadDetail" ID="0000000000000000" User="687273237" Nice="934901" System="295009870" Idle="2058843655" IOWait="2289794" IRQ="0" SoftIRQ="155368247" Steal="0" Guest="0" Machine="10.58.121.24:12310" LogGroup="default" /> <Event Severity="10" Time="1549495006.367175" Type="MemoryMetrics" ID="0000000000000000" AllocatedMemory16="262144" ApproximateUnusedMemory16="0" AllocatedMemory32="393216" ApproximateUnusedMemory32="131072" AllocatedMemory64="1048576" ApproximateUnusedMemory64="0" AllocatedMemory128="8126464" ApproximateUnusedMemory128="131072" AllocatedMemory256="786432" ApproximateUnusedMemory256="0" AllocatedMemory512="262144" ApproximateUnusedMemory512="0" AllocatedMemory1024="524288" ApproximateUnusedMemory1024="262144" AllocatedMemory2048="0" ApproximateUnusedMemory2048="0" AllocatedMemory4096="1835008" ApproximateUnusedMemory4096="0" Machine="10.58.121.24:12310" LogGroup="default" />

Here are the process metrics from the client at non-healthy state (2000k total qps, 12.5k per client)
<Event Severity="10" Time="1549495967.906281" Type="NetworkMetrics" ID="0000000000000000" CantSleep="754" WontSleep="9583" Yields="9585" YieldCalls="0" YieldCallsTrue="0" SlowTaskSignals="0" YieldBigStack="0" RunLoopIterations="16102" TimersExecuted="134738" TasksExecuted="577462" ASIOEventsProcessed="136813" ReadCalls="66380" WriteCalls="66764" ReadProbes="66380" WriteProbes="0" PacketsRead="70946" PacketsGenerated="70948" WouldBlock="0" SlowTask1M="89" SlowTask2M="92" SlowTask4M="19" SlowTask8M="6" SlowTask16M="1" SlowTask32M="1" S2Pri1="30.6195" S2Pri2050="30.6195" S2Pri3050="30.6195" S2Pri4050="30.6195" S2Pri4950="30.6195" S2Pri5050="0.417555" S2Pri7050="0.390901" S2Pri8050="0.00018723" S2Pri10050="9.7441e-09" Machine="10.58.136.34:34355" LogGroup="default" /> <Event Severity="10" Time="1549495967.906281" Type="ProcessMetrics" ID="0000000000000000" Elapsed="5.00033" CPUSeconds="5.94292" MainThreadCPUSeconds="4.17738" UptimeSeconds="395.029" Memory="461516800" ResidentMemory="106426368" UnusedAllocatedMemory="34865152" MbpsSent="8.45999" MbpsReceived="12.7579" DiskTotalBytes="0" DiskFreeBytes="0" DiskQueueDepth="0" DiskIdleSeconds="0" DiskReads="0" DiskWrites="0" DiskReadsCount="0" DiskWritesCount="0" DiskWriteSectors="0" DiskReadSectors="0" FileWrites="0" FileReads="0" CacheReadBytes="0" CacheFinds="0" CacheWritesBlocked="0" CacheReadsBlocked="0" CachePageReadsMerged="0" CacheWrites="0" CacheReads="0" ZoneID="[not set]" MachineID="[not set]" AIOSubmitCount="0" AIOCollectCount="0" AIOSubmitLag="0" AIODiskStall="0" CurrentConnections="425" ConnectionsEstablished="0" ConnectionsClosed="0" ConnectionErrors="0" Machine="10.58.136.34:34355" LogGroup="default" TrackLatestType="Original" /> <Event Severity="10" Time="1549495967.906281" Type="MachineMetrics" ID="0000000000000000" Elapsed="5.00033" MbpsSent="290.851" MbpsReceived="311.339" OutSegs="1515292" RetransSegs="1" CPUSeconds="3.1805" TotalMemory="67479838720" CommittedMemory="7308333056" AvailableMemory="60171505664" ZoneID="[not set]" MachineID="[not set]" Machine="10.58.136.34:34355" LogGroup="default" TrackLatestType="Original" /> <Event Severity="10" Time="1549495971.358711" Type="TransactionMetrics" ID="0000000000000000" ReadVersions="4983126" LogicalUncachedReads="4980266" PhysicalReadRequests="4978992" CommittedMutations="0" CommittedMutationBytes="0" CommitStarted="0" CommitCompleted="0" TooOld="0" FutureVersions="0" NotCommitted="0" MaybeCommitted="0" ResourceConstrained="0" TransactionBatcherReadVersionRequestsSent="10267" TransactionBatcherMaxBatchSizeReached="7286" MeanLatency="0" MedianLatency="0" Latency90="0" Latency98="0" MaxLatency="0" MeanRowReadLatency="0.00137533" MedianRowReadLatency="0.000899792" MaxRowReadLatency="0.0170996" MeanGRVLatency="0.134298" MedianGRVLatency="0.121609" MaxGRVLatency="0.459229" MeanCommitLatency="0" MedianCommitLatency="0" MaxCommitLatency="0" MeanMutationsPerCommit="0" MedianMutationsPerCommit="0" MaxMutationsPerCommit="0" MeanBytesPerCommit="0" MedianBytesPerCommit="0" MaxBytesPerCommit="0" GRVBatchTimeMean="0.00480935" GRVBatchTimeMedian="0.005" GRVBatchTimeMax="0.005" GRVBatchReplyLatenciesMean="0.118248" GRVBatchReplyLatenciesMedian="0.100317" GRVBatchReplyLatenciesMax="0.457898" Machine="10.58.136.34:34355" LogGroup="default" /> <Event Severity="10" Time="1549495972.906609" Type="MachineLoadDetail" ID="0000000000000000" User="613283393" Nice="689132" System="279238315" Idle="1959405876" IOWait="1587405" IRQ="0" SoftIRQ="148393452" Steal="0" Guest="0" Machine="10.58.136.34:34355" LogGroup="default" /> <Event Severity="10" Time="1549495972.906609" Type="MemoryMetrics" ID="0000000000000000" AllocatedMemory16="393216" ApproximateUnusedMemory16="0" AllocatedMemory32="655360" ApproximateUnusedMemory32="262144" AllocatedMemory64="5373952" ApproximateUnusedMemory64="2883584" AllocatedMemory128="14024704" ApproximateUnusedMemory128="4194304" AllocatedMemory256="11796480" ApproximateUnusedMemory256="7864320" AllocatedMemory512="8781824" ApproximateUnusedMemory512="6160384" AllocatedMemory1024="9699328" ApproximateUnusedMemory1024="6291456" AllocatedMemory2048="0" ApproximateUnusedMemory2048="0" AllocatedMemory4096="1835008" ApproximateUnusedMemory4096="0" Machine="10.58.136.34:34355" LogGroup="default" />

We see a lot of SlowTask*M in the second sample. Does this mean we are client side network bound? The stats on the machine look good and we are not saturating the NIC but may be we do too many packets per second. We are also going back to more clients to validate.

Also we notice that the GRV mean and max latency collapse from 3ms to 700ms and 200ms to 15s respectively, while the Row Read latencies on the client mean and median are steady below 1ms and only max increases from 10ms to 200ms. Is there any priority for Row Read requests over GRV requests on the client side that would explain the disproportional effect?

One theory I have about the batching is that the behavior is a result of a priority inversion on the proxies. Could you try swapping the values of TaskProxyGetConsistentReadVersion and TaskProxyGetRawCommittedVersion in network.h and let me know if that changes anything?

(so TaskProxyGetConsistentReadVersion = 8565, and TaskProxyGetRawCommittedVersion = 8500)

Slow tasks are measured in terms of their duration in clock cycles, so to account for the hundreds of milliseconds you are seeing, you’d probably need them to be at least in the 100s of millions of cycles. Based on the one event that you pasted, it doesn’t look like you are seeing slow tasks quite that high.

One thing that does stick out to me, though, are the fields named S2PriXXXX. Roughly speaking, these are intended to track how long it’s taking to service tasks at various priorities. They add up the squared durations of the time spent working with higher priority tasks, so large numbers here usually mean that your process is very busy doing work at or above some particular priority. In this case, we see that between 4950 and 5050, there is a substantial change (30.6 vs. 0.4), so tasks at or below the busy priority are likely to experience significant latencies. It looks like parts of the get read version path use priority 5000, so that may explain it.

Also, based on this one event it doesn’t look like your network thread is completely saturated (I see MainThreadCPUSeconds=4.2 out of 5.0), but it is pretty busy. If you try to drive the same load to the cluster with more clients, do you still run into trouble?

Thanks Evan and A.J., for the responses. All that makes sense.

We switched to reusing transactions to try to find other bottlenecks besides the GRV. What we have observed is that we are bottlenecked by the clients (we can’t do 2 million rps with 160 clients without collapsing latencies). What was confusing was the GRV latency increased even though we do very few of them but reading from storage servers is more steady which is can be explained by it having higher priority. When we double the number of clients and reuse transactions, we were able to get higher throughput so it seems we were client bound although we where not 100% sure why exactly (CPU and network on the machine where looked fine).

We will probably get back to testing without reusing transactions and varying number of clients later today to confirm if we are bottlenecked by client or proxy server in that scenario, and when we start being bottlenecked by the server, we will try Evan’s fix and update the thread.

It turned out we were client side limited and adding more clients did increase the throughput. What was very confusing was that although we did very few GRV requests, their latency did skyrocket while reads only degreased slightly. This seems to be because of the relative priority of the two operations as mentioned above. Seems reasonable given FDB approach at slowing down things when obtaining GRV if overloaded but was confusing.

To reduce any noise, we tried a GRV only benchmark (no actual reads):

Clients x GRV/s p50 Client GRV mean (ms) p50 Client GRV Batch time (ms) p50 Proxy GRV Batch time (ms) Proxy getLiveCommitedVersion/s Client GRV requests to Proxy/s Client max GRV batch size reached (%)
640x100 2ms 1ms 0.25ms 20k 20.5k 0%
640x200 2ms 1ms 0.28ms 19k 40.4k 0%
640x400 3ms 1.3ms 0.5ms 12.6k 78.8k 0%
640x800 3ms 1.7ms 0.7ms 8.9k 92.6k 0%
640x1600 4ms 1.8ms 0.7ms 7.8k 102k 0%
640x3200 5ms 2ms 0.7ms 7.5k 105.1k 0%
640x6400 5.3ms 2ms 0.9ms 6.8k 105.8k 0%
640x12800 5.3ms 2ms 0.9ms 6.8k 106.5k 0%
640x25600 4.6ms 1.8ms 0.7ms 8k 100k 14%

Observations:

  1. The first bottleneck we hit was MaxBatchSize. This forces the clients to send the requests early limiting the overall ability to batch. Once fdb clients start hitting that limit, the number of requests they send to the proxies keep growing linearly as we scale GRV requests, and we quickly hit a breaking point where we overloaded the proxy. We have set MaxBatchSize to 100 to achieve the above results, otherwise the cluster was getting into bad state in the last two tests.
  2. The test is obviously very sensitive with the number of clients. If every client sends one batch request every 5 milliseconds, and proxies have hard limit on how many such requests they can process, this obviously puts a limit on number of clients being involved. We tested with 640 clients below.
  3. Batching worked pretty well and got us to about 16 million rps and seemed we can just keep scaling (until the client exceeds MaxBatchSize request every 5ms). However, the 5 ms mean GRV is not great for our use case. We tried varying the proxies a bit from 8 to 16 and then 24 but did not see major difference. We did not add a ton (like hundreds) of proxies but we can perform that experiment if someone believes that could help at all. As far as we understand, adding more proxies will give proxies more bandwidth to answer the client batch requests but also require more communication between the proxies themselves so there is a trade-off there.

We don’t see a way way to get to that scale (16 million) with 640 clients without degrading the latencies significantly. Is that a correct observation?

2 Likes

To be clear, what you’re observing isn’t the intentional behavior. Your previous posts clearly showed the proxy-side GRV batching logic isn’t batching in your workload, and that’s concerning. Dynamically resizing MaxBatchSize according to observed GRV latency (or just increasing it) also sounds like a probably good idea.

It seems like you’ve done good benchmarking work here, so I’m going to guess that, yes, this is as good as you’re going to get with the current state of the code. Did you happen to run the test that Evan suggested? Did that improve the batching and end result for you?

I’ll file an issue to track the GRV batching problems tomorrow, because it’s something that should be addressed regardless.

EDIT: Filed #1192.

I do think the batching is working as intended at least to some degree. In my last reponse I said that 1) we were previously client side limited and thus no proxy side batching was needed, sorry about the confusion 2) Proxy-side GRV batching is happening, you can see the Proxy getLiveCommitedVersions (proxy side computation) is way less than the client GRV requests as we scale requests.

I did try to change the priorities as Evan suggested but it did not lead to substantial improvement since batching is already working.

Otherwise, thanks for filing the task. GRV performance test that will help identify and fix the bottlenecks sounds good to me.

What are the recommendations for the cached GRV lifetime?
Let’s say if I expect to have an only one active client performing operations on a subspace (no conflicts are expected), can I cache GRV calls for at least 1 second or more? Will it affect the database in any other way besides a higher possibility of the transaction conflict?

I believe a read version remains valid for up to 5 seconds.
One issue I can think of, when caching read version aggressively - bypassing the ratelimiter’s function to throttle down transactions from starting when the cluster is unable to keep up with the load.

1 Like