Question on parallel commit latencies

Hiya,

I’ve been working on writing some prototype code for a CouchDB layer have finally gotten far enough to start analyzing various requests at the transaction tracing level. I’ve come across a somewhat curious behavior that I’m having trouble understanding based on my reading of the commit protocols in FoundationDB.

The following gantt chart shows seven total transactions. The first transaction is a single doc write. The second transaction is a two document write. Then there are five write transactions spawned in parallel that each write five documents.

My question has to do with the way that the commits are ending at different times. With five document writes across five parallel transactions I’m seeing each commit ending about 1-2ms after the previous. Tweaking that test to write fifty documents (i.e., 250 total) across five transactions extends that smear to something along the lines of 5-10ms between commit finish times.

I’m not particularly concerned about the absolute values on timing here as this is just a trace against a local dev server. What I’m finding curious though is that according to the versionstamps that are written as part of the parallel transactions I am getting multiple transactions into a single batch. My understanding is that proxies will batch transactions which are then submitted to resolvers and eventually the transaction logs. Given that, I would have expected to see any transactions that were part of the same batch to have a response from the proxy at roughly the same time rather than what appears to have been some sort of serialization of each commit.

Granted it’s also conceivable that even though transactions move through the system in a batch the transaction logs end up making each transaction durable individually which could explain what I’m seeing. But based on my shallow understanding of the process it was surprising enough I figured I’d ask for help understanding whatever I’m missing.

Thanks!

I am not sure what is the “commit time” on your figure. Is it the timestamp the client side sees the commit outcome? If so, I might have an explanation to what happened – it is because the outcome from a batch is sent back in order:

But it seems these timestamps are milliseconds apart, which is quite a large interval. So my interpretation can be wrong.

Thanks for the quick reply, Jingyu!

The script that generates those graphs is here:

The widths of all those bars in the gantt chart are the Latency reported in the trace-*.xml files generated using the FDB_NETWORK_OPTION_TRACE_ENABLE and FDB_TR_OPTION_TRANSACTION_LOGGING_ENABLE options.

And I agree, I’d expect at least some variation on reported times but was surprised to see the multi-millisecond differences as well.

Thanks for the code link though! I’ll use that as another jumping point to start off learning more parts of the code base. I’ve not yet got a super clear picture of where all of the important bits in the code are so its been a bit of a slog grep’ing through the codebase trying to figure out where various things are.

Your commit latency for a single document update is 25ms, which seems large for just a single write? What sort of cluster are you running this test against?

I’d be really suspicious that some priority issue is what’s causing this for you. Either many roles packed into one process causing each send() to involve switching to some other task, or local contention causing the transaction end time to be recorded artificially late. That said, I suppose we don’t have a test that measures the smear of reported transaction end times in batches, so it’s possible that there could be something actually weird here.

Thanks Alex,

This is just a local single fdbserver process on OS X. I was only looking at traces to double check that my code was doing what I intended when I noticed the odd behavior. I was just surprised because I was expecting batches to return similar-ish commit times for anything in a single batch so was just trying to figure out where my mental model was wrong there.

Also for those document writes, its 2 + 4 * $num_docs total writes per transaction. The writing on the chart is fairly tiny to read there. Though still low enough I wasn’t thinking that it was a factor.

I am aware that there were concerns about OS X performance though I couldn’t find anything in the forums as to why so I could totally believe this to just be a weird effect of the dev environment I’m using. I’m not at all concerned about the absolute values here, just that the commits were spread out even when they ended up in the same batch.

Thanks!

All the transactions from a single batch (i.e. with the same commit version) should be responded to at roughly the same time by the proxy. Jingyu posted the loop that does this, and my expectation is that it wouldn’t take very long for this loop to send replies to everyone.

On the client side, the latency is logged as soon as the client starts processing the reply. As long as one reply is being processed, however, subsequent ones will not be. I believe the response to this request will at the very least propagate all the way back up to your language binding, and depending on how you’ve written your client it could also include some of your client code as well. If you aren’t yourself running anything on the network thread in reaction to the commit completing, though, I wouldn’t expect this to take a particularly long time.

Another possibility that Alex alluded to is that your workload could be starving the client for brief intervals in between when it processes one commit response and when it starts processing the next. Is there any other activity going on with your client? It might be worth checking the ProcessMetrics client trace event to see if MainThreadCPUSeconds is large relative to elapsed, if the NetworkMetrics trace event has large values for S2Pri1 (an indicator for whether the event loop is ever taking a long time to flush), or whether you have any SlowTask trace events of relevant durations, meaning a single task runs for a while without yielding. The SlowTask events are sampled based on their duration, so depending on how long your test is, the absence of them may not be strong evidence of anything, but the presence of them could be interesting.

I finally managed to find some time to gather some more data on this issue.

That gist contains a simple python script and a few images showing before and after applying a diff. The flame graphs were all generated at 100,000 Hz sampling for roughly 10s while that script sat in a loop. The gantt charts are just a single iteration of the loop.

The first flame chart shows that nearly 40% of CPU time was spent working on just logging the commit messages. That observation lead me to commenting out the logging of the read and write conflict ranges along with the mutations. With that diff I generated the second gantt and flame charts.

Without the logging it’s clear that the commits are now processed in batches nearly simultaneously as I was originally expecting. However, I’m now a bit more curious as there still appears to be some odd structure to the recorded latencies.

My current hypothesis is that there’s some high-ish priority task in the network thread that runs periodically that takes on the order of 0.5ms (on this particular Linux VM on my laptop). So the amount of work that can be done must fit within that time window. With the commit logging disabled we can now fit a bunch of commits into a single slot before being interrupted to service something higher priority.

Judging from the two flame graphs epoll_wait looks to be the obvious candidate. Its been a long time since I’ve read a lot of benchmarking on various event loops, but epoll_wait doesn’t seem all that surprising.

Does that make sense to anyone else?

After reading more code today I think I’ve generally figured this out. If I understand correctly, the general structure of the main run loop in Net2::run looks like this:

  1. Execute runFunc if set which deals with async file io
  2. Calculate a sleep time depending on whether tasks are ready
  3. Run reactor.sleepAndReact
    • If sleep time is 0 becuase tasks are waiting we end up calling epoll_wait with a timeout of 0
    • Else, set a timer to wake us up after sleep time and epoll_wait for up to five minutes
  4. Dequeue completed tasks
  5. Handle tasks for up to TSC_YIELD_TIME CPU cycles
  6. Maybe log a slow task if one was found

Originally I was thinking that maybe something was being enqueued at a higher priority but it turns out that the run loop is actually structured to ensure that IO is processed every TSC_YIELD_TIME CPU cycles which is pretty cool.

The other interesting thing is that this behavior is triggered due to the trace logging itself. So while a somewhat interesting investigation it shouldn’t be an issue in production since we’re not going to be tracing every request like I was doing during development.

I ran into a bit of a performance gotchya related to this. Just setting a transaction debug identifier is enough to cause this behavior in production. I was under the mistaken impression that the trace enable network option would prevent them from being generated at all.

I hadn’t read through the code for this yet but my general idea was to always set the transaction id and then just toggle logging on and off when we wanted to capture. Now that I’ve read more that looks to be 100% backwards. We’ll want to always enable network tracing and then just tag particular transactions to be logged when needed.