How does Flow handle long operations?

A few years ago I was using the tornado framework with python. It was really similar to Flow in that it tried to provide high performance on a single thread via asynchronous operations.

The problem with tornado is that it used an event loop to simulate concurrency. Functions could “yield” to the event loop when ever they we’re about to do something like I/O, which was akin to saying “ok, peempt me now and then come back to this spot in the function when the long running I/O is done. But in the mean time let someone else run their code while I’m waiting”.

The problem with this was always “what happens if I forget to yield”? Say I start doing some CPU intensive calculation and it takes me half a second to finish this calculation. During that time, I’ve grabbed the single thread of execution and no one else can make any progress.

I’m not sure if Flow uses an event loop model, but it must have to deal with this problem of long running tasks hogging the single thread of execution. Does Flow have some sort of task/green thread preemption built into it?

(Disclaimer: it’s been over three years since I worked on FoundationDB, and obviously some things have changed! Hopefully someone will correct me where I am either wrong or out of date.)

Your intuition that Flow also uses an event loop model is correct, as is the sense that blocking the single thread of execution(*) would be bad.

Every actor implicitly yields execution whenever it is waiting for a future value to be ready. Because all I/O should be performed through the IAsyncFile interface, which is fully asynchronous and returns futures, long-running I/O shouldn’t be an issue. The mechanism for obtaining an instance of an async file appears to have changed since I was last here; it looks like IAsyncFileSystem::filesystem()->open() will get you started.

But that won’t save you from CPU intensive calculations! For this purpose, you will see calls sprinkled throughout the codebase that look like Void _ = wait(yield()); or Void _ = wait(yield(somePriority));. This introduces an explicit yield in the case that there are other (or other higher priority) tasks waiting to run.

The main loop is (or was, last time I checked!) in flow/Net2.actor.cpp in Net2::run(). There you will see trace events that will log when there are long delays in the event loop, from which you can determine where you may need to introduce explicit yields in long-running operations so as to keep things humming along.

(*) Single thread of execution isn’t quite strictly true; I recall at least that log flushing happened on its own thread. And the event loop isn’t the only form of (pseudo)concurrency, as the storage engine also uses coroutines to keep disk queues full.

3 Likes

Besides the ability to yield and explicit prioritization that Ian mentions, the database is designed as far as possible to not have large tasks to do (especially at high priorities). For example, if you are reading a long range of data from the client the actual requests sent to the server are more or less of the form “please give me another 100KB of data starting from here”, trading some single-client performance to avoid various possible head-of-line blocking performance problems. It’s also largely designed with data structures and algorithms with bounded worst case performance (rather than merely amortized performance) in order to keep latencies low. For example, if you quickly delete every value in a range and then read the now-empty range over and over (while those deletes are still in the MVCC window) we go to a lot of trouble to ensure that those reads are still fast.

But nevertheless, this is a type of performance problem that we can have, and there’s monitoring and debugging infrastructure to detect and fix these problems.

1 Like

In terms of monitoring, we log a “SlowTask” trace event for CPU intensive tasks that take more than around 250ms.

We also periodically log “Net2SlowTaskTrace” trace events during a slow task that contain backtraces which help us determine what code is blocking the run loop.

1 Like

We actually log SlowTask events for much shorter events than that, though at a certain threshold we start sampling the events. If the slow task profiler is on (which produces Net2SlowTaskTrace events), this threshold is at most 125ms.

Besides the 125ms one, the thresholds for these aren’t a fixed duration of time, but are based on the number of clock cycles that have elapsed. It looks like we currently start logging sampled events at 1,000,000 cycles. We start logging all events at 125ms if the slow task profiler is on, and otherwise we do so at 500,000,000 cycles. At 500,000,000 cycles, we also log the event at a higher severity (SevWarnAlways=30). See:

1 Like

I am not affiliated with FoundationDB; I am studying the codebase though, and I have been interested in concurrency/non-blocking designs for some time now (mostly using coroutines/fibers, and future/promises continuations), and I really like what the FoundationDB folks have accomplished.

Specifically for fair-scheduling, you need a way to essentially support cooperative multitasking semantics, via some scheduler, or by introducing yield-points where it makes sense, where you yield to another ready/runnable task depending on priority semantics(that should take into account the time the task has been waiting for its turn – see the various LK scheduler algorithms, for example). To accomplish that though, you either need to only be able to yield at predefined/specific code paths, otherwise you 'd need coroutines, each with its own stack, where you can save/restore registers (including stack pointer, EIP, etc), which means you can yield from anywhere as opposed to only where you have a continuation that can be used for that purpose. In other words, if you have a loop with N iterations, and you want to interface with some scheduler to determine if you need to yield to another ready task anywhere within that loop code-block, it won’t work without “real” coroutines, unless you otherwise “abuse” the continuations facilities, so to speak.

There are trade-offs with every choice we make of course. Coroutines require their own stack (well, there are stackless coroutines but those are different constructs with all kinds of limitations), and context-switching (which against involves storing and restoring CPU registers) is far, far more expensive than an indirect call (which is what FDB does).

I think the FDB design is brilliant and I am looking forward to spending more time studying the codebase :slight_smile:

Relative to implementations of async/await in other languages, a very nice feature of flow is automatic cancellation when the future returned by an actor is discarded. This is easy for us to do because C++, not being garbage collected, runs destructors in a predictable way. When an actor is cancelled, the wait statement it is executing throws an exception and the call stack of the actor unwinds, destroying and canceling futures it is holding in turn. So the common case of recursively canceling a “call tree” of actors, and indeed many other cases, work effortlessly and correctly. There is so little code dealing explicitly with cancellation that you could miss that it is a feature!

5 Likes