FoundationDB

Throughput of a Queue/Set like data structure


#1

Hey guys,

I’m new to FoundationDB and have only started playing with it. I’d like to use it to implement a queue / work distribution mechanism. I’m not too fussed about the order of the elements but would prefer it if the overhead was small.

I am running a single instance fdb server and a single client (with two threads on a dual core '17 Macbook Pro), but I’m only managing a throughput of about 120 removals per second. My transaction consists of finding an element and clearing it.

(I am using the queue example in the documentation, but I have changed it to make the get/clear transacted)

Is this normal throughput for a single server (no cluster)?

Thank you

Biturbo


(David Scherer) #2

No, that speed implies that you are latency bound. Either you are only removing items from the queue sequentially (in which case you are latency bound by definition) or your queue design makes your transactions all conflict (so that only one is happening at a time). At a guess you should be able to do some tens of thousands of operations per second on a single server if you get your workload properly parallel.


#3

Thanks Dave

I’m using the example from the documentation, and the client and server are running on the same machine.

How would you recommend designing a high throughput queue where the elements are removed one by one? I’m ok with no ordering.

More specifically, how can I efficiently implement “get and remove any KV”?

Apologies if I’m thinking of this the wrong way.

Thanks


(David Scherer) #4

Totally apart from queue design, you need to separate latency (“how fast can I do one operation on this data structure”) and throughput (“how many operations can (a huge number of clients) do on this data structure per second”)? FoundationDB’s latencies can be pretty low in the scheme of things - a tuned server setup with nice disks and network might be twice as fast as your laptop even with the overhead of communications between multiple machines - but they can never be good enough for a single client doing things one at a time to keep up with the throughputs it is capable of. To remove just 10,000 items per second from the queue one at a time you would have to have a whole-transaction latency of 100 microseconds, which is not possible with current hardware. And making the cluster bigger can’t help with this - nine women can’t have a baby in one month.

This is basically true of any distributed system, or of any system that offers durability (so that it has to sync to disk on every operation), and of course FoundationDB is both. For that matter in modern hardware even RAM has significantly higher throughput than 1/latency!

The usual use case for a “work queue” is that lots of processes/threads/actors/whatever are putting work on the queue and lots are taking work off the queue. It should be possible for, say, 100 parallel transactions to remove 100 different items from the queue in about the same time as 1 transaction can remove 1 item from the queue. If this isn’t true, it’s a problem with the design of your data model or transactions.

If the problem is that you really have a single process that wants to consume items from a queue sequentially, then the key is to not remove them one at a time! One process scanning through an ordered log with range reads (and erasing items with range clears) should be able to get much better performance, though it still won’t be able to keep a whole FoundationDB cluster busy.


#5

Thank you again for taking the time to reply.

I totally understand what you are saying, and of course throughput is bounded by latency. My question was more around the fact that I don’t know how to structure the transaction in order to allow sufficient concurrency, thus (by adding more parallel clients) increasing throughput.

More specifically, I believe the “primitive” I require is to “remove and return any element” (because I don’t care about order), and I am not sure how to go about doing that.

What I have currently is something that removes the first element in the range, which obviously results in transaction contention. What I am wondering is whether it is (even theoretically) possible to find a random key that exists without transaction contention, and if so any pointers as to how to achieve that in FoundationDB?

Edit: The ultimate goal is to prevent different workers from working on the same work items.

Many thanks


(Evan Tschannen) #6

Our backup and DR are implemented as clients of the database. We wrote a task bucket abstraction on top of the database to prevent different workers from doing the same work. You might be interested in looking at how we implemented that to give you some ideas for what you are planning.


(David Scherer) #7

Take a look at foundationdb/layers/taskbucket/__init__.py. It does more or less the same thing I think you are trying to do. I don’t think it’s extraordinarily optimized, but it should get you past the barrier of being totally latency bound.

The basic principle is to insert items with random keys, and to pick a random item by picking a random key and looking for the “next” item. (Some extra care to manage conflict ranges prevents remove transactions from conflicting unnecessarily with transactions adding new items that “would have been” removed by the remove transaction)


#8

Thank you both, I will look into that implementation.