Object store on FoundationDB

Hi! To become familiar with FoundationDB I recently tried implementing a simple object store on top of it. It was a very pleasant experience that only got me more excited about the concept of stateless layers.

I wrote a blog post detailing the implementation, check it out if you like: https://fabianlindfors.se/blog/building-an-object-store-with-foundation-db/


Thanks for sharing, super cool! The post is spot-on with regards to the layer approach, among other things.

I like that you’re using the tuple layer to store multiple data “elements” for each file (the data and content-type). Additionally, you’re splitting the data over multiple keys and then using prefix range reads to get it all back. Seems like a solid approach.

This seems quite useful for someone who would like something like S3 for small files and with very low latencies. Maybe it would even be more API compatible some day!

1 Like

Glad to hear! FoundationDB made it dead simple to implement what would otherwise be a quite complex service, especially considering how easy it would be to scale the store out.

If I find the time it would be great to expand on the concept and build an actual usable object store. Leaving the complexity to FoundationDB makes potential features (such as S3 compatibility) super easy to approach and build.

I really enjoyed your - well-written - blog post too!

The questions that popped up, though, were:

  • Is the single connection from Go client to FDB a bottleneck in some ways?
  • What exactly did you have in mind when writing about the ‘necessary error handling and validation’ that it lacks?
  • What would be your preferred approach to lift the 10MB transaction limit in order to allow larger file objects?

Any thoughts you can share on this would be greatly appreciated!


Thank you!

  • I haven’t had a chance to test this. Judging by the Go bindings documentation it seems as all parts are safe for usage in Goroutines hence that should be the only limit. Considering the efficiency of Goroutines I would guess they won’t become a bottleneck but I’m not certain at all.
  • Currently the program won’t validate the uploaded file size or content type whatsoever meaning the upload will seem to have been successful even though the file might not have been saved. It also won’t handle any errors from the FDB connection (however unlikely).
  • I’m not nearly qualified enough to answers this question but I’m guessing the only way would be to split the upload over multiple transactions. Preferably every transaction should be below 1MB according to the documentation. This of course means losing the ACID guarantees that transactions bring. To avoid concurrent write conflicts some kind of object-level lock would probably have to be implemented but I don’t know how that is best achieved.

If anyone else has something to add/correct, please chime in!

I don’t know how the Go bindings work, but in Python implementing something like that will be very challenging. Especially if you want, the blob/object layer to cooperate with other layers and keep the ACID semantic. A crappy solution would be to store the file in disk and use background job to push it into the database…

By the way, I think the name “object layer” is misleading it reminds me of ZODB.

1 Like

From what I understand there are no differences in capabilities between the Go and Python bindings. If at the beginning of an upload the file hash is added to FDB as a type of lock, it could be checked at the beginning of each sequential transaction to ensure the object hasn’t been overwritten by a different upload. Once all the transactions succeed, the rest of the metadata could be added and not until then is the object/file ready to be read. This way we can ensure an object won’t be read unless it’s fully uploaded. If there are two separate uploads, the earlier one would fail early because it checks the hash for each transaction, ensuring the two uploads won’t overwrite each other. Once again, I’m not qualified to answer this but it might be a resonable solution.

Regarding the naming, I agree that “object store” is a way too generic name. It does however seem to be an accepted term at least in the sphere of cloud services, both AWS and Google Cloud uses it for example.

I should have noticed and commented earlier…

Splitting work across transactions is a common pattern and there’s often a pretty clean way to make this work while maintaining ACID properties. Generally, the approach is to add a layer of indirection between the data “chunks” and the notion of a completed “file”. A data model could comprise two types of tuples, one for file paths and one for data chunks:

files/file1 = [id1]
files/file2 = [id2]
data/id1/chunk1 = [bytes]
data/id1/chunk2 = [bytes]
data/id1/chunk3 = [bytes]
data/id2/chunk1 = [bytes]

A process to make this work could be:

  1. At the client determine an unique ID for the data contents. Let’s say this 42 for this example.
  2. Begin, over the course of multiple transactions, to upload data into the data section of the database.
  3. When the uploads are all complete, set the files/x key pointing to the correct region of data space.

A few considerations: this will work most simply for immutable files. Reads for especially large files could take more than 5 seconds and therefore have to span more than one transaction. This is trivial if the files are not changing and, if they are changing, could also be easily be done by simply pointing the file to a new data chuck region (at the risk of reading consistent and complete copies of deleted or modified files).

Hopefully that something to get you started!


Thanks for laying this out Ben!

Seems like a good way of handling the transaction limits. Was thinking that one could potentially use a hash of the file contents as the ID. This way duplicate files across the store won’t be stored redundantly and it would enable a stateless way of generating an unique ID. It might cause problems with file deletions but that could easily solved by some kind of atomic reference counter. Thoughts?

Hope I find the time to extend the code with your suggestions!

Yeah, that sounds reasonable enough to me (assuming that hash collisions aren’t a problem).

One problem that you could run into with the atomic reference counter approach is contention on the counter. I can see roughly three approaches:

  1. You add a read conflict and a write conflict to the reference counter each transaction. This is the simplest and is probably fine if you don’t expect multiple objects with the same hash all that often (which should be the case for most files–maybe not the empty file, but perhaps that should be handled differently anyways…).
  2. You can use atomic ops to increment the reference counter on insertions, but add the read-conflict key on deletes anyway. This lets you have conflict free inserts, but your deletions might conflict if two people decrement a reference counter at the same time. (You will also need to add a read conflict range to the data themselves or you might increment the counter while concurrently deleting–oops.) Then the delete logic is (1) do a read (with a conflict range), (2) decrement, and (3) if it hits zero, delete the object itself. Two things might end up doing (1) and (2) at the same time, but the resolver will fail one of them, and when it retries, it will re-do (1) and (2) before (maybe) doing (3).
  3. Garbage collection! Then all of your updates to the reference counter are conflict free, but you need to periodically sweep all of the reference counting things (using a conflict-free range read), then for all that are zero, add read a read conflict key on that key and delete the data. (You will still need to add a read conflict key on the data on insertions as in the second option.) This allows everything to be conflict free at the expense of keeping deleted objects around a little longer than necessary. (But it also means that if you have some weird pattern where a specific document keeps getting uploaded and deleted and uploaded again, then because you are lazily deleting things, each upload (after the first one) is essentially free.) The downside is that now you need a sweep job, and you also need to sweep over all counters.*

My guess is that the second option is probably good enough and probably what you’d want to do, but it really will depend on the rate of deletes to your store, how hot your system is (because (3) might create hot keys), and how often people upload the same document twice (as either (2) or (3) might be pre-mature optimization if (1) will do just fine.

* Or do you? If you get really spiffy, you can do something like keep a hierarchical data structure where for each range of reference counters, you can keep another key that you increment (or maybe max with 1 or use as a bit mask if you want to be really fancy) whenever you do a delete. Then, you only need to sweep over (1) the range of those secondary keys and (2) any range of reference counters that have had any deletes. On a sweep, you can zero out that secondary entry (you will need to add a read conflict key before you zero it out in case there is a concurrent delete in that range) after removing all of the things that have been deleted forever. If that still is too many keys, you can add another layer of the same on top of that key and so on.