FoundationDB

Discussion thread for new storage engine ideas


(Dave Rosenthal) #1

All, one of the most interesting and potentially easy ways to improve FoundationDB is to integrate new storage engines. Storage engines are simple key-value stores that work on a local machine. They don’t need to handle transactions, etc. at all. All you need to do to implement a new one is follow the IKeyValueAbstration, shown here:

[https://github.com/apple/foundationdb/blob/e0c8175f3ccad92c582a3e70e9bcae58fff53633/fdbserver/IKeyValueStore.h]

The only non-standard/tricky part is that erasing a range is expected to be a fast operation, even for large ranges.

Currently there are only two storage engines: The main SQLite-based one, and a “toy” one that works only for datasets that fit in memory. Many interesting ones could be built/integrated, for example, a log-structured engine for spinning disks, a better-optimized b-tree one, or even one that directly used some non-disk remote storage.

Discuss :slight_smile:


Large Range Scans - avoid 5s limit
(sumit jha) #2

Definitely it would be needed and a big improvisation over existing storage engine, integration with rocksDB is something can cover couple of use cases which would be a big boost .
Would like to thought of others .


#3

LMDB would seem to be a good fit for this as its a lean and efficient KV store, although it does have some idiocracies:


(Steve Atherton) #4

Here are some design points of SQLite which a new FDB storage engine may want to avoid:

  • SQLite uses a BTree rather than a B+Tree so it stores full key/value pairs in internal tree nodes, which harms branching factor when using large key/value pairs.
  • There is no prefix or other type of compression in SQLite.
  • Large key/value pairs are stored using a linked list of 4k overflow pages exclusive to that record. This adds additional disk latencies when reading large records and can also increase slack space.
  • SQLite uses a page-based WAL so every 4k page modified in a commit ends up being written in full twice.

(Rinat Abdullin) #5

LMDB is a nice piece of engineering that might be a good fit. We’ve used it in production as in-process app storage for years. Key benefits:

  • Simple design and small codebase
  • Predictable and consistent performance
  • Single writer and parallel readers working with a proper ACID transactions
  • Uses memory-mapped files and lets OS manage the cache (it is different from SQLite in that sense)
  • Aligns well with FDB concepts (e.g. FDB could be replaced with LMDB while running the application code in a deterministic simulation)

However, unlike SQLite, LMDB seems to feature less impressive test suite. Its write amplification factor is also a thing to consider.


(Will Wilson) #6

I think some form of compressing storage engine could be very useful. Prefix compression might be especially handy given that many layers want to represent data hierarchically (and tuples/directories encourage this).

The straw-man argument for why this should be low priority is that you could easily implement it in a layer, and it could be done “adaptively” by storing the versioned compression dictionary itself in the database, keeping metadata about what dictionary version(s) different portions of keyspace were written at, and having a background process that periodically re-compresses old data and garbage collects old dictionary versions.

You could probably do better by writing something special-purpose for each layer that takes advantage of what it knows about the layer’s data model. But that seems like an optimization, and it’d be nice to have a default besides “no compression”.


Best practice of storing structs. Should I pack or store fields separately
(David Scherer) #7

When integrating with off the shelf storage engines that do I/O synchronously (including via memory mapped files) I think some effort may be needed to get decent performance. FoundationDB uses a single-threaded asynchronous execution model and blocking that thread to read from disk will be disastrous for throughput and latency.

Our adaptation of sqlite uses coroutines to let it use our asynchronous I/O abstraction. That approach won’t work for memory mapped I/O. You could use a real thread pool but I’m not sure how performance will be.

A storage engine that doesn’t use our I/O abstractions also won’t be simulatable, so I will be somewhat nervous about correctness.


RocksDB backend
(Amirouche) #8

I don’t know about how fast is wiredtiger for that, but I think it’s a good candidate for those not bothered by gpl.


(cih.y2k) #9

@dave,

  1. although not fully support async I/O, but maybe thread pool solution works with this (https://github.com/leo-yuriev/libmdbx), a much better version of lmdb. And if it works, we can have long read operation.

perf-slide-2

  1. About in-memory storage engine, maybe we can use B+ tree index and mentx engine from tarantool
    https://github.com/tarantool/tarantool/blob/2.0/src/box/memtx_tree.h

https://github.com/tarantool/tarantool/blob/2.0/src/box/memtx_engine.h


(Ryan Worl) #10

Apparently Async IO is at least in the prototyping stage for RocksDB.

RocksDB has seen a ton of mileage inside and outside Facebook.


(Yang Ruifeng ) #11

what about the sqlite/lsm,any idea?


#12

Range deletes are quick using a LSM, e.g.:


(Dave Lester) #13

Related to this topic, one talk at FoundationDB Summit may of particular interest: “Future of FoundationDB Storage Engines


(Deepankar Sharma) #14

Will the talks at the summit be recorded? (Especially this one)


(Erthink) #15

I will be glad to answer a questions about libmdbx.


(Dave Lester) #16

Yes, talks will be recorded and published online after the event.


(Meai1) #17

would be incredible to try this as a backing store: https://github.com/Microsoft/FASTER


(cih.y2k) #18

It seems Faster only support point operations. Not ordered key value data store.


#19

SQLite has added a LSM backend.
https://www.sqlite.org/cgi/src/dir?ci=9b37bbf5f338dea9&name=ext/lsm1
FDB is currently running SQLite with BTREE.


(Austin Seipp) #20

This may be a naive question, but one I haven’t seen addressed is: as a user, why would I want a new storage engine using these alternatives (RocksDB, LMDB, etc) instead of the existing ones, and what features would it offer me over using the existing engine? There are a lot of suggestions for adding different key/value engines onto FoundationDB in this thread, but there are no insights as to why they are good additions, or what gaps they fill, or excel at, that the current storage engine does suboptimally at.

It is worth noting that FoundationDB 6.1 will apparently feature a new “Redwood Storage Engine”, which will be described in the upcoming Summit 2018 talk “The Future of FoundationDB Storage Engines”. This has already been merged into master. This new storage engine seems like it will have some major distinct advantages that require more changes than just coupling an existing storage engine onto the bottom of FDB.

In addition to higher write throughput, this new design will have a smaller space footprint via key prefix compression and a configurable ability to enable long running read transactions by trading off a reasonable amount of disk space to persist old data versions on disk.

This seems like a great advancement and being able to plan my design around long-running read transactions being available is exciting. An interesting question would be what Redwood could further have improved about it, too (including a new low-level storage/tree layer, perhaps)


While we’re at it and I’m throwing out naive ideas here, it would be interesting to think about a persistent-memory backend for FoundationDB. Intel is soon going to ship Optane DIMMs with more widespread availability in their new Cascade Lake platforms. pmem support is also actively available in QEMU, so it can be programmed against, just not with realistic performance numbers. But having a storage engine designed for highly-dense persistent DIMMs (~7TB with lower latency than NVMe, per machine) would be very interesting to experiment with, and Optane seems like it could be very attractive for a lot of workloads.