FoundationDB

RocksDB backend


#1

Are there any plans to support a LSM tree like “RocksDB” as compile-time option?


(Yang Ruifeng ) #2

hi,here is a demo,which use rocksdb as backend。
rocksdb backend for foundationdb


(Ryan Worl) #3

This post explains why RocksDB, and other existing storage engines, are not included as options. They do not work with the simulation testing framework as they require thread pools because they do synchronous IO. The simulation testing framework requires that every part of the database be able to run in a single thread.

That doesn’t mean you can’t or shouldn’t write an adapter for RocksDB if you personally trust it, but you would need to do the testing yourself to verify things worked as expected.


#4

@ ryanworl

Thanks for pointing out.


(Dave Lester) #5

Any plans to contribute this back to the OSS project, or thoughts on whether you’re looking to maintain this outside of the demo?


#6

@yangruifeng
How did you solve the async issue?
A dedicated thread?
The async-patch was never applied to RocksDB.


(Alex Miller) #7

@yangruifeng, is this a thing that you’d be interested in trying to see if there’s a way to adapt RocksDB to run in a more evented style and integrated into the main FDB run loop so that it can get merged into mainline FDB? For the sqlite storage engine, there’s a layer that adapts stackful coroutines into flow, CoroFlow.actor.cpp, that you might be able to use to similarly flow-ify RocksDB operations.

But like, well done though. I’m really happy to see that integrating in a new storage engine wasn’t too much work, and was feasible to do without help and guidance.

The code looks like all operations are tossed onto a rocksdb::ThreadPool.


#8

“wiredtiger” (also LSM) would supports async ops - but I don’t know if range-deletes are possible.


(Yang Ruifeng ) #9

@doublemax @alexmiller
yes, a decicated threadpool(rocksdb::ThreadPool).

@alexmiller
do you mean like this?
this is my old design, but I worry that rocksdb write stalls affects main thread.
image

@davelester
the coding is rough, any suggestion is welcome, then i can contibute this back to the OSS project.


(Ricky Saltzer) #10

Would there be any benefit to also having an LMDB based backend as well?


(Alex Miller) #11

There was a RocksDB meetup today, and Nutanix presented some very similar work that they did in dispatching RocksDB reads and writes from a pool of user-level threads / fibers. I’m hoping slides will show up soon that I can point you towards.

Your diagram is roughly a layering of what I’d expect this to look like. You’re going to be following a general pattern of getting RocksDB operations to run in a coroutine, and for any call that they would make to block, call into Flow code that returns a future as to when the coroutine can resume, and then suspend the coroutine and switch to something new. CoroFlow, and its existing usage by sqlite, should be of help in figuring out how to do that.

I think the rough concrete outline of work would be to define a new RocksDB::Env that implements:

  1. most functions with their platform::* equivalent
  2. a new implementation of SequentialFile and RandomAccessFile that wrap IAsyncFile
  3. schedule() as a function that spawns a new coroutine

And then see what fails. I’d probably start by defining a Flow unit test that initializes a FlowRocksDBEnv and calls get() on a random key, and start implementing the minimal things you need to get that working.

This strongly relies upon the assumption that all blocking operations that RocksDB does somehow goes through RocksDB::Env. Anything from memory-mapped files, to sleep() calls, to mutex usage, would invalidate this assumption, and probably make integrating RocksDB into deterministic simulation infeasible.


The above defines the work that would need to be done in order to integrate RocksDB into FDB’s deterministic simulation tests. For running with RocksDB on real world clusters, I’m concerned that trying to pack the CPU load of compactions onto the same thread as what serves user read requests would be impractical. I suspect that although packing everything into one thread will be needed for testing, we’ll need to run with compactions running on threads in the background, but we can figure that out once it’s shown that we can solve the testing side of this problem.


#12

“SQLite4” is also a LSM (with fast range-delete).
Maybe FDB should simply change to SQLite4?
https://sqlite.org/src4/doc/trunk/www/lsmusr.wiki


(Christophe Chevalier) #13

It seems that work on sqlite4’s LSM engine has stopped for some time now:

Last commit: https://sqlite.org/src4/info/c0b7f14c0976ed5e


(Yang Ruifeng ) #14

LSM engine has been folded into SQLite3


#15

You are right. The LSM was merged into SQLite3 (interesting).
A list of possible storage engines: github. com/pmwkaa/engine.so

The LSM trees are: SQLite3, LevelDB/RocksDB and WiredTiger


#16

Maybe it is easier to enable the LSM tree of SQLite?
https://www.sqlite.org/cgi/src/dir?ci=9b37bbf5f338dea9&name=ext/lsm1