Another storage engine idea

I’ve been reading papers on storage engines recently trying to find an alternative storage engine design (i.e. not a b-tree) which preserves as much as possible the functionality of FDB’s existing storage engines, but makes different trade-offs to increase write throughput and perhaps decrease write latency as well. By functionality, I mean the expectation of most layers that there is good range and point read performance.

It would also have to be a good fit for Flow, which means not only being single-threaded or adaptable as such, but also not conceptually require long periods non-interruptible work or long periods of high memory usage. A non-goal is making a storage engine which is suitable for spinning disks given how atypical I think it would be to expect that today.

Ideally it would not require a large amount of new code to be written. :slightly_smiling_face:

My search has led me to a design which is in the family of “index+log” (see outline from RocksDB developer). I haven’t seen it in any papers, probably because of trade-off #2 from below.

The most widely used implementation of this family is Badger from what I can tell, which is a design like the one in the WiscKey paper. It uses an LSM to store the keys + pointers of positions in a log file which holds the keys and the actual data. The idea is that the LSM when it contains only keys + pointers will almost always fit in main memory and the data from the log will require one non-cached random read if the log file is larger than main memory. The log file is compacted by reading it from the beginning, checking the LSM if the key is still alive at that log file position, then re-inserting into the end of the log if it is active. Otherwise you skip it. Eventually the log contains enough dead data that you can swap to a new file.

The downside to just re-implementing this design is it requires writing an LSM from scratch, which would be a lot of code and the range read performance would still be questionable.

My design uses components which already exist today, namely the memory engine’s binary tree and log structure, and add a new log file component which is mostly the same as the existing WAL for the memory engine.

The memory engine would operate as normal, writing keys and values into the WAL and storing keys and values in memory. During snapshotting, it would examine all values and check if they are a real value or a pointer into the log file. If they are real values, they would be appended to the log file and replaced with a pointer. This means the memory engine’s snapshot files would over time become only slightly larger than the size of the keys and pointers. Compaction would work like the WiscKey design.

The trade-offs made here are as follows:

  1. Increased space usage from dead data if it is updated or deleted. I think this is a small price to pay in the cloud where IOPS are harder to get than space.
  2. All keys and log pointers must fit in memory. This is a significant drawback. I think most clusters are sized such that this is the case anyway even if it were not a requirement, but it must be noted.
  3. Data locality for caching purposes is different than a b-tree. Values written near each other in time would be more likely to be in cache than values written near each other by comparing keys (like a b-tree).
  4. I think write IOPS amplification would be less than a b-tree, but write IO bandwidth amplification would probably be similar or even higher. Again, I think this is a small price to pay in the cloud where IO bandwidth is easier to come by than IOPS.

Hopefully you all found this interesting. I think this storage engine design would be a good option for workloads which require high write throughput, reasonable read throughput, and can afford some extra space usage. I’m not really expecting anything to come out of it unless I do it myself, but I thought it was worth writing down.

1 Like

You might be interested in Bε-trees. https://www.usenix.org/publications/login/oct15/bender

Comparing the tradeoffs very quickly:

  1. They do maintain (potentially quite a few) tombstones for dead data, so there’s a bit of increased space usage, though range deletes can be stored as range tombstones.
  2. Can work while holding anywhere between 1 and all pages in memory.
  3. Data locality is pretty close to a b-tree’s.
  4. Reduced IOPS, pages are intended to be very large, and all reads/writes are to be done on entire pages.

The very short version of their operation is that you’ve got a tree, where every node maintains a “little” log of all of the writes that have hit it. When that log grows too large, the operations are all pushed down to the leaf nodes, adding any new data, reconciling tombstones, etc.

A C++ implementation of WiscKey here https://github.com/pingcap/titan