Discussion thread for new storage engine ideas

(cih.y2k) #21

I’m excited to see Redwood, I also love the design of Bw-tree for both in-memory and disk engine.

(David Scherer) #22

This may be a naive question, but one I haven’t seen addressed is: as a user, why would I want a new storage engine using these alternatives (RocksDB, LMDB, etc) instead of the existing ones, and what features would it offer me over using the existing engine? There are a lot of suggestions for adding different key/value engines onto FoundationDB in this thread, but there are no insights as to why they are good additions, or what gaps they fill, or excel at, that the current storage engine does suboptimally at.

This is in fact a great question. Some thoughts:

Redwood is exciting and hopefully will resolve some of the most important limitations of the existing storage engine. I hope that it will eventually become the default storage engine for most purposes. I’m sure we’ll learn more about its progress at the summit, but I suspect that long running read transactions may not be available in the first release of it, exactly because they will require changes to the interface between storage server and storage engine.

No single storage engine can do everything perfectly, so there will still be gaps, for example:

  • Write performance on spinning disks or other disk subsystems where random I/O is extremely expensive. Redwood, like the ssd2 engine, is optimized for solid state disks. If all or part of your data is big and cold, for cost reasons you might prefer to store it on a disk subsystem which has much worse random I/O performance. Something more toward the LSM-tree part of the design space, which does writes in huge blocks, would do great for this. Likewise, it should be possible to replicate data to (for example) one btree on SSD and two LSM trees on spinning disks, and direct all random reads to the btree, giving you the read performance of the btree and fault tolerance at little more than 1/3 the storage cost of having all replicas on btree

  • Performance with extremely small random writes. Again LSM trees have a theoretical advantage here. For example, if you have a frequently written table with many secondary indices, you might want to put the indices on an LSM storage engine.

  • Space overhead of replication. If you store very big and cold data, you might not want to store multiple copies of it anywhere at all. There are various replication schemes based on error correction (RAID-5 is the simplest one; systems like S3 also work on this principle) that can achieve high fault tolerance levels with much less than 2x space overhead. They aren’t a great fit for typical “hot” database data, but if you want FoundationDB to be the only stateful system in your architecture, you want it to be able to do this.

  • Compression. Redwood will offer key prefix compression, but it’s possible for other technologies to compress data more aggressively.

  • Optimization for specific use cases. For example, if you want to use FoundationDB as a block storage device you will be storing lots of exactly 4KiB values with tiny keys, and you could imagine a storage engine which could store the values sector-aligned with no I/O overhead.

  • Optimization for things like Optane or NVRAM, or direct access to raw flash memory without a firmware flash translation layer

  • Ephemeral data. Maybe you know that certain key ranges are being used just for caching, and you don’t mind having them stored in RAM with no backing disk, instructing the database to simply clear any key ranges that are lost to failures. An in-memory variant of Redwood might be pretty good at this, but maybe you could do even better.

In general, having multiple storage engines becomes much more useful with the ability to assign different key ranges, and different replicas of a given key range, to different storage engine types. This core capability doesn’t exist yet, but should be relatively straightforward to add.



This storage engine looks promising

(seddonm1) #24

I’m looking forward to the Redwood Storage Engine PrefixTree (prefix “compression”) - it should help a lot with how I am looking to store data and compare with RocksDB. Looking forward to the video.