Discussion thread for new storage engine ideas

I’m excited to see Redwood, I also love the design of Bw-tree for both in-memory and disk engine.


This may be a naive question, but one I haven’t seen addressed is: as a user, why would I want a new storage engine using these alternatives (RocksDB, LMDB, etc) instead of the existing ones, and what features would it offer me over using the existing engine? There are a lot of suggestions for adding different key/value engines onto FoundationDB in this thread, but there are no insights as to why they are good additions, or what gaps they fill, or excel at, that the current storage engine does suboptimally at.

This is in fact a great question. Some thoughts:

Redwood is exciting and hopefully will resolve some of the most important limitations of the existing storage engine. I hope that it will eventually become the default storage engine for most purposes. I’m sure we’ll learn more about its progress at the summit, but I suspect that long running read transactions may not be available in the first release of it, exactly because they will require changes to the interface between storage server and storage engine.

No single storage engine can do everything perfectly, so there will still be gaps, for example:

  • Write performance on spinning disks or other disk subsystems where random I/O is extremely expensive. Redwood, like the ssd2 engine, is optimized for solid state disks. If all or part of your data is big and cold, for cost reasons you might prefer to store it on a disk subsystem which has much worse random I/O performance. Something more toward the LSM-tree part of the design space, which does writes in huge blocks, would do great for this. Likewise, it should be possible to replicate data to (for example) one btree on SSD and two LSM trees on spinning disks, and direct all random reads to the btree, giving you the read performance of the btree and fault tolerance at little more than 1/3 the storage cost of having all replicas on btree

  • Performance with extremely small random writes. Again LSM trees have a theoretical advantage here. For example, if you have a frequently written table with many secondary indices, you might want to put the indices on an LSM storage engine.

  • Space overhead of replication. If you store very big and cold data, you might not want to store multiple copies of it anywhere at all. There are various replication schemes based on error correction (RAID-5 is the simplest one; systems like S3 also work on this principle) that can achieve high fault tolerance levels with much less than 2x space overhead. They aren’t a great fit for typical “hot” database data, but if you want FoundationDB to be the only stateful system in your architecture, you want it to be able to do this.

  • Compression. Redwood will offer key prefix compression, but it’s possible for other technologies to compress data more aggressively.

  • Optimization for specific use cases. For example, if you want to use FoundationDB as a block storage device you will be storing lots of exactly 4KiB values with tiny keys, and you could imagine a storage engine which could store the values sector-aligned with no I/O overhead.

  • Optimization for things like Optane or NVRAM, or direct access to raw flash memory without a firmware flash translation layer

  • Ephemeral data. Maybe you know that certain key ranges are being used just for caching, and you don’t mind having them stored in RAM with no backing disk, instructing the database to simply clear any key ranges that are lost to failures. An in-memory variant of Redwood might be pretty good at this, but maybe you could do even better.

In general, having multiple storage engines becomes much more useful with the ability to assign different key ranges, and different replicas of a given key range, to different storage engine types. This core capability doesn’t exist yet, but should be relatively straightforward to add.

Dave

5 Likes

This storage engine looks promising

I’m looking forward to the Redwood Storage Engine PrefixTree (prefix “compression”) - it should help a lot with how I am looking to store data and compare with RocksDB. Looking forward to the video.

Video of Steve’s talk on FoundationDB Redwood is now online: https://www.youtube.com/watch?v=nlus1Z7TVTI

1 Like

Usualy hot and cold data coexist in the same database. Different storage engines may be optimised for the first or for the second but not at the same time.

It would be great to specify a particular storage engine on a keyrange level instead of the whole database level.

Especially using a separate engine may be useful for the transaction log. Their access pattern - large volume of sequencional write - woud be ideal for spinned disk systems.

I am not sure I understand you correctly. Are you talking about cost or performance? For performance the block cache within FDB is working very well for us (we have a very high cache hit rate and use our IOPS mostly for writing). If you have mostly cold data and some hot data, a B-Tree is usually performing quite well.

If we had a storage engine that optimizes for spinning disks this would be very interesting though. In that case FDB could use that for cold data. Or we could have a deeper hierarchy (spinning disk with an SSD cache).

All this being said, SSDs have become pretty inexpensive so I am not sure how much need for a feature like this there is.

We’re currently working on a caching role which will allow users to keep several copies of read-hot data in memory in a consistent way. This is not the same thing as what you ask for but it solves part of that problem.

This is already the case. The tlog is writing all data into a disk queue and keeps the data in memory as most data only has to be stored in the tlog for a few seconds. Only when something becomes unhealthy these disk queues can become large. In that case the tlogs will index the disk queue using the same B-Tree implementation as the storage engine.

I don’t know how well this would perform on a spinning disk. My intuition is that it would perform well as long as no storages fall behind. But I would fear that spilling could seriously degrade performance.

We could potentially write a special tlog that optimizes for spinning disks. However, I am not convinced that this would be beneficial for most users. TLogs (even in large systems) will typically only hold a few hundred megabytes of data. In order to survive large failures (for example the loss of an availability zone on AWS) you would only need to keep updates on these disks until you can restore the cluster to a healthy state. So tlog disks don’t need to be very large. Therefore, using SSDs for tlogs is not very expensive in the first place (and you want to have fast tlogs - slow tlogs can have global effects on performance).

When the whole database is huge and genetates high volume of IOPS usually there is only a small part of data generating high IO.

A agree that the era of spinning disks are passing. But the idea of storage tiering itself exists and it is still actual. For example, several cloud providers propose storage volumes with different cost dependent on the iops: dense-io volumes are more expensive than low-io ones. So the desirable approach is to to have large low-io volume spaces and a few small high-io spaces. It is cheapier than having one large space with high io.

I do not trust any existing automatic SSD-caching algorithms and I prefer to distribute data manually between different storage tiers. And I expect Fdb to give me such configuration capability on key-range basis (redefine the storage engine on a keyrange basis).

Yes, it is also very useful. It will not affect the io volume, but It will reduce the network transmission volume dramatically. Especially if the administrator is not limited by automatic determination of hot-read data and can specify manually which keyranges should be copied.

Yes, but when planning the production environment we must lease a high-io space and pay for it even if is not used most of time. Having a cheaper space would be an issue.

  1. I could not realise why the b-tree index is necessary, if these data are accessed sequentially?
  2. Even it is necessary, there are several fast algorithms (mostly merge-sorting) of building b-tree structures with sequencional data access. They are used widely in the world of relational databases when they index the data

The data is written sequentially, but not read sequentially. O(number of storage servers) streams of commit data are written together in one file on the transaction logs, and then their independent streams of data are read by storage servers. This data is held in memory for some time, so that the semi-random reads don’t cost IO, but once memory fills up, the commit data is dropped from memory and the index of which pieces of the commit log correspond to which storage-server-specific streams of data is written to disk.

For far more details, see https://github.com/apple/foundationdb/pull/2088


Overall, configuring storage engine per keyspace is a thing that numerous people have suggested, and I think everyone roughly agrees it’d be a good option to have, but no one has volunteered the substantial amount of time required to actually implement it yet.

Thanks for the article. It’s too interesting. After reeding this seems some solution exists without any b-tree index necessary for TLog. Where may we discuss this?

Can you open a new topic in Development > FDB Core ?