FoundationDB on SSDs with atomic write support


I wonder whether it makes sense to enhance the storage engine (SQLite and the upcoming Redwood) of FoundationDB so that it can natively leverage the atomic write support of SSDs? Due to their indirect write nature, SSDs can easily support atomic write for a write request spanning multiple consecutive LBAs, and such SSDs are emerging on the market. If the underlying SSDs readily support atomic write, FoundationDB storage engine can naturally disable the journaling/WAL (SQLite) and indirect mapping (Redwood), which may lead to lower write amplification and even higher speed performance. Sincerely appreciate any comments and advices!

I think you mean non-consecutive. Atomic write support enables a set of writes to arbitrary LBA’s to be made atomically.

And yes, this feature should make it possible to atomically update SQLite’s B-Tree without the use of a WAL.

In Redwood, however, the indirection layer [ (logical page, version) → physical page ] is not just for facilitating atomic tree updates, it also enables old versions of pages to be kept for a configurable amount of time so that clients can efficiently read from the database at older data versions. Atomic writes provided by an SSD will not fill this need, however it is possible that Redwood could make use of atomic writes when configured to not retain any version history.

Thank you very much for sharing your comments. For SSDs to support atomic write over non-consecutive LBAs, one has to enhance the interface so that applications/filesystems can pass the atomic group information to SSDs. This may not be trivial and is not supported by current standards like NVMe and SATA (to my understanding). What I meant was atomic write spanning consecutive LBAs (it may not require interface change if we can ensure the write over consecutive LBAs falls into one BIO at the Linux block layer), which however seems to be too strict to be useful to SQLite and Redwood. Thanks!

The nature of B-Trees and B+Trees is such that making changes to the key space at or between existing keys will result in essentially a random block write pattern. So atomic linear writes are not useful here. In contrast, a log structured merge tree always issues serial block writes regardless of where the key space is changing.

I have not looked into implementation details or the APIs for this feature. I had the impression that arbitrary non-consecutive LBA’s in the same atomic write were meant to be supported from these slides proposing an atomic write interface from several years ago, but I guess what vendors ended up doing is different…

Yes, over the years quite a few academic papers picked this low-hanging fruits of realizing atomic write support over non-consecutive LBAs. But unfortunately only Fusion-io implemented it many years ago with its own NVMFS. Many thanks!