Redwood page fillfactor support

Hello,

We are now testing FoundationDB (with Redwood storage engine) on compression-capable NVMe SSDs being released by ScaleFlux (that can internally compress each 4KB block, being transparent to filesystem and Apps). Results show very good storage cost saving (i.e., over 2x). I wonder whether Redwood supports (or will support) user-configurable page fillfactor (like the one in PostgreSQL and Oracle) that can partially fill each page and reserve some space in each page for future insertion/update? Fillfactor enables users to configure the trade-off between storage cost and performance, when using normal SSDs. In the case of compression-capable SSDs, fillfactor enables users to improve performance at almost zero extra storage cost. Does or will Redwood support page fillfactor?

Thanks,
Tong Zhang

Yes. It isn’t currently run-time configurable, but I plan to have some parameters exposed that determine how Redwood splits data into pages. I have not yet decided what these parameters will be. A fill factor or something like it is likely.

I’m curious about the shape of your data, things like key size, value size, are the values compressible, do the keys share common prefixes, or repeated suffixes under different prefixes?

Redwood only stores unique key prefix bytes, but values are stored as-is and repeated suffixes under different prefixes will be duplicated on disk, plus there is a per KV pair overhead of about 10 bytes.

Hi Steve,

Great to know that it is being planned! In our initial testing, we used YCSB with value size of 1KB, and keys do not have common prefixes. Each value is taken from a corpus file with over 2:1 compression ratio. The purpose of this initial testing is to confirm that, given good raw data compressibility, the compression-capable SSD indeed could transparently and largely reduce the storage footprint of FoundationDB. The SSD carries out zlib compression in its internal hardware engine (2.2GB/s compression and 3GB/s decompression, and few microseconds of decompression latency, compared with 70~90 microseconds of flash memory chip read latency). It works best for B-tree based data management systems like Redwood (and MySQL, PostgreSQL). Could you please suggest ways to do some further testings (maybe smaller KV size) on Redwood when using the compression-capable SSD? Also when the fillfactor feature will be available for us to try? We saw very good benefits when playing with the fillfactor in PostgreSQL. Thanks!

Redwood currently still has very high CPU overhead for random insertions, so you will get much higher write speeds when writing keys and values in sequential clustered groups. The larger the groups, the lower the CPU overhead. This also goes for key/value sizes, larger KV pairs will have less CPU overhead.

It would be a very interesting experiment to try Redwood on your compressing SSD with prefix compression turned off. Unfortunately there isn’t a way to flip that switch at the moment, but I do plan to make it a configuration option for use with largely incompressible keys.

Regarding the fill factor, is the idea to lower it because your SSD will prevent waste of most of the slack space and then page splits will be less frequent?

Thanks for the comments and suggestions. We will do some more experiments with different write pattern and different KV size. It would be great if the prefix compression can be configured.

Regarding the fill factor, yes, exactly as you pointed out, the objective is to make the page split less frequent since compression-capable SSD can highly compress the slack space in each page, and hopefully this could lead to higher performance for write-heavy workloads. Intuitively, I feel that compression-capable SSD could make B-tree-based KV store more attractive than log-structure merge tree based KV store like RocksDB in many applications.

Hi Steve, We just finished some further YCSB testing with smaller KV size (100-byte and 300-byte), and indeed the performance is noticeably lower than that of large KV size. Still we see over 3:1 compression ratio on our compression-capable SSD. Does it mean that being able to adjust page fillfactor is more beneficial for the scenarios with small KV size?

I would expect that the ratio of KV size to average page slack will mostly determine the number of page split/rebuilds per write, so from the perspective of minimizing page splits yes smaller KV sizes will benefit more for the same slack size, and a higher slack size is better.

Assuming your write pattern is random, I suspect the biggest reason you saw a performance drop is just the not-yet-optimized part of Redwood which I’m working on now. Redwood currently splits value sizes over 256 bytes into chunks of up to 256 bytes each, so from Redwood’s perspective writing 1000 byte values vs 250 byte values at the same KV bytes rate is writing roughly the same number of keys internally, however in the 1000 byte case groups of 4 those internal records are definitely sequential, whereas in the 250 byte case if those records are random then they incur a bunch of CPU overhead because of how Redwood finds mutation points in the tree during its commit path. This will be fixed very soon, I’ll update this thread once the changes are merged.

Hi Steve,

Thank you very much for the information. Yes, we used random write patterns in our testings. Look forward to further optimized Redwood. Meanwhile, I wonder whether FoundationDB Redwood may possibly get the record schema information from the FoundationDB record layer? For structured records, we may apply some very simple data transformation to further improve FoundationDB page compressibility. We are now doing such research using MySQL/InnoDB as a test vehicle, and have seen significant improvement of page data compressibility. We are very excited about the unexplored potential on how B-tree based data store can fully benefit from new storage hardware with built-in transparent compression :slight_smile: