Redwood page fillfactor support

Hello,

We are now testing FoundationDB (with Redwood storage engine) on compression-capable NVMe SSDs being released by ScaleFlux (that can internally compress each 4KB block, being transparent to filesystem and Apps). Results show very good storage cost saving (i.e., over 2x). I wonder whether Redwood supports (or will support) user-configurable page fillfactor (like the one in PostgreSQL and Oracle) that can partially fill each page and reserve some space in each page for future insertion/update? Fillfactor enables users to configure the trade-off between storage cost and performance, when using normal SSDs. In the case of compression-capable SSDs, fillfactor enables users to improve performance at almost zero extra storage cost. Does or will Redwood support page fillfactor?

Thanks,
Tong Zhang

Yes. It isn’t currently run-time configurable, but I plan to have some parameters exposed that determine how Redwood splits data into pages. I have not yet decided what these parameters will be. A fill factor or something like it is likely.

I’m curious about the shape of your data, things like key size, value size, are the values compressible, do the keys share common prefixes, or repeated suffixes under different prefixes?

Redwood only stores unique key prefix bytes, but values are stored as-is and repeated suffixes under different prefixes will be duplicated on disk, plus there is a per KV pair overhead of about 10 bytes.

Hi Steve,

Great to know that it is being planned! In our initial testing, we used YCSB with value size of 1KB, and keys do not have common prefixes. Each value is taken from a corpus file with over 2:1 compression ratio. The purpose of this initial testing is to confirm that, given good raw data compressibility, the compression-capable SSD indeed could transparently and largely reduce the storage footprint of FoundationDB. The SSD carries out zlib compression in its internal hardware engine (2.2GB/s compression and 3GB/s decompression, and few microseconds of decompression latency, compared with 70~90 microseconds of flash memory chip read latency). It works best for B-tree based data management systems like Redwood (and MySQL, PostgreSQL). Could you please suggest ways to do some further testings (maybe smaller KV size) on Redwood when using the compression-capable SSD? Also when the fillfactor feature will be available for us to try? We saw very good benefits when playing with the fillfactor in PostgreSQL. Thanks!

Redwood currently still has very high CPU overhead for random insertions, so you will get much higher write speeds when writing keys and values in sequential clustered groups. The larger the groups, the lower the CPU overhead. This also goes for key/value sizes, larger KV pairs will have less CPU overhead.

It would be a very interesting experiment to try Redwood on your compressing SSD with prefix compression turned off. Unfortunately there isn’t a way to flip that switch at the moment, but I do plan to make it a configuration option for use with largely incompressible keys.

Regarding the fill factor, is the idea to lower it because your SSD will prevent waste of most of the slack space and then page splits will be less frequent?

Thanks for the comments and suggestions. We will do some more experiments with different write pattern and different KV size. It would be great if the prefix compression can be configured.

Regarding the fill factor, yes, exactly as you pointed out, the objective is to make the page split less frequent since compression-capable SSD can highly compress the slack space in each page, and hopefully this could lead to higher performance for write-heavy workloads. Intuitively, I feel that compression-capable SSD could make B-tree-based KV store more attractive than log-structure merge tree based KV store like RocksDB in many applications.

Hi Steve, We just finished some further YCSB testing with smaller KV size (100-byte and 300-byte), and indeed the performance is noticeably lower than that of large KV size. Still we see over 3:1 compression ratio on our compression-capable SSD. Does it mean that being able to adjust page fillfactor is more beneficial for the scenarios with small KV size?

I would expect that the ratio of KV size to average page slack will mostly determine the number of page split/rebuilds per write, so from the perspective of minimizing page splits yes smaller KV sizes will benefit more for the same slack size, and a higher slack size is better.

Assuming your write pattern is random, I suspect the biggest reason you saw a performance drop is just the not-yet-optimized part of Redwood which I’m working on now. Redwood currently splits value sizes over 256 bytes into chunks of up to 256 bytes each, so from Redwood’s perspective writing 1000 byte values vs 250 byte values at the same KV bytes rate is writing roughly the same number of keys internally, however in the 1000 byte case groups of 4 those internal records are definitely sequential, whereas in the 250 byte case if those records are random then they incur a bunch of CPU overhead because of how Redwood finds mutation points in the tree during its commit path. This will be fixed very soon, I’ll update this thread once the changes are merged.

Hi Steve,

Thank you very much for the information. Yes, we used random write patterns in our testings. Look forward to further optimized Redwood. Meanwhile, I wonder whether FoundationDB Redwood may possibly get the record schema information from the FoundationDB record layer? For structured records, we may apply some very simple data transformation to further improve FoundationDB page compressibility. We are now doing such research using MySQL/InnoDB as a test vehicle, and have seen significant improvement of page data compressibility. We are very excited about the unexplored potential on how B-tree based data store can fully benefit from new storage hardware with built-in transparent compression :slight_smile:

The record schema information should be controlled (i.e. write) by RecLayer (@alloc correct me if I’m wrong). I doubt it is a good idea to let the storage engine to infer the schema information (if it is possible). It’s reasonable to let RecLayer write the schema info to a special key space, which can be stored in a subset of the storage engine.

Do you have a pointer to how MySQL/InnoDB does that? I’m happy to take a look.

Right, the schema is currently entirely managed by the Record Layer without actually ever telling the key-value store or the storage engine anything about it or even necessarily storing it in FDB. (And when it is in FDB, FDB has no way of knowing that this data is special meta-data.)

It, well, “could” be pushed down, but it would be quite a bit of plumbing. This is especially true because of the multi-tenancy model, i.e., the Record Layer expects to be able to store different databases (or record stores, or what have you) in the same FDB cluster with different schemata, so you would need to rendezvous a record with the schema for that particular subspace (or something like that).

Note also that the Record Layer allows you to provide an encrypting serializer, which probably makes knowing the schema itself less useful (as it’s just encrypted data if enabled).

That being said, depending on the nature of the optimizations being discussed, it may still be possible to do some amount of optimization.

1 Like

Meng and Alec,

Thank you very much for the information! Regarding MySQL, its storage engine InnoDB has access to the table schema information and each table has its own B-tree. Hence, we could readily leverage the schema information to carry out data transformation within each 16KB page in order to improve the on-storage page compressibility (over 40% improvement in our experiments). Does it mean that, in Redwood, records with different schema could stay in the same 4KB page (at least in the multi-tenancy model)? Intuitively, B-tree and compression-capable SSD are ideal match to each other, and it would be very interesting to see Redwood could take full advantage of compression-capable SSD.

@Passion

The upcoming FDB 6.3 will include a much more CPU-optimized version of Redwood that should be able to push your SSD much harder. It also has knobs for page fill factor and page size. Fill factor can be changed at any time and will affect page builds on existing files, but be aware that the page size knob only applies to Redwood files created after the change. Existing Redwood files opened will use whatever page size they were created with.

The knobs are REDWOOD_DEFAULT_PAGE_SIZE which defaults to 4096, and REDWOOD_PAGE_REBUILD_FILL_FACTOR which defaults to 0.66.

FDB 6.3 will be out some time soon, or if you don’t want to wait you can build FDB yourself from


which has all of the changes now.

Hi Steave, Thank you very much for the notice, and we will give it a try and share the results.

Oh I should also point out, you can’t use a page size less than 4096. Is this something that you wanted to do?

4KB page size is perfect for us, and we should never need to reduce the size below 4KB. Thanks!