Page compression in Redwood


#1

Hello,

Is there any plan to implement page compression (something like the transparent page compression in MySQL/InnoDB) in Redwood? It could be very desirable for many users. Of course, the compression ratio strongly depends on the page size, and one may have to rely on the hole-punching support of the underlying filesystem. Whether the typical page size in Redwood is too small (4kB?) to benefit from page compression? Thanks!


(Steve Atherton) #2

Keys in Redwood are prefix compressed. No other form of compression is planned at this time. Applications with large values can of course compress them client side.


(Steve Atherton) #3

I’m sorry but I don’t quite follow. Compression ratio is mostly going to depend on the compression type and the compressibility of the data. How the compressed bytes are then stored in a file on disk is another matter. Writing pages (compressed or not) as a whole number of blocks on a block boundary happens to be the most efficient way to read from and write to a disk, but if minimal space usage is the main goal then finer resolution could be used. In other words, if a 4k uncompressed page becomes 2k compressed it could be colocated with all or part of another compressed block on disk so long as the pager knows how to find it by page number when asked.

I’m not sure what you are envisioning using hole punching for. Using it to mark freed pages as unused is always good, but I’m not sure how that relates to compression. If you are thinking you can use it to erase or insert blocks in a file as pages change compressed size, consider that this will badly fragment the file’s extents not to mention change the offset of every page location after an edit point. Please let me know if I’ve completely missed something.

Also, one of the goals of the Redwood design is to have a well defined and separate Pager interface to allow for different pager designs. (This interface actually exists already but its contract is not in final form or well documented). Theoretically one could use this to implement a transparently compressing pager.


#4

Hi Steve,

Sorry for not clarifying the above statement. In addition to compression type (e.g., lz4 vs. zlib) and data compressibility (e.g., text string vs. floating point number), the size of data chunk (e.g., 4kB vs. 16kB) to which compression is applied also plays an important role. This is because general-purpose compression algorithms (e.g., Snappy, lz4, zlib, and ZSTD) all use LZ search that tries to find repeated bytes in the to-be-compressed data chunk and accordingly use a pointer to replace those repeated bytes. In general, the larger the to-be-compressed data chunk is, the higher probability that we may find more repeated bytes and hence achieve a better compression ratio. My experience is that compression-over-16kB can achieve 10%~30% better compression ratio than compression-over-4kB.

As you pointed out, aligning compressed data with storage sector boundary (e.g., 4kB) is always desirable (especially for storage engines using B-tree structure like InnoDB and Redwood). Taking the transparent page compression feature in InnoDB as one example, with the default 16kB page size, after applying compression to each 16kB page, InnoDB aligns the compressed page to the 4kB boundary (e.g., if we compress one 16kB page to 10kB, then the page is stored as 12kB, wasting 2kB). To simplify the data management, InnoDB relies on sparse file and hole punching support from the underlying filesystem to return the unused one or more 4kB physical sectors back to the filesystem. Because of the use of sparse file, we do not need to change the LBA allocation when the compression results change. Yes, it certainly creates fragmentation, which is nevertheless not a big problem for modern SSDs.

The 4kB-alignment requirement indeed could cause significant space waste or even make page compression unfeasible (e.g., InnoDB/Redwood with 4kB page size). One of the major selling points of LSM-tree (e.g., RocksDB) is that its compression is not subject to the 4kB-alignment constraint, hence MyRocks claims to achieve better compression efficiency than InnoDB. Certainly prefix key encoding also helps MyRocks to improve the compression efficiency.

The Pager interface sounds interesting. Could you please point me to relevant documents or codes so that I could learn more about it? Thanks!


(Steve Atherton) #5

Conceptually, a Pager must be able to

  • Write a Page by page ID at a given Version
  • Read a Page by page ID at a given Version
  • Commit changes since last Commit atomically
  • Delete old (non-latest) versions of each page older than some given cutoff Version

How the pager organizes, stores, and reclaims versions of pages and whether or not they are compressed is internal to the pager.

The interface as it exists today is here but the comments are likely a bit out of date, the interface will be changing in the future, and the first implementation of the interface is not yet complete.


(Steve Atherton) #6

Thanks for the detailed explanations. For some reason I was thinking that compression dictionaries would not be created and stored per-page, but I guess that approach makes the most sense.

The hole-punching scheme is interesting but I would be concerned about overhead from fragmentation at the filesystem level (not the SSD). That said, I don’t have a better idea, and it sounds like it works well enough. (To use a concrete example: In ext4, a file extent tree is made up of internal nodes of around 340 entries each and leaf nodes of also around 340 entries each. Let’s say you have a 1TB sparse file of 16kB pages and at any given time 90% of pages compress to 12kB or smaller. That would require an extent tree of about 60 million extents, which itself is a 5-level B+tree of 4kB blocks which is over 700MB in size. If that isn’t in RAM at all times then page reads could incur an additional disk read latency, and of course any time a page changes its compressed block footprint a 4kB block of the extent tree must also be rewritten.)


#7

Hi Steve,

Thank you very much for the information and sharing your insightful comments. Yes, although hole-punching simplifies the design of applications, it indeed complicates the operation of the underlying filesystem and may incur performance penalty. Seems the ideal solution would be that SSDs can carry out compression internally and expose a thin-provisioned LBA space. Thanks.


(David Scherer) #8

If you were going to try to add page compression to redwood, you would want to do it while building the page rather than transparently at the pager level, so that you could target a compressed rather than uncompressed 4k page. But it would be critical to read performance to store the uncompressed page in memory. The great thing about prefix compressed pages is that they can still be searched fast.

I’m not super enthusiastic about this direction, but it’s possible. You could even get into maintaining shared compression dictionaries based on a sample of pages, but the engineering gets ugly when you have to retire dictionaries.


(Steve Atherton) #9

Growing an uncompressed searchable structure incrementally and stopping when its compressed size reaches some target without going over is a bit tricky. I agree it’s better but it’s also more complex, while a transparently compressing pager presenting a larger page size is pretty straightforward to implement and can still be beneficial for the right data set.

But I think if I were going to prioritize full key/value compression in Redwood, I would try to add some sort of dictionary to the prefix compressed binary tree structure so that full page decompression would not be required for search. The key suffix and value bytes could be encoded using a dictionary which is stored aligned to the end of the page and the binary tree grows toward it. Insertions into the tree are still possible (without balancing, new nodes are appended and leaf child links are updated) and the new key suffix and value can be encoded using the existing dictionary. A full page rebuild would build a new, optimal dictionary for the current data set.