Maximum file size on ext4

Hello,

Is it possible for the storage or tlog server to encounter a condition wherein it can exceed the maximum file size on ext4, i.e., 16TiB? In other words, assuming it grew to this size and failed with an I/O error, does the data file get rolled over into a different/new file, or just fails and degrades the process?

I see that there exist these two knobs: STORAGE_HARD_LIMIT_BYTES and TLOG_HARD_LIMIT_BYTES. Are these settings guaranteed to limit the size of the data files used by the storage and tlog servers, such that they can be capped from ever exceeding the maximum file size limit imposed by the filesystem?

Thank you in advance

3 Likes

Unfortunately, as far as I know neither the DiskQueue structure (which backs the TLog) nor Redwood pay attention to file size limits specifically.

Redwood’s max file size using the default page size is 34TB. Hitting the filesystem limit would manifest as a truncate(larger_size) failure after which point the StorageServer would fail. The Redwood data file would still be readable and undamaged as of the last StorageServer commit, but unfortunately the StorageServer would fail repeatedly because FDB currently cannot use read-only StorageServer state.

Note that Redwood only grows its data file when its internal free block lists are empty. Small range clears free blocks immediately, large clears free blocks on a delay but still very quickly, well over 1GB/s on a single Redwood instance. This is because large range clears only need to read a tiny fraction of the data being cleared (<<1%). Just to complete the picture, the reason Redwood does not shrink its data files is to avoid the cost of compacting space internally to enable truncation of the file to a smaller size.

STORAGE_HARD_LIMIT_BYTES defaults to 1.5 GB and has nothing to do with disk usage, it relates to the in-memory Storage Queue which holds non-persisted commits and the MVCC structure which enables reads at different versions.

TLOG_HARD_LIMIT_BYTES does relate to file sizes used by the DiskQueue structure but despite the name I’m not sure if it is actually a hard limit in an unhealthy case where there is a long tail of old versioned data that has not yet been popped by all consumers.

Following up on this, no this is NOT a hard limit on the TLog DiskQueue file size. The disk state of one TLog is always stored in 2 files. If the cluster is unhealthy and the TLog is accumulating old unconsumed data, one of these files will grow past the “hard limit” until the other file is fully consumed.

Is internal fragmentation an issue with redwood? With innodb we’ve run into substantial internal fragmentation issues on tables that rapidly churn through data.

We have not found Redwood’s internal slack space to be a problem despite the fact that Redwood does not currently merge internal pages which have high slack. I plan to add this in the future but it is very low priority because in practice we are seeing Logical KV bytes : Redwood structure size with slack ratios of 0.9 to 1.2 typically.

One reason for this is that FDB shuffles data around a good bit in response to writes, and any time the cluster relocates a data range from one Redwood instance to another it is effectively “compacted” as the destination will have a condensed low-slack (or even negative slack) subtree of that data and the source will delete the subtree completely which frees all of its blocks.

This incidental compaction is particularly effective in fighting internal slack because Redwood’s key prefix compression often more than pays for its structural overhead, and Redwood can bulk build subtrees with nodes which are nearly full in compressed form so their logical stored KV bytes are often greater than the node size. Another factor is that Redwood BTree nodes are variable sized as an integer multiple of the configured page size, so Redwood will upsize a node so it can hold more data in order to minimize slack in the node. For example, with an 8k page size and a large keyspace with 5k values, you would waste 3k per Node if they were all 8k in size, but by using 16k nodes for that subtree you can fit 3 records with 15k of data in 16k of page space which is 6% internal slack instead of 37%. Variable node sizing does not cause slack because the page components of a node do not have to be contiguous.

In order to accumulate a large amount of internal slack in Redwood BTree nodes, you would need a workload pattern where you delete most but not all of the records on many nodes and then don’t add anymore records to them. You would have to do this in enough nodes that the waste is significant compared to the efficiency of the subtrees which do not hit this pattern, and the high slack ranges would have to remain unmoved on the same StorageServer so they are never incidentally compacted.

All that said, production workloads have a way of finding worst-case scenarios so it would be irresponsible to not have a fallback plan to combat Redwood internal node slack. That plan is the Perpetual Storage Wiggle which essentially forces a move of all data periodically to compact it (safely, while maintaining the replication factor the entire time). The perpetual wiggle is mainly used for gradual mode storage engine migrations which is strongly recommended for production clusters (as in, definitely use it) but it can also be left on and will “wiggle” (drain, refill) each StorageServer after they reach a certain age set by the knob DD_STORAGE_WIGGLE_MIN_SS_AGE_SEC which defaults to 3 weeks. In practice, at Snowflake we increased this interval substantially and eventually just disabled the perpetual wiggle on our fleet because Redwood slack is not an issue at all.

1 Like