Redwood storage engine runs out of memory

Hi, All.

I’d like to compare performance of differnent FDB storage engines. I’m doing some tests.

I have a testing environment with 3-node fdb cluster. Each node has one stateless process, one transaction process and two storage processes, all 4 fdbserver processes on each node. I also have a transaction source that can send testing transactions to FDB with an internal ratekeeper.

I’ve put a memory limit to the processes:

memory = 5GiB
cache_memory = 3GiB

Such configuration works perfectly with sqlite ssd storage engine: I get stable 2250 tps. If I send a greater TPS volume, then the storage processes are not on time in writing changes and both the storage queue and the durability lag start growing. So I regard 2250 tps as a maximal stable performance.

But when I run the same test on the same stand with ssd-redwood-experimental storage engine, I cann’t get any stable performance result. Even with a relatively small TPS volume (1000 tps) the storage processes consume the memory infinitely. They respect neither cache_memory nor memory limits and after a short time fdbmonitor kills and restart them:

The storage queue is quite low (~7 Mb) and does not grow at the same time.

What am I missing? How to achieve stable work of redwood without memory leaks and restarts?

fdbserver trace file

foundationdb.conf

I might be wrong, but my understanding is that the minimum memory requirement is 8GB with the default 2GB cache memory, so if you set cache_memory to 3GB, you should set memory to at least 9GB.

From doc:

If you increase the cache_memory parameter, you should also increase the memory parameter by the same amount.

  1. With sqlite this recommendation is actual only with general fdbserver processes, when the single process has a lot of memory objects. But a dedicated storage process has only two large memory objects: the storage cache and the storage queue that is 1.5GB by default. So memory=cache_memory+2GB is enough for a dedicated sqlite storage process
  2. I also tried to set
memory = 9GiB
cache_memory = 3GiB

The problem with Out-Of-Memory persisted.

Hi Oleg,

Redwood does indeed have at least one memory leak, which I am currently investigating.

The one that I’ve found so far is a one-line patch, here is the diff.

This bug will “leak” (memory is held but useless) key-sized chunks of memory when keys are inserted into a page, but these extra key copies only live as long as the page is in cache, and since insertion fills the page which leads to a rebuild or split I do not expect this to be an enormous source of memory leakage. I think there is something else in the update path which is holding on to memory that is no longer needed.

If it is convenient for you to try this patch and report your results that would be very helpful!

But a dedicated storage process has only two large memory objects: the storage cache and the storage queue that is 1.5GB by default.

Redwood actually has significant state in memory beyond this in the form of a per-page cache of reconstituted compressed keys. Unfortunately how large this is relative to the page cache varies based on how many unique paths in each page a workload has visited and how large and how compressible its keys are. Tracking the size of this cache in realtime and trimming or size limiting it is tricky, but in the short term at least I will add a knob to reduce Redwood’s effective page cache size by some % to leave room for the decompressed key cache.

The 3GB / 9GB test you ran would have been my next request - Does the memory usage curve change at all as the memory size grows?

Thank you very much for your answer. Now I’m testing the ssd-rocksdb-experimental engine. After I complete these tests I will be able to return to testing redwood with your patch.

But it would be nice if there were some metrics that showed the page cache and the recompression state sizes. It would help to troubleshoot such problems.

@SteavedHams This bug is not a reason of my issue because I’m using 6.3.22 there is no RedwoodRecordRef::updateCache method.

What fdb version would you recommend to test for redwood?

Redwood in FDB 6.3 is quite old and was far from complete, much has been refactored since then to remove CPU and memory overheads and improve IO scheduling. I recommend testing with FDB 7.0.0, in this version Redwood is renamed to ssd-redwood-1-experimental, still carrying the experimental label as it is still not ready for production use.

1 Like

I finally manged to test the ssd-redwood-1-experimental 7.0.0 with the patch you had provided.

No memory leaks more occured,

FDB in the same configuraiton gives

  • 2250 tps sqlite 6.3
  • 2500 tps sqlite 7.0
  • 5000 tps redwood 7.0
  • 5500 tps rocksdb 6.3

redwood storage processes take 4.65 GB of memory each when the cache_memory is 3 GB

So the riht formula for dedicated storage process memory = cache_memory*1.5 + 2GB (because there is a 1.5 GB limit of the storage queue) .

2 Likes

The additional memory Redwood needs varies a lot by the workload’s key randomness and compressibility. In 7.1, this extra usage will be tracked and count against the page cache limit so total memory usage should be closer to SQLite’s for a given cache size configuration. Also note that SQLite and Redwood do not make use of the OS page cache, so if you have unallocated memory on the host it would be better to add it to the storage processes. RocksDB does use the OS page cache so anything unused by applications effectively increase its cache size.

1 Like

There were only a little of unallocated memory on the OS.

Rocksdb utilizes I/O in the softest manner: it makes a few of large I/O operations instead of a lot of small operations with SQLite and Redwood. So if the performance is bound by ssd, then rocksdb seems is preferrable.

1 Like

These are not good general conclusions. Storage engine performance is highly workload dependent and there is no storage engine design which minimizes I/O for all workloads.

few of large I/O operations instead of a lot of small operations

You say “I/O” here but this is not correct, this statement applies to only writes. LSMs optimize for write I/O while BTrees optimize for read I/O. LSMs write only serially and only to new files, which enables low write amp at commit time. To pay for this, at read time there is read amplification because logically adjacent records are written to different places at different times. To close this gap, compaction rewrites existing data, combining multiple files to a single file to collate data logically and reduce the number of files that must be consulted at read time. Point reads can avoid most read amp using bloom filters but probabilistic filtering is much harder to apply to range reads, so read amp can be especially high there especially when reading ranges with deleted records.

In contrast, BTrees pay write amplification at commit time to guarantee that at read time each record or range has a well defined location, minimizing read amplification. Redwood in particular does only batch updates, merging a sorted change set into the tree and pushing all updates to the leaf level immediately. For highly random small KV sets/clears, this will result in very high write amplification, but for more correlated/grouped changes or for large KV pairs the write amp can be very low.

So if the performance is bound by ssd, then rocksdb seems is preferrable.

This really doesn’t follow. In a BTree, the write amp and read amp per operation are very predictable. With an LSM there is a constant competition for I/O between read amp from needing compaction and read+write I/O to perform that compaction. At disk saturation this is a delicate balance. On the BTree side, the biggest issue with IO saturation is reads starving writes or writes starving reads. With the SQLite engine, reads starving writes is a common problem at disk saturation, but Redwood uses prioritized I/O dispatch to keep performance stable even when the disk is saturated.

@SteavedHams Thank you for your response.

You are right, the test results belong to my workload model that is mostly write-intensive and do a little read volume:

There are a relative small amount of static data and a huge amount of dynamic data. The static data are read and written (are updated), but the dynamic data are only written (are inserted).

Because the total size of the static data fits to the storage cache, it does not cause any disk reads. So the only i/o are writes with sqlite and redwood. Of cause, there are some ssd reads on LSM rewrites with rocksdb, but it is done asynchronously.

When using rocksdb the ssd utilisation is lower, but the CPU utilisation is large, because each logical read of static requires merging a lot of data in memory.