Impact of workload concurrency on write amplification

Hello,

We just ran YCSB workloads over the latest FoundationDB (with redwood storage engine) on SSDs with built-in hardware-based transparent compression, and measured how transparent compression could reduce the physical dataset footprint and write amplification. One thing we observed is that the workload concurrency (i.e., the number of YCSB clients) has a significant impact on the write amplification, e.g., under 100%-update YCSB with 64B record size and 4KB page size, write amplification (i.e., the total_write_IO_traffic_volume / total_size_of_updated_records) increases by 3x when we reduce the client number from 32 to 4. Could anyone please shed light on what the reason could be? The reason why we are interested in write amplification (in addition to storage capacity) is that the emerging QLC flash has very limited cycling endurance, hence it is critical to reduce the write amplification. Thanks.

Tong Zhang, ScaleFlux

1 Like

When you reduce the client count, doesn’t that also decrease throughput? Unless it does not, then what you are seeing is completely expected for a BTree and a small random update workload.

The more writes there are, the more likelihood there is that in a given commit more than one of those writes will fall on the same page. When that happens, there is less write amplification because you are able to update more KV bytes per 4k page written.

1 Like

Thanks for your response. Yes, the overall throughput drops as the client count reduces. The dataset is about 200GB, and the updates distribute uniformly over the entire dataset. So for either 32 or 4 clients, there should be a small probability that multiple updates fall into the same 4KB page in a given commit. Similar write amplification difference was also observed when we increase the record size from 64B to 256B. Could the write amplification difference partly come from WAL or pager (e.g., less client count may cause larger write amplification on WAL or pager writes since each WAL/pager write-to-SSD must be 4KB)?

1 Like

The RedwoodMetrics trace events will help answer this question, if you can make them available to me I should be able to tell you what is going on.

The storage servers do not care about or see client count, they just see (Version, Mutation)... and send them to the storage engine in order with a commit() periodically based on a time or byte limit, whichever comes first. Therefore it does not matter how many clients are generating the random writes, it only matters what the writes are in each commit batch on each storage server.

The extra writes are likely coming from the Pager, yes, but it’s probably the case that in the lower throughput regime the Pager is having to make writes that it was able to skip in the higher throughput regime.

Redwood does not exactly have a WAL in the traditional sense. It writes new or updated pages onto a free page, and in the case of updated pages it writes a record to a pager log that says “logical page X as of version V is now located at physical page Y”. Eventually, in order to truncate records from the front of this pager log so that it does not grow larger indefinitely, after data versions prior to V are no longer being maintained the contents of Y might be copied onto physical page X. I say “might” because this copy will be skipped if it is possible to do so without data loss. The more write activity you have, the more likely it is possible to do this.

One mechanism to skip some of these writes is that the log truncation intentionally lags behind the oldest readable commit version so that when a remap entry is popped from the front of the log the copy can be skipped if it is known that the page is updated again or freed prior to the first retained readable version. The longer the remap cleanup window is (this is a knob, defaults to 50 storage engine commits which, due to other knobs, equates to up to 25 seconds but will be less under high write load) the more skippable writes of this form there will be.

Another mechanism is that if, during the remap cleanup window, multiple sibling BTree nodes under the same parent nodes are updated, then the BTree will update the parent node to point directly to the new child locations so when the child remap entries are truncated from the log the new page data does not have to be copied onto the original page.

Most likely, these two optimizations are what reduce your write amplification with higher throughput. The RedwoodMetrics trace events will show whether or not this is the case.

Really appreciate your detailed explanation! Yes, it now makes our observation a perfect sense, and we will further double check with the trace. Thanks.