Reasons for not co-locating tlog and SS? IO characteristics of SS


Question 1

Hi, there have been few posts (here, here, here and here) that warn of severe performance degradation if SS and TLOGs either share (i) process or (ii) disk. Can someone please provide an intuition for the reason for this degradation?

Some reasons given in above threads are related to SS saturating the CPU when doing heavy work and thereby causing starvation of CPU for tLogs. While others talk about total disk IO saturation being the reason (but afaik SSDs are hard to saturate if there is high enough concurrency).


Question 2

When trying to determine IO characteristics of Storage servers for read and write operations, there have been posts (here and here) that suggest that SS issues all write IO at a single io-depth - implying that SS's write capacity will be bound by the underlying storage’s serial write IO throughput (i.e. fio --ioengine=libaio --direct=1 --iodepth=1).

However, @SteavedHams and @markus.pilman have later clarified (here and here) that it is not the writes IO that are issued at single io-depth, but rather, the single io-depth limit is for the read IO that fetch disk-blocks prior to writing those back - implying that the limiting factor for SS write capacity will be determined by serial read IO throughput at single io-depth (i.e. fio --ioengine=libaio --direct=1 --iodepth=1).

It would be very helpful if there is an under-the-hood explanation of how the writes work in SS and how does one reason about the IO characteristics of a Storage Server.

In practice, my observations seem to match the suggestions that Write IO happen at a much larger io-depth: whenever SS are doing heavy writes, iostat tool shows a much higher queue-depth.


It would be great if these suggestions could be clarified/consolidated, and any inaccuracies be corrected. It will be very helpful in determining the likelihood of saturation apriori in environments where such decisions have to be taken based on some pre-checks (like fio) before installing and actually running fdb-based component.

All the questions/observations above are for ssd-2 storage engine; but similar guidance will be useful for redwood storage engine when it arrives.

thanks, gaurav

1 Like

For #1, two other considerations are

  • The logs call fsync() for every commit version, so hundreds of times per second, while storage servers only call it once or twice per second. I think most drives incur some small hiccup in performance while an fsync is pending.
  • An SSD’s write performance per pattern (linear vs random) under a mixed linear+random workload is usually not the same as what each workload can achieve individually. I’m not entirely sure why this is, but it’s a thing. In other words, if a drive can do 300MB/s linear and 50MB/s random writes, if you do 25MB/s of random writes you do not still have 150MB/s of linear write budget remaining, it is something less.

For #2, I certainly agree that storage server I/O characteristics should be better explained and in one place. Probably the single most detailed source of this information right now is my presentation and side deck from the 2019 summit. The slides can be found here, the video is not yet linked but should be soon. FoundationDB Summit 2019: Redwood Storage Engine Update

Regarding the write queue depth: FDB uses SQLite on top of a file caching layer that holds all writes in memory until commit time and then issues them to disk all at once. This is to coalesce multiple writes of the same pages during the commit cycle. So yes, the write queue depth is large when writes are being done but for much of the time writes are not being done, and the bottleneck is for the single threaded writer to read the uncached pages it requires as it traverses the tree for each mutation and applies its changes.

1 Like

Thanks Steve for these explanations! The 2019 fdb talk of yours helped clear a lot of doubts.

One more question:

Where would you think fsync() performance hit will most likely have an impact? Is it that the event-loop will be blocked for a small duration when fsync() is being submitted (I am assuming that event loop is shared by tlog and SS in a single process)? Or will it cause other IOs to take more time while being submitted due to an outstanding fsync(), thereby blocking the event-loop. Or do the IOs themselves take more time to complete on SSD if there are any outstanding fsync() calls?

I think the main concern is with doing simultaneous IOs, and on the storage server you’ll be most interested in the performance of reads. The fsync should be asynchronous and not block the event loop, and assuming everything works smoothly with kernel async IO during this time (which may not be a given), other IOs shouldn’t block the run loop. One exception to this is with truncations, which are synchronous and could block the event loop if an fsync caused them to be slow.

1 Like

Mixed read/write performance is strongly dependent on the firmware of the underlying drive. On modern drives (especially ones that use LDPC error correction) reads are usually bottlenecked by the error corrector’s decode path, and that hardware is not used by the write path.

The number of fsyncs shouldn’t be an issue on enterprise drives, as the fsync’ed data should land in non-volatile RAM that’s flushed to NAND at power loss (or when it fills up).

Concurrent writes lead to poor read performance because the writes cause the drive firmware to garbage collect and wear level the underlying media. If the data is laid out poorly, the garbage collector needs to compact blocks by partially reading them, and that competes with host reads.

The RocksDB team (and other LSM implementations) have had good luck using the NVMe multi-stream extensions to tell the drive that some writes are logically independent of other writes, and that they have different expected lifetimes. The drive uses this information to place the data to optimize future garbage collection operations. I suspect that FDB could do something similar, and ensure that the tlogs and ss are being written to different NVMe streams. For a fair comparison, we should stripe the tlogs across the available drives, and think about how to distribute the storage server data to use all the hardware parallelism.

Here’s Samsung’s multi-stream presentation from the Flash Memory Summit: https://www.samsung.com/us/labs/pdfs/2016-08-fms-multi-stream-v4.pdf

Thanks AJ. From what I gathered so far, the key reason for not co-locating logs and SS on same disk or process is to avoid saturating the disk due to excessive IOs. But if the SSD is able to support large concurrency (io depth), which almost every SSD is able to, then would tlog and SS being on same disk matter?

In most scenarios I have seen that SS is bottlenecked by the single io-depth read performance when doing writes. Other than a possibility that tlog IO operations (write + fsync) are somehow reducing the SS read IOPS occuring concurrently with tlogs, there does not seem to be any other strong reason that I could understand for why these cannot be collocated.

I may be missing something obvious or repeating questions here - but it is only because I am still trying to get my head around how the IO interactions from tlog affect SS performance.

Read latencies, though maybe this is no longer an issue at the disk level (EDIT: or maybe it never was? See below). Something a few years ago led me to believe that fsync’s were stalling read operations, so I did a non-fdb (but still using flow) test where one process is serially reading uncached blocks and printing read latency and another process writes one random block and calls fsync(), and the fsync calls correlated with a large spike in read latency.

I tried a few different disks from different manufacturers and got similar results however since I recently learned that iosubmit() can block for milliseconds (for reasons having nothing to do with the disk) I’m now wondering if perhaps that’s really what was going on.

1 Like