SSD engine and IO queue depth

At this timestamp in his talk, @SteavedHams says the SSD engine has only one outstanding IOP at any given time, which I am taking to mean that the queue depth if you deploy one disk per storage process would only ever get to up to one.

This makes me believe we should be more explicit about telling people to deploy more storage processes per disk if their disks reach peak IOPS at high queue depths (16-32 for many SSDs, and… a lot more for NVMe).

The current Performance section of the documentation, for example, uses c3.8xlarge nodes, which have 2 disks each, which means it probably had at least 5 storage processes per disk. The bare metal deployment in the latency section of that doc probably had 9 or more storage processes per disk.

While Redwood will (hopefully) address this given it is an inherited limitation from SQLite, I think adding a description of this issue to the docs would be useful.

I’d be happy to write it up, but I just want to confirm my assessment is correct before I do.

Just to clarify, the SSD engine’s single writer has a queue depth of 1, while there are many concurrent readers (currently hardcoded to 64).

Assuming your assumption is “Some workloads will see higher write throughput by deploying more than 1 storage server processes per disk” then yes it is true. We don’t have a lot of experimental data to suggest how to decide what your StorageServer/disk count should be, however. It’s going to depend on the disk and the workload, and also the total processes per machine should probably not exceed the CPU core count. Probably the best advice is to try increasing it and see if you like the results.

Also, while I haven’t verified your quoted StorageServer/disk counts for our example deployments I can tell you we did not choose those with the intent to suggest a specific SS/disk density.

1 Like

I am always very hesitant to disagree with Steve when it comes to disk/storage stuff :wink:

However, I don’t think this is the whole story. There’s only one write coro-thread in the ssd storage engine. However, writes only go to memory (in AsyncFileCached) - so writing should be relatively fast. The thing that generates disk writes is sync and flush which will write back all dirty pages to disk.

So the actual writing happens outside of the storage engine and if you run on Linux with kernel AIO (which is the default on Linux) you can have many (64 IIRC) parallel writes sent to the kernel.

From experience I can tell you that you can easily saturate a (relatively slow) EBS volume with almost a write-only workload (we use huge caches - so almost all of our reads go to memory).

If you have fast disks, for us the bottleneck was often the CPU. In that case it might make sense to have more than one storage process per disk (though we currently don’t run in that configuration).