Storage queue limiting performance when initially loading data

Hi,
We have been playing around with FoundationDB for a few months now and and things are going great, however we have been having some trouble getting things to perform when initially loading large volumes of data into a cluster. When running a large load I pretty much always see the following message in fdbcli:

Performance limited by process: Storage server performance (storage queue).

In the details view for that particular process it doesn’t look like it is particularly struggling:

10.22.64.25:4501 ( 57% cpu; 14% machine; 0.045 Gbps; 60% disk IO; 3.4 GB / 7.3 GB RAM )

Looking at the status json for that process I do see that the durability lag is large:

"durability_lag": {
  "seconds": 88.504000000000005,
  "versions": 88503993
}

and that the storage queue is full.
With regards to my workload we basically load some complex documents and populate simple indexes. In a single transaction we insert a number of documents so there can be a large number of keys across different subspaces in a single write. One behaviour I also see on the cluster is that initially when I start populating it I get a write rate of about 180,000 Hz (transactions committed about 350Hz) however this slowly degrades over time, for example after about 15 hours the workload see is 80,000Hz (transactions committed 170Hz) with about 500GB of KV data loaded in that time.

I have tried a number of different cluster setups but my current is 6 nodes each with 8 vCPUs, 30GB memory and 2 375GB NVME disks. I run 4 processes on each node 2 stateless class and 2 storage class, on 2 of the nodes I run a log class instead of a storage.

Does anyone know what may be going on here or what I should be investigating further? I’ve read around other posts and tried things out but haven’t been able to get past this. And in general does anyone have any thoughts on if/how I can improve the write performance for these initial data loads into the cluster?
Thank you! (Apologies for the large post, we are relatively new to running FDB so not sure what information is most important here and what to be looking at.)

You could try deploying more storage processes onto fewer disks. It looks like you’ve got one storage process per disk right now, and you can’t reach peak IOPS with only one storage process writing at a time. I don’t think this is spelled out anywhere in the documentation, but it is worth a shot.

If you look at the benchmark numbers in the documentation, you’ll see multiple storage processes per disk. In some cases the machines only had a single disk and every process was writing to it.

Just to show my work a bit here:

It sounds like the steady state is somewhere between where you ended up (80kHz) and where you started (180kHz), so let’s go with 100kHz because the burst write absorption capacity is significant in FDB.

If only 4 nodes are running storage processes, and each has two disks, and each node has one disk per storage process, that’s a total of 8 disks and 8 storage processes. If you divide that capacity evenly since it sounds like you’re not writing data sequentially, that’s 12.5kHz per process (which is also per disk).


Source (66% read IOs, 33% write IOs, which is not identical but useful as a model of expected performance)

Notice that even NVMe SSDs require more outstanding IOs to reach their peak IOPS. You’re also going to be doing some read IO on the write path since this is a b-tree and not an LSM.

I don’t know the size of your writes, nor do I know the exact specs of the disks you’re using, but it sounds like you’ve gotten exactly the performance I would expect.

If you add more storage processes you’ll have more outstanding IOs, which should lead to higher throughput.

I’ve been benchmarking a number of NVMe based storage systems recently and
it’s likely you need significantly more storage processes to saturate i/o.

you should do a rough comparison of raw disk partition throughput for a single
process (e.g. dd) vs the same on a formatted partition, compared to FDB.

Then run the same again with multiple dd process in parallel, and see what you
can get for throughput.

In my case, 16-30 concurrent writers were sufficient.

1 Like

Given that enough people seem to have discovered this independently, I think this is a useful bit to add to the documentation.

I know there probably shouldn’t be an official recommendation of storage processes per disk as it is obviously specific to your hardware, but telling people “you need to run more than one to saturate a modern disk” seems like extremely conservative advice.

Hi,
Thank you very much for your answers they are really helpful, I will up the storage servers and see how that affects things. Once I have a bit more time I will also give the disk performance tests a go. I am running these tests on GCP using local SSDs which should have good IO performance, hopefully I can get a good understanding of how they perform and how to best utilise them.

One question I do have is why if the disk isn’t saturated do I see the storage queues growing in size? Is it now a case that the process isn’t able to keep up with the amount of data coming into the system and process the writes at the same time, which would be CPU bound? (I am not doing any reads at the moment)

The SSD storage engine issues one write IO at a time. It will issue up to 64 read IOs concurrently.

If you have more data coming in than can be written by 1 IO at a time on these 8 disks, your queue will eventually back up. Your disks can handle much more than that, and only perform at their peak with much more than that.

Regardless of if you’re doing reads at the application level, b-trees require reading before writing to know where the data goes, in contrast to LSMs which only append or write larger immutable files.

1 Like

That’s really helpful thanks, I think I understand. So by adding more storage processes per disk they will be able to issue more IOs to the disk concurrently which means that I will be using more of the disks capacity and that I will have more outstanding IOs which helps performance as shown in the graph you posted.

1 Like

Just wanted to post a quick update on this topic. I increased the storage processes to 3 per disk on all of my nodes and this really helped, I was able to load at a rate of roughly 300,000Hz over 12 hours, the rate stayed stable the whole time and the queues sat at about 5-10% capacity.

When I get some more time I plan to increase the number of processes to really understand how far I can push these disks and see what the maximum is. It feels like I should be able to push them further.

Thank you very much for your help and explanations on this.

2 Likes

I am running my cluster in SSD mode. Increasing resolver from 1 to 2 helped me as it was becoming a bottleneck for me.

If you can tell the two resolvers have spare CPU and memory once your data loading is finished, don’t forget to return it back to one resolver. Resolvers do not communicate which transactions they’ve aborted, so they have to be safe and assume any transaction which doesn’t conflict from their perspective actually committed. This means you’ll potentially abort more transactions in the future which could’ve committed in a single resolver configuration. A single resolver knows which ranges to keep around because it sees every transaction completely.

Sorry for the confusion on this issue. To be very precise the SSD engine does often have an IO queue depth of 1 but it is due to waiting on reads. Each set or clear must do a BTree seek for each key (or range endpoint) and any time those seeks encounter a non-cached page they must wait for it. Then they can modify the page and “write” it but the write is buffered and not sent to disk until commit time, the idea being that the page may be modified again. I only recently realized this write buffering was happening, the interface made it not obvious because IAsyncFile::write() returns a Future but in the AsyncFileCached implementation it just returns Void() and queues the write for later.

The SSD engine does do up to 64 get() or getRange() operations at once, but that queue is independent of the Writer thread which must do its own reads serially (using the shared cache) as part of its update path.

2 Likes