FoundationDB cluster performance issue - Periods of high disk I/O and sustained high latency

We are running a foundationdb cluster with the following configuration -

Configuration:
  Redundancy mode        - triple
  Storage engine         - ssd-2
  Coordinators           - 8

Cluster:
  FoundationDB processes - 37
  Zones                  - 8
  Machines               - 8
  Memory availability    - 7.2 GB per process on machine with least available
  Retransmissions rate   - 1 Hz
  Fault Tolerance        - 2 machines

The process class configuration is -
4 storage + 1 proxy on 3 nodes
4 storage + 1 stateless on 2 nodes
1 log + 3 stateless on 3 nodes
Every machine has a single ssd.

We observe periods with spikes in disk I/O (75-80% on some processes) where the transaction processing latency spikes up quite a bit, high disk I/O doesn’t always lead to high latency but there are sustained periods (~2hr) where processing of transactions is very slow (and disk I/O is high). Very often this coincides with a decrease in the total disk space used (or happens around this time) but there doesn’t seem to be a correlation with the number of clear key transactions with the latency spikes. Sometimes a large number of these types
of transactions are processed with minimal latency. Also it is not the case that the write rate is low during these periods.
We want to figure out if there’s something that we are missing that might be happening in the cluster leading to these periods of sustained high latency.

Are you cleaning keys or ranges? The number of clears does not so much matter, but the scope of the clears does. The ssd-2 engine has to do a lot of deferred cleanup over time after a large clear.

Also, 4 storage processes on a single SSD is probably too many. For ssd-2, two storage processes per disk is likely better than one because it will make better use of the disk’s IOPS budget for writes and deferred cleanup work because the storage engine has low parallelism for those things, but three storage processes per disk is likely too many.

We clean individual keys but quite a bit of them - clearing 1 million keys over a period of few minutes is common. this will spread across tens of thousands of txns.

The biggest issue here is we have no visibility on whether there is some actions going on the background ( moving or purging data) that will spike up the disk usage.

Based on what I see after certain number of deletes some switch gets flipped (which we have no visibility) and disk goes crazy.

Even if the no.of.storage servers are a problem based on what parameters we can determine that is actually the problem ? I cant make much out of status.json details.

How large are your keys and values?

You can look in the trace log for DiskMetrics and SpringCleaningMetrics events for some details about what I/O operations the ssd-2 engine is doing.

You also might want to try the upcoming Redwood storage engine. The first performant version of it will be in FDB 6.3, which should be officially released in the near future, but if you want to try it now and don’t mind building FDB yourself it is available in the release-6.3 branch and called ssd-redwood-experimental.

I suspect that Redwood’s performance will be more consistent over time. It also might be higher, depending on your workload. In any case, its metrics events (RedwoodMetrics) give a lot of insight into what the storage engine is doing and about your workload.

Thanks a lot for quick reply.

Keys - less than 64 bytes.
Values - Varies, mostly about 32kb but there can be few that exceeds 1mb.

I will look into tracelogs and come back on anything suspicious.

1 Like

Are you splitting the 1MB values into chunks, or did you override the value size limit knob? Normally values are limited to 100kb.

In any case, these value sizes cause a terrible I/O pattern on the ssd-2 engine. Every 32kb value is made of a linked list of 8 pages, 4kb each, and due to the serial nature of a linked list in order to clear one of the keys each of those value pages must be read, serially, to find the next page in the sequence and free it for reuse. Over time, the pages that comprise one of these large values could randomly become more spread throughout the data file, and this further increases the write operations required for a set or clear because when whole pages are added to or removed from a large value there are metadata pages that must be updated. The more random the value page set is, the larger the covering metadata page set will be.

I suspect you will see better performance with the ssd-redwood-experimental storage engine, if not now then in the future (it is still being improved), though be aware that it will not leave the experimental stage for a while.

At the moment we are overriding the knob. But splitting them up should be straightforward. But I am sure more than 99% of our values are less than 32kb.

I picked up on your pointer about “deferred cleanup” and did some analysis. This is what I found

At 6.40AM the purge job starts - Disk usage is normal.
6.40 - 7.10 - Disk usage is normal and better tps. Purge job is runnign but I dont see a drop in k/v size of disk space.
At 7.10AM the purge job is still running and cleared about 5 million keys - Disk usage spikes ( 40% usage)
At 7.25AM - purge job completes with slow tps between 7.10 - 7.25am - Disk usage still around 40%
At 7.40AM - Disk usage back to normal ( 1-2%) and tps improves. Overall k/z size and disk size sees a drop and stabilizes.

  1. I dont have an exact count but we have cleared about 10 million keys and k/v size reduced by about 1GB.
  2. Total DB has about 500 million keys spread across 40 subspaces approximately.
  3. Disk metrics show an increase in writeops ( about 5K) without any significant increase in application load.
  4. fdb version - 6.2.19

Is there any kind of b-tree rebalancing going on that will impact the performance ?

Redwood engine - This requires bit more effort, i will find sometime to experiment it. This for highlighting.

Have you changed any other knobs?

Txn size is bumped up to 20MB. But I can guarantee nothing exceeded 10MB. I can reproduce this with 10MB limit too. Actual txn size for purge jobs are tiny. It will be few KB max.

I would not use the term “rebalancing” to describe any of the deferred cleanup processes that happen within the BTree, but yes there are background tasks that are performed over time in response to mutations to the data that can impact performance. One is the execution of large range clears, which requires reading all of the data in the cleared range but it happens slowly over time. The other is shrinking the data file, which can be IO intensive per unit of space yielded, depending on the data pattern and also other mutations happening while vacuuming is happening. This process does not change the BTree’s structure (fanout / node contents, etc, which is why I say it’s not really rebalancing anything) but rather it consolidates free space to the end of the data file so it can be truncated off of the file. This is probably at least part of what is happening in your (3) above since your data file is shrinking.

Note that if all of your clears are of single keys and not ranges, the clear work is NOT deferred and it happens when the mutation is applied. Writing new large values (to new keys) will be faster than clearing existing large values due to the IO patterns involved.

Stepping back a bit, however, at the cluster level outside of the BTree (storage engine) the cluster does rebalance data cross its Storage Server processes as you write and delete keys. This happens for several reasons, including when the write pattern causes some Storage Servers to hold significantly more data than others or if a single shard (a contiguous range of the key space) is taking too many writes or has become too large or small compared to the rest of the shards. So it is possible your database is doing some rebalancing when you see high IO. The amount of data rebalancing happening at any given moment is reported in status in the cli and the json document.

Something else that could be happening here is when your dataset mostly fits in the page cache of the Storage Server processes everything runs a lot faster, and your load and purge cycles might be pushing you back and forth across this threshold.

I am pretty sure this is happening as I see a clear correlation between disk space usage and disk busy %. When the data files stop shrinking the throughput significantly increases.

Just did some calc on how much we clear,

KV size - 150GB
Data cleared - 10GB (diff in k/v size pre and post delete),
Keys cleared - 8-10 million keys (individual clear commands).

I am a bit stuck here, is there any option to make the data files clearing process deferred or take linear time ?

Note that if all of your clears are of single keys and not ranges, the clear work is NOT deferred and it happens when the mutation is applied.

We have objects such that each is stored across two k/v pairs, with a common key prefix. We do a clear range on that common prefix, when we delete them. Do you think the deferred work for even these smaller ranges could be impacting performance, as they pile up?

The file shrinking deferred work might be having an effect on performance. There are some limits on how much of this “vacuuming” work is done per spring cleaning cycle in an attempt to limit its affect on performance, but it can still be an issue in certain workloads I think.

You can actually just disable it and see what happens, it just means your files will not shrink on disk but the free space inside them will still be reused. The option, which you must pass to your fdbserver processes (or at least the storage process ones) is

--knob_spring_cleaning_max_vacuum_pages=0

and if that improves things you can try to find an acceptable non-zero value that doesn’t harm performance. I think a value of 100 will limit file shrinking to something around 400kb/s.

The other type of deferred work, the deferred range clear, would not be happening when you clear just two keys. Deferred range clears are done when the range cleared completely covers one or more BTree subtrees, each with a single root node. The pages which link to those subtrees are modified to remove those links, and then the subtree roots themselves are queued for incremental recursion later. Clearing only 2 keys in an operation would not constitute a subtree so there would be no queuing of work for later.

Something else that causes overhead (though again not deferred) is that with your large values it is likely that many of the 4k nodes in your BTree have only 4 entries in them, with each entry then linking to a linked list of overflow pages that hold the large value bytes, as I described earlier. It is possible that your 2-key clears are triggering page underflows, where the page has too much unused space in it so an attempt is made to merge it with a sibling. Unfortunately, there are not currently any metrics that can confirm whether or not this is happening.

How many individual keys are you clearing in each transaction? In past we have experienced that keeping the number of cleared keys operations per transaction to a small number (<~10) was helpful. But we did not have anything concrete to explain this behavior.

I dont have an exact answer, a batch process clears them. So its most likely few thousand keys in each txn. We are testing with the option which Steve suggested.

Next up will be to measure number of keys per txn.

I would suggest to experiment with splitting the clear ops across more transactions. Try 10 clears per tx. It might be of help.

See this thread where we faced some issues with clear Troubleshooting queue build up

Update on this thread - It turns out we have a very complex delete logic. For every logical delete op we do 20 - 30 reads and another 20 - 30 actual delete ops in fdb. By definition these are old keys and they are never in the cache.

We got the clue from Cache stats in storage server as there was a significant cache miss during this period. An optimization around the “read ops” made the actual delete transaction many times faster.

We now have a new problem with an insane amount repartitioning that gets triggered during the deletes but I will start a separate thread for that.

Thanks everyone.

Could you elaborate on what did this involve? It might be helpful for others.

We have a complex index-like record storing where a single record is broken into 20 - 30 individual k/v pairs in different subspaces. During delete we were doing a “read” in each subspace for validation before deleting.

The fix was simple, we retrieve the original record with a single read and apply the same algorithm that is used during “insert” to generate the keys, apply validations and delete them. This means reads are reduced significantly.

Thanks. But wouldn’t a blind delete of the 20-30 keys result in implicit reading of the btree disk blocks (read, update, write)?

Curious to know how would an explicit read of keys followed by delete be any worse than blind deletes of the same keys, from the cache-friendliness and IO-activity perspective?

In fact, from what I’ve understood, gets of keys are done by SS in parallel (higher queue depth), vs the implicit reads (for write/delete) being done serially (with queue depth of one).