What most affects the performance of the 'streaming mutation log' part of an FDB restore?

I know FDB restore is pretty slow, even for relatively small DBs. If you’ve done a recent snapshot, it’s not bad. But if there’s a good length of mutation log to read since the last snapshot, it has a massively adverse effect on the restore time.

For instance we seem to be able to be able to restore an ~350GB K/V size DB from a just-completed (within the last hour or so) snapshot within ~90m, which does cause a good amount of disk I/O on the hosts (doesn’t seem to be hitting any bottlenecks on host performance though). But when there’s a days’ worth of mutation logs (~4.5k individual files, ~4-7GB data), it takes a further 4-5 hours to complete the restore operation, during which the cluster seems to be doing very little in terms of CPU, memory, or disk I/O.

(Side note: when we were ssd-2 on FDB 7.1 on network-attached cloud storage the bottleneck was quite clearly disk I/O as the storage nodes of the restore cluster were square-topping that stat, but since moving to FDB 7.3 and ssd-redwood-1 the I/O has been vastly lower both for day-to-day workloads and test restores, and it no longer appears to be the issue)

My hypothesis (from observation of host performance stats during restore, and from reading this) is that once the latest snapshot data has been restored, the mutation log can only be processed in-order on a single thread, or similar. So I’m wondering what most affects that time, and how we might look to reduce it as much as possible and/or prevent unexpected surprises with it jumping suddenly due to a system design change.

I can imagine that any or all of

  • Number of key updates present in the log files (a lot of updates to small keys/values, vs fewer updates to larger keys/value)
  • Number of log files (what actually triggers a new log file? Is it just time/versionstamp progression, or size too?)
  • Total size in bytes of the necessary log data (more updates made and/or just larger keys/values)
  • Number of individual transactions (or, number of unique versionstamps)

could affect the speed of a restore. But I don’t know which will affect it most and which are basically non-issues. If anyone can provide me guidance there, it would be most appreciated. I’m going to poke through the codebase, so feel free to link me to bits of it, but I’m not a C++ coder generally so it takes me a while to read/digest/mostly-understand.

I know data can be pulled from S3 a lot faster than even the snapshot portion of a restore is doing, so would it be faster to pull the S3 backup prefix to a local ephemeral NVMe disk on each node running a backup/restore agent, and then trigger the restore from the local path? Is this even possible, or do too many parts of the backup/restore system reference the original backup path and a lot of unexpected stuff will break?

I’m aware there’s a partitioned backup/restore process which is marked as experimental and has been since 2019, which from the design doc allows multiple mutation log files to be processed simultaneously. We tried to run the backup side on a test source cluster, and it locked the entire cluster up in such a way that we never managed to recover it. We still don’t understand why, so we’re understandably not comfortable deploying it on our production DB.

I’m also aware there’s a backup system that involves calling out to an external script simultaneously on all the non-stateless hosts, intended for things like triggering a hypervisor-level disk snapshot. We’d like to investigate that in future (and I’d be interested to know what the DB does to keep everything in sync while that is running. Does it just hold everything in memory and not flush to disk?), but for the time being I want to know what I can squeeze from the existing fdbbackup/fdbrestore.