What most affects the performance of the 'streaming mutation log' part of an FDB restore?

I know FDB restore is pretty slow, even for relatively small DBs. If you’ve done a recent snapshot, it’s not bad. But if there’s a good length of mutation log to read since the last snapshot, it has a massively adverse effect on the restore time.

For instance we seem to be able to be able to restore an ~350GB K/V size DB from a just-completed (within the last hour or so) snapshot within ~90m, which does cause a good amount of disk I/O on the hosts (doesn’t seem to be hitting any bottlenecks on host performance though). But when there’s a days’ worth of mutation logs (~4.5k individual files, ~4-7GB data), it takes a further 4-5 hours to complete the restore operation, during which the cluster seems to be doing very little in terms of CPU, memory, or disk I/O.

(Side note: when we were ssd-2 on FDB 7.1 on network-attached cloud storage the bottleneck was quite clearly disk I/O as the storage nodes of the restore cluster were square-topping that stat, but since moving to FDB 7.3 and ssd-redwood-1 the I/O has been vastly lower both for day-to-day workloads and test restores, and it no longer appears to be the issue)

My hypothesis (from observation of host performance stats during restore, and from reading this) is that once the latest snapshot data has been restored, the mutation log can only be processed in-order on a single thread, or similar. So I’m wondering what most affects that time, and how we might look to reduce it as much as possible and/or prevent unexpected surprises with it jumping suddenly due to a system design change.

I can imagine that any or all of

  • Number of key updates present in the log files (a lot of updates to small keys/values, vs fewer updates to larger keys/value)
  • Number of log files (what actually triggers a new log file? Is it just time/versionstamp progression, or size too?)
  • Total size in bytes of the necessary log data (more updates made and/or just larger keys/values)
  • Number of individual transactions (or, number of unique versionstamps)

could affect the speed of a restore. But I don’t know which will affect it most and which are basically non-issues. If anyone can provide me guidance there, it would be most appreciated. I’m going to poke through the codebase, so feel free to link me to bits of it, but I’m not a C++ coder generally so it takes me a while to read/digest/mostly-understand.

I know data can be pulled from S3 a lot faster than even the snapshot portion of a restore is doing, so would it be faster to pull the S3 backup prefix to a local ephemeral NVMe disk on each node running a backup/restore agent, and then trigger the restore from the local path? Is this even possible, or do too many parts of the backup/restore system reference the original backup path and a lot of unexpected stuff will break?

I’m aware there’s a partitioned backup/restore process which is marked as experimental and has been since 2019, which from the design doc allows multiple mutation log files to be processed simultaneously. We tried to run the backup side on a test source cluster, and it locked the entire cluster up in such a way that we never managed to recover it. We still don’t understand why, so we’re understandably not comfortable deploying it on our production DB.

I’m also aware there’s a backup system that involves calling out to an external script simultaneously on all the non-stateless hosts, intended for things like triggering a hypervisor-level disk snapshot. We’d like to investigate that in future (and I’d be interested to know what the DB does to keep everything in sync while that is running. Does it just hold everything in memory and not flush to disk?), but for the time being I want to know what I can squeeze from the existing fdbbackup/fdbrestore.

@danmeyers were you able to achieve any faster restores? We’ve been experimenting with tuning the cluster but are also not getting very far. Similar to you, we have observed a small restore of < 5 GB of KV size as reported by the database taking about 4-5 hours too. fdbrestore status reports 5k files, so I assume we are running into the same situation you describe where there are lots of mutation logs.

@hxu Sort of. We made some tradeoffs to get the results we wanted:

  • We invoke fdbbackup to create new backups with the flag --initial-snapshot-interval 3600. For our current DB (~750GB) doing a full snapshot backup in ~1hr doesn’t load it significantly (unlike without this flag, where it runs as fast as possible and completes in 10-15 mins but performance of the rest of the system suffers as a result).
  • We configure backup snapshotting within a backup to be every 6 hours.
  • We have a system that auto-creates an entirely new backup and stops the old one once a day.
    • Actually, we have 2 backups running at any one time, and one rotates at 11:00 and one at 15:00, so when one is stopped and a new backup created with the same tag (as there doesn’t appear to be any way to clean up tags, and we didn’t want the list to get hugely long). So there’s always at least one backup running while the other is non-restorable during the initial snapshot phase.
  • We’ve done a load of work on how our system interacts with FDB to cut the amount of data we write in a given time period right down. We needed to do that anyway, otherwise our storage costs would just be unmanageable. So while the DB itself is quite large, we’re only writing ~5GB mutation logs in a day now, and the actual change in key size for the snapshot is much smaller.

With all of that combined, we can restore the DB to the latest point in time we have a valid backup for in ~4 hrs, although it is creeping up still as the DB grows. The main reason for the regular recreation of the backups is that you can’t set S3 to expire old data from an existing running backup and still have something valid to restore from. So by recreating the entire backup we can aggressively expire the old backups and keep our S3 costs down even though we’re snapshotting really frequently.

We did a test where we snapshotted every 2 hours and restore was noticeably faster again (also if we try and restore when a snapshot has just completed), so it’s still the mutation logs taking the majority of the restore time… We’re still looking for a long-term solution for faster restores.