Faster restores of an FDB cluster?

danm · February 22, 2024, 2:49pm

We’re backing up our FDB 7.1 cluster to S3 directly via fdbbackup. Recently, as we load more data into our system (the K/V size is not growing as much, but we are seeing a lot of ‘churn’ certain key prefixes which means the backup log bytes is 10x more than K/V), we’re seeing much longer restore times than we’d like.

I am aware of a faster backup and restore method which seems to have been introduced back in 6.x. We tried it on 7.1 and it completely destroyed our source cluster, and we don’t really understand why. Like, the cluster locked up and no transactions could be run against it, and we couldn’t stop the backup process to get it usable without backups. We ended up having to spin up a replacement cluster and restore our data using the old/slow version. Of course this was all in our non-prod environment so there wasn’t a massive issue there. But since then I’ve seen various other posts on here about people testing that backup method on 6.3, 7.0, and 7.1, and no-one ever mentioned the cluster locking up like we saw…

Regardless, my understanding is that even that method would only give an ~2x speed increase over the existing version, which is still going to very quickly blast through our RTO with any reasonable data volumes. How are other people speeding up restores?

So far we’ve identified a cluster makeup that seems to improve things. We deploy as single replica instead of three_data_hall, with 3x more storage nodes and additional commit proxies. Increasing beyond those numbers doesn’t seem to give us a noticeable performance uplift, at least currently. Once the data is on the cluster we convert to three_data_hall and then migrate data via exclusions and scale our nodes back down to get to a more ‘reasonable’ size/cost of cluster that can handle our traffic levels.

Has anyone else found other ways to improve the restore speed? We have noticed that the throughput of the cluster nodes when rebalancing data seems to be significantly higher/faster than pulling that data from S3, so we were wondering if a mounted ‘backup’ volume which we periodically snapshot to S3 or similar might improve matters. Has anyone tried anything like that?

We’re also not yet using the redwood storage engine. Is there anything in there that would improve restore performance?

Topic		Replies	Views
How to speed up the FDB restore process from s3? Using FoundationDB	1	405	October 14, 2021
Restore is slow and parallel restore doesn't achieve performance boost Using FoundationDB	10	1375	May 23, 2020
Fdbrestore Performance Running FoundationDB performance	6	394	December 14, 2023
FDB 6.3 performant restore Using FoundationDB performance	16	2538	October 5, 2022
Backup using disk snapshots Using FoundationDB	14	2662	February 14, 2019

Faster restores of an FDB cluster?

Related topics