We’re backing up our FDB 7.1 cluster to S3 directly via fdbbackup
. Recently, as we load more data into our system (the K/V size is not growing as much, but we are seeing a lot of ‘churn’ certain key prefixes which means the backup log bytes is 10x more than K/V), we’re seeing much longer restore times than we’d like.
I am aware of a faster backup and restore method which seems to have been introduced back in 6.x. We tried it on 7.1 and it completely destroyed our source cluster, and we don’t really understand why. Like, the cluster locked up and no transactions could be run against it, and we couldn’t stop the backup process to get it usable without backups. We ended up having to spin up a replacement cluster and restore our data using the old/slow version. Of course this was all in our non-prod environment so there wasn’t a massive issue there. But since then I’ve seen various other posts on here about people testing that backup method on 6.3, 7.0, and 7.1, and no-one ever mentioned the cluster locking up like we saw…
Regardless, my understanding is that even that method would only give an ~2x speed increase over the existing version, which is still going to very quickly blast through our RTO with any reasonable data volumes. How are other people speeding up restores?
So far we’ve identified a cluster makeup that seems to improve things. We deploy as single replica instead of three_data_hall, with 3x more storage nodes and additional commit proxies. Increasing beyond those numbers doesn’t seem to give us a noticeable performance uplift, at least currently. Once the data is on the cluster we convert to three_data_hall and then migrate data via exclusions and scale our nodes back down to get to a more ‘reasonable’ size/cost of cluster that can handle our traffic levels.
Has anyone else found other ways to improve the restore speed? We have noticed that the throughput of the cluster nodes when rebalancing data seems to be significantly higher/faster than pulling that data from S3, so we were wondering if a mounted ‘backup’ volume which we periodically snapshot to S3 or similar might improve matters. Has anyone tried anything like that?
We’re also not yet using the redwood storage engine. Is there anything in there that would improve restore performance?