I’m currently testing FoundationDB’s backup and restore processes, and I’ve noticed that restoring a backup from S3 (which contains approximately 250GB of data) to a new cluster with the same configuration as the source cluster is taking around 24 hours. I’m using the default configurations for backup and restore. The data restored is from a backup <6 hours old.
More details:
FDB version - 7.1.43
Configuration:
Redundancy mode - double
Storage engine - ssd-2
Coordinators - 5
Usable Regions - 1
Cluster:
FoundationDB processes - 92
Zones - 8
Machines - 8
Memory availability - 7.2 GB per process on machine with least available
Fault Tolerance - 1 machines
Server time - 12/08/23 22:12:38
We have distributed 60 storage processes across three i3en.6xl nodes, while the remaining five i3.xl nodes primarily handle non-storage processes.
Additionally, an interesting observation has been made: there appears to be a pattern of alternating high and low activity periods in the workload. The graphs below illustrate periods of intensive work followed by intervals of reduced activity before the workload resumes its higher pace.
Is this expected, and are there specific parameters I can tune to optimize and speed up the restore process? Your guidance on potential optimizations would be greatly appreciated
In your cluster configuration, i3en.6xl has 24 “vcpus” so 12 real cores and 192GB of RAM, and you are putting 30 Storage Servers on each of these machines. In my experience, this is too many. Each Storage Server is getting only 0.4 physical cores and only 6.4GB of memory assuming that the operating system uses 0 memory which of course is not practical. I suggest giving each Storage Server one physical CPU core and at least 12GiB of --memory with 6GiB --cache_memory.
If there are enough Backup Agents on the cluster (2x the TLog Count should be enough) then Restore ideally will saturate the destination cluster with write traffic, so having a properly configured cluster is important. Each of your TLog processes should also have their own physical CPU core, and if you see them near 100% CPU during restore then increase the number of TLogs in your cluster.
Restore executes in a two-stage pipeline.
Load Stage
Snapshot and Mutation Log data from the backup are loaded into the database key space for some Version Range from the backup.
Backup Agents execute this stage directly so more of them will increase throughput. They commit many randomly ordered ~1MB chunks of sequential data to the cluster, which is essentially an ideal insert workload for FDB and will max out the cluster’s write bandwidth.
Apply Stage
The Commit Proxy applies the staged mutations for some Version Range from the backup.
The Backup Agent directs this stage with control transactions but the actual work is done on the first Commit Proxy.
The stages are executed in pairs, where the Load Stage for one Version Range is run at the same time as the Apply Stage for the previous Version Range. For various reasons it is unlikely that the paired stages will take the same amount of time, so when one finishes before the other you will see various cluster metrics cycle between different patterns.
Also, you should see greatly improved restore (and anything else) performance if you configure your destination cluster to use the Redwood storage engine. In FDB 7.1, Redwood is named ssd-redwood-1-experimental but it is production ready, and Snowflake started using Redwood in production with FDB 7.1 without any issues.
I’ve successfully cut down the restore time to under 30 minutes with the suggestions. Thank you @SteavedHams! I’ve observed that the disks are fully utilized, signaling that we may have reached the cluster’s saturation point.
Sorry for the missing Redwood docs, I intend to write these in the next 6 weeks or so. At Snowflake, we’ve finished migrating our entire fleet to Redwood. With that done, I strongly recommend using Redwood for all production FDB deployments*. The documentation will give a design overview but more importantly it will explain what users need to know about monitoring and migrating to Redwood.
Redwood outperforms the ssd-2 engine significantly for all workloads we’ve tested, and I do not know of any workloads where Redwood would not show at least some increase in performance. Compared to ssd-2, Redwood will have lower CPU usage, lower disk IO, and lower read latency for the same workload and configuration.
*: More specifically, for all FDB deployments that use dedicated/reserved space for StorageServer processes. This is because Redwood does not shrink its data file when data is cleared, instead it tracks and reuses free space internally.
As for FDB versions, I don’t work for Apple so I can’t speak to what is going on with the 7.1 and 7.3 releases or which versions they are running in production.
While I think it is likely that FDB 7.3 is suitable for production use, Snowflake has not used or tested that exact release, so I can’t say for sure based on evidence or experience. FWIW, Snowflake has deployed a sort of early variant of FDB 7.3 to its entire fleet. This version comes from the snowflake/release-71.3 branch (strange name for reasons I won’t go into) which was created from main at an earlier point than release-7.3. Since the official release-7.3 branch is newer than the Snowflake branch, I suspect it is suitable for production use but nobody has published yet that they have proven it at scale.