I tested some FDB backup/restore scenarios for our dbs, and found the backups were slow and restores very slow. Like to get some advice on performance tuning for backup/restore.
Let me first describe our example db cluster and how backup/restore were done.
- The db contains 1.5TB KV data.
- The cluster uses 2 FDB regions with 3 DCs, and with triple replication mode.
- It is deployed on Kubernetes with TLS enabled, 20 storage pods/nodes.
- Need to backup to remote storage target, as a K8s pod’s local storage cannot be guaranteed to persistent when the pod goes bad. Only one type of remote storages is available in the RNPCI env the cluster is deployed, which is NFS drive (NetApp Filer).
- Use dedicated pods/nodes (separate from data nodes) for backup agents.
- When restoring, disabled cross-DC replication first, and further reduced it to single mode. After restore, enabled triple mode and cross-DC replication.
Here are the best performance times we’ve achieved:
- backup time: 8 hrs, at 3.1GB per min .
- restore time: 27 hrs at 0.93GB per min (minute, not second) . At that time, the db is operational at single-replica mode. After turning on triple-mode and cross-DC replication, then Another 9 hrs to achieve fully redundant state.
The restore is too slow. We need to tune it. I tried parallel restore with multiple backup_agents on one node, and multiple agents on 2 nodes. I want to see how backup agents are collaborating, e.g., which agent is working and how fast, so that I can adjust the config. But I don’t see such info from the status output.
Is there a way to get performance data on each backup agent?
What are the good methods to identify bottlenecks of the restore op? Which is usually most crucial? I/O, CPU, network, or # of backup agents?
What type of storage is usually faster for backup/restore? NFS, blob, others?
If you can share your backup/restore performance data and tips, I’d highly appreciate it. Please be specific with how you achieve your best performance, such as
- backup agent config (# of processes, # of nodes),
- what type of storage destinations is used,
- whether you disable triple mode before restore, and reenable it after.
- the env and topology your cluster is deployed.