Backup & restore performance tuning

I tested some FDB backup/restore scenarios for our dbs, and found the backups were slow and restores very slow. Like to get some advice on performance tuning for backup/restore.

Let me first describe our example db cluster and how backup/restore were done.

  • The db contains 1.5TB KV data.
  • The cluster uses 2 FDB regions with 3 DCs, and with triple replication mode.
  • It is deployed on Kubernetes with TLS enabled, 20 storage pods/nodes.
  • Need to backup to remote storage target, as a K8s pod’s local storage cannot be guaranteed to persistent when the pod goes bad. Only one type of remote storages is available in the RNPCI env the cluster is deployed, which is NFS drive (NetApp Filer).
  • Use dedicated pods/nodes (separate from data nodes) for backup agents.
  • When restoring, disabled cross-DC replication first, and further reduced it to single mode. After restore, enabled triple mode and cross-DC replication.

Here are the best performance times we’ve achieved:

  • backup time: 8 hrs, at 3.1GB per min .
  • restore time: 27 hrs at 0.93GB per min (minute, not second) . At that time, the db is operational at single-replica mode. After turning on triple-mode and cross-DC replication, then Another 9 hrs to achieve fully redundant state.

The restore is too slow. We need to tune it. I tried parallel restore with multiple backup_agents on one node, and multiple agents on 2 nodes. I want to see how backup agents are collaborating, e.g., which agent is working and how fast, so that I can adjust the config. But I don’t see such info from the status output.

Is there a way to get performance data on each backup agent?

What are the good methods to identify bottlenecks of the restore op? Which is usually most crucial? I/O, CPU, network, or # of backup agents?

What type of storage is usually faster for backup/restore? NFS, blob, others?

If you can share your backup/restore performance data and tips, I’d highly appreciate it. Please be specific with how you achieve your best performance, such as

  • backup agent config (# of processes, # of nodes),
  • what type of storage destinations is used,
  • whether you disable triple mode before restore, and reenable it after.
  • the env and topology your cluster is deployed.

Thank you.

Leo

  • One backup agent per 6 fdb storage processes should perform well assuming their hardware and cluster connectivity is comparable to that of the cluster’s storage processes.

  • The backup agents get work randomly assigned, they should be essentially interchangeable for best performance (similar hardware, similar connectivity).

  • The backup medium (blob, filesystem) does not matter as far as FDB is concerned, only the speed/connectivity to that medium and in most instances this will not be the bottleneck.

  • Switching to single replication during restore is a good idea. Double is also an option, which still maintains some fault tolerance during the restore.

  • Backup agents can produce a trace log with the --logdir option, and there are trace events beginning with FileBackup and FileRestore that can tell you about how much work they are doing. But there really is no need to go here unless you suspect that your agents’ capabilities are extremely mismatched. For example, one backup agent with a much slower connection than all the others could spend a long time finishing the tasks it has been assigned long after all of the others have finished all other outstanding work.

Some more info about backup/restore performance limits:

The current default backup and restore tools operate mostly as clients of the database. This means that backup will not go faster than the read rate of your fdb cluster, and restore will not go faster than the write rate of your cluster. Adding more agents only helps up to these limits.

A backup that runs to a restorable state and then ends will write a randomly-ordered series of key space snapshots at different versions which collectively cover the entire range being backed up, plus the mutation log during that period. This should max out of the read speed of a cluster as long as you have enough backup agents. The random ordering means they will spread the work evenly over all storage teams.

Restore will write all of a backup’s key range snapshots to the database and apply the backup’s mutation log to update them to the target version. This process normally maxes out at the rate at which a single fdb proxy can apply the backup’s mutation log, which is around 100MB/s on a well-tuned cluster. A backup from a mostly-idle cluster can restore faster than this since it will have very few mutations to apply which are subject to this bottleneck.

Coming in FDB 6.3 is a completely new restore process which eliminates the single-proxy bottleneck. It uses a set of processes called restore workers to apply the backup’s mutation log in parallel using large memory buffers and then commit the coalesced results to the database cluster in parallel. This should bring restore speeds much closer to the write speed of the cluster even for backups that have large mutation logs.

1 Like

Hi Steve,

Thanks a lot for the tips and the info on the inner working of backup and restore. A couple of follow-up questions.

  1. On the number of backup agents: currently our backup agents use the same hardware config as storage nodes of 3 CPUs. We are flexible to get more powerful nodes from the Kubernetes cluster. Let’s assume we expand to 40 storage processes and like to use 7 backup agents (~40/6). Which of the following is better or is there a difference?
  • using 3 nodes of 3 CPUs, with 3, 3, 1 backup agents on them.
  • using 1 node of 7 CPUs, with 7 backup agents.
  1. How do I detect and tune on the fdb proxy bottleneck? We are using fdb v6.2.

Appreciate it.

Leo

Another question:
FDB has DR backup/restore option, where logs are continuously applied to the DR target cluster. Is it possible to configure to delay applying logs by a certain amount of time?

This function is useful to remediate some human errors like accidental deletion of data. MongoDB has a similar function like this, called Delayed Replica Set Members.

Thanks.

Leo

Yet another question. The Fdb restore has the option to restore by key ranges selectively. Wondering what conditions need to be met to be able to perform a selective restore from a key range.

  • Can it be restored from a full db backup? Or must it be from a backup done on the same key range?
  • Can I keep the rest of the db (all keys out of the selected key range) unchanged and unaffected?
  • Any additional actions needed for the full db consistency at the end?

Thanks.

Leo

You can restore a more restrictive set of key range(s) than what the backup contains, and leave the rest of the database untouched. At the end of the restore, the target range will have been restored to the restore target version.

There is one catch currently, however, that the database is locked during restore so you can’t continue to use the untouched ranges. This is so that clients do not write to the range being restored. It’s not uncommon for users to be unaware of all of the things using their database, so relaxing this part of the restore process is not a great idea, but in the future locks will likely be key range based and restore can just lock the ranges it will be restoring.

I’m not aware of this currently being a feature, although it was discussed as a useful feature in the early days of DR. The idea was to pause the application of mutations on the DR side and let them buffer in the database (where they are initially stored before being applied), but I’m not sure if/when this would be implemented.

Just so I understand you correctly, you’re saying you want to have 40 storage processes but only 3 physical nodes with 3 CPUs each?

EDIT: I think I get it, these are just referring to the nodes running backup agents. The first option is likely to perform better. In networking, bandwidth isn’t a completely fungible commodity to be split over an arbitrary number of hosts. In other words, if you want to use X total bandwidth, even if X is less than the network interface maximum speed on a single host, you may not actually achieve X on a single host but you probably will using multiple hosts, each using less bandwidth.

The 3 nodes or 1 node are for backup agents only, separated from storage processes. I was thinking to have as many backup agent processes on a node as its number of CPUs. So I’ll start 3 backup agent processes on a node of 3 CPUs, and 7 on a node of 7 CPUs.

Sorry for not being clear on that.

Thank you on your replies to my other questions.

Sorry I was being slow, @alexmiller explained it to me and I updated my post before you replied.

I can see that more hosts (thus likely more network interface cards) have an advantage over fewer hosts in terms of accumulated network performance. Thanks, Steve.

Yes, and this sort of thing can be in play at multiple levels. Talking to the storage service (be it blob store or remote filesystem) from 3 client IPs could be better than just 1 if it tries to balance/limit bandwidth by IP to enforce some notion of fairness.

Hi folks, I have questions on fault tolerance with backup agents.

When we have multiple nodes with multiple agents on each node collaborating together,

  • What happens if one of the node or agent dies?
  • Will the backup job still produces a consistent and complete backup at the end?
  • Are there any affect on the restorability of the backup?

Also I’ve been wondering

  • Is there a leader/controller elected among the agents?
  • Through what mechanism does the agents collaborate together?
  • Where is the metadata of backup/restore activities stored?
  • How does the agents use the metadata to divide and conquer?
  • Does each agent get the same of work pre-assigned? Workload dynamically adjusted?

Thank you.
Leo

Backup agents are all stateless workers, that coordinate using the database to grab smaller tasks as part of the overall backup. If a backup agent dies, then the work it was assigned will eventually be picked up and completed by some other backup agent.

has links to both the TaskBucket implementation in FDB that’s used to coordinate backup agent work, and the next post down has a link to something similar, but written in python (and smaller, clearer, and probably easier to quickly grok).

This is also all changing in 6.3, and there will be a new backup implementation available that uses a different sort of worker, that instead recruited and managed as part of the cluster. You can read more about that in the design doc.

Backup agents can die at any time, this does not affect the integrity of the backup. As long as one backup agent is running and successfully connected to your cluster and backup storage medium backup progress will be made.

Backup agents use a shared work queue stored in FDB. The queue stores tasks under different priorities. The agents use FDB transactions to “claim” a task exclusively for execution. That claim times out after a period, so while an agent is executing task it routinely extends its claim, again with an FDB transaction. If the agent fails, it will no longer extend its task ownership claim. Another agent will then notice this (all agents check occasionally for expired task claims) and transactionally return the task to an executable state.

A Backup begins as a single task, and its execution creates other tasks transactionally with its completion, and those tasks do the same. Some tasks that represent large amounts of work are split into many smaller tasks representing the same work in aggregate. Tasks are not “pre”-assigned to agents. Each agent has a number of task slots, determined by a knob, and when an agent has a free task slot it will look for work to claim as described above. The order in which tasks are claimed by agents for execution is random within a priority level.

To be clear, the part that is changing is how mutations are written to the backup. Instead of the backup tasks writing the mutations themselves, one of the initial backup tasks will essentially configure a new database feature where the database will recruit “backup worker” roles and stream mutation logs more efficiently to a backup destination URL until it is told, by another backup task, to stop.

Everything else (backup start/stop/abort/pause, key range snapshots and metadata written to backup) remains the same.

Quite interesting algorithm for parallelism in backup/restore!

I figure, since KV pairs in FDB are more independent from each other than rows/tables in a relational db, which has rich data integrity constraints, we can have more flexibility and creativity for parallel ops. Is that right?

I did one preliminary test with multiple backup nodes, but one of nodes erred. Describing the backup returns “restorable”. From the info you provided, I’ll try testing to restore it.

Thank you.

Yes, we need to improve the description around the “recent errors” output. They do not mean necessarily that anything is “wrong”, just that errors did occur during execution. The errors are only a problem if they are happening constantly and the backup is not making any progress.

Not quite, no. When you restore a backup in FDB, you restore to a specific version and the entire key space that was backed up, all keys and values, are restored to that version. Keys and values from different parts of the keyspace are no more independent than, say, a set of tables or indexes in a relational model that are all restored to a specific consistent version.