Restore is slow and parallel restore doesn't achieve performance boost

We set up a set of backup nodes/pods separate from regular data nodes. We are very happy that backup with multiple backup nodes have sped up significantly. For example, with a cluster of 50GB data in test env, with a single backup node with multiple agents, the backup takes 9min. With 2 backup nodes, it’s 5 mins; with 4 nodes, it’s 3min. With a large prod db of 3TB data in prod, we are able to reduce the backup time from 13.5 hours to 2.5 hours with six backup nodes.

On the other hand, I cannot achieve any performance boost for RESTORE with N nodes. Restores are all very slow. For the 50GB db in triple replication mode, I tested 4 times. It took 2 to 3 hours to restore. The durations were similar whether it’s one backup node or N backup nodes. It seems there has been NO parallelism for restore.

I haven’t tested restoring for the large db in prod with N backup nodes yet. In the past, I did restore with 1 backup node, and it took 27 hours for 1.5TB.

The restore seems to be quite slow. Is this normal with the numbers above?

Why are the backup time and the restore time so different (3 mins vs 2.5 hours)?

Do I miss something with parallel restore? How do I troubleshoot it?

Thank you.
Leo

I assume you are running 6.2 version, which still uses the old restore.
The old (current-released) restore can first restore the key-value range files in parallel, but has to sequentially apply the mutation in log files.
The new restore (which will be released hopefully in 6.3 as an experimental feature) is a customized Spark-like system that can process all backup files in parallel and asynchronously, and preapplies the mutations in memory so that the mutations applied to FDB is much smaller.

As to the speed, your experiment shows less than 10MB/s speed of processing the backup data, which is super slow compared to what we know. The old restore should be able to process at least 50MB/s (sometimes 100MB/s) backup data with many hosts (@SteavedHams do you know how many hosts are used for the existing restore?). The new fast restore, in a very premature experiment, shows 70MB/s with only 13 processes and 4-host FDB cluster. We are still experimenting the new fast restore. The target speedup or the new fast restore is 5X faster than the current restore. (I’m trying to make it faster than that.)

@SteavedHams may want to chim in about the old (current) restore.

Yes, we are using v6.2.
Are the “key-value range files” the ones generated by the key-range options on the fdbbackup cmd?
We backup the whole db, and do not use key-range backups. So the restore won’t perform in parallel in our case. Is that right?

I just have a new restore finished. It’s 15GB backup data and restored in 15min. It’s about 16MB/sec. For the large restore I mentioned, it’s1.5TB in key-value size. The backup size was bigger, like 1.7TB. The restore speed for that one is about 17MB/sec. Still well below 50MB/sec.

Can the performance of our remote storage be a limiting factor in restore? Will giving larger RAM to backup agent help? Does replication factor affect restore speed? (Is the single replication faster than triple for restore?)

Thanks.

No.
Regardless of what range is being backed up, the backup process generates many small key range files representing small slices of the key range, and they are generated in a random order over time to spread the read workload over the storage servers in the cluster.

Restore loads, in parallel, all of the key range files within the restored range (in your case all of the database) but restricted to some range of versions, and after that a serial process applies mutations that your cluster executed during that version window. While that serial process is happening, the key range files for the next version range are being loaded.

This means that restore is speed depends largely on how much your cluster was being mutated during the backup, as applying those mutations is serial.

The speed of the backup also affects restore speed because it affects how parallel the key range file load can be and how much mutation data needs to be applied.

Did you try restoring the backup on your 50GB db that only took 3 minutes to create? And what is the your cluster configuration? Adding more logs during a restore can speed up performance, as the parallel loading of key range data could easily be log limited with too few logs. This is also true of the new restore process available in v6.3.

1 Like

@lehu BTW, if you are running on AWS, did you try the snapshot based backup and restore, which can be much easier and faster.
The only drawback is that it does not support point-in-time restore.

Steve, quite interesting.

For the test dbs, I made backups after the data loading finished. No changes to the db when backups were taking place. I assume that means there were no mutations during the backup window.

I first loaded 50GB of data, and tested backups and restores. Since the restores took 2+ hours, I shrank the db to 15GB by clearing ranges of keys. After that, I tested again.

Here is the process count report of my cluster config:

 CNT Role
---- -------------
   1 cluster_controller
   3 coordinator
   1 data_distributor
   5 log
   1 master
   3 proxy
   1 ratekeeper
   1 resolver
  30 storage

I have 5 Tx log processes on 5 different pods. Are they enough?

I created 6 backup pods with 6 agents on each pod, with a total of 36 agents. I checked the status json output. The 36 agents are connected to the db. From the status json output:

"cluster" : {
    "clients" : {
        "count" : 1,
        "supported_versions" : [
            {
                "client_version" : "Unknown",
                "connected_clients" : [
                    {
                        "address" : "10.69.196.45:54386:tls",
                        "log_group" : "default"
                    },
                    ...... (35 more similar entries)
                ],
                "count" : 36,
                "protocol_version" : "Unknown",
                "source_version" : "Unknown"
            },

Anything I should change?
Thanks.

Meng, we are not using AWS, because eBay doesn’t like them. :slight_smile:

Currently we deploy fdb to eBay’s customized Kubernetes clusters. Have local-ssd PVC as data storage. Not sure if the local-ssd supports snapshots.

In general, what are prerequisites to have snapshot based backup and restore, in terms of FS, local vs remote storage, etc?

Question about fdb’s point-in-time restore: what’s the time window of “point in time”? Is it the same as the backup window, or larger? For example, if I started my backup at 1am today, and it finished at 3am. Is the time window from 1am to 3pm today, from which I can choose a time point like 2am to restore? Or can I choose a time point before the backup started, like 9pm yesterday?

Thank you.

Haha, I see.

The snapshot based backup and restore requires a filesystem that supports snapshot. It basically snapshots the storage and tLog files on each disk for backup and restore those files later for restore.

About point-in-time restore, its contract is: when you get a restorable point (which is represented as a version number), you can restore to any version number equal or higher than the restorable point, assuming you don’t delete backup files that has data after the restorable point.

It takes a while (hours) to get the restorable point. The backup agents have to use transactions to take snapshot over the entire key space. Because a transaction has size limit, each transaction can only take snapshot on a sub-keyspace. The time depends on how large your DB is and how many backup agents you have: The more backup agents you have, the more snapshots you can take in parallel, the faster you can snapshot the entire keyspace.

Note that the snapshot on each subspace is taken at different versions. So simply concatenating them together does not produce a consistent view of the entire keyspace. That’s why we also need to capture all mutations once the backup starts.

Just to clarify, for reading key range snapshots it’s the transaction time limit that matters. Which I guess is sort of a size, just in seconds :slight_smile:

Correct, your mutation log would be nearly empty, and your restore speed should go as fast as your cluster can write.

Based on 50GB and 2 hours, and a negligible amount of mutations, you are getting around 7 MB/s. This is far too slow, I’m not sure what the issue is. A few things to check/try:

  • Look at cluster status during restore, are any processes CPU bound?
  • Check your hosts during the restore, are any of the over-utilized? For example if you have 10 processes on a 4 core system, no single process will show high CPU but only because there isn’t enough to go around.
  • Check that your processes are successfully using unbuffered linux async IO. You can tell this by looking for TraceEvents called AsyncFileKAIOOpen and AsyncFileKAIOOpenFailed. You should have many of the former and none of the latter.
  • Switch to single or double replication during the restore
  • Try more proxies and more logs, perhaps 5 proxies and 8 logs.

Hi Steve, I went after all the items in your list one by one. We are fine with the first 4 items, but I expanded provies and Tx logs. I’ve made some progress with restores. For the 50GB db,

  • It used to be 2+ hours a few days ago.
  • yesterday’s restore was 1hr 10min. I didn’t do much with this restore itself, but we happened to upgrade the Kubernetes cluster by 3 minor versions 2 days ago. The upgrade could have enhanced the speed.
  • Today it’s 55min after adding more Tx and proxies to the fdb cluster following your advice.

I found something interesting from our Tess internal Wiki page on K8s volumes: local-dynamic is 3-time slower in IO performance than local-ssd (one-third). The IOPS are 10K for local-dynamic VS. 30K for local-ssd. In the test K8s cluster, I have used local-dynamic. In the big prod db, we use local-ssd. Therefore the test db has slower SSDs than the prod db.

What a coincidence! Can the 3-time SLOWER performance of local-dynamic in test env be (at least partially) responsible for the 3-time slower restore performance we have than Apple’s 50MB/sec?

I’ll find it out when doing restore testing with prod db. Currently we have planned the K8s upgrade on that prod K8s cluster as well. Will test after the upgrade.

Thank you.

Leo