Design and Implementation of a Performant Restore System in FDB

Note:
(1) This is a design doc. for an upcoming draft PR on GitHub. After we figure out where we should put the design docs in a collaborative site, we will move this design doc. there.
(2) This design doc. is a draft version and omitted a lot of details. It will also be revised as we work towards a better implementation.

Abstract

This is a draft design of a performant restore system that restores backup data to a FDB cluster. The new restore aims to be 2x faster than the current partial-parallel restore system in FDB. The new restore system follows the design idea of Spark system: it has one coordinator (also called master), multiple loaders and multiple appliers. The coordinator assign loader or applier role to fdb process and dispatch workload to them. Loaders parse backup data into mutations from blob storage and send them to appliers. Each applier is responsible for a range of keys, sort the mutations in increasing order of mutation version and apply mutations in order.

Problem statement

Database administrator should be able to restore the database to a point in time at the event of mis-operation. The restore procedure has the following requirement:
(1) The restored database must be consistent: each restored state (i.e., key-value pair) at a version v must match the original state at version v;
(2) The administrator should be able to restore to any given version after a restorable version v’, the version when the backup procedure takes a snapshot of the entire key space and records all mutations during the snapshot;
(3) The restore procedure should be performant: the bottleneck should be the FDB cluster’s write throughput instead of the restore procedure’s output throughput to the FDB cluster.

Design challenges

The consistent requirement raises several challenges to the restore procedure:
(1) The backup data includes not only the key-value pairs of the database’s state but also the mutations on the database during the backup procedure. To restore to a consistent state, the restore procedure must apply the mutations in the same order when the mutations were taken;
(2) Mutations include atomic operations, which are not idempotent. The restore procedure must apply each atomic operation exactly once in a distributed system environment where network packet may be delivered more than once and out of order;
(3) Memory is not large enough to hold all mutations before they are applied. The restore procedure must apply some mutations before all backup data are processed.

Design (Sketch)

System components. The fast restore system has 1 master, 1 to p loaders, and 1 to q appliers.

The master is responsible for the following functionalities: (1) discover the restore process and assign loader or applier role to each restore process; (2) collect backup files information; (3) distribute backup files as workload for loaders; (4) split the key space into q key ranges and assign a key range to a applier; (5) control the progress of loaders and appliers, such as driving them to a new processing phase, synchronizing nodes’ progress.

A loader is responsible for the following functionalities: (1) parses the range file and log files into a group of mutations at versions; (2) sends parsed mutations to appliers.

An applier collects mutations whose key belong to the applier’s responsible key range. The applier performs the following functionalities: (1) sorts mutations sent from loader in increasing order of versions; (2) applies mutations to database in increasing order of versions.

System control flow . The restore system has five major phases: (1) Setting up roles, which decides which restore process takes which restore role; (2) collecting, which collects the description (e.g., filename, file size) of all backup files from the backup url; (3) sampling, which samples a fraction of backup files to decide how to divide the key space into q ranges for the q appliers based on the access frequency of the key space; (4) loading, which uses p loaders to parse backup files into mutations in parallel and sends mutations to q appliers; (5) applying, which asks q appliers to apply received mutations to database in increasing order of mutations’ versions.

Setting up roles phase. Users start multiple fdbserver processes as restore processes. To decide which process is the master, each restore process tries to write its own restore interface into a special system key (say, masterKey). The succeeded one becomes the master, and the rest become the restore workers. Once the master is selected, the workers register their restore interfaces into a system key range (say, workers key range). The master reads all keys in the workers key range and gets the restore interfaces of all workers. Based on the user-defined ratio of loaders over appliers, the master decides the number of loaders and appliers and assign a role to each worker.

Collecting phase. The master uses the user-provided url to locate the root folder of the backup data. By listing all files recursively under the root folder, the master collects all backup files, group them to range files and log files based on the filename, and sort the files based on the begin_version of the files.

Sampling phase. The following steps are performed in the sampling phase: (1) the master picks an applier as the master-applier which receives the sampled mutations; (2) the master picks a data block in every sample_step, which is sample_rate * total_backup_data_size, and asks an idle loader to parse the data block into a group of mutations. The master repeats this step until all loaders are busy. When all busy loaders become idle again, the master repeats the step (2) ; (3) Once a loader receives the description of the data block (i.e., filename, file offset to read, and data size to read), it parses the data block into mutations, and sends each parsed mutation to the master-applier. The master-applier counts the number of operations for each key in its received mutations; (4) Once the master reaches the end of the last backup file, the master notifies the applier to calculate the key ranges such that the number of sampled operations is evenly distributed in these key ranges; (5) The master notifies each worker the key range of each applier.

Loading phase. The following steps are performed in the loading phase: (1) the master picks the unparsed file with the smallest begin_version, and notifies an idle loader to parse the file; (2) the loader decodes data blocks in the file into a set of mutations with versions. For each mutation, the loader uses the mutation’s key to determine which applier should receives the mutation, and the loader sends each mutation and the mutation’s version to an applier; (3) the applier puts each received mutation into a sorted map, which is indexed by mutations’ versions. The value at each version in the map is a vector of mutations happened at that version; (4) we repeats step (1) to (3) until all files are processed.

Applying phase. The master notifies all appliers to apply mutations in the increasing order of versions.

Discussion of alternative design: Disk-level snapshot based backup and recovery

This solution is proposed by Snowflake in PR1246.

This solution has two appealing strengths: (i) backup and recover can be very fast because taking a snapshot on a disk or restoring a disk snapshot is very fast; (ii) implementation does not need to consider the data format stored on a disk, which significantly simply the restore process.

However, this solution also has some obvious limitations, as acknowledge by the author: (i) it cannot be used in a system whose disk does not support snapshot. This limits its applicability; (ii) it can restore the system only to the version when the snapshot is taken. There exists an arbitrary time interval between two snapshots. So it cannot support point-in-time restore; (iii) it does not support FDB system that has multiple files (e.g., tLog files and storage server files) on the same disk.

If a FDB system fells in any of the following category, it should consider the proposed restore system in this article:
(1) it runs on disks without snapshot capability; (2) it requires point-in-time restore; (3) a disk may have more than one FDB files.

Did I somehow miss the explanation of how the proposed design accomplishes that? Overall, you don’t seem to discuss the failure cases much, which I’d be interested in hearing about.

Do you mean as --role restore or --class restore ? Starting multiple fdbserver processes on different machines feels like a difficult tool to use. Would it be better to just recruit these as roles on stateless processes?

Does this effectively start to degrade into having a single applier at a time if the workload against the database largely writes sequentially increasing keys (like versionstamps)? (This might tie into my confusion about are there multiple iterations of this process.)

… and then we repeat the loading and applying phase with the next set of mutation files? I’m missing how this doesn’t end up with each applier having the entire copy of its part of the key range in memory at once.

It does support it, it just complicates the design. It required being able to temporarily disable popping of transaction logs, but works after doing that.

No you didn’t miss it. I didn’t write how to handle the fault yet.
Right now, we only handle network package delay (or temporary loss). We haven’t considered the node failure yet. We want to get a non-fault-tolerant working version first to see how good performance we can have, before we make it more complex.

It’s --class restore. The restore system is separate from the DB cluster on purpose. The restore system can be viewed as a crazy client that dump data to the cluster at full speed.

For the purpose of reducing the workload of cluster controller, we do not make it a process monitored by cluster controller.

In the future, we may consider put the master restore process monitored by cluster controller so that it is easier to handle node failure.

The current version cannot parallelize the operations on the same key onto multiple appliers. This is a possibly valid point for future improvement. The benefit will depend on how frequent a single key can be written. It is better to make the decision after we have a working version and have some performance profiling on real data.

The sampling has multiple iterations. I didn’t mention that we actually process backup data in version batches. Each version batch is the backup data in a version range. In each version batch, we sample the data.

The concern I have is not we don’t sample the workload for multiple iteration but the opposite. Sampling takes time. The performance overhead of multiple sampling may be more than the performance benefits. Again, this needs some performance profiling on real data.

Once an applier applies the mutations to DB, it can delete its cached mutations.

Yes, when the disk-snapshot based restore supports the multiple files on the same disk, I will update my document.

The discussion of alternative design is more for people to choose which solution to use that defending which one is better. :wink:

1 Like

Hi, all!

Now the restore process is too slow but supports point-in-time restore.

New restore aims to be 2x faster than the current partial-parallel restore system in FDB
But 2x faster than now is still too slow.

So I think on faster alternatives.

  1. If the disk is without snapshot capability then there are some software solutions that bring such capability. For example, lvm or disk vitrualisation.

Unfortunally all snapshot solutions require a separate disk area for data differences. Because fdb rewrites all data blocks often, the size of this area becomes quasi-equal to the snapshot size. If we need several snapshots (ex. dayly), the additional disk space becomes too large.

  1. Disk-level snapshot is not the only possible solution. We can use file-level backups and restore that are a slower than snapshots but are still much faster than fdbrestore. For example, copying files or tar-gz them.

  2. If we have a 24*365 working production fdb cluster with constant data changes then it is difficult to make a consistent copy of all files from all nodes at the same time moment. But the most common case that there is a dr cluster applying mutations from the production cluster. My idea is to pause dr cluster, make a consistent copy from it (using any technique including snapshot or filesystem copy) and resume the dr cluster.

I’ve made some performance tests with fdb built from master.

I have a three-node fdb cluster with

  Sum of key-value sizes - 20.440 GB
  Disk space used        - 51.472 GB

I’ll call this cluster as primary.
I have a three-node dr fdb cluster.
I have another three nodes where I want to restore the copy of primary cluster. I’ll call them as clone cluster.
Each of 9 nodes has one fdbserver process and two backup agent running.

The all 9 nodes in my test are virtual machines on the same host machine with the single nvme ssd drive. The backup storage was on an external disk array attached by nfs.

I compare the time of backing up and restoring to clone of different ways. Reading and writing to the single physical ssd was the bottleneck in all test cases.

  1. File level backup’restore
    1.1. Backup made from dr
fdbdr pause -s fdb-primary.cluster -d fdb-dr.cluster
ansible -i inventories/osamarin3-virt/hosts fdb_dr -m shell -a "systemctl stop foundationdb"
ansible -i inventories/osamarin3-virt/hosts fdb_dr -m shell -a 'tar -cv --use-compress-program=pigz -f /mnt/backup/fdb/cold/9/`hostname`.tar.gz -C /fdb/data .'
ansible -i inventories/osamarin3-virt/hosts fdb_dr -m shell -a "systemctl start foundationdb"
fdbdr resume -s fdb-primary.cluster -d fdb-dr.cluster

took 4 minutes 40 seconds.The total backup size (three tar.gz files) is about 12.2 Gb

1.2. Restore (I need to create symlinks for the backup files to match clone hostnames)

ansible -i inventories/osamarin3-virt/hosts fdb_clone -m shell -a "systemctl stop foundationdb"
ansible -i inventories/osamarin3-virt/hosts fdb_clone -m shell -a 'rm -rf /fdb/data/*'
ansible -i inventories/osamarin3-virt/hosts fdb_clone -m shell -a 'tar -xv --use-compress-program=pigz -f /mnt/backup/fdb/cold/9/`hostname`.tar.gz -C /fdb/data'
ansible -i inventories/osamarin3-virt/hosts fdb_clone -m shell -a "systemctl start foundationdb"
fdbdr abort -s fdb-primary.cluster -d fdb-clone.cluster --dstonly

took 5 minutes 44 seconds

  1. fdb-level backup/restore
    2.1. backup
fdbbackup start -C fdb-primary.cluster -d file:///mnt/synology/nfs/backup/fdb/hot/9 -w

took 6 minutes 35 seconds. The total size of backup is about 20.4 Gb.

2.2 restore

fdbrestore start --dest_cluster_file fdb-clone.cluster -r file:///mnt/synology/nfs/backup/fdb/hot/9/backup-2020-07-14-20-20-58.753553 -w

I have an issue 'Corrupted backup data' in fdbrestore · Issue #3522 · apple/foundationdb · GitHub when restoring so I could not mesure the exact restoring time but I can approximate it: the total bachup size is 21900 blocks, the restoring speed is about 500 blocks per minute, so restoring all 21900 blocks should tale about 44 minutes.

44 minutes is much longer than 5 minutes 14 seconds.

So my proposal is

  1. Make a contionous backup (we need only mutation log part of them)
  2. Make restore points by any fast way (snapshots, tar-gz etc) in a periodic manner (daily, weekly and so on)
  3. When we need a point-to-time recovery
    3.1. Make fast restoring from the latest restore point before the point-to-time.
    3.2. Apply mutations made between the restore point and the required point-to time from the continous backup.
    The restore time will be much shorter than apllying all the kv-pairs as mutations.

For implementing this idea fdbrestore should have, as the minimum, a special oftion --fromversion for forcing it not to restore kvs from backup but only to apply the mutation log from the specified version. As the maximum, it should determine the destination database as a copy of dr cluster on a certain version and only to apply logs from this version.

@osamarin Thank you very much for the interesting idea and detailed description!

I will comment from the high-level to details:

  1. Your proposal makes sense. I had the similar thought and had an issue here: https://github.com/apple/foundationdb/issues/2127. I guess we are on the same page. :wink:

  2. I think the DR-based backup approach is an interesting idea and I personally like it (at least it is a good direction to explore).

However, it does have a major concern:
Now FDB has multi-region configuration (also called fearless configuration), which no long has a separate FDB cluster in the remote DC. The multi-region configuration will become the recommended configuration for high-availability service.
The DR-based backup and restore solution won’t work out of box for the multi-region configuration.

  1. About the speed of file-based backup and restore in your evaluation, it is 12.2 Gb backup data, and restore time is 5m44s. So the restore speed is 12.2 * 1024MB/ 5m44s ~= 36MB/s. The speed itself is still slow for a big backup size and the current fdb restore can reaches up to 100MB/s (IIRC).

Intuitively, I think the file-based restore should be much faster than the measured speed. Did you happen to test where the time is spent and if it is scalable to make the proposed restore faster?

  1. About the speed of file-based backup and restore in your evaluation, it is 12.2 Gb backup data, and restore time is 5m44s. So the restore speed is 12.2 * 1024MB/ 5m44s ~= 36MB/s. The speed itself is still slow for a big backup size and the current fdb restore can reaches up to 100MB/s (IIRC). Intuitively, I think the file-based restore should be much faster than the measured speed. Did you happen to test where the time is spent and if it is scalable to make the proposed restore faster?

12.2 gb is the summary size of three tar.gz files. The summary size of uncompressed fdb files is about 80 Gb.

  1. The all nine nodes has a common storage - a single ssd (all of them a virtual machines running on my workstation). Writing speed to it was ~ 550Mb/s and it limitted the restore speed. If all the nodes had separate ssds then the solution will be faster. Unfortunally I do not have sufficient number of separate ssds for testing scalability.
  2. The size of files of the nodes are not equal. So restoring of two nodes finished earlier than of the last one. 5m 44s is the restore time of the largest node
1 Like

Does it mean that the configuration with a separate DR cluster will not more supported?

The speed for 80GB (I assume it is GBytes instead of Gbits) is better than the current restore for sure.

Copying files should be much faster than writing transactions.

I assume DR will still be supported. I don’t know if it will be supported indefinitely.

We can run multi-region configuration with DR for the same primary cluster. The cost is that supporting DR will need to write mutations to system keyspace, which means 2x write amplification.

So theoretically, for a multi-region cluster, you can still use a temporary DR site to backup the database at the cost of reduced throughput (say with 30% write bandwidth overhead). The caveat is how fast can the DR backup finish without affecting the cluster’s normal traffic, especially when a big cluster is running at the edge of its read and write throughput.

I used decompressing with tar x instead of copying. Decompressing is slower but otherwise the single network interface of my workstation becomes a bottleneck.

So theoretically, for a multi-region cluster, you can still use a temporary DR site to backup the database at the cost of reduced throughput (say with 30% write bandwidth overhead). The caveat is how fast can the DR backup finish without affecting the cluster’s normal traffic, especially when a big cluster is running at the edge of its read and write throughput.

If the backups are done in a regular basis then this DR must be permanent rather temporary.

Another restriction of DR-based backups that the number of nodes and the distribution of fdbserver processes should be same on the primary and DR clusters: otherwise it is impossible to do a file-based or a snapshor-based restore.

But making a continous backup also requires to write mutations to system keyspace. Now these two mutation streams are separate that brings a significant overhead.

According to https://github.com/alexmiller-apple/foundationdb/blob/1547cbcef3e2916d742d37f17641eb84871ee51c/design/tlog-spilling.md.html
there is a capability of writing transaction log only once and retriving it for all destinations by index.
It woud be great if supporting backup and dr could use the same way without writing mutations to the system keyspace.

Actually, since FDB 5.2 the streams are no longer separate. Saved mutation streams for Backup and DR operations that cover the same key ranges are only stored once and are gradually cleaned up as all consumers are finished with the data.

I have some good news and bad news. The good news is that FDB 6.3 has a new feature for Backup which uses the log system to write the mutation stream to the backup, skipping the temporary copy staged in the database. The bad news is this feature has not been added to DR, so running a new-style backup and a DR will still incur the same overhead of saving the mutation log to the database (same overhead since the two streams were already shared, not duplicated).

Why not to add this feature to dr?

1 Like

This is a great idea! I think it is possible.

I will follow up internally to see if we can make this as a release feature.

Adding this feature to DR needs a significant amount of work, including

  • the capability of one FDB cluster writes to another (so that mutations are not saved in the source cluster, only temporarily stored in transaction logs);
  • combining multiple mutation streams into one at the destination cluster so that mutations are applied in version order. This step is actually more complicated, since the merging of mutation streams needs to handle duplicated and overlapped version ranges.

I am not saying we shouldn’t do it. It’s something not on the priority list yet.

the capability of one FDB cluster writes to another (so that mutations are not saved in the source cluster, only temporarily stored in transaction logs);

How does fdbbackup work? Does fdb cluster writes mutations to backup itself or writes them to backup agents?

I’d implement a ‘pull’ approach, where dr_agent would read the mutation log from the source transaction log instead of a system keyspace.

combining multiple mutation streams into one at the destination cluster so that mutations are applied in version order. This step is actually more complicated, since the merging of mutation streams needs to handle duplicated and overlapped version ranges

I think this is a good task for dr_agent: to pull transaction log from all source nodes and combine them.

This is a great suggestion! FDB did have this feature coming soon. The implementation is done in 6.3 and our evaluation shows it works as expected. The details are in this issue: Backup the mutation log using the log router tags · Issue #1003 · apple/foundationdb · GitHub.

fdbbackup tells the cluster to perform backup. Then the cluster starts recording mutations and write to its key space; and backup agents starts taking key space snapshots and copy recorded mutations to blob (e.g., S3).

Note the pull approach is breaking abstraction – transaction logs are never intended to be exposed to clients. By design, clients only talk to Proxies and Storage Servers for writes and reads, respectively. There are technical reasons as well:

  • Mutations are tagged with their destinations by Proxies. Transaction logs do the bookkeeping of which tags are processed (popped) at what version. So pulling means assigning tags external to the cluster, which brings the question of how to maintain it.
  • If external tags are introduced, the immediate question is what if external processes are not pulling data faster enough, which can cause transaction logs spill to disks. This may eventually cause transaction logs to fill up disks and have performance impact.
1 Like

fdbbackup tells the cluster to perform backup. Then the cluster starts recording mutations and write to its key space; and backup agents starts taking key space snapshots and copy recorded mutations to blob (e.g., S3).

Are you talking about writing to the system keyspace? But it contradicts the following statement:

FDB 6.3 has a new feature for Backup which uses the log system to write the mutation stream to the backup, skipping the temporary copy staged in the database

Does it mean that backup agent reads transaction logs from the log system instead of reading temporary storage in the system keyspace,