Design and Implementation of a Performant Restore System in FDB


(Meng Xu) #1

Note:
(1) This is a design doc. for an upcoming draft PR on GitHub. After we figure out where we should put the design docs in a collaborative site, we will move this design doc. there.
(2) This design doc. is a draft version and omitted a lot of details. It will also be revised as we work towards a better implementation.

Abstract

This is a draft design of a performant restore system that restores backup data to a FDB cluster. The new restore aims to be 2x faster than the current partial-parallel restore system in FDB. The new restore system follows the design idea of Spark system: it has one coordinator (also called master), multiple loaders and multiple appliers. The coordinator assign loader or applier role to fdb process and dispatch workload to them. Loaders parse backup data into mutations from blob storage and send them to appliers. Each applier is responsible for a range of keys, sort the mutations in increasing order of mutation version and apply mutations in order.

Problem statement

Database administrator should be able to restore the database to a point in time at the event of mis-operation. The restore procedure has the following requirement:
(1) The restored database must be consistent: each restored state (i.e., key-value pair) at a version v must match the original state at version v;
(2) The administrator should be able to restore to any given version after a restorable version v’, the version when the backup procedure takes a snapshot of the entire key space and records all mutations during the snapshot;
(3) The restore procedure should be performant: the bottleneck should be the FDB cluster’s write throughput instead of the restore procedure’s output throughput to the FDB cluster.

Design challenges

The consistent requirement raises several challenges to the restore procedure:
(1) The backup data includes not only the key-value pairs of the database’s state but also the mutations on the database during the backup procedure. To restore to a consistent state, the restore procedure must apply the mutations in the same order when the mutations were taken;
(2) Mutations include atomic operations, which are not idempotent. The restore procedure must apply each atomic operation exactly once in a distributed system environment where network packet may be delivered more than once and out of order;
(3) Memory is not large enough to hold all mutations before they are applied. The restore procedure must apply some mutations before all backup data are processed.

Design (Sketch)

System components. The fast restore system has 1 master, 1 to p loaders, and 1 to q appliers.

The master is responsible for the following functionalities: (1) discover the restore process and assign loader or applier role to each restore process; (2) collect backup files information; (3) distribute backup files as workload for loaders; (4) split the key space into q key ranges and assign a key range to a applier; (5) control the progress of loaders and appliers, such as driving them to a new processing phase, synchronizing nodes’ progress.

A loader is responsible for the following functionalities: (1) parses the range file and log files into a group of mutations at versions; (2) sends parsed mutations to appliers.

An applier collects mutations whose key belong to the applier’s responsible key range. The applier performs the following functionalities: (1) sorts mutations sent from loader in increasing order of versions; (2) applies mutations to database in increasing order of versions.

System control flow . The restore system has five major phases: (1) Setting up roles, which decides which restore process takes which restore role; (2) collecting, which collects the description (e.g., filename, file size) of all backup files from the backup url; (3) sampling, which samples a fraction of backup files to decide how to divide the key space into q ranges for the q appliers based on the access frequency of the key space; (4) loading, which uses p loaders to parse backup files into mutations in parallel and sends mutations to q appliers; (5) applying, which asks q appliers to apply received mutations to database in increasing order of mutations’ versions.

Setting up roles phase. Users start multiple fdbserver processes as restore processes. To decide which process is the master, each restore process tries to write its own restore interface into a special system key (say, masterKey). The succeeded one becomes the master, and the rest become the restore workers. Once the master is selected, the workers register their restore interfaces into a system key range (say, workers key range). The master reads all keys in the workers key range and gets the restore interfaces of all workers. Based on the user-defined ratio of loaders over appliers, the master decides the number of loaders and appliers and assign a role to each worker.

Collecting phase. The master uses the user-provided url to locate the root folder of the backup data. By listing all files recursively under the root folder, the master collects all backup files, group them to range files and log files based on the filename, and sort the files based on the begin_version of the files.

Sampling phase. The following steps are performed in the sampling phase: (1) the master picks an applier as the master-applier which receives the sampled mutations; (2) the master picks a data block in every sample_step, which is sample_rate * total_backup_data_size, and asks an idle loader to parse the data block into a group of mutations. The master repeats this step until all loaders are busy. When all busy loaders become idle again, the master repeats the step (2) ; (3) Once a loader receives the description of the data block (i.e., filename, file offset to read, and data size to read), it parses the data block into mutations, and sends each parsed mutation to the master-applier. The master-applier counts the number of operations for each key in its received mutations; (4) Once the master reaches the end of the last backup file, the master notifies the applier to calculate the key ranges such that the number of sampled operations is evenly distributed in these key ranges; (5) The master notifies each worker the key range of each applier.

Loading phase. The following steps are performed in the loading phase: (1) the master picks the unparsed file with the smallest begin_version, and notifies an idle loader to parse the file; (2) the loader decodes data blocks in the file into a set of mutations with versions. For each mutation, the loader uses the mutation’s key to determine which applier should receives the mutation, and the loader sends each mutation and the mutation’s version to an applier; (3) the applier puts each received mutation into a sorted map, which is indexed by mutations’ versions. The value at each version in the map is a vector of mutations happened at that version; (4) we repeats step (1) to (3) until all files are processed.

Applying phase. The master notifies all appliers to apply mutations in the increasing order of versions.

Discussion of alternative design: Disk-level snapshot based backup and recovery

This solution is proposed by Snowflake in PR1246.

This solution has two appealing strengths: (i) backup and recover can be very fast because taking a snapshot on a disk or restoring a disk snapshot is very fast; (ii) implementation does not need to consider the data format stored on a disk, which significantly simply the restore process.

However, this solution also has some obvious limitations, as acknowledge by the author: (i) it cannot be used in a system whose disk does not support snapshot. This limits its applicability; (ii) it can restore the system only to the version when the snapshot is taken. There exists an arbitrary time interval between two snapshots. So it cannot support point-in-time restore; (iii) it does not support FDB system that has multiple files (e.g., tLog files and storage server files) on the same disk.

If a FDB system fells in any of the following category, it should consider the proposed restore system in this article:
(1) it runs on disks without snapshot capability; (2) it requires point-in-time restore; (3) a disk may have more than one FDB files.


(Alex Miller) #2

Did I somehow miss the explanation of how the proposed design accomplishes that? Overall, you don’t seem to discuss the failure cases much, which I’d be interested in hearing about.

Do you mean as --role restore or --class restore ? Starting multiple fdbserver processes on different machines feels like a difficult tool to use. Would it be better to just recruit these as roles on stateless processes?

Does this effectively start to degrade into having a single applier at a time if the workload against the database largely writes sequentially increasing keys (like versionstamps)? (This might tie into my confusion about are there multiple iterations of this process.)

… and then we repeat the loading and applying phase with the next set of mutation files? I’m missing how this doesn’t end up with each applier having the entire copy of its part of the key range in memory at once.

It does support it, it just complicates the design. It required being able to temporarily disable popping of transaction logs, but works after doing that.


(Meng Xu) #3

No you didn’t miss it. I didn’t write how to handle the fault yet.
Right now, we only handle network package delay (or temporary loss). We haven’t considered the node failure yet. We want to get a non-fault-tolerant working version first to see how good performance we can have, before we make it more complex.

It’s --class restore. The restore system is separate from the DB cluster on purpose. The restore system can be viewed as a crazy client that dump data to the cluster at full speed.

For the purpose of reducing the workload of cluster controller, we do not make it a process monitored by cluster controller.

In the future, we may consider put the master restore process monitored by cluster controller so that it is easier to handle node failure.

The current version cannot parallelize the operations on the same key onto multiple appliers. This is a possibly valid point for future improvement. The benefit will depend on how frequent a single key can be written. It is better to make the decision after we have a working version and have some performance profiling on real data.

The sampling has multiple iterations. I didn’t mention that we actually process backup data in version batches. Each version batch is the backup data in a version range. In each version batch, we sample the data.

The concern I have is not we don’t sample the workload for multiple iteration but the opposite. Sampling takes time. The performance overhead of multiple sampling may be more than the performance benefits. Again, this needs some performance profiling on real data.

Once an applier applies the mutations to DB, it can delete its cached mutations.

Yes, when the disk-snapshot based restore supports the multiple files on the same disk, I will update my document.

The discussion of alternative design is more for people to choose which solution to use that defending which one is better. :wink: