Differentiating between primary cluster and cluster restored from snapshot

trevor.clinkenbeard · October 17, 2020, 5:17pm

If cluster A is periodically taking snapshots, at some point in time we may want to restore one of these snapshots to cluster B. By the design of the snapshot feature, the restored cluster’s keyspace looks exactly like that of the original cluster. Unlike with traditional backups, even the system keyspace is identical on cluster B. This means we cannot use any system keys to differentiate between a cluster being restored from a snapshot and a cluster that is going through recovery while being snapshotted. There are several reasons we’d like to distinguish between these cases:

The system keyspace used by backup agents is no longer relevant to cluster B, and should be cleared before backup agents start reading this keyspace (Issue 3873)
We need to do additional setup (i.e. applying a incremental backup) to cluster B before it is ready to handle client workload. Ideally we could lock cluster B in the initial recovery transaction, but we also need to avoid locking cluster A if it happens to go through a recovery while snapshotting.

To get around the above two issues, our current approach is to solve these problems at the operational level, instead of natively in FoundationDB. However, we were wondering if anyone has any other ideas for handling this differentiation natively in FoundationDB? Thank you.

mengxu · October 18, 2020, 5:31pm

I assume when operator wants to restore the snapshot backup to the cluster B (restore destination cluster), it needs to copy the backup files (i.e., SS, tLog and coordinator files) to the cluster B.

If so, can it also create a dummy file on each host of the cluster B. When a fdbserver worker starts, it scans the files on disk. If it sees such a file, it knows the data is from backup.

Following the same idea, you can also choose to add a prefix to the backup file to distinguish it.

trevor.clinkenbeard · October 18, 2020, 11:16pm

That is one solution that could work. I think unfortunately there’s no way around doing something at the operational level.

mengxu · October 21, 2020, 5:18am

there’s no way around doing something at the operational level.

Did you mean “there’s no way around without doing something at the operational level.” ?

If you need to copy backup files from one place (say S3) to your destination cluster, that is an operation you cannot avoid. It does not seem too much operation overhead to add a new dummy file per host.

trevor.clinkenbeard · October 21, 2020, 5:24am

Yes, I meant no way to avoid doing something at the operational level. I agree, this should not be too much overhead, since there’s already a lot of operational effort involved in snapshot-restore.

Topic		Replies	Views
Backup/restore of a large cluster in a distributed way Using FoundationDB	1	525	January 13, 2021
Backup using disk snapshots Using FoundationDB	14	2642	February 14, 2019
Increased system keyspace size after backups finish Using FoundationDB	9	802	November 1, 2022
Lots of questions about backup and restore Using FoundationDB	2	767	September 28, 2021
Use of FDB disk snapshot facility Using FoundationDB	0	303	July 21, 2023

Differentiating between primary cluster and cluster restored from snapshot

Related topics