Backup /restore fdb

we have a 3 node fdb cluster on 6.1.8 version .we have configured backups to run on one of the nodes to the same local file system and set up a cron job to run nightly backups and copy to s3 , we noticed that some days the backups are not being taken. when i run the same script from command line , i find the same issue. sometimes it backsup, sometimes i just see this
[root@ip-10-152-41-88 backup]# /data/fdb_backup_script.sh
Submitted and now waiting for the backup on tag `default’ to complete.
[root@ip-10-152-41-88 backup]# cd /data/

and it comes out to prompt and no backup

any idea on what is happening here ?

Are you seeing that some days the backup is correctly being taken?

we tried a couple more times and now found that the back up is being taken in either one of the nodes each time. not the same node from which we are running the script.

on another instance , when we started backup from one node, we saw that it was created on other 2 nodes with different KV ranges.

2nd node

data/backup/fdb/backup-2019-09-19-20-47-19.201742/kvranges/snapshot.000024439621753268/0

total 3288
-rw-r–r--. 1 foundationdb foundationdb 1216640 Sep 19 20:47 range,24439621837969,8a1aefc523fb2abc0a3b2fd577ba4a64,1048576
-rw-r–r--. 1 foundationdb foundationdb 555264 Sep 19 20:47 range,24439621869660,fea39762bd513c2201554085620b905f,1048576
-rw-r–r--. 1 foundationdb foundationdb 1162069 Sep 19 20:47 range,24439621869660,3ccc375de709992ba2d7703cd857228a,1048576
-rw-r–r--. 1 foundationdb foundationdb 425289 Sep 19 20:47 range,24439621890133,403b4447d6778f7d0616634c1b0d36a6,1048576

3nd node

/data/backup/fdb/backup-2019-09-19-20-47-19.201742/kvranges/snapshot.000024439621753268/0
[root@ip-10-152-42-103 0]# ls -ltr
total 3328
-rw-r–r--. 1 foundationdb foundationdb 341880 Sep 19 20:47 range,24439621839159,a16530ad88b1024de7c74a9b5315bb5b,1048576
-rw-r–r--. 1 foundationdb foundationdb 562432 Sep 19 20:47 range,24439621859742,8bcd6f38fae68080920e188435836280,1048576
-rw-r–r--. 1 foundationdb foundationdb 499468 Sep 19 20:47 range,24439621892688,e39df635cc2bfa0784a1e221a6a6f4fc,1048576
-rw-r–r--. 1 foundationdb foundationdb 1996036 Sep 19 20:47 range,24439621859742,a3b53c7b6b036eb941cfe773de2270b7,1048576

You probably have a backup_agent running on each node, and thus you’re emitting backup files on the different nodes as they randomly claim responsibility for (parts of) the backup. It’d be best to just archive the backup files from all of your nodes together, so that you can get parallelism in your backup and not overload one disk with storing the whole database. If you need the backup emitted entirely in one node, then you’d want to run one or more backup_agent processes only on that one node.

so u mean if i want back up to run only on one node, i need to stop the backup agents on other nodes ? and also in that case will i have the entire data backup on that node

Yes and yes.

Your fdb_backup_script.sh is requesting that a backup should be taken, but backup_agent is what actually takes the backup, done in chunks by different backup agent processes in parallel. If you need all of the backup files to be emitted in one node, then you need to only run backup_agents on that one node.

ok thanks will try that,

Also i have some questions with restore. we take nightly backups and store in s3. are each backups a full backup and restorable? or is it incremental.

when i tried restoring from one backup which i have taken in the night. i see this errors
then last i also see 'Restored to version …

so that means has it restore completely or is there anything missing?

Tag: default UID: 7e3d4ff576a75bde06a30d9ca53bdd7e State: running Blocks: 10/10 BlocksInProgress: 0 Files: 7 BytesWritten: 5749811 ApplyVersionLag: 1341944 LastError: ‘‘File not found’ on ‘restore_range_data’’ 59s ago.
Tag: default UID: 7e3d4ff576a75bde06a30d9ca53bdd7e State: completed Blocks: 10/10 BlocksInProgress: 0 Files: 7 BytesWritten: 5749811 ApplyVersionLag: 0 LastError: ‘‘File not found’ on ‘restore_range_data’’ 60s ago.
Restored to version 24177538741943

any suggestions on the restore error ?

Restored to version is the version the DB will be restored to.
Only after you run your backup agents for a while, will the backup become restorable.
You can try the describe option with the backup agent that triggers backup.
The option will output the description of the backup. The restorable version is the minimum version your DB can be restored to.
DB can restore to any version after the restorable version, assuming you are not deleting the backup.

@SteavedHams Correct me if I’m wrong.

This is correct, and also if you are using fdbbackup expire to remove older data then the change in restorable version range will be reflected in subsequent fdbbackup describe operations.

@Minaxi Please see https://apple.github.io/foundationdb/backups.html for more details on how backup/restore and their tools work.

The error is saying that one of the backup_agent processes failed to open a file from the restore set at one point in time. Apparently - since the restore did succeed later - it or another backup_agent process was able to open the file eventually.

To be clear, the message indicating completion means the restore was completely successful, regardless of any errors were encountered during the restore process.

If you would like to know which file could not be accessed, look at the trace logs from your backup_agent processes, but note that by default the agents to not write trace logs so you may not be generating any.

ohk, good to know , that helps, thanks all

just one question, what do u mean by “Only after you run your backup agents for a while, will the backup become restorable.” @mengxu.

does it mean that after we take many backups ?

No. It means that backups take time to finish writing enough data into the backup destination so that they are restorable. The work of backups is done by the backup agents.

i was trying to test the following scenario. taking local backups in a cluster uploading to s3. the entire backupfolder is uplaoded as a targz. then i downloaded this tar in to another cluster and tried resoring it, thats when i for those above errors. I heard from a colleague that there will be missing metadata when local backups are uploaded to s3 and later used for restore.

can u please explain this further and also when we upload to s3 we upload the complete backup with all folders tarred and zipped

Backup destinations specified with file:// and blobstore:// are not compatible with each other. Someone recently asked about writing backups to local folders and then transplanting them to S3 and restoring from there (with a blobstore:// URL, I assumed). Currently that is not possible, and there are no plans to make it possible as you can and should backup directly to S3 instead.

So to clarify, you can NOT do this:

  1. Create a backup with a file:// URL
  2. Upload the directory and all of its files to S3
  3. Restore the backup using a blobstore:// URL

But what you CAN do, which is the situation you describe, is

  1. Create a backup with a file:// URL
  2. Upload the directory and all of its files to S3
  3. Download the directory from S3 to some local folder
  4. Restore the backup with a file:// URL that points to the local folder downloaded to in step 3.

In this scenario, the fact that the files were copied somewhere and back makes no difference, restore just needs to see all of the files it wrote during the backup in whatever input directory you give it.

thanks, I was trying to do some testing with backup agents … but i was unable to invoke the backup_agent command line tool
[root@ip-10-148-41-166 foundationdb]# backup_agent

bash: backup_agent: command not found

[root@ip-10-148-41-166 foundationdb]# where is backup_agent

bash: where: command not found

[root@ip-10-148-41-166 foundationdb]# which backup_agent

/usr/bin/which: no backup_agent in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/bin)

[root@ip-10-148-41-166 foundationdb]#

to stop the backup agents i already commented in conf file and restarted the fdb process. so backup agent is not running anymore, i wanted to check how to start the backup agent only on one node thru the backup_agent command line tool as mentioned in the document but unable to find that

Add /usr/local/foundationdb/ to $PATH.