Fdbbackup expire with blobstore returning error, not deleting from S3


(Ray Jenkins) #1

Hi Folks,

Running into some problems with fdbbackup expire and looking for some guidance on debugging.

Setup:
15 instance cluster running 6.0.16
3 i3.large running 2 fdbserver processes with transaction class running log
3 c5.xlarge running 4 fdbserver processes with stateless class running proxy/master/cluster controller
9 i3.xlarge running 4 fdbserver processes with storage class

backup_agent is running on all nodes and fdbbackup is running in continuous mode.

fdbbackup start -d <blobstore_url> -s 86400 -z

So far so good, backups have been running fine.

$ fdbbackup describe -d blobstore://xxxxxxxxxxxxxxxxs3.us-west-2.amazonaws.com/blah?bucket=some_bucket
URL: blobstore://xxxxxxxxxxxxxxxx@s3.us-west-2.amazonaws.com/blah?bucket=some_bucket
Restorable: true
Snapshot: startVersion=680672277778 endVersion=681349765701 totalBytes=258515087913 restorable=true
Snapshot: startVersion=681359352824 endVersion=767745402472 totalBytes=261944092554 restorable=true
Snapshot: startVersion=767899470014 endVersion=854238123473 totalBytes=268954145233 restorable=true
Snapshot: startVersion=854255297744 endVersion=940607774184 totalBytes=282034102462 restorable=true
Snapshot: startVersion=940685990059 endVersion=1027047586694 totalBytes=295028242383 restorable=true
Snapshot: startVersion=1027144233746 endVersion=1113480511478 totalBytes=307961176975 restorable=true
Snapshot: startVersion=1113638682379 endVersion=1200023371426 totalBytes=321032543807 restorable=true
SnapshotBytes: 1995469391327
MinLogBeginVersion: 680672000491
ContiguousLogEndVersion: 1241861148586
MaxLogEndVersion: 1241861148586
MinRestorableVersion: 681349765701
MaxRestorableVersion: 1241861148585

In preparing to move to production I’m performing some tests, including expiring backups. I began by attempting to delete backups prior to 02-06-2019 using the expire_before_timestamp feature.

fdbbackup expire -d blobstore://xxxxx@s3.us-west-2.amazonaws.com/blah?bucket=some_bucket --expire_before_timestamp 2019-02-06.10:00:00 All data before version 680671939982 is deleted.

Afterwards I inspected the bucket on S3 and no data was deleted. I then attempted to delete data before 2019-02-07.10:00:00 and received the following error. (Note: I have also tried with the --force flag as well with the same results)

$ fdbbackup expire -d blobstore://xxxxxxxxx@s3.us-west-2.amazonaws.com/blah?bucket=some_bucket --expire_before_timestamp 2019-02-07.10:00:00 ERROR: HTTP response code not received or indicated failure Fatal Error: HTTP response code not received or indicated failure

Performing a fdbbackup describe now shows some of the backups are no longer restorable, as expected, and the MinLogBeginVersion and MinRestorableVersion has increased but the size of the S3 bucket remains steady and growing due to continuous backups.

$ fdbbackup describe -d blobstore://xxxxxxxxxxxxxxxx@s3.us-west-2.amazonaws.com/blah?bucket=some_bucket
URL: blobstore://xxxxxxxxx@s3.us-west-2.amazonaws.com/blah?bucket=some_bucket
Restorable: true
Snapshot: startVersion=680672277778 endVersion=681349765701 totalBytes=258515087913 restorable=false
Snapshot: startVersion=681359352824 endVersion=767745402472 totalBytes=261944092554 restorable=false
Snapshot: startVersion=767899470014 endVersion=854238123473 totalBytes=268954145233 restorable=true
Snapshot: startVersion=854255297744 endVersion=940607774184 totalBytes=282034102462 restorable=true
Snapshot: startVersion=940685990059 endVersion=1027047586694 totalBytes=295028242383 restorable=true
Snapshot: startVersion=1027144233746 endVersion=1113480511478 totalBytes=307961176975 restorable=true
Snapshot: startVersion=1113638682379 endVersion=1200023371426 totalBytes=321032543807 restorable=true
SnapshotBytes: 1995469391327
MinLogBeginVersion: 767441226070
ContiguousLogEndVersion: 1243110292855
MaxLogEndVersion: 1243110292855
MinRestorableVersion: 854238123473
MaxRestorableVersion: 1243110292854

Additionally I can see the delete requests being received in Cloudwatch

And it appears the properties folder has been updated.

It appears nothing from the kvranges folder has been deleted and there is no expiration date or expiration rule on the objects in the kvranges/snapshot folders. Additionally I have inspected the trace files but haven’t noticed anything pertaining to my manual attempts at fdbbackup expire.

Any tips on debugging?


Backup using disk snapshots
(Steve Atherton) #2

There were several bugs recently fixed in backup in 6.0.18 and the prior version, including a particularly nasty HTTP request bug. While I’m not sure exactly what bad behavior you might be hitting, try these steps first:

  1. Upgrade to 6.0.18
  2. Delete the 3 files you see in the “properties” folder. I assume you can do this in the web UI where you took the screenshot.
  3. Run fdbbackup describe on your backup URL.
  4. Run fdbbackup expire on your backup URL as you did before.

In case this was a point of confusion, please note that the byte counts you see in the fdbbackup describe output are not based on listing and summing up all of the kvrange files for each snapshot, as snapshots can contain millions of these so that could be very slow. The byte counts are instead based on metadata files, one per snapshot, which are written after each snapshot’s completion and which are deleted in expire operations after all of the kvrange files for the snapshot are deleted.


(Ray Jenkins) #3

Thanks @SteavedHams, I will upgrade tomorrow and follow your instructions and let you know the results.


(Ray Jenkins) #4

Hiya @SteavedHams I’ve rebuilt the cluster with new instances and on 6.0.18 and deleted the 3 files in the properties folder. Issuing a describe shows

fdbbackup describe -d blobstore://xxxxxx@s3.us-west-2.amazonaws.com/blah?bucket=some_bucket
URL: blobstore://xxxxxxxxxx@s3.us-west-2.amazonaws.com/blah?bucket=some_bucket
Restorable: true
Snapshot: startVersion=38380127694 (maxLogEnd -14.02 days) endVersion=38801987336 (maxLogEnd -14.02 days) totalBytes=343824424885 restorable=true
Snapshot: startVersion=680672277778 (maxLogEnd -6.59 days) endVersion=681349765701 (maxLogEnd -6.58 days) totalBytes=258515087913 restorable=false
Snapshot: startVersion=681359352824 (maxLogEnd -6.58 days) endVersion=767745402472 (maxLogEnd -5.58 days) totalBytes=261944092554 restorable=false
Snapshot: startVersion=767899470014 (maxLogEnd -5.58 days) endVersion=854238123473 (maxLogEnd -4.58 days) totalBytes=268954145233 restorable=false
Snapshot: startVersion=854255297744 (maxLogEnd -4.58 days) endVersion=940607774184 (maxLogEnd -3.58 days) totalBytes=282034102462 restorable=false
Snapshot: startVersion=940685990059 (maxLogEnd -3.58 days) endVersion=1027047586694 (maxLogEnd -2.58 days) totalBytes=295028242383 restorable=false
Snapshot: startVersion=1027144233746 (maxLogEnd -2.58 days) endVersion=1113480511478 (maxLogEnd -1.58 days) totalBytes=307961176975 restorable=false
Snapshot: startVersion=1113638682379 (maxLogEnd -1.58 days) endVersion=1200023371426 (maxLogEnd -0.58 days) totalBytes=321032543807 restorable=false
SnapshotBytes: 2339293816212
MinLogBeginVersion: 38379644395 (maxLogEnd -14.02 days)
ContiguousLogEndVersion: 61000828852 (maxLogEnd -13.76 days)
MaxLogEndVersion: 1249978824514 (maxLogEnd -0.00 days)
MinRestorableVersion: 38801987336 (maxLogEnd -14.02 days)
MaxRestorableVersion: 61000828851 (maxLogEnd -13.76 days)

I then tried to run some expires.

fdbbackup expire -d blobstore://xxxxxx@s3.us-west-2.amazonaws.com/blah?bucket=some_bucket --expire_before_timestamp 2019-02-07.10:00:00
Final metadata update…
All data before version 0 has been deleted.

Notably trying --expire_before_version attempts to List files and then errors

$ fdbbackup expire -d blobstore://xxxxxx@s3.us-west-2.amazonaws.com/blah?bucket=some_bucket --expire_before_version 854255297744
Listing files…ERROR: HTTP response code not received or indicated failure
Fatal Error: HTTP response code not received or indicated failure

A fdbbackup describe now shows

$ fdbbackup describe -d blobstore://xxxxxxxxxxxx@s3.us-west-2.amazonaws.com/blah?bucket=some_bucket
URL: blobstore://xxxxx@s3.us-west-2.amazonaws.com/blah?bucket=some_bucket
Restorable: true
Snapshot: startVersion=38380127694 (maxLogEnd -14.02 days) endVersion=38801987336 (maxLogEnd -14.02 days) totalBytes=343824424885 restorable=false
Snapshot: startVersion=680672277778 (maxLogEnd -6.59 days) endVersion=681349765701 (maxLogEnd -6.58 days) totalBytes=258515087913 restorable=false
Snapshot: startVersion=681359352824 (maxLogEnd -6.58 days) endVersion=767745402472 (maxLogEnd -5.58 days) totalBytes=261944092554 restorable=false
Snapshot: startVersion=767899470014 (maxLogEnd -5.58 days) endVersion=854238123473 (maxLogEnd -4.58 days) totalBytes=268954145233 restorable=false
Snapshot: startVersion=854255297744 (maxLogEnd -4.58 days) endVersion=940607774184 (maxLogEnd -3.58 days) totalBytes=282034102462 restorable=true
Snapshot: startVersion=940685990059 (maxLogEnd -3.58 days) endVersion=1027047586694 (maxLogEnd -2.58 days) totalBytes=295028242383 restorable=true
Snapshot: startVersion=1027144233746 (maxLogEnd -2.58 days) endVersion=1113480511478 (maxLogEnd -1.58 days) totalBytes=307961176975 restorable=true
Snapshot: startVersion=1113638682379 (maxLogEnd -1.58 days) endVersion=1200023371426 (maxLogEnd -0.58 days) totalBytes=321032543807 restorable=true
SnapshotBytes: 2339293816212
UnreliableEndVersion: 854247018002 (maxLogEnd -4.58 days)
MinLogBeginVersion: 854247018002 (maxLogEnd -4.58 days)
ContiguousLogEndVersion: 1249978824514 (maxLogEnd -0.00 days)
MaxLogEndVersion: 1249978824514 (maxLogEnd -0.00 days)
MinRestorableVersion: 940607774184 (maxLogEnd -3.58 days)
MaxRestorableVersion: 1249978824513 (maxLogEnd -0.00 days)

Inspecting the S3 bucket the size has not decreased. The properties file now contains.


(Steve Atherton) #5

This line:

ContiguousLogEndVersion: 61000828852 (maxLogEnd -13.76 days)

suggests that you likely hit a rare pre-6.0.18 bug where one of your log files was not successfully uploaded, breaking the continuity of the mutation stream at that point. The logs after that point appear to be contiguous, however, so you can still restore to points in time after the next snapshot after that log discontinuity.

When you began the expire to version 854255297744, the process got far enough that it marked everything prior to that version as being “unreliable”, meaning that files prior to that version could be missing because an expiration process began on that range. Describe assumes that files before the Unreliable version are missing, so it will not list/scan them and they will not be included in the cached logBegin / logEnd metadata which indicates what range of log files has been verified to exist at some point in time. (Some context here… Since backups can contain millions of files, the backup file scheme uses metadata files to indicate which snapshots were completely written and which log file ranges were verified to contiguously exist). Your last describe then scanned after that point and found log continuity up to the latest log file present, shown here:

ContiguousLogEndVersion: 1249978824514 (maxLogEnd -0.00 days)
MaxLogEndVersion: 1249978824514 (maxLogEnd -0.00 days)

I can’t explain why your expire operations seem to be unable to actually delete files and are returning errors. Please add the --log flag to fdbbackup to get a trace file and look for errors containing the strings FileBackup, BackupContainer, BlobStore, or HTTP.

Also, a note about using timestamps with backup operations. This should probably be made more clear in the docs/tooling, but converting from a Timestamp to a Version requires access to the cluster from which the backup data originated. Without that, either absolute version or relative decimal days from the last log present in the backup (which will be converted to an approximately accurate version using an assumption of 1e6 versions per second) must be used. You specified --expire_before_timestamp 2019-02-07.10:00:00 and the tool came up with version 0 for that timestamp. Is it possible that the operation used a different cluster for the lookup than the source of the backup, possibly a cluster file in a default location such as ./fdb.cluster?


(Ray Jenkins) #6

@SteavedHams thank you for the detailed response. The --log option was very helpful for debugging, it turns out this was just an overlooked permissions problem on our end and after fixing expires are now working.

<Event Severity="20" Time="1550101467.915273" Type="BlobStoreEndpointRequestFailed" ID="0000000000000000" SuppressedEventCount="0" ResponseCode="403" ConnectionEstablished="1" RemoteEndpoint="52.218.216.224:443:tls" Verb="DELETE" Resource="/some_bucket/data/foo/snapshots/snapshot,681359352824,767745402472,261944092554" ThisTry="1" Machine="0.0.0.0:0" LogGroup="" />

I think it would be worthwhile to consider adding the HTTP response code to the output of fdbbackup when an error occurs or perhaps mention --log in the backup documentation. I will open an issue on GH.

Thanks again for your help.


(Meng Xu) #7

Perhaps, you can also update the document with a PR on GitHub. :slight_smile: It does not have to be big changes.


(Ray Jenkins) #8

@SteavedHams after resolving the expire issue we rebuilt the cluster and kicked off a new fdbbackup. Each night at 2:00AM PST there is a job that runs and expires any backups over 7 days old. It’s currently showing the last couple days of backups as restorable=false.

fdbbackup describe -d blobstore://xxxxxxxxxxxxx@s3.us-west-2.amazonaws.com/blah?bucket=some_bucket
URL: blobstore://xxxxxxxxxxxx@s3.us-west-2.amazonaws.com/blah?bucket=some_bucket
Restorable: true
Snapshot: startVersion=353447944273 (maxLogEnd -2.94 days) endVersion=353448585130 (maxLogEnd -2.94 days) totalBytes=50749583 restorable=true
Snapshot: startVersion=413355786099 (maxLogEnd -2.24 days) endVersion=413356599485 (maxLogEnd -2.24 days) totalBytes=63749211 restorable=false
Snapshot: startVersion=413900113933 (maxLogEnd -2.24 days) endVersion=499582938750 (maxLogEnd -1.25 days) totalBytes=11386890817 restorable=false
Snapshot: startVersion=499640399747 (maxLogEnd -1.25 days) endVersion=585963699169 (maxLogEnd -0.25 days) totalBytes=19205766899 restorable=false
SnapshotBytes: 30707156510
ExpiredEndVersion: 0 (maxLogEnd -7.03 days)
UnreliableEndVersion: 0 (maxLogEnd -7.03 days)
MinLogBeginVersion: 353447805204 (maxLogEnd -2.94 days)
ContiguousLogEndVersion: 353467805204 (maxLogEnd -2.94 days)
MaxLogEndVersion: 607311681429 (maxLogEnd -0.00 days)
MinRestorableVersion: 353448585130 (maxLogEnd -2.94 days)
MaxRestorableVersion: 353467805203 (maxLogEnd -2.94 days)

A fdbbackup status shows the following.

fdbbackup status
The backup on tag `default’ is restorable but continuing to blobstore://xxxxxxx@s3.us-west-2.amazonaws.com/blah?bucket=some_bucket.
Snapshot interval is 86400 seconds. Current snapshot progress target is 25.32% (>100% means the snapshot is supposed to be done)

Details:
LogBytes written - 19919852683
RangeBytes written - 35519783194
Last complete log version and timestamp - 607832819857, 02/22/19 15:59:41
Last complete snapshot version and timestamp - 585963699169, 02/22/19 09:55:11
Current Snapshot start version and timestamp - 585964113623, 02/22/19 09:55:11
Expected snapshot end version and timestamp - 672364113623, 02/23/19 09:55:12
Backup supposed to stop at next snapshot completion - No
Older Errors
1.82 day(s) ago : ‘Task execution stopped due to timeout, abort, or completion by another worker’ on ‘file_backup_write_logs_5.2’

I am a bit confused on how to interpret the describe restorable=false, particularly in lieu of the status showing.

Last complete log version and timestamp - 607832819857, 02/22/19 15:59:41
Last complete snapshot version and timestamp - 585963699169, 02/22/19 09:55:11

Does this mean if I were to attempt a restore the past few days data would be lost as the Max RestorableVersion shows?

MaxRestorableVersion: 353467805203 (maxLogEnd -2.94 days).

If so do you have any tips on debugging or actions to take (i.e. wipe out backups and start over or should it correct itself)? I am also curious if the first expire that was executed before there was 7 days of backups might have somehow affected restorability?

— snip job log —
info expiring data before 2019-02-13 10:00:00.778124264 +0000 UTC m=-545335.225315229
[ server ] 02-20 04:00:01 ERROR - 2019/02/20 10:00:01 info expire output is All data before version 0 is deleted.


(Meng Xu) #9

@SteavedHams Just a follow-up question to @rjenkins question:

Suppose the original backup allows us to do point-in-time recovery.
Now we expire all backup before time t.
Can we still do the point-in-time recovery to any time after t?


(Steve Atherton) #10

You can recover to any point starting from the end of the first snapshot which began on or after t.

To recover, you need
a) A snapshot, captured over some period of time I will call T
b) Logs from the entire duration T
c) Logs after T until the point you want to restore to.

So basically, when expiring it makes sense to choose snapshot start versions as expiration cutoff versions. But for a large cluster and say a 1 week snapshot interval, rather than deleting say 3 million files every 7 days you can delete 1/7th of that each day, for example.


(Steve Atherton) #11

Your describe output suggests that a log file is missing. The log file would have had a begin version of 353467805203.

There was a bug in FDB < 6.0.18 which could fail to complete uploads successfully in the presence of certain other HTTP errors due to incorrect reuse of the TCP connection to the blob service. It is possible that you hit this bug and that’s why the log file is missing - were you running a backup agent at a version < 6.0.18 around 3 days ago?


(Steve Atherton) #12

To answer your question about restorability - Describe does a non-exhaustive search to find max restorability, it looks for a single mutation log sequence, unbroken from the oldest log file that exists in the backup. This stops 2.94 days ago (as of the time you can the command) because of the missing log file. But unless there are log files missing after that, you should be able to still restore to your current version, it’s just that the default version that restore will use will not be what you want, as it will stop at the missing log file.

You can verify restorability (in version 6.0.18) - which will verify via list operations each needed file in the source URL - by running

fdbrestore start --dryrun -r <url> -version <MaxLogEndVersion from describe output>

If this works, and you just have one missing log file (which I’d still like to determine the cause of) then you can just expire data before 2.9 days ago and now your earliest log version in the backup will have version continuity up to the present version. You can do this by
fdbbackup expire -d <url> --delete_before_days 2.9


(Ray Jenkins) #13

Hi @SteavedHams. I see thank you for the explanation and clarification about the “default” version. I believe all these backups have occurred on 6.0.18.

$ fdbserver --version
FoundationDB 6.0 (v6.0.18)
source version 48d84faa3e6174deb7f852ef4d314f7bad1dfa57
protocol fdb00a570010001

Thanks for the suggestions, following your instructions here is the output now, all showing restorable.

$ fdbrestore start --dryrun -r xxxxxx@s3.us-west-2.amazonaws.com/blah?bucket=some_bucket --version 633345286610
Backup can be used to restore to version 633345286610

fdbbackup expire -d blobstore://xxxxx@s3.us-west-2.amazonaws.com/blah?bucket=some_bucket --delete_before_days 3.24
Final metadata update…
All data before 259200000000 versions (3 days) prior to latest backup log has been deleted.

fdbbackup describe -d blobstore://xxxxx@s3.us-west-2.amazonaws.com/blah?bucket=some_bucket
URL: blobstore://xxxxx@s3.us-west-2.amazonaws.com/blah?bucket=some_bucket
Restorable: true
Snapshot: startVersion=413355786099 (maxLogEnd -2.55 days) endVersion=413356599485 (maxLogEnd -2.55 days) totalBytes=63749211 restorable=true
Snapshot: startVersion=413900113933 (maxLogEnd -2.54 days) endVersion=499582938750 (maxLogEnd -1.55 days) totalBytes=11386890817 restorable=true
Snapshot: startVersion=499640399747 (maxLogEnd -1.55 days) endVersion=585963699169 (maxLogEnd -0.55 days) totalBytes=19205766899 restorable=true
SnapshotBytes: 30656406927
ExpiredEndVersion: 374505286610 (maxLogEnd -3.00 days)
UnreliableEndVersion: 374505286610 (maxLogEnd -3.00 days)
MinLogBeginVersion: 413355575165 (maxLogEnd -2.55 days)
ContiguousLogEndVersion: 633725286610 (maxLogEnd -0.00 days)
MaxLogEndVersion: 633725286610 (maxLogEnd -0.00 days)
MinRestorableVersion: 413356599485 (maxLogEnd -2.55 days)
MaxRestorableVersion: 633725286609 (maxLogEnd -0.00 days)


(Steve Atherton) #14

Glad you got things cleaned up, though I am a bit concerned by there being a missing log file on 6.0.18.