FoundationDB

Fdbbackup expire with blobstore returning error, not deleting from S3


(Ray Jenkins) #1

Hi Folks,

Running into some problems with fdbbackup expire and looking for some guidance on debugging.

Setup:
15 instance cluster running 6.0.16
3 i3.large running 2 fdbserver processes with transaction class running log
3 c5.xlarge running 4 fdbserver processes with stateless class running proxy/master/cluster controller
9 i3.xlarge running 4 fdbserver processes with storage class

backup_agent is running on all nodes and fdbbackup is running in continuous mode.

fdbbackup start -d <blobstore_url> -s 86400 -z

So far so good, backups have been running fine.

$ fdbbackup describe -d blobstore://xxxxxxxxxxxxxxxxs3.us-west-2.amazonaws.com/blah?bucket=some_bucket
URL: blobstore://xxxxxxxxxxxxxxxx@s3.us-west-2.amazonaws.com/blah?bucket=some_bucket
Restorable: true
Snapshot: startVersion=680672277778 endVersion=681349765701 totalBytes=258515087913 restorable=true
Snapshot: startVersion=681359352824 endVersion=767745402472 totalBytes=261944092554 restorable=true
Snapshot: startVersion=767899470014 endVersion=854238123473 totalBytes=268954145233 restorable=true
Snapshot: startVersion=854255297744 endVersion=940607774184 totalBytes=282034102462 restorable=true
Snapshot: startVersion=940685990059 endVersion=1027047586694 totalBytes=295028242383 restorable=true
Snapshot: startVersion=1027144233746 endVersion=1113480511478 totalBytes=307961176975 restorable=true
Snapshot: startVersion=1113638682379 endVersion=1200023371426 totalBytes=321032543807 restorable=true
SnapshotBytes: 1995469391327
MinLogBeginVersion: 680672000491
ContiguousLogEndVersion: 1241861148586
MaxLogEndVersion: 1241861148586
MinRestorableVersion: 681349765701
MaxRestorableVersion: 1241861148585

In preparing to move to production I’m performing some tests, including expiring backups. I began by attempting to delete backups prior to 02-06-2019 using the expire_before_timestamp feature.

fdbbackup expire -d blobstore://xxxxx@s3.us-west-2.amazonaws.com/blah?bucket=some_bucket --expire_before_timestamp 2019-02-06.10:00:00 All data before version 680671939982 is deleted.

Afterwards I inspected the bucket on S3 and no data was deleted. I then attempted to delete data before 2019-02-07.10:00:00 and received the following error. (Note: I have also tried with the --force flag as well with the same results)

$ fdbbackup expire -d blobstore://xxxxxxxxx@s3.us-west-2.amazonaws.com/blah?bucket=some_bucket --expire_before_timestamp 2019-02-07.10:00:00 ERROR: HTTP response code not received or indicated failure Fatal Error: HTTP response code not received or indicated failure

Performing a fdbbackup describe now shows some of the backups are no longer restorable, as expected, and the MinLogBeginVersion and MinRestorableVersion has increased but the size of the S3 bucket remains steady and growing due to continuous backups.

$ fdbbackup describe -d blobstore://xxxxxxxxxxxxxxxx@s3.us-west-2.amazonaws.com/blah?bucket=some_bucket
URL: blobstore://xxxxxxxxx@s3.us-west-2.amazonaws.com/blah?bucket=some_bucket
Restorable: true
Snapshot: startVersion=680672277778 endVersion=681349765701 totalBytes=258515087913 restorable=false
Snapshot: startVersion=681359352824 endVersion=767745402472 totalBytes=261944092554 restorable=false
Snapshot: startVersion=767899470014 endVersion=854238123473 totalBytes=268954145233 restorable=true
Snapshot: startVersion=854255297744 endVersion=940607774184 totalBytes=282034102462 restorable=true
Snapshot: startVersion=940685990059 endVersion=1027047586694 totalBytes=295028242383 restorable=true
Snapshot: startVersion=1027144233746 endVersion=1113480511478 totalBytes=307961176975 restorable=true
Snapshot: startVersion=1113638682379 endVersion=1200023371426 totalBytes=321032543807 restorable=true
SnapshotBytes: 1995469391327
MinLogBeginVersion: 767441226070
ContiguousLogEndVersion: 1243110292855
MaxLogEndVersion: 1243110292855
MinRestorableVersion: 854238123473
MaxRestorableVersion: 1243110292854

Additionally I can see the delete requests being received in Cloudwatch

And it appears the properties folder has been updated.

It appears nothing from the kvranges folder has been deleted and there is no expiration date or expiration rule on the objects in the kvranges/snapshot folders. Additionally I have inspected the trace files but haven’t noticed anything pertaining to my manual attempts at fdbbackup expire.

Any tips on debugging?


Backup using disk snapshots
(Steve Atherton) #2

There were several bugs recently fixed in backup in 6.0.18 and the prior version, including a particularly nasty HTTP request bug. While I’m not sure exactly what bad behavior you might be hitting, try these steps first:

  1. Upgrade to 6.0.18
  2. Delete the 3 files you see in the “properties” folder. I assume you can do this in the web UI where you took the screenshot.
  3. Run fdbbackup describe on your backup URL.
  4. Run fdbbackup expire on your backup URL as you did before.

In case this was a point of confusion, please note that the byte counts you see in the fdbbackup describe output are not based on listing and summing up all of the kvrange files for each snapshot, as snapshots can contain millions of these so that could be very slow. The byte counts are instead based on metadata files, one per snapshot, which are written after each snapshot’s completion and which are deleted in expire operations after all of the kvrange files for the snapshot are deleted.


(Ray Jenkins) #3

Thanks @SteavedHams, I will upgrade tomorrow and follow your instructions and let you know the results.


(Ray Jenkins) #4

Hiya @SteavedHams I’ve rebuilt the cluster with new instances and on 6.0.18 and deleted the 3 files in the properties folder. Issuing a describe shows

fdbbackup describe -d blobstore://xxxxxx@s3.us-west-2.amazonaws.com/blah?bucket=some_bucket
URL: blobstore://xxxxxxxxxx@s3.us-west-2.amazonaws.com/blah?bucket=some_bucket
Restorable: true
Snapshot: startVersion=38380127694 (maxLogEnd -14.02 days) endVersion=38801987336 (maxLogEnd -14.02 days) totalBytes=343824424885 restorable=true
Snapshot: startVersion=680672277778 (maxLogEnd -6.59 days) endVersion=681349765701 (maxLogEnd -6.58 days) totalBytes=258515087913 restorable=false
Snapshot: startVersion=681359352824 (maxLogEnd -6.58 days) endVersion=767745402472 (maxLogEnd -5.58 days) totalBytes=261944092554 restorable=false
Snapshot: startVersion=767899470014 (maxLogEnd -5.58 days) endVersion=854238123473 (maxLogEnd -4.58 days) totalBytes=268954145233 restorable=false
Snapshot: startVersion=854255297744 (maxLogEnd -4.58 days) endVersion=940607774184 (maxLogEnd -3.58 days) totalBytes=282034102462 restorable=false
Snapshot: startVersion=940685990059 (maxLogEnd -3.58 days) endVersion=1027047586694 (maxLogEnd -2.58 days) totalBytes=295028242383 restorable=false
Snapshot: startVersion=1027144233746 (maxLogEnd -2.58 days) endVersion=1113480511478 (maxLogEnd -1.58 days) totalBytes=307961176975 restorable=false
Snapshot: startVersion=1113638682379 (maxLogEnd -1.58 days) endVersion=1200023371426 (maxLogEnd -0.58 days) totalBytes=321032543807 restorable=false
SnapshotBytes: 2339293816212
MinLogBeginVersion: 38379644395 (maxLogEnd -14.02 days)
ContiguousLogEndVersion: 61000828852 (maxLogEnd -13.76 days)
MaxLogEndVersion: 1249978824514 (maxLogEnd -0.00 days)
MinRestorableVersion: 38801987336 (maxLogEnd -14.02 days)
MaxRestorableVersion: 61000828851 (maxLogEnd -13.76 days)

I then tried to run some expires.

fdbbackup expire -d blobstore://xxxxxx@s3.us-west-2.amazonaws.com/blah?bucket=some_bucket --expire_before_timestamp 2019-02-07.10:00:00
Final metadata update…
All data before version 0 has been deleted.

Notably trying --expire_before_version attempts to List files and then errors

$ fdbbackup expire -d blobstore://xxxxxx@s3.us-west-2.amazonaws.com/blah?bucket=some_bucket --expire_before_version 854255297744
Listing files…ERROR: HTTP response code not received or indicated failure
Fatal Error: HTTP response code not received or indicated failure

A fdbbackup describe now shows

$ fdbbackup describe -d blobstore://xxxxxxxxxxxx@s3.us-west-2.amazonaws.com/blah?bucket=some_bucket
URL: blobstore://xxxxx@s3.us-west-2.amazonaws.com/blah?bucket=some_bucket
Restorable: true
Snapshot: startVersion=38380127694 (maxLogEnd -14.02 days) endVersion=38801987336 (maxLogEnd -14.02 days) totalBytes=343824424885 restorable=false
Snapshot: startVersion=680672277778 (maxLogEnd -6.59 days) endVersion=681349765701 (maxLogEnd -6.58 days) totalBytes=258515087913 restorable=false
Snapshot: startVersion=681359352824 (maxLogEnd -6.58 days) endVersion=767745402472 (maxLogEnd -5.58 days) totalBytes=261944092554 restorable=false
Snapshot: startVersion=767899470014 (maxLogEnd -5.58 days) endVersion=854238123473 (maxLogEnd -4.58 days) totalBytes=268954145233 restorable=false
Snapshot: startVersion=854255297744 (maxLogEnd -4.58 days) endVersion=940607774184 (maxLogEnd -3.58 days) totalBytes=282034102462 restorable=true
Snapshot: startVersion=940685990059 (maxLogEnd -3.58 days) endVersion=1027047586694 (maxLogEnd -2.58 days) totalBytes=295028242383 restorable=true
Snapshot: startVersion=1027144233746 (maxLogEnd -2.58 days) endVersion=1113480511478 (maxLogEnd -1.58 days) totalBytes=307961176975 restorable=true
Snapshot: startVersion=1113638682379 (maxLogEnd -1.58 days) endVersion=1200023371426 (maxLogEnd -0.58 days) totalBytes=321032543807 restorable=true
SnapshotBytes: 2339293816212
UnreliableEndVersion: 854247018002 (maxLogEnd -4.58 days)
MinLogBeginVersion: 854247018002 (maxLogEnd -4.58 days)
ContiguousLogEndVersion: 1249978824514 (maxLogEnd -0.00 days)
MaxLogEndVersion: 1249978824514 (maxLogEnd -0.00 days)
MinRestorableVersion: 940607774184 (maxLogEnd -3.58 days)
MaxRestorableVersion: 1249978824513 (maxLogEnd -0.00 days)

Inspecting the S3 bucket the size has not decreased. The properties file now contains.


(Steve Atherton) #5

This line:

ContiguousLogEndVersion: 61000828852 (maxLogEnd -13.76 days)

suggests that you likely hit a rare pre-6.0.18 bug where one of your log files was not successfully uploaded, breaking the continuity of the mutation stream at that point. The logs after that point appear to be contiguous, however, so you can still restore to points in time after the next snapshot after that log discontinuity.

When you began the expire to version 854255297744, the process got far enough that it marked everything prior to that version as being “unreliable”, meaning that files prior to that version could be missing because an expiration process began on that range. Describe assumes that files before the Unreliable version are missing, so it will not list/scan them and they will not be included in the cached logBegin / logEnd metadata which indicates what range of log files has been verified to exist at some point in time. (Some context here… Since backups can contain millions of files, the backup file scheme uses metadata files to indicate which snapshots were completely written and which log file ranges were verified to contiguously exist). Your last describe then scanned after that point and found log continuity up to the latest log file present, shown here:

ContiguousLogEndVersion: 1249978824514 (maxLogEnd -0.00 days)
MaxLogEndVersion: 1249978824514 (maxLogEnd -0.00 days)

I can’t explain why your expire operations seem to be unable to actually delete files and are returning errors. Please add the --log flag to fdbbackup to get a trace file and look for errors containing the strings FileBackup, BackupContainer, BlobStore, or HTTP.

Also, a note about using timestamps with backup operations. This should probably be made more clear in the docs/tooling, but converting from a Timestamp to a Version requires access to the cluster from which the backup data originated. Without that, either absolute version or relative decimal days from the last log present in the backup (which will be converted to an approximately accurate version using an assumption of 1e6 versions per second) must be used. You specified --expire_before_timestamp 2019-02-07.10:00:00 and the tool came up with version 0 for that timestamp. Is it possible that the operation used a different cluster for the lookup than the source of the backup, possibly a cluster file in a default location such as ./fdb.cluster?


(Ray Jenkins) #6

@SteavedHams thank you for the detailed response. The --log option was very helpful for debugging, it turns out this was just an overlooked permissions problem on our end and after fixing expires are now working.

<Event Severity="20" Time="1550101467.915273" Type="BlobStoreEndpointRequestFailed" ID="0000000000000000" SuppressedEventCount="0" ResponseCode="403" ConnectionEstablished="1" RemoteEndpoint="52.218.216.224:443:tls" Verb="DELETE" Resource="/some_bucket/data/foo/snapshots/snapshot,681359352824,767745402472,261944092554" ThisTry="1" Machine="0.0.0.0:0" LogGroup="" />

I think it would be worthwhile to consider adding the HTTP response code to the output of fdbbackup when an error occurs or perhaps mention --log in the backup documentation. I will open an issue on GH.

Thanks again for your help.


(Meng Xu) #7

Perhaps, you can also update the document with a PR on GitHub. :slight_smile: It does not have to be big changes.