Sanity check: FoundationDB backup blobstore credentials permissions errors

Summary

Hello all, I’m Kyle over at IBM Cloudant. I’m currently configuring/testing backups/restores against our test infrastructure.

We use IBM Cloud Object Store (COS) as our blobstore. The FoundationDB cluster itself is not currently configured for intracluster TLS although you will see I’ve provided the necessary TLS configuration to backup CLI commands and backup agents so they can work with IBM COS over TLS. All that is working well and I can trigger backups.

The issue surfaces when I try to fdbbackup describe or attempt to restore a backup from our blobstore. I’m seeing some kind of permissions error in the logs. They look filesystemy although I’m betting it’s to do with the COS requests. I believe it’s an issue trying to read data back from our store. I can however read objects just fine using the same HMAC API keys configured for backup, but with our ibmcloud CLI tool.

I’m going to continue troubleshooting – pull the logs from our COS friends and see if it gives insight into the request against COS that fails. I’ve read through some of the codebase relevant to backup management and most of the dots are connecting, but I’m left stumped as to why we fail to read the snapshot from the blobstore. Any tips on understanding this error are appreciated.

Error summary

I will restate my test process below but here’s the highlight. Please note that we have some scripts such as sfdbbackup which are just the familiar fdbbackup etc. commands but with TLS/blobcreds flags set. Here’s the kernel for a describe attempt:

<Event Severity="10" Time="1581545857.363042" Type="BackupContainerDescribe2" ID="0000000000000000" URL="blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/ktest-feb-12-01?bucket=cloudant-wp2-backups" LogStartVersionOverride="-1" ExpiredEndVersion="-1" UnreliableEndVersion="-1" LogBeginVersion="-1" LogEndVersion="-1" Machine="0.0.0.0:0" LogGroup="" />
<Event Severity="30" Time="1581545857.363042" Type="BackupContainerMetadataInvalid" ID="0000000000000000" URL="blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/ktest-feb-12-01?bucket=cloudant-wp2-backups" ExpiredEndVersion="-1" UnreliableEndVersion="-1" LogBeginVersion="-1" LogEndVersion="-1" Machine="0.0.0.0:0" LogGroup="" />
<Event Severity="30" Time="1581545857.479831" Type="AbsolutePathError" ID="0000000000000000" Path="snapshots/snapshot,8016609928,8016609928,42" UnixErrorCode="d" UnixError="Permission denied" Error="platform_error" ErrorDescription="Platform error" ErrorCode="1500" Machine="0.0.0.0:0" LogGroup="" />
<Event Severity="40" Time="1581545857.479831" Type="StopAfterError" ID="0000000000000000" Error="platform_error" ErrorDescription="Platform error" ErrorCode="1500" Backtrace="addr2line -e fdbbackup.debug -p -C -f -i 0xaa1d3f 0xaa114f 0x4d74c3 0x4d76ae 0x4f8349 0x4cd4f1 0x4cd66e 0x5b3441 0x5b38e1 0x4f8349 0x4f8349 0x593fde 0x5be1d9 0x42851c 0x5b7990 0x5e0f15 0x5e1047 0x5d9551 0x4fa0f8 0x5e9785 0x5eaced 0x5eafa1 0x5e51f8 0x5e5497 0x5e575c 0x5e0ab8 0x75d87f 0x75e0c1 0x75bd00 0x75ce57 0x75d083 0x75d11c 0x75b190 0x75ba8d 0x4fa0f8 0x4cd15f 0x75aad8 0x75bc0f 0x4fa060 0xae4660 0x8391ea 0x4b6e5d 0x7f69703fc2e1" Machine="0.0.0.0:0" LogGroup="" />

Here is the API key reading that object just fine:

ibmcloud cos config list
Key                     Value   
Last Updated            Wednesday, February 12 2020 at 16:11:14   
Default Region          us-south   
Download Location       /Users/ksnavely@us.ibm.com/Downloads   
CRN                        
AccessKeyID             936027fd996b447197a674452f421752   
SecretAccessKey         ************************************************   
Authentication Method   HMAC   
URL Style               VHost   

ibmcloud cos get-object --bucket cloudant-wp2-backups --key 'data/ktest-feb-12-01/snapshots/snapshot,8016609928,8016609928,42' get_test
OK
Successfully downloaded 'data/ktest-feb-12-01/snapshots/snapshot,8016609928,8016609928,42' from bucket 'cloudant-wp2-backups'
176 B downloaded.

Testing process

What follows is the complete process of my attempt to describe a backup. Restores fail in the same way.

First, seed a key/value.

fdb> writemode on
fdb> get ktest
`ktest': not found
fdb> set ktest foo
Committed (7575052530)

Next, start a backup covering this keyrange.

sudo -u foundationdb sfdbbackup start -d 'blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/ktest-feb-12-01?bucket=cloudant-wp2-backups&sc=1' --log -t ktest-feb-12-01 -k "ktest ktest/xff"
The backup on tag `ktest-feb-12-01' was successfully submitted.

There were no errors in the trace logs for the above command. I checked the status:

fdbbackup status -t ktest-feb-12-01
The previous backup on tag `ktest-feb-12-01' at blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/ktest-feb-12-01?bucket=cloudant-wp2-backups&sc=1 completed at version 8036461619.
BackupUID: a6cef066c8e245fff9eb2a3fd651bd84
BackupURL: blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/ktest-feb-12-01?bucket=cloudant-wp2-backups&sc=1

It had completed quickly as expected.

I next pulled all the logs for all backup agents from all cluster nodes. I could not find any error or metadata information in the logs. All I see is a brief few logs denoting success of the ktest backup.

Next I inspect the state of the backup.

sudo -u foundationdb sfdbbackup list -b 'blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/?bucket=cloudant-wp2-backups&sc=1'
blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/ktest-feb-12-01?bucket=cloudant-wp2-backups

sudo -u foundationdb sfdbbackup describe -d blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/ktest-feb-12-01?bucket=cloudant-wp2-backups -C /etc/foundationdb/fdb.cluster --log
ERROR: Platform error
Fatal Error: Platform error

The logs from the describe attempt:

<Event Severity="10" Time="1581545857.236408" Type="BlobStoreEndpointReusingConnected" ID="0000000000000000" SuppressedEventCount="0" RemoteEndpoint="10.1.129.83:443:tls" ExpiresIn="119.88" Machine="0.0.0.0:0" LogGroup="" />
<Event Severity="10" Time="1581545857.363042" Type="BackupContainerDescribe2" ID="0000000000000000" URL="blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/ktest-feb-12-01?bucket=cloudant-wp2-backups" LogStartVersionOverride="-1" ExpiredEndVersion="-1" UnreliableEndVersion="-1" LogBeginVersion="-1" LogEndVersion="-1" Machine="0.0.0.0:0" LogGroup="" />
<Event Severity="30" Time="1581545857.363042" Type="BackupContainerMetadataInvalid" ID="0000000000000000" URL="blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/ktest-feb-12-01?bucket=cloudant-wp2-backups" ExpiredEndVersion="-1" UnreliableEndVersion="-1" LogBeginVersion="-1" LogEndVersion="-1" Machine="0.0.0.0:0" LogGroup="" />
<Event Severity="30" Time="1581545857.479831" Type="AbsolutePathError" ID="0000000000000000" Path="snapshots/snapshot,8016609928,8016609928,42" UnixErrorCode="d" UnixError="Permission denied" Error="platform_error" ErrorDescription="Platform error" ErrorCode="1500" Machine="0.0.0.0:0" LogGroup="" />
<Event Severity="40" Time="1581545857.479831" Type="StopAfterError" ID="0000000000000000" Error="platform_error" ErrorDescription="Platform error" ErrorCode="1500" Backtrace="addr2line -e fdbbackup.debug -p -C -f -i 0xaa1d3f 0xaa114f 0x4d74c3 0x4d76ae 0x4f8349 0x4cd4f1 0x4cd66e 0x5b3441 0x5b38e1 0x4f8349 0x4f8349 0x593fde 0x5be1d9 0x42851c 0x5b7990 0x5e0f15 0x5e1047 0x5d9551 0x4fa0f8 0x5e9785 0x5eaced 0x5eafa1 0x5e51f8 0x5e5497 0x5e575c 0x5e0ab8 0x75d87f 0x75e0c1 0x75bd00 0x75ce57 0x75d083 0x75d11c 0x75b190 0x75ba8d 0x4fa0f8 0x4cd15f 0x75aad8 0x75bc0f 0x4fa060 0xae4660 0x8391ea 0x4b6e5d 0x7f69703fc2e1" Machine="0.0.0.0:0" LogGroup="" />

By the way, we are using FoundationDB 6.2.10.

First thing to try is your describe command again with --knob_http_verbose_level=2 which will print the HTTPS requests and responses to standard output.

Just an FYI, the BackupContainerMetadataInvalid trace event is not actually a problem in this situation. It is referring to metadata which is populated and updated incrementally by backup management commands such as describe and expire. This metadata describes version ranges of files which are either known to have been complete at some point in time or known to potentially be incomplete, and this allows describe commands to not have to list/scan the eventual millions of files a backup will contain during each backup management operation, as that would take a long time.

My guess as to what is happening here is the blob requests are working fine but something in the blob backup destination code is using a function in Platform.cpp called abspath() for convenience, while not expecting the path to actually exist on disk. I can’t actually find a code path that does this, but the logs are evidence that there is one.

Might there be an unreadable snapshots folder in fdbbackup's working directory?

Steve thank you for your response. I’d forgotten all about that HTTP verboseness knob!

To add some intrigue to the mix the IBM COS folks are saying they can’t find e.g. a 403 in our access logs. I’ll revisit this as soon as I’m able and share more information.