Sanity check: FoundationDB backup blobstore credentials permissions errors

Summary

Hello all, I’m Kyle over at IBM Cloudant. I’m currently configuring/testing backups/restores against our test infrastructure.

We use IBM Cloud Object Store (COS) as our blobstore. The FoundationDB cluster itself is not currently configured for intracluster TLS although you will see I’ve provided the necessary TLS configuration to backup CLI commands and backup agents so they can work with IBM COS over TLS. All that is working well and I can trigger backups.

The issue surfaces when I try to fdbbackup describe or attempt to restore a backup from our blobstore. I’m seeing some kind of permissions error in the logs. They look filesystemy although I’m betting it’s to do with the COS requests. I believe it’s an issue trying to read data back from our store. I can however read objects just fine using the same HMAC API keys configured for backup, but with our ibmcloud CLI tool.

I’m going to continue troubleshooting – pull the logs from our COS friends and see if it gives insight into the request against COS that fails. I’ve read through some of the codebase relevant to backup management and most of the dots are connecting, but I’m left stumped as to why we fail to read the snapshot from the blobstore. Any tips on understanding this error are appreciated.

Error summary

I will restate my test process below but here’s the highlight. Please note that we have some scripts such as sfdbbackup which are just the familiar fdbbackup etc. commands but with TLS/blobcreds flags set. Here’s the kernel for a describe attempt:

<Event Severity="10" Time="1581545857.363042" Type="BackupContainerDescribe2" ID="0000000000000000" URL="blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/ktest-feb-12-01?bucket=cloudant-wp2-backups" LogStartVersionOverride="-1" ExpiredEndVersion="-1" UnreliableEndVersion="-1" LogBeginVersion="-1" LogEndVersion="-1" Machine="0.0.0.0:0" LogGroup="" />
<Event Severity="30" Time="1581545857.363042" Type="BackupContainerMetadataInvalid" ID="0000000000000000" URL="blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/ktest-feb-12-01?bucket=cloudant-wp2-backups" ExpiredEndVersion="-1" UnreliableEndVersion="-1" LogBeginVersion="-1" LogEndVersion="-1" Machine="0.0.0.0:0" LogGroup="" />
<Event Severity="30" Time="1581545857.479831" Type="AbsolutePathError" ID="0000000000000000" Path="snapshots/snapshot,8016609928,8016609928,42" UnixErrorCode="d" UnixError="Permission denied" Error="platform_error" ErrorDescription="Platform error" ErrorCode="1500" Machine="0.0.0.0:0" LogGroup="" />
<Event Severity="40" Time="1581545857.479831" Type="StopAfterError" ID="0000000000000000" Error="platform_error" ErrorDescription="Platform error" ErrorCode="1500" Backtrace="addr2line -e fdbbackup.debug -p -C -f -i 0xaa1d3f 0xaa114f 0x4d74c3 0x4d76ae 0x4f8349 0x4cd4f1 0x4cd66e 0x5b3441 0x5b38e1 0x4f8349 0x4f8349 0x593fde 0x5be1d9 0x42851c 0x5b7990 0x5e0f15 0x5e1047 0x5d9551 0x4fa0f8 0x5e9785 0x5eaced 0x5eafa1 0x5e51f8 0x5e5497 0x5e575c 0x5e0ab8 0x75d87f 0x75e0c1 0x75bd00 0x75ce57 0x75d083 0x75d11c 0x75b190 0x75ba8d 0x4fa0f8 0x4cd15f 0x75aad8 0x75bc0f 0x4fa060 0xae4660 0x8391ea 0x4b6e5d 0x7f69703fc2e1" Machine="0.0.0.0:0" LogGroup="" />

Here is the API key reading that object just fine:

ibmcloud cos config list
Key                     Value   
Last Updated            Wednesday, February 12 2020 at 16:11:14   
Default Region          us-south   
Download Location       /Users/ksnavely@us.ibm.com/Downloads   
CRN                        
AccessKeyID             936027fd996b447197a674452f421752   
SecretAccessKey         ************************************************   
Authentication Method   HMAC   
URL Style               VHost   

ibmcloud cos get-object --bucket cloudant-wp2-backups --key 'data/ktest-feb-12-01/snapshots/snapshot,8016609928,8016609928,42' get_test
OK
Successfully downloaded 'data/ktest-feb-12-01/snapshots/snapshot,8016609928,8016609928,42' from bucket 'cloudant-wp2-backups'
176 B downloaded.

Testing process

What follows is the complete process of my attempt to describe a backup. Restores fail in the same way.

First, seed a key/value.

fdb> writemode on
fdb> get ktest
`ktest': not found
fdb> set ktest foo
Committed (7575052530)

Next, start a backup covering this keyrange.

sudo -u foundationdb sfdbbackup start -d 'blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/ktest-feb-12-01?bucket=cloudant-wp2-backups&sc=1' --log -t ktest-feb-12-01 -k "ktest ktest/xff"
The backup on tag `ktest-feb-12-01' was successfully submitted.

There were no errors in the trace logs for the above command. I checked the status:

fdbbackup status -t ktest-feb-12-01
The previous backup on tag `ktest-feb-12-01' at blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/ktest-feb-12-01?bucket=cloudant-wp2-backups&sc=1 completed at version 8036461619.
BackupUID: a6cef066c8e245fff9eb2a3fd651bd84
BackupURL: blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/ktest-feb-12-01?bucket=cloudant-wp2-backups&sc=1

It had completed quickly as expected.

I next pulled all the logs for all backup agents from all cluster nodes. I could not find any error or metadata information in the logs. All I see is a brief few logs denoting success of the ktest backup.

Next I inspect the state of the backup.

sudo -u foundationdb sfdbbackup list -b 'blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/?bucket=cloudant-wp2-backups&sc=1'
blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/ktest-feb-12-01?bucket=cloudant-wp2-backups

sudo -u foundationdb sfdbbackup describe -d blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/ktest-feb-12-01?bucket=cloudant-wp2-backups -C /etc/foundationdb/fdb.cluster --log
ERROR: Platform error
Fatal Error: Platform error

The logs from the describe attempt:

<Event Severity="10" Time="1581545857.236408" Type="BlobStoreEndpointReusingConnected" ID="0000000000000000" SuppressedEventCount="0" RemoteEndpoint="10.1.129.83:443:tls" ExpiresIn="119.88" Machine="0.0.0.0:0" LogGroup="" />
<Event Severity="10" Time="1581545857.363042" Type="BackupContainerDescribe2" ID="0000000000000000" URL="blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/ktest-feb-12-01?bucket=cloudant-wp2-backups" LogStartVersionOverride="-1" ExpiredEndVersion="-1" UnreliableEndVersion="-1" LogBeginVersion="-1" LogEndVersion="-1" Machine="0.0.0.0:0" LogGroup="" />
<Event Severity="30" Time="1581545857.363042" Type="BackupContainerMetadataInvalid" ID="0000000000000000" URL="blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/ktest-feb-12-01?bucket=cloudant-wp2-backups" ExpiredEndVersion="-1" UnreliableEndVersion="-1" LogBeginVersion="-1" LogEndVersion="-1" Machine="0.0.0.0:0" LogGroup="" />
<Event Severity="30" Time="1581545857.479831" Type="AbsolutePathError" ID="0000000000000000" Path="snapshots/snapshot,8016609928,8016609928,42" UnixErrorCode="d" UnixError="Permission denied" Error="platform_error" ErrorDescription="Platform error" ErrorCode="1500" Machine="0.0.0.0:0" LogGroup="" />
<Event Severity="40" Time="1581545857.479831" Type="StopAfterError" ID="0000000000000000" Error="platform_error" ErrorDescription="Platform error" ErrorCode="1500" Backtrace="addr2line -e fdbbackup.debug -p -C -f -i 0xaa1d3f 0xaa114f 0x4d74c3 0x4d76ae 0x4f8349 0x4cd4f1 0x4cd66e 0x5b3441 0x5b38e1 0x4f8349 0x4f8349 0x593fde 0x5be1d9 0x42851c 0x5b7990 0x5e0f15 0x5e1047 0x5d9551 0x4fa0f8 0x5e9785 0x5eaced 0x5eafa1 0x5e51f8 0x5e5497 0x5e575c 0x5e0ab8 0x75d87f 0x75e0c1 0x75bd00 0x75ce57 0x75d083 0x75d11c 0x75b190 0x75ba8d 0x4fa0f8 0x4cd15f 0x75aad8 0x75bc0f 0x4fa060 0xae4660 0x8391ea 0x4b6e5d 0x7f69703fc2e1" Machine="0.0.0.0:0" LogGroup="" />

By the way, we are using FoundationDB 6.2.10.

First thing to try is your describe command again with --knob_http_verbose_level=2 which will print the HTTPS requests and responses to standard output.

Just an FYI, the BackupContainerMetadataInvalid trace event is not actually a problem in this situation. It is referring to metadata which is populated and updated incrementally by backup management commands such as describe and expire. This metadata describes version ranges of files which are either known to have been complete at some point in time or known to potentially be incomplete, and this allows describe commands to not have to list/scan the eventual millions of files a backup will contain during each backup management operation, as that would take a long time.

My guess as to what is happening here is the blob requests are working fine but something in the blob backup destination code is using a function in Platform.cpp called abspath() for convenience, while not expecting the path to actually exist on disk. I can’t actually find a code path that does this, but the logs are evidence that there is one.

Might there be an unreadable snapshots folder in fdbbackup's working directory?

Steve thank you for your response. I’d forgotten all about that HTTP verboseness knob!

To add some intrigue to the mix the IBM COS folks are saying they can’t find e.g. a 403 in our access logs. I’ll revisit this as soon as I’m able and share more information.

That was great to be reminded of.

sudo -u foundationdb sfdbbackup start -d 'blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/ktest-feb-18-01?bucket=cloudant-wp2-backups&sc=1' --log -t ktest-feb-18-01 -k "ktest ktest/xff"
The backup on tag `ktest-feb-18-01' was successfully submitted.
sudo -u foundationdb sfdbbackup describe -d blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/ktest-feb-18-01?bucket=cloudant-wp2-backups -C /etc/foundationdb/fdb.cluster --log --knob_http_verbose_level=2
[a733810bc81535efcc61efad34748808] HTTP starting HEAD /cloudant-wp2-backups/backups/ktest-feb-18-01 ContentLen:0
[a733810bc81535efcc61efad34748808] HTTP code=200 early=0, time=0.103528s HEAD /cloudant-wp2-backups/backups/ktest-feb-18-01 contentLen=0 [277 out, response content len 0]
[a733810bc81535efcc61efad34748808] HTTP starting HEAD /cloudant-wp2-backups/data/ktest-feb-18-01/properties/expired_end_version ContentLen:0
[7f0345d1b2675c981a9798e11597e2d0] HTTP starting HEAD /cloudant-wp2-backups/data/ktest-feb-18-01/properties/log_end_version ContentLen:0
[15dd7e7d000b425d2eefadd0857a8a1d] HTTP starting HEAD /cloudant-wp2-backups/data/ktest-feb-18-01/properties/log_begin_version ContentLen:0
[de0061b79bebbf4b8486eee0c52584f7] HTTP starting HEAD /cloudant-wp2-backups/data/ktest-feb-18-01/properties/unreliable_end_version ContentLen:0
[15dd7e7d000b425d2eefadd0857a8a1d] HTTP code=404 early=0, time=0.075547s HEAD /cloudant-wp2-backups/data/ktest-feb-18-01/properties/log_begin_version contentLen=0 [303 out, response content len 331]
[7f0345d1b2675c981a9798e11597e2d0] HTTP code=404 early=0, time=0.083543s HEAD /cloudant-wp2-backups/data/ktest-feb-18-01/properties/log_end_version contentLen=0 [301 out, response content len 329]
[de0061b79bebbf4b8486eee0c52584f7] HTTP code=404 early=0, time=0.174761s HEAD /cloudant-wp2-backups/data/ktest-feb-18-01/properties/unreliable_end_version contentLen=0 [308 out, response content len 336]
[a733810bc81535efcc61efad34748808] HTTP code=404 early=0, time=0.435969s HEAD /cloudant-wp2-backups/data/ktest-feb-18-01/properties/expired_end_version contentLen=0 [305 out, response content len 333]
[15dd7e7d000b425d2eefadd0857a8a1d] HTTP starting GET /cloudant-wp2-backups/?max-keys=1000&prefix=data/ktest-feb-18-01/snapshots/&delimiter=/&marker= ContentLen:0
[7f0345d1b2675c981a9798e11597e2d0] HTTP starting GET /cloudant-wp2-backups/?max-keys=1000&prefix=data/ktest-feb-18-01/logs/&delimiter=/&marker= ContentLen:0
[7f0345d1b2675c981a9798e11597e2d0] HTTP code=200 early=0, time=0.006408s GET /cloudant-wp2-backups/?max-keys=1000&prefix=data/ktest-feb-18-01/logs/&delimiter=/&marker= contentLen=0 [321 out, response content len 393]
[de0061b79bebbf4b8486eee0c52584f7] HTTP starting GET /cloudant-wp2-backups/?max-keys=1000&prefix=data/ktest-feb-18-01/logs/0000/&delimiter=/&marker= ContentLen:0
[de0061b79bebbf4b8486eee0c52584f7] HTTP code=200 early=0, time=0.063112s GET /cloudant-wp2-backups/?max-keys=1000&prefix=data/ktest-feb-18-01/logs/0000/&delimiter=/&marker= contentLen=0 [326 out, response content len 403]
[a733810bc81535efcc61efad34748808] HTTP starting GET /cloudant-wp2-backups/?max-keys=1000&prefix=data/ktest-feb-18-01/logs/0000/0000/&delimiter=/&marker= ContentLen:0
[15dd7e7d000b425d2eefadd0857a8a1d] HTTP code=200 early=0, time=0.080596s GET /cloudant-wp2-backups/?max-keys=1000&prefix=data/ktest-feb-18-01/snapshots/&delimiter=/&marker= contentLen=0 [326 out, response content len 701]
[a733810bc81535efcc61efad34748808] HTTP *ERROR*=operation_cancelled early=0, time=0.011054s GET /cloudant-wp2-backups/?max-keys=1000&prefix=data/ktest-feb-18-01/logs/0000/0000/&delimiter=/&marker= contentLen=0 [331 out]
ERROR: Platform error
Fatal Error: Platform error

I’m just looking into this new message. It looks like GET /cloudant-wp2-backups/?max-keys=1000&prefix=data/ktest-feb-18-01/logs/0000/0000/&delimiter=/&marker= is cancelled after 11ms?

I performed similar commands with our IBM Cloud CLI tool.

ic cos list-objects --bucket cloudant-wp2-backups --max-items 10 --page-size 20 --prefix 'data/ktest-feb-18-01/logs/0000/0000' --delimiter '' --marker ''
OK
Found 1 object in bucket 'cloudant-wp2-backups':

Name                                                                                                       Last Modified              Object Size   
data/ktest-feb-18-01/logs/0000/0000/log,81944689829,81964689829,6dafb1ab8a0370b35d513aaf10e3d307,1048576   Feb 18, 2020 at 22:18:18   0 B   

From the verbose logs for that command I can see the successful request:

REQUEST: [2020-02-18T15:36:47-07:00]
GET /?delimiter=&marker=&max-keys=10&prefix=data%2Fktest-feb-18-01%2Flogs%2F0000%2F0000 HTTP/1.1
Host: cloudant-wp2-backups.s3.us-south.cloud-object-storage.appdomain.cloud
...
RESPONSE: [2020-02-18T15:36:48-07:00] Elapsed: 338ms
HTTP/1.1 200 OK

I’ll keep thinking on this one. The 404’s for data/ktest-feb-18-01/properties/ data, is that OK with the client?

Oh also, no there is no local snapshots directory.

The request is being canceled because something is encountering a platform error exception which is being handled at the outermost level, which cancels the program execution and stops all in-progress operations. As I said before, something is calling absPath(), but I’m not sure what that is. You can see from this trace event

<Event Severity="30" Time="1581545857.479831" Type="AbsolutePathError" ID="0000000000000000" Path="snapshots/snapshot,8016609928,8016609928,42" UnixErrorCode="d" UnixError="Permission denied" Error="platform_error" ErrorDescription="Platform error" ErrorCode="1500" Machine="0.0.0.0:0" LogGroup="" />
that something is calling abspath() with a backup file path, and that’s resulting in a platform error.

I can’t find a code path that could lead to this.

Is the working directory perhaps not listable or readable by the user running the process?

I found the issue.

BackupContainerFileSystem, the parent class of BackupContainerBlobStore, uses a method called basename() in several places to extract a filename from a path string, and basename() uses abspath(), which first normalizes the given path using abspath().

Obviously it makes no sense to look for remote paths on the local filesystem, basename() was just being used as a convenience because it also works for paths that do not exist at all. But I’m guessing it throws a platform error if something in the path does exist but just isn’t readable.

Here’s the local permissions layout. The foundationdb user is used to run the commands due to secrets file ownership (the sfdbbackup alias).

My working dir for a describe attempt should be wide open.

ls -alh .
total 16K
drwxrwxrwx 2 foundationdb foundationdb 4.0K Feb 18 23:01 .
drwxrwxrwx 3 foundationdb foundationdb 4.0K Feb 18 23:00 ..
-rw-r--r-- 1 foundationdb foundationdb 6.5K Feb 18 23:01 trace.0.0.0.0.0.1582066875.WRkUFF.0.1.xml

These were the important bits. Thanks!

Symbolicated backtrace at https://gist.github.com/alexmiller-apple/07ab8d3199f3629c23628eba911c1743

@SteavedHams It looks like he ends up in BackupContainerFileSystem::listKeyspaceSnapshots, and listKeyspaceSnapshots calls pathToKeyspaceSnapshotFile which calls basename, which calls abspath.

EDIT: ARRRGH YOU BEAT ME

I see so looking around here regarding the actual but unintended filesystem check: https://github.com/apple/foundationdb/blame/a41aa41816ec21351a1b823d2f8814a3db52757d/flow/Platform.cpp#L1976 For the 6.2 release anyways.

Yes we need to get the debug symbol binaries in common usage here. We’ve been using the Ubuntu .debs which don’t have these the last I checked.

I’m just surprised I’m the first to see this I suppose.

Alright I know why I’m special. Even though the working dir and parent dir were accessible to the foundationdb user, they still were in my home directory.

To change things, I ran this test from a directory where the foundationdb user was able to read everything from / to the workdir and I had a successful describe.

sudo -u foundationdb sfdbbackup describe -d blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/ktest-feb-18-01?bucket=cloudant-wp2-backups -C /etc/foundationdb/fdb.cluster --log 
URL: blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/ktest-feb-18-01?bucket=cloudant-wp2-backups
Restorable: true
Snapshot:  startVersion=81944848101 (maxLogEnd -0.00 days)  endVersion=81944848101 (maxLogEnd -0.00 days)  totalBytes=42  restorable=true  expiredPct=0.00
SnapshotBytes: 42
MinLogBeginVersion:      81944689829 (maxLogEnd -0.00 days)
ContiguousLogEndVersion: 81964689829 (maxLogEnd -0.00 days)
MaxLogEndVersion:        81964689829 (maxLogEnd -0.00 days)
MinRestorableVersion:    81944848101 (maxLogEnd -0.00 days)
MaxRestorableVersion:    81964689828 (maxLogEnd -0.00 days)

If you hadn’t already seen, on the Downloads page, there’s a fdb_$version-debug.tar.gz. Copy the URL, and edit it to the right version, and the tarball should still be there for any previously published version.

Ah nice, that’s a great tip.

Would you guys like me to file a bug for this remote-blobstore local-fs check behavior?

I’ve confirmed using a workdir path where all dirs in the path can been x/r’d by the foundationdb user fixes my issue for remote blobstore restore as well.

I’ve already fixed it by replacing basename with a pure string manipulation that handles both slash types.

Glad you found a way to avoid the issue for now. The next release will have it fixed.

1 Like