Summary
Hello all, I’m Kyle over at IBM Cloudant. I’m currently configuring/testing backups/restores against our test infrastructure.
We use IBM Cloud Object Store (COS) as our blobstore. The FoundationDB cluster itself is not currently configured for intracluster TLS although you will see I’ve provided the necessary TLS configuration to backup CLI commands and backup agents so they can work with IBM COS over TLS. All that is working well and I can trigger backups.
The issue surfaces when I try to fdbbackup describe
or attempt to restore a backup from our blobstore. I’m seeing some kind of permissions error in the logs. They look filesystemy although I’m betting it’s to do with the COS requests. I believe it’s an issue trying to read data back from our store. I can however read objects just fine using the same HMAC API keys configured for backup, but with our ibmcloud
CLI tool.
I’m going to continue troubleshooting – pull the logs from our COS friends and see if it gives insight into the request against COS that fails. I’ve read through some of the codebase relevant to backup management and most of the dots are connecting, but I’m left stumped as to why we fail to read the snapshot from the blobstore. Any tips on understanding this error are appreciated.
Error summary
I will restate my test process below but here’s the highlight. Please note that we have some scripts such as sfdbbackup
which are just the familiar fdbbackup
etc. commands but with TLS/blobcreds flags set. Here’s the kernel for a describe attempt:
<Event Severity="10" Time="1581545857.363042" Type="BackupContainerDescribe2" ID="0000000000000000" URL="blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/ktest-feb-12-01?bucket=cloudant-wp2-backups" LogStartVersionOverride="-1" ExpiredEndVersion="-1" UnreliableEndVersion="-1" LogBeginVersion="-1" LogEndVersion="-1" Machine="0.0.0.0:0" LogGroup="" />
<Event Severity="30" Time="1581545857.363042" Type="BackupContainerMetadataInvalid" ID="0000000000000000" URL="blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/ktest-feb-12-01?bucket=cloudant-wp2-backups" ExpiredEndVersion="-1" UnreliableEndVersion="-1" LogBeginVersion="-1" LogEndVersion="-1" Machine="0.0.0.0:0" LogGroup="" />
<Event Severity="30" Time="1581545857.479831" Type="AbsolutePathError" ID="0000000000000000" Path="snapshots/snapshot,8016609928,8016609928,42" UnixErrorCode="d" UnixError="Permission denied" Error="platform_error" ErrorDescription="Platform error" ErrorCode="1500" Machine="0.0.0.0:0" LogGroup="" />
<Event Severity="40" Time="1581545857.479831" Type="StopAfterError" ID="0000000000000000" Error="platform_error" ErrorDescription="Platform error" ErrorCode="1500" Backtrace="addr2line -e fdbbackup.debug -p -C -f -i 0xaa1d3f 0xaa114f 0x4d74c3 0x4d76ae 0x4f8349 0x4cd4f1 0x4cd66e 0x5b3441 0x5b38e1 0x4f8349 0x4f8349 0x593fde 0x5be1d9 0x42851c 0x5b7990 0x5e0f15 0x5e1047 0x5d9551 0x4fa0f8 0x5e9785 0x5eaced 0x5eafa1 0x5e51f8 0x5e5497 0x5e575c 0x5e0ab8 0x75d87f 0x75e0c1 0x75bd00 0x75ce57 0x75d083 0x75d11c 0x75b190 0x75ba8d 0x4fa0f8 0x4cd15f 0x75aad8 0x75bc0f 0x4fa060 0xae4660 0x8391ea 0x4b6e5d 0x7f69703fc2e1" Machine="0.0.0.0:0" LogGroup="" />
Here is the API key reading that object just fine:
ibmcloud cos config list
Key Value
Last Updated Wednesday, February 12 2020 at 16:11:14
Default Region us-south
Download Location /Users/ksnavely@us.ibm.com/Downloads
CRN
AccessKeyID 936027fd996b447197a674452f421752
SecretAccessKey ************************************************
Authentication Method HMAC
URL Style VHost
ibmcloud cos get-object --bucket cloudant-wp2-backups --key 'data/ktest-feb-12-01/snapshots/snapshot,8016609928,8016609928,42' get_test
OK
Successfully downloaded 'data/ktest-feb-12-01/snapshots/snapshot,8016609928,8016609928,42' from bucket 'cloudant-wp2-backups'
176 B downloaded.
Testing process
What follows is the complete process of my attempt to describe
a backup. Restores fail in the same way.
First, seed a key/value.
fdb> writemode on
fdb> get ktest
`ktest': not found
fdb> set ktest foo
Committed (7575052530)
Next, start a backup covering this keyrange.
sudo -u foundationdb sfdbbackup start -d 'blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/ktest-feb-12-01?bucket=cloudant-wp2-backups&sc=1' --log -t ktest-feb-12-01 -k "ktest ktest/xff"
The backup on tag `ktest-feb-12-01' was successfully submitted.
There were no errors in the trace logs for the above command. I checked the status:
fdbbackup status -t ktest-feb-12-01
The previous backup on tag `ktest-feb-12-01' at blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/ktest-feb-12-01?bucket=cloudant-wp2-backups&sc=1 completed at version 8036461619.
BackupUID: a6cef066c8e245fff9eb2a3fd651bd84
BackupURL: blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/ktest-feb-12-01?bucket=cloudant-wp2-backups&sc=1
It had completed quickly as expected.
I next pulled all the logs for all backup agents from all cluster nodes. I could not find any error or metadata information in the logs. All I see is a brief few logs denoting success of the ktest backup.
Next I inspect the state of the backup.
sudo -u foundationdb sfdbbackup list -b 'blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/?bucket=cloudant-wp2-backups&sc=1'
blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/ktest-feb-12-01?bucket=cloudant-wp2-backups
sudo -u foundationdb sfdbbackup describe -d blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/ktest-feb-12-01?bucket=cloudant-wp2-backups -C /etc/foundationdb/fdb.cluster --log
ERROR: Platform error
Fatal Error: Platform error
The logs from the describe
attempt:
<Event Severity="10" Time="1581545857.236408" Type="BlobStoreEndpointReusingConnected" ID="0000000000000000" SuppressedEventCount="0" RemoteEndpoint="10.1.129.83:443:tls" ExpiresIn="119.88" Machine="0.0.0.0:0" LogGroup="" />
<Event Severity="10" Time="1581545857.363042" Type="BackupContainerDescribe2" ID="0000000000000000" URL="blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/ktest-feb-12-01?bucket=cloudant-wp2-backups" LogStartVersionOverride="-1" ExpiredEndVersion="-1" UnreliableEndVersion="-1" LogBeginVersion="-1" LogEndVersion="-1" Machine="0.0.0.0:0" LogGroup="" />
<Event Severity="30" Time="1581545857.363042" Type="BackupContainerMetadataInvalid" ID="0000000000000000" URL="blobstore://936027fd996b447197a674452f421752@s3.private.us-south.cloud-object-storage.appdomain.cloud/ktest-feb-12-01?bucket=cloudant-wp2-backups" ExpiredEndVersion="-1" UnreliableEndVersion="-1" LogBeginVersion="-1" LogEndVersion="-1" Machine="0.0.0.0:0" LogGroup="" />
<Event Severity="30" Time="1581545857.479831" Type="AbsolutePathError" ID="0000000000000000" Path="snapshots/snapshot,8016609928,8016609928,42" UnixErrorCode="d" UnixError="Permission denied" Error="platform_error" ErrorDescription="Platform error" ErrorCode="1500" Machine="0.0.0.0:0" LogGroup="" />
<Event Severity="40" Time="1581545857.479831" Type="StopAfterError" ID="0000000000000000" Error="platform_error" ErrorDescription="Platform error" ErrorCode="1500" Backtrace="addr2line -e fdbbackup.debug -p -C -f -i 0xaa1d3f 0xaa114f 0x4d74c3 0x4d76ae 0x4f8349 0x4cd4f1 0x4cd66e 0x5b3441 0x5b38e1 0x4f8349 0x4f8349 0x593fde 0x5be1d9 0x42851c 0x5b7990 0x5e0f15 0x5e1047 0x5d9551 0x4fa0f8 0x5e9785 0x5eaced 0x5eafa1 0x5e51f8 0x5e5497 0x5e575c 0x5e0ab8 0x75d87f 0x75e0c1 0x75bd00 0x75ce57 0x75d083 0x75d11c 0x75b190 0x75ba8d 0x4fa0f8 0x4cd15f 0x75aad8 0x75bc0f 0x4fa060 0xae4660 0x8391ea 0x4b6e5d 0x7f69703fc2e1" Machine="0.0.0.0:0" LogGroup="" />