Running FDB backup with AWS SDK

Hi FDB Community,
I’m trying to run FDB backup with aws_sdk flag enabled as we run FDB in K8s pods with IAM role for Service Account Authentication. However, my effort to get it running has failed so far. I’m using 7.3.43 for servers.

I’ve built the FDB image with WITH_AWS_BACKUP flag enabled. Here is the fdbstart command on a pod and the trace log.

I see that I get 403 error from S3, but I’ve double/triple checked my policies. I even passed the secret, and tokens to fdbbackup directly and they worked.

command:

fdbbackup start -C /var/dynamic-conf/fdb.cluster -w -z -s 3600 --log --tls_verify_peers=Check.Valid=0 --logdir . -t c2 -d "blobstore://s3.us-east-1.amazonaws.com:443/test-backups?region=us-east-1&sdk_auth=1&bucket=super-private-bucket-name"

Log:


<Event Severity="10" Time="1727560032.194388" DateTime="2024-09-28T21:47:12Z" Type="FBA_SubmitBackup" ID="0000000000000000" TagName="c2" StopWhenDone="0" UsePartitionedLog="0" OutContainer="blobstore://s3.us-east-1.amazonaws.com:443/test-backups?region=us-east-1&amp;sdk_auth=1&amp;bucket=super-private-bucket-name" ThreadID="10261369439410137752" Machine="10.10.69.117:57" LogGroup="default" ClientDescription="primary-7.3.43-13056699628119360000" />
<Event Severity="10" Time="1727560032.194388" DateTime="2024-09-28T21:47:12Z" Type="CodeCoverage" ID="0000000000000000" File="fdbclient/ReadYourWrites.actor.cpp" Line="1632" Condition="true" Covered="1" Comment="ReadYourWritesTransaction::get" ThreadID="10261369439410137752" Machine="10.10.69.117:57" LogGroup="default" ClientDescription="primary-7.3.43-13056699628119360000" />
<Event Severity="10" Time="1727560032.194388" DateTime="2024-09-28T21:47:12Z" Type="StartingRunLoopProfilingThread" ID="0000000000000000" Interval="0.125" ThreadID="10261369439410137752" Machine="10.10.69.117:57" LogGroup="default" ClientDescription="primary-7.3.43-13056699628119360000" />
<Event Severity="10" Time="1727560032.194388" DateTime="2024-09-28T21:47:12Z" Type="Net2Running" ID="0000000000000000" ThreadID="10261369439410137752" Machine="10.10.69.117:57" LogGroup="default" ClientDescription="primary-7.3.43-13056699628119360000" />
<Event Severity="10" Time="1727560032.194388" DateTime="2024-09-28T21:47:12Z" Type="ProcessTimeOffset" ID="0000000000000000" ProcessTime="1727560032.201446" SystemTime="1727560032.201447" OffsetFromSystemTime="-0.000000" ThreadID="10261369439410137752" Machine="10.10.69.117:57" LogGroup="default" ClientDescription="primary-7.3.43-13056699628119360000" />
<Event Severity="10" Time="1727560032.201496" DateTime="2024-09-28T21:47:12Z" Type="ConnectingTo" ID="0000000000000000" SuppressedEventCount="0" PeerAddr="172.20.247.149:4501" PeerReferences="3" FailureStatus="OK" ThreadID="10261369439410137752" Machine="10.10.69.117:57" LogGroup="default" ClientDescription="primary-7.3.43-13056699628119360000" />
<Event Severity="10" Time="1727560032.201496" DateTime="2024-09-28T21:47:12Z" Type="BlobStoreMetrics" ID="df17f05a0101c4d4" Elapsed="0" RequestsSuccessful="0 -1 0" RequestsFailed="0 -1 0" NewConnections="0 -1 0" ExpiredConnections="0 -1 0" ReusedConnections="0 -1 0" FastRetries="0 -1 0" GlobalConnectionPoolCount="0" GlobalConnectionPoolSize="0" ThreadID="10261369439410137752" Machine="10.10.69.117:57" LogGroup="default" ClientDescription="primary-7.3.43-13056699628119360000" TrackLatestType="Original" />
<Event Severity="10" Time="1727560032.201496" DateTime="2024-09-28T21:47:12Z" Type="ConnectionExchangingConnectPacket" ID="51d7a3fa8bfa5c6e" SuppressedEventCount="0" PeerAddr="172.20.247.149:4501" ThreadID="10261369439410137752" Machine="10.10.69.117:57" LogGroup="default" ClientDescription="primary-7.3.43-13056699628119360000" />
<Event Severity="10" Time="1727560032.204250" DateTime="2024-09-28T21:47:12Z" Type="ConnectionEstablished" ID="51d7a3fa8bfa5c6e" SuppressedEventCount="0" Peer="172.20.247.149:4501" ConnectionId="1" ThreadID="10261369439410137752" Machine="10.10.69.117:57" LogGroup="default" ClientDescription="primary-7.3.43-13056699628119360000" />
<Event Severity="10" Time="1727560032.204250" DateTime="2024-09-28T21:47:12Z" Type="ConnectedOutgoing" ID="0000000000000000" SuppressedEventCount="0" PeerAddr="172.20.247.149:4501" ThreadID="10261369439410137752" Machine="10.10.69.117:57" LogGroup="default" ClientDescription="primary-7.3.43-13056699628119360000" />
<Event Severity="10" Time="1727560032.232761" DateTime="2024-09-28T21:47:12Z" Type="S3BlobStoreEndpointNewConnectionSuccess" ID="0000000000000000" SuppressedEventCount="0" RemoteEndpoint="54.231.236.80:443:tls(fromHostname)" ExpiresIn="120" Proxy="" ThreadID="10261369439410137752" Machine="10.10.69.117:57" LogGroup="default" ClientDescription="primary-7.3.43-13056699628119360000" />
<Event Severity="10" Time="1727560032.232761" DateTime="2024-09-28T21:47:12Z" Type="AWSSDKInitSuccessful" ID="0000000000000000" ThreadID="10261369439410137752" Machine="10.10.69.117:57" LogGroup="default" ClientDescription="primary-7.3.43-13056699628119360000" />
<Event Severity="10" Time="1727560032.232761" DateTime="2024-09-28T21:47:12Z" Type="S3BlobStoreGotSdkCredentials" ID="0000000000000000" SuppressedEventCount="0" Duration="0.158162" ThreadID="10261369439410137752" Machine="10.10.69.117:57" LogGroup="default" ClientDescription="primary-7.3.43-13056699628119360000" />
<Event Severity="10" Time="1727560032.232761" DateTime="2024-09-28T21:47:12Z" Type="SlowTask" ID="0000000000000000" TaskID="8900" MClocks="348.959" Duration="0.158619" SampleRate="1" NumYields="1" ThreadID="10261369439410137752" Machine="10.10.69.117:57" LogGroup="default" ClientDescription="primary-7.3.43-13056699628119360000" />
<Event Severity="20" Time="1727560032.442914" DateTime="2024-09-28T21:47:12Z" Type="ParseS3XMLResponseNoError" ID="0000000000000000" Response="" ThreadID="10261369439410137752" Machine="10.10.69.117:57" LogGroup="default" ClientDescription="primary-7.3.43-13056699628119360000" />
<Event Severity="40" ErrorKind="Unset" Time="1727560032.442914" DateTime="2024-09-28T21:47:12Z" Type="StopAfterError" ID="0000000000000000" Error="backup_error" ErrorDescription="Backup error" ErrorCode="2300" ThreadID="10261369439410137752" Backtrace="addr2line -e fdbbackup.debug -p -C -f -i 0x14da7f7 0x14d9dd1 0x14d9f6d 0x6d5707 0x6d58bf 0x70f260 0x6f12fc 0x6f17bf 0x70f260 0x6f6742 0x6f70a1 0x70f260 0xa432a1 0xac7c3f 0x70f260 0x6cafd9 0x70f260 0x7f0e04 0x70f260 0xfd05fe 0xfc7eb0 0xfd018e 0xfdb1f0 0xff6519 0xff6974 0xff6baa 0xfdb0e0 0x1281638 0x128cbd8 0x7134c8 0x1290dd7 0x128d998 0x128db0d 0x7134c8 0x79a310 0x128c558 0x145ea78 0xcc43e2 0x68af46 0x7f5d0ab6f590" Machine="10.10.69.117:57" LogGroup="default" ClientDescription="primary-7.3.43-13056699628119360000" />
<Event Severity="20" Time="1727560032.442914" DateTime="2024-09-28T21:47:12Z" Type="S3BlobStoreEndpointRequestFailed" ID="0000000000000000" SuppressedEventCount="0" ResponseCode="403" S3ErrorCode="" ConnectionEstablished="1" ReusingConn="0" ConnID="82f5cf0fb20cb4f4" ConnectDuration="0.17573" ReqDuration="0.0518951" RemoteEndpoint="54.231.236.80:443:tls(fromHostname)" Verb="HEAD" Resource="/super-private-bucket-name" ThisTry="1" URI="/super-private-bucket-name" Proxy="" ThreadID="10261369439410137752" Machine="10.10.69.117:57" LogGroup="default" ClientDescription="primary-7.3.43-13056699628119360000" />
<Event Severity="10" Time="1727560032.442914" DateTime="2024-09-28T21:47:12Z" Type="MachineLoadDetail" ID="0000000000000000" User="144358039" Nice="50762" System="35703444" Idle="2035142928" IOWait="25796772" IRQ="0" SoftIRQ="3505726" Steal="178503" Guest="0" ThreadID="10261369439410137752" Machine="10.10.69.117:57" LogGroup="default" ClientDescription="primary-7.3.43-13056699628119360000" />
<Event Severity="10" Time="1727560032.442914" DateTime="2024-09-28T21:47:12Z" Type="ProcessMetrics" ID="0000000000000000" Elapsed="0.247243" CPUSeconds="0.021114" MainThreadCPUSeconds="0.01532" UptimeSeconds="0.248525" Memory="1512628224" ResidentMemory="25993216" UnusedAllocatedMemory="0" MbpsSent="0.0859397" MbpsReceived="0.114996" DiskTotalBytes="0" DiskFreeBytes="0" DiskQueueDepth="0" DiskIdleSeconds="0" DiskReads="0" DiskReadSeconds="0" DiskWrites="0" DiskWriteSeconds="0" DiskReadsCount="0" DiskWritesCount="0" DiskWriteSectors="0" DiskReadSectors="0" FileWrites="0" FileReads="0" CacheReadBytes="0" CacheFinds="0" CacheWritesBlocked="0" CacheReadsBlocked="0" CachePageReadsMerged="0" CacheWrites="0" CacheReads="0" CacheHits="0" CacheMisses="0" CacheEvictions="0" DCID="[not set]" ZoneID="[not set]" MachineID="[not set]" Version="[not set]" AIOSubmitCount="0" AIOCollectCount="0" AIOSubmitLag="0" AIODiskStall="0" CurrentConnections="5" ConnectionsEstablished="20.223" ConnectionsClosed="0" ConnectionErrors="0" TLSPolicyFailures="0" ThreadID="10261369439410137752" Machine="10.10.69.117:57" LogGroup="default" ClientDescription="primary-7.3.43-13056699628119360000" TrackLatestType="Original" />
<Event Severity="10" Time="1727560032.442914" DateTime="2024-09-28T21:47:12Z" Type="MemoryMetrics" ID="0000000000000000" TotalMemory16="0" ApproximateUnusedMemory16="0" ActiveThreads16="0" TotalMemory32="131072" ApproximateUnusedMemory32="0" ActiveThreads32="1" TotalMemory64="262144" ApproximateUnusedMemory64="0" ActiveThreads64="2" TotalMemory96="131040" ApproximateUnusedMemory96="0" ActiveThreads96="1" TotalMemory128="131072" ApproximateUnusedMemory128="0" ActiveThreads128="1" TotalMemory256="262144" ApproximateUnusedMemory256="0" ActiveThreads256="2" TotalMemory512="0" ApproximateUnusedMemory512="0" ActiveThreads512="0" TotalMemory1024="0" ApproximateUnusedMemory1024="0" ActiveThreads1024="0" TotalMemory2048="0" ApproximateUnusedMemory2048="0" ActiveThreads2048="0" TotalMemory4096="0" ApproximateUnusedMemory4096="0" ActiveThreads4096="0" TotalMemory8192="0" ApproximateUnusedMemory8192="0" ActiveThreads8192="0" TotalMemory16384="0" ApproximateUnusedMemory16384="0" ActiveThreads16384="0" HugeArenaMemory="0" DCID="[not set]" ZoneID="[not set]" MachineID="[not set]" ThreadID="10261369439410137752" Machine="10.10.69.117:57" LogGroup="default" ClientDescription="primary-7.3.43-13056699628119360000" />
<Event Severity="10" Time="1727560032.442914" DateTime="2024-09-28T21:47:12Z" Type="FastAllocMemoryUsage" ID="0000000000000000" TotalMemory="917472" UnusedMemory="0" Utilization="100.000000%" ThreadID="10261369439410137752" Machine="10.10.69.117:57" LogGroup="default" ClientDescription="primary-7.3.43-13056699628119360000" />

S3BlobStoreEndpointRequestFailed ResponseCode="403" means Forbidden. So your access to S3 has issues, probably with your credentials. You can add a knob --knob_http_verbose_level 3 to the command line to print out HTTP request/response pairs and debug from there.

1 Like

Thanks @jzhou, here is the log.

When, on the same pod, I use aws s3api with head-bucket, list-objects, etc., it works.

Note that I replaced the bucket name & parts of token for privacy reasons.

[bde63a1780dbbf20b9bd24b92e5c614f] HTTP starting HEAD /super-private-bucket-name ContentLen:0
Request Header: Accept: application/xml
Request Header: Authorization: AWS4-HMAC-SHA256 Credential=ASIAZURGGLSKWSNRSX33/20240930/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date;x-amz-security-token, Signature=0cf179f545ee24c32ecf771cdb90f68ac274fc493bb3b13a1808ccfad18f0952
Request Header: Content-Length: 0
Request Header: Host: s3.us-east-1.amazonaws.com
Request Header: x-amz-content-sha256: UNSIGNED-PAYLOAD
Request Header: x-amz-date: 20240930T184042Z
Request Header: x-amz-security-token: IQoJb3JpZ2luX2VjEDNaCXVzLWVhc3QtMSJIMEYCIQCYxMTfz3KUX+Na1b7OKEJgmdYdo8YS/NbHlgEVbLSCqAIhAIo/4uQeCtkJlUxZFsfVMhUcIXLTGSTsYJep+DxsxusxKrsFCHwQAxoMNjYyNTc4ODE0MTAxIgwYI3oLT4bQ6na/CHEqmAW78LS3gNVE9EJa6w/nuEN2Tw59Z1m/JdKWLbugPFnEGTSct81uC/TyvHvPK0tv60wRrv5GFxeDhv/2M2RCmnDtxCl4t7tkPqR9qlzweXqB5wLo+Cq/dhkmAtuiMSP3bRPVCbQ3hAZXioQVoZc3zTWchvV5dHZtjwOwcTOKFUPDHPsl01Qq6+n449cTj+/ZWkc3/Dmw6ZXwe4gEBIlMnffqcaUhxPe0OoGRf72BbrRljBbGPdSYQHO0kA031TetvyfSeNtnNnXcqDXtHW72jHpxutX6q0EyCqdwGtULPD21i+by6Z7+7nHDFKFvLxx3WwS8ueNn4Wl8xYxJf0D2/RtvkZvOBMj0P1WP0sMbp6IU/XCxC9CRSonaZapq06vuDJpvg2jQ3eQdn+fE2HaiAhX0fMYmCz9oO+0uGuhJWFLi4rw9ee3sN53Mr1gKGhCHTAk8MR4VlfyLUiBiW4JxhwIOxP2345wJYZhwU1pbKuHKGnKDurHtbP6dYtKpedGvC9efgzOnqwDlQt+x5dKB+fZCgaUJGWdNln+OXDc5ylBIiJriHZu743CmgduecXg/0DSestGOoabXFIiCLWj45jotzLhlt6z6sns2D0SP/Kiu8KiVVohdlGsdIvDA4P2kCYm+oSvYmVzfGZK6M+aV9YKnGMkI/NzLPpSjUIEJG0vbKID+3eqijIYRgaiWG/E6EiwAEWk8ErPwuBnCvYaKJSTQ9CHm/VhummBVOdjh/BhMWK5bcahEvkRtd9/mOsCYrkRNziCtqib0A8ufInSk25egRxUrNZp2p8ZTWeqNMPfoZ6E+cX9LIVSU+Ucw55zl07sy2bj9vskxunpMF8h3EcCUoulB9pYIeiTRsic7Ryn3h61rtODKbABpMIXc67cGOrABcxHZfVbp43G5QyAmsZFpK9l03lN30HNiJ23qmPU8GeQ/iNC4Tsalnz5DYoLK7lbnCnLY0cyXSkAb1hpbolXUxnuRp8u+I5g3mzVumRucd3Tvs1c1nQ/+5/bn6XwqZo8fkN8qtQ/INh+AMvssLPp33WilFZDQpXuQgab/6ySfaT3epLejgxk+d2WEdSKco/RdpJHmo79wl1Adz0OkaVWjgUrQ86v3KFJtnIJy/hcmJTU=
[bde63a1980dbbf20b9bd24b92e5a614f] HTTP code=403 early=false, time=0.028265714645385742 HEAD /super-private-bucket-name contentLen=0 [1780 out, response content len -1]
[bde63a1780dbbf20b9bd24b92e5c614f] HTTP RESPONSE:  HEAD /super-private-bucket-name
Response Code: 403
Response ContentLen: -1
Reponse Header: Content-Type: application/xml
Reponse Header: Date: Mon, 30 Sep 2024 18:40:41 GMT
Reponse Header: Server: AmazonS3
Reponse Header: x-amz-bucket-region: us-east-1
Reponse Header: x-amz-id-2: v7dE9/tEoWc4y/ISz93o2SGW53Kr5GDGBJ+m0LkL8WaePYciuH4kySmVXiyFmVRsPt7cG9MO4e8=
Reponse Header: x-amz-request-id: VQ934ZWQYJ40XXWF
-- RESPONSE CONTENT--

--------

ERROR: Could not create backup container: HTTP response code not received or indicated failure
ERROR: An error was encountered during submission
Fatal Error: Backup error

You could capture request/response from AWS S3API and compare with fdbbackup’s output to debug further.

I think I found the bug.

TLDR; It boils down to how AWS_SDK is built and linked to fdbbackup.

AWS_SDK depends on libcurl. As part of IAM role for Service Account Authentication, the SDK code has to connect to AWS STS over HTTPS, but the libcurl is built without https support or it is disabled. Therefore, the credential provider chain will try another form of authentication and gets some other credentials that do not have permission, hence 403 error.
Here is an excerpt of the log from aws_sdk that is linked to the fdbbackup agent. I collect it by invoking fdbbackup start…

[TRACE] 2024-10-03 20:32:46.185 CurlHttpClient [139659828168256] Making request to https://sts.us-east-1.amazonaws.com
[TRACE] 2024-10-03 20:32:46.185 CurlHttpClient [139659828168256] Including headers:
[TRACE] 2024-10-03 20:32:46.185 CurlHttpClient [139659828168256] content-length: 1481
[TRACE] 2024-10-03 20:32:46.185 CurlHttpClient [139659828168256] content-type: application/x-www-form-urlencoded
[TRACE] 2024-10-03 20:32:46.185 CurlHttpClient [139659828168256] host: sts.us-east-1.amazonaws.com
[TRACE] 2024-10-03 20:32:46.185 CurlHttpClient [139659828168256] user-agent: aws-sdk-cpp/1.9.331 Linux/6.1.109-118.189.amzn2023.x86_64 x86_64 GCC/11.4.1
[DEBUG] 2024-10-03 20:32:46.185 CurlHandleContainer [139659828168256] Attempting to acquire curl connection.
[DEBUG] 2024-10-03 20:32:46.185 CurlHandleContainer [139659828168256] No current connections available in pool. Attempting to create new connections.
[DEBUG] 2024-10-03 20:32:46.185 CurlHandleContainer [139659828168256] attempting to grow pool size by 2
[INFO] 2024-10-03 20:32:46.185 CurlHandleContainer [139659828168256] Pool grown by 2
[INFO] 2024-10-03 20:32:46.185 CurlHandleContainer [139659828168256] Connection has been released. Continuing.
[DEBUG] 2024-10-03 20:32:46.185 CurlHandleContainer [139659828168256] Returning connection handle 0x12d26b60
[DEBUG] 2024-10-03 20:32:46.185 CurlHttpClient [139659828168256] Obtained connection handle 0x12d26b60
[DEBUG] 2024-10-03 20:32:46.185 CURL [139659828168256] (Text) Rebuilt URL to: https://sts.us-east-1.amazonaws.com/

[DEBUG] 2024-10-03 20:32:46.185 CURL [139659828168256] (Text) Protocol "https" not supported or disabled in libcurl

[DEBUG] 2024-10-03 20:32:46.185 CURL [139659828168256] (Text) Closing connection -1

[ERROR] 2024-10-03 20:32:46.185 CurlHttpClient [139659828168256] Curl returned error code 1 - Unsupported protocol
[DEBUG] 2024-10-03 20:32:46.185 CurlHandleContainer [139659828168256] Destroy curl handle: 0x12d26b60
[DEBUG] 2024-10-03 20:32:46.185 CurlHandleContainer [139659828168256] Created replacement handle and released to pool: 0x12d2fb10
[ERROR] 2024-10-03 20:32:46.185 STSResourceClient [139659828168256] Http request to retrieve credentials failed
[WARN] 2024-10-03 20:32:46.185 STSResourceClient [139659828168256] Request failed, now waiting 0 ms before attempting again.
[TRACE] 2024-10-03 20:32:46.185 CurlHttpClient [139659828168256] Making request to https://sts.us-east-1.amazonaws.com
[TRACE] 2024-10-03 20:32:46.185 CurlHttpClient [139659828168256] Including headers:
[TRACE] 2024-10-03 20:32:46.185 CurlHttpClient [139659828168256] content-length: 1481
[TRACE] 2024-10-03 20:32:46.185 CurlHttpClient [139659828168256] content-type: application/x-www-form-urlencoded
[TRACE] 2024-10-03 20:32:46.185 CurlHttpClient [139659828168256] host: sts.us-east-1.amazonaws.com
[TRACE] 2024-10-03 20:32:46.185 CurlHttpClient [139659828168256] user-agent: aws-sdk-cpp/1.9.331 Linux/6.1.109-118.189.amzn2023.x86_64 x86_64 GCC/11.4.1
[DEBUG] 2024-10-03 20:32:46.185 CurlHandleContainer [139659828168256] Attempting to acquire curl connection.
[INFO] 2024-10-03 20:32:46.185 CurlHandleContainer [139659828168256] Connection has been released. Continuing.
[DEBUG] 2024-10-03 20:32:46.185 CurlHandleContainer [139659828168256] Returning connection handle 0x12d2fb10
[DEBUG] 2024-10-03 20:32:46.185 CurlHttpClient [139659828168256] Obtained connection handle 0x12d2fb10
[DEBUG] 2024-10-03 20:32:46.185 CURL [139659828168256] (Text) Rebuilt URL to: https://sts.us-east-1.amazonaws.com/

[DEBUG] 2024-10-03 20:32:46.185 CURL [139659828168256] **(Text) Protocol "https" not supported or disabled in libcurl**

[DEBUG] 2024-10-03 20:32:46.185 CURL [139659828168256] (Text) Closing connection -1

[ERROR] 2024-10-03 20:32:46.185 CurlHttpClient [139659828168256] Curl returned error code 1 - Unsupported protocol

That explains that why my connections to S3 using aws cli or my Go program works, but not when I use the fdbbackup with aws_sdk.

I recall that we hit a similar issue during our testing. I’m not sure about the status of the AWS SDK, but it might be worth to call out what a user can expect (e.g. “experimental”). It’s probably also worth to call out (in a doc?) that the AWS SDK support is fairly limited, e.g.: Add support for assuming a role in the provided AWS config · Issue #10835 · apple/foundationdb · GitHub, so some AWS configurations will not work with the current state of the AWS SDK in FDB. We ended up building a small sidecar that fetches the credentials and writes it into the expected blob store format for the backup agents.