Trying to restore a backup from S3

Hi Folks,

I’m trying to restore a backup from S3 to an empty FDB cluster. The context is the following : I have a FDB cluster running on k8s with backup process running on it. It uploads with success on S3 via minIO.

fdbbackup list -b "blobstore://admin@127.0.0.1:9000?bucket=xxxx&sc=0" --blob_credentials=/etc/fdb/blob_cred.json
blobstore://admin@127.0.0.1:9000/blabla?bucket=xxxx&sc=0

Then I created another FDB cluster (empty) with a same settings. I’m able to reach S3 through minIO from the pod where I want to do the fdbrestore. But when I’m running the following command :

fdbrestore start -w --log -r "blobstore://admin@127.0.0.1:9000/blabla?bucket=xxxx&sc=0" --blob_credentials=/etc/fdb/blob_cred.json --dest_cluster_file=/etc/foundationdb/fdb.cluster

I have the following output when I’m check the status of the restore :

Tag: default  UID: 22edfe20b449050db4d04d90a42efc71  State: starting  Blocks: 0/0  BlocksInProgress: 0  Files: 0  BytesWritten: 0  ApplyVersionLag: 0  LastError: ''HTTP response code not received or indicated failure' on 'restore_start'' 274s ago.
  URL: blobstore://admin@127.0.0.1:9000/blabla?bucket=xxxx&sc=0  Range: ''-'\xff'  AddPrefix: ''  RemovePrefix: ''  Version: 1885133112347

I suspect to have some issue with minIO but I’m not sure how to debug it. Then I checked the traces I had when I did start the restore : I noticed some weird logs.

<Event Severity="10" Time="1609247923.979622" Type="ConnectedOutgoing" ID="0000000000000000" SuppressedEventCount="0" PeerAddr="10.128.140.100:4500" Machine="10.136.138.248:7359" LogGroup="default" />
<Event Severity="10" Time="1609247923.986947" Type="BlobStoreEndpointNewConnection" ID="0000000000000000" SuppressedEventCount="0" RemoteEndpoint="127.0.0.1:9000" ExpiresIn="120" Machine="10.136.138.248:7359" LogGroup="default" />
<Event Severity="10" Time="1609247923.987410" Type="AsyncFileOpened" ID="0000000000000000" SuppressedEventCount="0" Filename="/etc/fdb/blob_cred.json" Fd="11" Flags="2228225" Machine="10.136.138.248:7359" LogGroup="default" />
<Event Severity="10" Time="1609247923.987633" Type="AsyncFileClosed" ID="0000000000000000" SuppressedEventCount="0" Fd="11" Machine="10.136.138.248:7359" LogGroup="default" />
<Event Severity="10" Time="1609247924.048588" Type="BlobStoreEndpointReusingConnected" ID="0000000000000000" SuppressedEventCount="0" RemoteEndpoint="127.0.0.1:9000" ExpiresIn="119.939" Machine="10.136.138.248:7359" LogGroup="default" />
<Event Severity="10" Time="1609247924.142139" Type="BackupContainerDescribe2" ID="0000000000000000" URL="blobstore://admin@127.0.0.1:9000/blabla?bucket=xxxx&amp;sc=0" LogStartVersionOverride="-1" ExpiredEndVersion="-1" UnreliableEndVersion="-1" LogBeginVersion="1315974786096" LogEndVersion="1896493112348" Machine="10.136.138.248:7359" LogGroup="default" />
<Event Severity="10" Time="1609247925.998811" Type="ConnectionClosed" ID="2a781cf640f78cf0" Error="connection_unreferenced" ErrorDescription="No peer references for connection" ErrorCode="1048" SuppressedEventCount="0" PeerAddr="10.128.140.100:4500" Machine="10.136.138.248:7359" LogGroup="default" />
<Event Severity="10" Time="1609247925.998811" Type="PeerDestroy" ID="0000000000000000" Error="connection_unreferenced" ErrorDescription="No peer references for connection" ErrorCode="1048" SuppressedEventCount="0" PeerAddr="10.128.140.100:4500" Machine="10.136.138.248:7359" LogGroup="default" />
<Event Severity="20" Time="1609247925.998811" Type="N2_ReadProbeError" ID="2a781cf640f78cf0" SuppressedEventCount="0" ErrorCode="125" Message="Operation canceled" Machine="10.136.138.248:7359" LogGroup="default" />
<Event Severity="10" Time="1609247926.538969" Type="AsyncFileOpened" ID="0000000000000000" SuppressedEventCount="3" Filename="/etc/fdb/blob_cred.json" Fd="15" Flags="2228225" Machine="10.136.138.248:7359" LogGroup="default" />
<Event Severity="10" Time="1609247926.538969" Type="AsyncFileClosed" ID="0000000000000000" SuppressedEventCount="3" Fd="15" Machine="10.136.138.248:7359" LogGroup="default" />

Can those errors explain the failures I have for the restore ? Does it means my coordinators pods are not reachable ? Which is weird because the status details from the fdbcli is telling me that everything is ok.

If you have any tips for debugging, it would be appreciate. :bowing_man:

Those events are not errors, most of them are info and one is a warning.

I suspect you do not have any backup agents on your destination cluster.

fdbrestore only controls restore jobs, it does not do any of the restore work. As with backup, backup_agent does all the restore work too. You need 1 or more backup agents in your cluster with access to your backup data.

Indeed, I did reapply my charts and my agents were up. Finally succeed to get my restore working.

Thanks for your help.