Hi Folks,
I’m trying to restore a backup from S3 to an empty FDB cluster. The context is the following : I have a FDB cluster running on k8s with backup process running on it. It uploads with success on S3 via minIO.
fdbbackup list -b "blobstore://admin@127.0.0.1:9000?bucket=xxxx&sc=0" --blob_credentials=/etc/fdb/blob_cred.json
blobstore://admin@127.0.0.1:9000/blabla?bucket=xxxx&sc=0
Then I created another FDB cluster (empty) with a same settings. I’m able to reach S3 through minIO from the pod where I want to do the fdbrestore
. But when I’m running the following command :
fdbrestore start -w --log -r "blobstore://admin@127.0.0.1:9000/blabla?bucket=xxxx&sc=0" --blob_credentials=/etc/fdb/blob_cred.json --dest_cluster_file=/etc/foundationdb/fdb.cluster
I have the following output when I’m check the status of the restore :
Tag: default UID: 22edfe20b449050db4d04d90a42efc71 State: starting Blocks: 0/0 BlocksInProgress: 0 Files: 0 BytesWritten: 0 ApplyVersionLag: 0 LastError: ''HTTP response code not received or indicated failure' on 'restore_start'' 274s ago.
URL: blobstore://admin@127.0.0.1:9000/blabla?bucket=xxxx&sc=0 Range: ''-'\xff' AddPrefix: '' RemovePrefix: '' Version: 1885133112347
I suspect to have some issue with minIO but I’m not sure how to debug it. Then I checked the traces I had when I did start the restore : I noticed some weird logs.
<Event Severity="10" Time="1609247923.979622" Type="ConnectedOutgoing" ID="0000000000000000" SuppressedEventCount="0" PeerAddr="10.128.140.100:4500" Machine="10.136.138.248:7359" LogGroup="default" />
<Event Severity="10" Time="1609247923.986947" Type="BlobStoreEndpointNewConnection" ID="0000000000000000" SuppressedEventCount="0" RemoteEndpoint="127.0.0.1:9000" ExpiresIn="120" Machine="10.136.138.248:7359" LogGroup="default" />
<Event Severity="10" Time="1609247923.987410" Type="AsyncFileOpened" ID="0000000000000000" SuppressedEventCount="0" Filename="/etc/fdb/blob_cred.json" Fd="11" Flags="2228225" Machine="10.136.138.248:7359" LogGroup="default" />
<Event Severity="10" Time="1609247923.987633" Type="AsyncFileClosed" ID="0000000000000000" SuppressedEventCount="0" Fd="11" Machine="10.136.138.248:7359" LogGroup="default" />
<Event Severity="10" Time="1609247924.048588" Type="BlobStoreEndpointReusingConnected" ID="0000000000000000" SuppressedEventCount="0" RemoteEndpoint="127.0.0.1:9000" ExpiresIn="119.939" Machine="10.136.138.248:7359" LogGroup="default" />
<Event Severity="10" Time="1609247924.142139" Type="BackupContainerDescribe2" ID="0000000000000000" URL="blobstore://admin@127.0.0.1:9000/blabla?bucket=xxxx&sc=0" LogStartVersionOverride="-1" ExpiredEndVersion="-1" UnreliableEndVersion="-1" LogBeginVersion="1315974786096" LogEndVersion="1896493112348" Machine="10.136.138.248:7359" LogGroup="default" />
<Event Severity="10" Time="1609247925.998811" Type="ConnectionClosed" ID="2a781cf640f78cf0" Error="connection_unreferenced" ErrorDescription="No peer references for connection" ErrorCode="1048" SuppressedEventCount="0" PeerAddr="10.128.140.100:4500" Machine="10.136.138.248:7359" LogGroup="default" />
<Event Severity="10" Time="1609247925.998811" Type="PeerDestroy" ID="0000000000000000" Error="connection_unreferenced" ErrorDescription="No peer references for connection" ErrorCode="1048" SuppressedEventCount="0" PeerAddr="10.128.140.100:4500" Machine="10.136.138.248:7359" LogGroup="default" />
<Event Severity="20" Time="1609247925.998811" Type="N2_ReadProbeError" ID="2a781cf640f78cf0" SuppressedEventCount="0" ErrorCode="125" Message="Operation canceled" Machine="10.136.138.248:7359" LogGroup="default" />
<Event Severity="10" Time="1609247926.538969" Type="AsyncFileOpened" ID="0000000000000000" SuppressedEventCount="3" Filename="/etc/fdb/blob_cred.json" Fd="15" Flags="2228225" Machine="10.136.138.248:7359" LogGroup="default" />
<Event Severity="10" Time="1609247926.538969" Type="AsyncFileClosed" ID="0000000000000000" SuppressedEventCount="3" Fd="15" Machine="10.136.138.248:7359" LogGroup="default" />
Can those errors explain the failures I have for the restore ? Does it means my coordinators pods are not reachable ? Which is weird because the status details
from the fdbcli is telling me that everything is ok.
If you have any tips for debugging, it would be appreciate.