Error "signal: killed" when recovering from backup

peter · September 14, 2021, 9:14am

When I try to restore my database from a backup using the Kubernetes operator I get the following error in the operator log:

{"level":"info","ts":1631609197.321,"logger":"fdbclient","msg":"Running command","namespace":"verifier","cluster":"sample-cluster","path":"/usr/bin/fdb/6.3/fdbrestore","args":["/usr/bin/fdb/6.3/fdbrestore","start","-r","blobstore://s3-proxy-user@fdb-backup-s3-proxy:80/default?bucket=foundationdbbackup&secure_connection=0","--dest_cluster_file","/tmp/405498555","--log","--logdir","/var/log/fdb"]}
{"level":"error","ts":1631609207.3601973,"logger":"fdbclient","msg":"Error from FDB command","namespace":"verifier","cluster":"sample-cluster","code":-1,"stdout":"No restore target version given, will use maximum restorable version from backup description.\nUsing target restore version 1207990314200\n","stderr":"","error":"signal: killed","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.2.0/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/log.(*DelegatingLogger).Error\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.3/pkg/log/deleg.go:144\ngithub.com/FoundationDB/fdb-kubernetes-operator/fdbclient.(*cliAdminClient).runCommand\n\t/workspace/fdbclient/admin_client.go:187\ngithub.com/FoundationDB/fdb-kubernetes-operator/fdbclient.(*cliAdminClient).StartRestore\n\t/workspace/fdbclient/admin_client.go:555\ngithub.com/FoundationDB/fdb-kubernetes-operator/controllers.StartRestore.Reconcile\n\t/workspace/controllers/start_restore.go:48\ngithub.com/FoundationDB/fdb-kubernetes-operator/controllers.(*FoundationDBRestoreReconciler).Reconcile\n\t/workspace/controllers/restore_controller.go:77\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:216\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:99"}
{"level":"error","ts":1631609207.3605845,"logger":"controller","msg":"Error in reconciliation","namespace":"verifier","restore":"sample-cluster","subReconciler":"controllers.StartRestore","requeueAfter":0,"error":"signal: killed","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.2.0/zapr.go:132\ngithub.com/FoundationDB/fdb-kubernetes-operator/controllers.processRequeue\n\t/workspace/controllers/controllers.go:114\ngithub.com/FoundationDB/fdb-kubernetes-operator/controllers.(*FoundationDBRestoreReconciler).Reconcile\n\t/workspace/controllers/restore_controller.go:82\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:216\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:99"}
{"level":"error","ts":1631609207.3613954,"logger":"controller-runtime.manager.controller.foundationdbrestore","msg":"Reconciler error","reconciler group":"apps.foundationdb.org","reconciler kind":"FoundationDBRestore","name":"sample-cluster","namespace":"verifier","error":"signal: killed","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.2.0/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:302\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:216\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:99"}

When I run the fdbrestore command manually (from the backup agent pod) the restore works without problems.

The URL s3-proxy-user@fdb-backup-s3-proxy I am using is a proxy to provide an S3 API for Azure blob storage.

Any ideas what might be the problem here or how I can debug this?

johscheuer · September 15, 2021, 2:26pm

I would spin up a foundationdb Pod with the same version and mount the cluster file. Then exec into this container and run the same command and hopefully in the trace logs there should be more information. If you already collect the trace logs of the operator Pod you can also look at them directly. One thing that comes to my mind (since this is a kill) that the default timeout is hit and the go program shutdown the process. If you can run the command in the separate Pod without any issues I would assume that this is the problem. Given the ~10 seconds between the first log line and the second line with the kill (and you’re right the default timeout is 10 seconds), that’s probably the case. Feel free to create an issue in GitHub and file a PR that resolves that issue. The backup and restore feature is still in an alpha stage.

Topic		Replies	Views
Failure / Recovery scenario Kubernetes Operator	1	685	October 12, 2020
Fdbrestore Errors Kubernetes Operator	0	53	July 9, 2024
FDB operator stuck without recreating pods Kubernetes Operator operator	4	378	February 22, 2024
Operator crashes with `illegal instruction` Kubernetes Operator Development	3	312	January 31, 2023
FDB K8s Operator stuck after FDB 7 migration Kubernetes Operator operator	4	297	July 7, 2023

Error "signal: killed" when recovering from backup

Related topics