We have been experimenting with the kubernetes operator in GCP, and it works great without TLS. When trying to follow the guidelines on setting up TLS, the controller ends up in a error loop, only logging
2020-02-19T19:44:31.149Z ERROR controller-runtime.controller Reconciler error {"controller": "foundationdbcluster", "request": "default/<cluster-name>", "error": "open : no such file or directory"}
github.com/go-logr/zapr.(*zapLogger).Error
/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:258
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88
This happens as long as the sidecar is happy with its TLS parameters (otherwise the sidecar fails and the controller never gets to this point). We would like to help out on the operator if possible, but it’s a complicating factor that the error gives no indication on where in the controller the error happened, the stack trace seems to be entirely from the k8s controller framework.
Any hints on what could be the reason, or how to efficiently investigate this would be appreciated.
EDIT: I forked the operator and added some stack traces to the errors, using the package https://github.com/pkg/errors. So far this has told me that the operator is failing to get pod clients in https://github.com/FoundationDB/fdb-kubernetes-operator/blob/9c2cd190c88e4eae10e9a5bc4cab4fb50f66b0d4/controllers/generate_initial_cluster_file.go#L67
EDIT2: The reason it failed was that the controller needs to have TLS environment variables set up when talking to a TLS enabled cluster, it failed here:https://github.com/FoundationDB/fdb-kubernetes-operator/blob/9c2cd190c88e4eae10e9a5bc4cab4fb50f66b0d4/controllers/pod_client.go#L104 Maybe I missed something, but I can’t remember seeing any documentation that the controller needs to be able to talk to the pods over TLS, though in hindsight it seems obvious. I think improving the error messages here could go a long way to reduce debug time.