Controller errors when enabling tls with the kubernetes operator

We have been experimenting with the kubernetes operator in GCP, and it works great without TLS. When trying to follow the guidelines on setting up TLS, the controller ends up in a error loop, only logging

2020-02-19T19:44:31.149Z        ERROR   controller-runtime.controller   Reconciler error        {"controller": "foundationdbcluster", "request": "default/<cluster-name>", "error": "open : no such file or directory"}
github.com/go-logr/zapr.(*zapLogger).Error
    /go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:258
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
    /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
    /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
    /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88

This happens as long as the sidecar is happy with its TLS parameters (otherwise the sidecar fails and the controller never gets to this point). We would like to help out on the operator if possible, but it’s a complicating factor that the error gives no indication on where in the controller the error happened, the stack trace seems to be entirely from the k8s controller framework.

Any hints on what could be the reason, or how to efficiently investigate this would be appreciated.

EDIT: I forked the operator and added some stack traces to the errors, using the package https://github.com/pkg/errors. So far this has told me that the operator is failing to get pod clients in https://github.com/FoundationDB/fdb-kubernetes-operator/blob/9c2cd190c88e4eae10e9a5bc4cab4fb50f66b0d4/controllers/generate_initial_cluster_file.go#L67

EDIT2: The reason it failed was that the controller needs to have TLS environment variables set up when talking to a TLS enabled cluster, it failed here:https://github.com/FoundationDB/fdb-kubernetes-operator/blob/9c2cd190c88e4eae10e9a5bc4cab4fb50f66b0d4/controllers/pod_client.go#L104 Maybe I missed something, but I can’t remember seeing any documentation that the controller needs to be able to talk to the pods over TLS, though in hindsight it seems obvious. I think improving the error messages here could go a long way to reduce debug time.

Thanks for that feedback. I agree that we should make the error messages clearer, and also make the documentation clearer. We set the TLS environment variables up in the development environment, but not in the samples for running the operator in a real environment. The reason for that is that I don’t know of a general-purpose solution for getting the certs set up in a way that is secure and provides appropriate fields for restricting access to a cluster. It’s a big gap in our solution that we should make clear, since it’ll be common for people to want to run with TLS.