After rolling nodes in our Kubernetes cluster, we found ourselves in the following situation:
-
Our FoundationDB cluster was unavailable.
We are using FDB 6.3.x, operator v1.21.0, FDB cluster withpublicIPSource: service
. -
Some pods of the FoundationDB cluster were missing. Particularly the one corresponding to the first coordinator in the connection string.
-
Operator logged errors like these and could not progress:
Error determining public address: connect: Operation not permitted [system:1] ERROR: Unable to bind to network (1512) Unable to connect to cluster from `/tmp/19f5e8db-4d3f-4d40-9fa4-ebaac5f4c57e
We also saw similar errors in FDB trace logs.
After some digging we found that the error came from determinePublicIPAutomatically function inside the FDB client, which uses the address of the first coordinator to determine the client’s public address. To do this, it creates a dummy UDP socket. As I understand, this trick does not require the first coordinator to be available, as no actual packets are sent.
This appears to not play well with Cilium socket-based load-balancing feature, which enables Kubernetes networking without kube-proxy. This feature (enabled by bpf-lb-sock
flag) in its default configuration breaks the above algorithm in case when the first coordinator pod is missing. The eBPF program rewires socket connect
calls replacing target address with Kubernetes Service endpoint address. And in our case the Service had no endpoints, which resulted in an error which FDB client seen as Operation not permitted
.
We fixed this problem by setting Cilium agent flag bpf-lb-sock-hostns-only
to true
.
As I can see in the recent FoundationDB code, the client now tries every coordinator, not just the first one, so it seems that this situation should not happen with recent FoundationDB versions. But we still want to share this in case anyone faces the same problem.