Cluster stuck in broken state when using Cilium networking

After rolling nodes in our Kubernetes cluster, we found ourselves in the following situation:

  1. Our FoundationDB cluster was unavailable.
    We are using FDB 6.3.x, operator v1.21.0, FDB cluster with publicIPSource: service.

  2. Some pods of the FoundationDB cluster were missing. Particularly the one corresponding to the first coordinator in the connection string.

  3. Operator logged errors like these and could not progress:

    Error determining public address: connect: Operation not permitted [system:1]
      ERROR: Unable to bind to network (1512)
      Unable to connect to cluster from `/tmp/19f5e8db-4d3f-4d40-9fa4-ebaac5f4c57e

    We also saw similar errors in FDB trace logs.

After some digging we found that the error came from determinePublicIPAutomatically function inside the FDB client, which uses the address of the first coordinator to determine the client’s public address. To do this, it creates a dummy UDP socket. As I understand, this trick does not require the first coordinator to be available, as no actual packets are sent.

This appears to not play well with Cilium socket-based load-balancing feature, which enables Kubernetes networking without kube-proxy. This feature (enabled by bpf-lb-sock flag) in its default configuration breaks the above algorithm in case when the first coordinator pod is missing. The eBPF program rewires socket connect calls replacing target address with Kubernetes Service endpoint address. And in our case the Service had no endpoints, which resulted in an error which FDB client seen as Operation not permitted.

We fixed this problem by setting Cilium agent flag bpf-lb-sock-hostns-only to true.

As I can see in the recent FoundationDB code, the client now tries every coordinator, not just the first one, so it seems that this situation should not happen with recent FoundationDB versions. But we still want to share this in case anyone faces the same problem.

Thanks for sharing! Do you know why some of the Pods were missing? The operator should be creating them, assuming there are enough resources in the cluster.

Depending on your setup, once you upgraded to FDB 7.0+, you could make use of DNS in the cluster file instead of using a service for the static IP:

Yeah, the pods got deleted because we restarted our nodes, and the operator could not recreate them because it was stuck in these errors. Here are the operator logs:

Right! It’s on our roadmap. Thanks for advice :slight_smile: