Cluster stuck in broken state when using Cilium networking

aikoven · December 6, 2023, 9:56am

After rolling nodes in our Kubernetes cluster, we found ourselves in the following situation:

Our FoundationDB cluster was unavailable.
We are using FDB 6.3.x, operator v1.21.0, FDB cluster with publicIPSource: service.
Some pods of the FoundationDB cluster were missing. Particularly the one corresponding to the first coordinator in the connection string.

Operator logged errors like these and could not progress:

Error determining public address: connect: Operation not permitted [system:1]
  ERROR: Unable to bind to network (1512)
  Unable to connect to cluster from `/tmp/19f5e8db-4d3f-4d40-9fa4-ebaac5f4c57e

We also saw similar errors in FDB trace logs.

After some digging we found that the error came from determinePublicIPAutomatically function inside the FDB client, which uses the address of the first coordinator to determine the client’s public address. To do this, it creates a dummy UDP socket. As I understand, this trick does not require the first coordinator to be available, as no actual packets are sent.

This appears to not play well with Cilium socket-based load-balancing feature, which enables Kubernetes networking without kube-proxy. This feature (enabled by bpf-lb-sock flag) in its default configuration breaks the above algorithm in case when the first coordinator pod is missing. The eBPF program rewires socket connect calls replacing target address with Kubernetes Service endpoint address. And in our case the Service had no endpoints, which resulted in an error which FDB client seen as Operation not permitted.

We fixed this problem by setting Cilium agent flag bpf-lb-sock-hostns-only to true.

As I can see in the recent FoundationDB code, the client now tries every coordinator, not just the first one, so it seems that this situation should not happen with recent FoundationDB versions. But we still want to share this in case anyone faces the same problem.

johscheuer · December 6, 2023, 10:53am

Thanks for sharing! Do you know why some of the Pods were missing? The operator should be creating them, assuming there are enough resources in the cluster.

Depending on your setup, once you upgraded to FDB 7.0+, you could make use of DNS in the cluster file instead of using a service for the static IP: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/docs/manual/customization.md#using-dns

aikoven · December 6, 2023, 11:33am

Yeah, the pods got deleted because we restarted our nodes, and the operator could not recreate them because it was stuck in these errors. Here are the operator logs: https://pastebin.com/raw/jGxPm5PS

Right! It’s on our roadmap. Thanks for advice

Topic		Replies	Views
Recovering from FoundationDB crashes Kubernetes Operator operator	5	900	August 24, 2021
Fdb database unavailable result of inconsistent coordinator ips Kubernetes Operator operator	2	464	August 24, 2022
Stateless node keep reaching out to removed storage node Kubernetes Operator operator	0	387	April 14, 2022
Fdb-doc-layer on kubernets Document Layer operator	1	560	December 2, 2021
Client in cluster_with_client.yaml does not connect to Coordinators Kubernetes Operator operator	3	525	April 25, 2021

Cluster stuck in broken state when using Cilium networking

Related topics