FDB pods continue connecting to stale IPs after pod restart → connection drops (Kubernetes + Operator + Cilium)

Hi team,

I’m running a FoundationDB cluster on Kubernetes using the official operator, and I’m seeing an issue where FDB processes continue attempting to connect to stale pod IPs after pod restarts, leading to persistent connection failures.

Environment

  • Kubernetes: 1.33 (EKS)

  • FDB Operator: v2.23.0 (also tested with v2.24.0 image)

  • CNI: Cilium (with network policies enabled)

  • Namespace: infino-fdb

Cluster Setup

Pods:

fdb-kubernetes-operator-controller-manager-649dcbb76c-f7sfr
infino-fdb-log-20923
infino-fdb-log-31566
infino-fdb-log-55394
infino-fdb-stateless-27316
infino-fdb-stateless-36530
infino-fdb-stateless-90352
infino-fdb-storage-22773
infino-fdb-storage-52998
infino-fdb-storage-74528

Observed Behavior

When a pod gets recreated (IP changes), other FDB processes continue trying to connect to the old IP address.

From Cilium Hubble logs:

Apr  7 07:38:57.707: infino-fdb/infino-fdb-stateless-27316:39862 (ID:4217364) <> 10.0.0.20:4501 (world) Policy denied DROPPED (TCP Flags: SYN)
Apr  7 07:38:57.787: infino-fdb/infino-fdb-stateless-27316:47730 (ID:4217364) <> 10.0.3.240:4501 (world) policy-verdict:none EGRESS DENIED (TCP Flags: SYN)
Apr  7 07:38:57.787: infino-fdb/infino-fdb-stateless-27316:47730 (ID:4217364) <> 10.0.3.240:4501 (world) Policy denied DROPPED (TCP Flags: SYN)
Apr  7 07:38:57.962: infino-fdb/infino-fdb-log-31566:54080 (ID:4207400) <> 10.0.3.240:4501 (world) policy-verdict:none EGRESS DENIED (TCP Flags: SYN)
Apr  7 07:38:57.962: infino-fdb/infino-fdb-log-31566:54080 (ID:4207400) <> 10.0.3.240:4501 (world) Policy denied DROPPED (TCP Flags: SYN)
Apr  7 07:38:58.107: infino-fdb/infino-fdb-stateless-36530:47260 (ID:4221191) <> 10.0.0.85:4501 (world) policy-verdict:none EGRESS DENIED (TCP Flags: SYN)
Apr  7 07:38:58.107: infino-fdb/infino-fdb-stateless-36530:47260 (ID:4221191) <> 10.0.0.85:4501 (world) Policy denied DROPPED (TCP Flags: SYN)

These IPs correspond to previous pod IPs, not current ones.

Because of Cilium policies, these connections are dropped, causing:

  • Continuous SYN retries

  • Cluster communication degradation

  • Noise in network observability

Expectation

My understanding is:

  • FDB cluster file / coordinator list should get updated when pods change

  • Processes should eventually stop trying stale IPs

  • Or cluster should converge to only valid addresses

Questions

  1. Is it expected that FDB processes continue attempting connections to old IPs after pod restarts?

  2. How does FDB propagate updated addresses internally in Kubernetes deployments?

  3. Is there a delay or mechanism (e.g. cache, failure detection) that explains this behavior?

  4. Could this be related to:

    • cluster file not being updated properly?

    • DNS vs IP-based addressing?

    • operator reconciliation lag?

  5. Is there a recommended way to:

    • avoid stale IP connections?

    • or make FDB more resilient in dynamic IP environments?

Additional Context

  • Using Cilium with strict egress policies

  • These stale connections are being blocked → making the issue very visible

  • Without network policies, this may go unnoticed

  • Also not all the pods of fdb are on same node - they are sometimes on different node

This is how my cluster file looks like

bash-5.1# cat /var/fdb/data/fdb.cluster
# DO NOT EDIT!
# This file is auto-generated, it is not to be edited by hand
infino_fdb:Q4JbW68z3WLm6LmHawZCVC4EVOCoBRKb@infino-fdb-log-20923.infino-fdb.infino-fdb.svc.cluster.local:4501,infino-fdb-log-31566.infino-fdb.infino-fdb.svc.cluster.local:4501,infino-fdb-log-55394.infino-fdb.infino-fdb.svc.cluster.local:4501
bash-5.1#

attaching fdb.yml which is deployed in kubernetes namespace so that we get an idea how it is stitched

fdb.yml
# FoundationDB Cluster for Infino Metadata Storage
# Defines the FoundationDB cluster and Cilium network policies.
#
# Prerequisites (handled by deploy-fdb Makefile target):
#   - FDB operator CRDs applied
#   - FDB operator deployment applied (from official repo)
#
# Required environment variables:
#   STORAGE_CLASS           - K8s storage class for PVCs (e.g., gp3-encrypted)
#   K8S_NAMESPACE_LABEL_KEY   - Namespace label key for Cilium ingress rules (default: io.kubernetes.pod.namespace for exact match)
#   K8S_NAMESPACE_LABEL_VALUE - Namespace label value (e.g., infino-core for exact match, or champagne for label-based wildcard)
#   K8S_DNS_PORT            - DNS port (usually 53)
#   FDB_NAMESPACE           - Namespace for FDB resources (default: infino-fdb)
#   FDB_CLUSTER_NAME        - Name of the FoundationDB cluster (default: infino-fdb)
#   K8S_WORKLOAD_TYPE       - Node workload label for pod scheduling (e.g., app, monitoring)
---
##
## Namespace for FoundationDB
##
apiVersion: v1
kind: Namespace
metadata:
  name: ${FDB_NAMESPACE}
  labels:
    app: foundationdb
    app.kubernetes.io/part-of: ${FDB_NAMESPACE}
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: fdb-kubernetes-operator-controller-manager
  namespace: ${FDB_NAMESPACE}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: fdb-kubernetes-operator-manager-clusterrole
  namespace: ${FDB_NAMESPACE}
rules:
  - apiGroups:
      - ""
    resources:
      - nodes
    verbs:
      - get
      - list
      - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: fdb-kubernetes-operator-manager-role
  namespace: ${FDB_NAMESPACE}
rules:
  - apiGroups:
      - ""
    resources:
      - configmaps
      - events
      - persistentvolumeclaims
      - pods
      - secrets
      - services
    verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
  - apiGroups:
      - apps
    resources:
      - deployments
    verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
  - apiGroups:
      - apps.foundationdb.org
    resources:
      - foundationdbbackups
      - foundationdbclusters
      - foundationdbrestores
    verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
  - apiGroups:
      - apps.foundationdb.org
    resources:
      - foundationdbbackups/status
      - foundationdbclusters/status
      - foundationdbrestores/status
    verbs:
      - get
      - patch
      - update
  - apiGroups:
      - coordination.k8s.io
    resources:
      - leases
    verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: fdb-kubernetes-operator-manager-rolebinding
  namespace: ${FDB_NAMESPACE}
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: fdb-kubernetes-operator-manager-role
subjects:
  - kind: ServiceAccount
    name: fdb-kubernetes-operator-controller-manager
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: fdb-kubernetes-operator-manager-clusterrolebinding
  namespace: ${FDB_NAMESPACE}
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: fdb-kubernetes-operator-manager-clusterrole
subjects:
  - kind: ServiceAccount
    name: fdb-kubernetes-operator-controller-manager
    namespace: ${FDB_NAMESPACE}
---
apiVersion: v1
kind: Secret
metadata:
  name: backup-credentials
  namespace: ${FDB_NAMESPACE}
type: Opaque
stringData:
  credentials: "${FDB_BACKUP_CREDENTIALS}"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: ${FDB_NAMESPACE}
  labels:
    app: fdb-kubernetes-operator-controller-manager
    control-plane: controller-manager
  name: fdb-kubernetes-operator-controller-manager
spec:
  replicas: 1
  selector:
    matchLabels:
      app: fdb-kubernetes-operator-controller-manager
  template:
    metadata:
      labels:
        app: fdb-kubernetes-operator-controller-manager
        control-plane: controller-manager
    spec:
      containers:
        - command:
            - /manager
          env:
            - name: WATCH_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: FDB_BLOB_CREDENTIALS
              value: /var/backup-credentials/credentials
          image: foundationdb/fdb-kubernetes-operator:v2.24.0
          name: manager
          ports:
            - containerPort: 8080
              name: metrics
          resources:
            limits:
              cpu: 500m
              memory: 256Mi
            requests:
              cpu: 500m
              memory: 256Mi
          securityContext:
            allowPrivilegeEscalation: false
            privileged: false
            readOnlyRootFilesystem: true
          volumeMounts:
            - mountPath: /tmp
              name: tmp
            - mountPath: /var/log/fdb
              name: logs
            - mountPath: /usr/bin/fdb
              name: fdb-binaries
            - mountPath: /var/backup-credentials
              name: backup-credentials
      initContainers:
        - args:
            - --copy-library
            - "7.3"
            - --copy-binary
            - fdbcli
            - --copy-binary
            - fdbbackup
            - --copy-binary
            - fdbrestore
            - --output-dir
            - /var/output-files
            - --mode
            - init
          image: foundationdb/fdb-kubernetes-monitor:7.3.69
          name: foundationdb-kubernetes-init-7-3
          volumeMounts:
            - mountPath: /var/output-files
              name: fdb-binaries
        - args:
            - --copy-library
            - "7.4"
            - --copy-binary
            - fdbcli
            - --copy-binary
            - fdbbackup
            - --copy-binary
            - fdbrestore
            - --output-dir
            - /var/output-files
            - --mode
            - init
          image: foundationdb/fdb-kubernetes-monitor:7.4.5
          name: foundationdb-kubernetes-init-7-4
          volumeMounts:
            - mountPath: /var/output-files
              name: fdb-binaries
      securityContext:
        fsGroup: 4059
        runAsGroup: 4059
        runAsUser: 4059
      serviceAccountName: fdb-kubernetes-operator-controller-manager
      terminationGracePeriodSeconds: 10
      volumes:
        - emptyDir: {}
          name: tmp
        - emptyDir: {}
          name: logs
        - emptyDir: {}
          name: fdb-binaries
        - name: backup-credentials
          secret:
            secretName: backup-credentials
---
##
## ServiceAccount for FDB pods (sidecar needs pod read/update permissions)
##
apiVersion: v1
kind: ServiceAccount
metadata:
  name: fdb-kubernetes
  namespace: ${FDB_NAMESPACE}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: fdb-kubernetes
  namespace: ${FDB_NAMESPACE}
rules:
  - apiGroups: [""]
    resources: [pods]
    verbs: [get, watch, update, patch, list]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: fdb-kubernetes
  namespace: ${FDB_NAMESPACE}
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: fdb-kubernetes
subjects:
  - kind: ServiceAccount
    name: fdb-kubernetes
---
##
## FoundationDB Cluster
##
apiVersion: apps.foundationdb.org/v1beta2
kind: FoundationDBCluster
metadata:
  name: ${FDB_CLUSTER_NAME}
  namespace: ${FDB_NAMESPACE}
spec:
  version: 7.3.69

  processCounts:
    stateless: 3
    storage: 3
    log: 3

  processes:
    general:
      podTemplate:
        spec:
          serviceAccountName: fdb-kubernetes
          nodeSelector:
            workload: "${K8S_WORKLOAD_TYPE}"
          containers:
            - name: foundationdb
              securityContext:
                runAsUser: 0
              resources:
                requests:
                  cpu: 250m
                  memory: 512Mi
                limits:
                  cpu: 1000m
                  memory: 2Gi
            - name: foundationdb-kubernetes-sidecar
              securityContext:
                runAsUser: 0
              resources:
                requests:
                  cpu: 100m
                  memory: 128Mi
                limits:
                  cpu: 100m
                  memory: 128Mi
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: ${STORAGE_CLASS}
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 20Gi
    log:
      volumeClaimTemplate:
        spec:
          storageClassName: ${STORAGE_CLASS}
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 3Gi
    stateless:
      volumeClaimTemplate:
        spec:
          storageClassName: ${STORAGE_CLASS}
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 1Gi

  databaseConfiguration:
    redundancy_mode: double
    storage_engine: ssd-2

  routing:
    defineDNSLocalityFields: true
    publicIPSource: pod

  sidecarContainer:
    enableLivenessProbe: true
    enableReadinessProbe: false

  automationOptions:
    replacements:
      enabled: true

  faultDomain:
    key: foundationdb.org/none
---
# =============================================================================
# Cilium Network Policies for ${FDB_NAMESPACE} namespace
# =============================================================================

##
## DNS policy - allow all pods in ${FDB_NAMESPACE} to resolve DNS
##
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  namespace: ${FDB_NAMESPACE}
  name: allow-dns-policy
  labels:
    app.kubernetes.io/part-of: ${FDB_NAMESPACE}
spec:
  endpointSelector: {}
  egress:
    - toEndpoints:
        - matchLabels:
            "k8s:io.kubernetes.pod.namespace": kube-system
            "k8s:k8s-app": kube-dns
      toPorts:
        - ports:
            - port: "${K8S_DNS_PORT}"
              protocol: ANY
---
##
## FDB cluster pods network policy
##
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  namespace: ${FDB_NAMESPACE}
  name: fdb-${FDB_CLUSTER_NAME}-cluster-policy
  labels:
    app.kubernetes.io/part-of: ${FDB_NAMESPACE}
spec:
  endpointSelector:
    matchLabels:
      foundationdb.org/fdb-cluster-name: ${FDB_CLUSTER_NAME}
  ingress:
    # Allow FDB cluster internal traffic (pod-to-pod coordination)
    - fromEndpoints:
        - matchLabels:
            foundationdb.org/fdb-cluster-name: ${FDB_CLUSTER_NAME}
      toPorts:
        - ports:
            - port: "4500"
              protocol: TCP
            - port: "4501"
              protocol: TCP
    # Allow from operator (for management)
    - fromEndpoints:
        - matchLabels:
            app: fdb-kubernetes-operator-controller-manager
      toPorts:
        - ports:
            - port: "4500"
              protocol: TCP
            - port: "4501"
              protocol: TCP
    # Allow from data-server, auth-server, ai-server matching namespace label
    - fromEndpoints:
        - matchLabels:
            "k8s:${K8S_NAMESPACE_LABEL_KEY}": ${K8S_NAMESPACE_LABEL_VALUE}
            app: data-server
        - matchLabels:
            "k8s:${K8S_NAMESPACE_LABEL_KEY}": ${K8S_NAMESPACE_LABEL_VALUE}
            app: auth-server
        - matchLabels:
            "k8s:${K8S_NAMESPACE_LABEL_KEY}": ${K8S_NAMESPACE_LABEL_VALUE}
            app: ai-server
      toPorts:
        - ports:
            - port: "4500"
              protocol: TCP
            - port: "4501"
              protocol: TCP
    # Allow fdb-exporter from monitoring namespace to read cluster status
    - fromEndpoints:
        - matchLabels:
            "k8s:io.kubernetes.pod.namespace": infino-monitoring
            app: fdb-exporter
      toPorts:
        - ports:
            - port: "4500"
              protocol: TCP
            - port: "4501"
              protocol: TCP
  egress:
    # Allow FDB cluster internal coordination
    - toEndpoints:
        - matchLabels:
            foundationdb.org/fdb-cluster-name: ${FDB_CLUSTER_NAME}
      toPorts:
        - ports:
            - port: "4500"
              protocol: TCP
            - port: "4501"
              protocol: TCP
    # Allow sidecar to access Kubernetes API server (pod status updates)
    - toServices:
        - k8sService:
            serviceName: kubernetes
            namespace: default
      toPorts:
        - ports:
            - port: "443"
              protocol: TCP
---
##
## Operator network policy
##
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  namespace: ${FDB_NAMESPACE}
  name: fdb-${FDB_CLUSTER_NAME}-operator-policy
  labels:
    app.kubernetes.io/part-of: ${FDB_NAMESPACE}
spec:
  endpointSelector:
    matchLabels:
      app: fdb-kubernetes-operator-controller-manager
  egress:
    # Allow operator to manage FDB pods
    - toEndpoints:
        - matchLabels:
            foundationdb.org/fdb-cluster-name: ${FDB_CLUSTER_NAME}
      toPorts:
        - ports:
            - port: "4500"
              protocol: TCP
            - port: "4501"
              protocol: TCP
    # Allow operator to access Kubernetes API server
    - toServices:
        - k8sService:
            serviceName: kubernetes
            namespace: default
      toPorts:
        - ports:
            - port: "443"
              protocol: TCP
---

Will changing fdb.yml from .spec.routing.publicIpSource=pod to .spec.routing.publicIpSource=service help ?