Hi team,
I’m running a FoundationDB cluster on Kubernetes using the official operator, and I’m seeing an issue where FDB processes continue attempting to connect to stale pod IPs after pod restarts, leading to persistent connection failures.
Environment
-
Kubernetes: 1.33 (EKS)
-
FDB Operator: v2.23.0 (also tested with v2.24.0 image)
-
CNI: Cilium (with network policies enabled)
-
Namespace:
infino-fdb
Cluster Setup
Pods:
fdb-kubernetes-operator-controller-manager-649dcbb76c-f7sfr
infino-fdb-log-20923
infino-fdb-log-31566
infino-fdb-log-55394
infino-fdb-stateless-27316
infino-fdb-stateless-36530
infino-fdb-stateless-90352
infino-fdb-storage-22773
infino-fdb-storage-52998
infino-fdb-storage-74528
Observed Behavior
When a pod gets recreated (IP changes), other FDB processes continue trying to connect to the old IP address.
From Cilium Hubble logs:
Apr 7 07:38:57.707: infino-fdb/infino-fdb-stateless-27316:39862 (ID:4217364) <> 10.0.0.20:4501 (world) Policy denied DROPPED (TCP Flags: SYN)
Apr 7 07:38:57.787: infino-fdb/infino-fdb-stateless-27316:47730 (ID:4217364) <> 10.0.3.240:4501 (world) policy-verdict:none EGRESS DENIED (TCP Flags: SYN)
Apr 7 07:38:57.787: infino-fdb/infino-fdb-stateless-27316:47730 (ID:4217364) <> 10.0.3.240:4501 (world) Policy denied DROPPED (TCP Flags: SYN)
Apr 7 07:38:57.962: infino-fdb/infino-fdb-log-31566:54080 (ID:4207400) <> 10.0.3.240:4501 (world) policy-verdict:none EGRESS DENIED (TCP Flags: SYN)
Apr 7 07:38:57.962: infino-fdb/infino-fdb-log-31566:54080 (ID:4207400) <> 10.0.3.240:4501 (world) Policy denied DROPPED (TCP Flags: SYN)
Apr 7 07:38:58.107: infino-fdb/infino-fdb-stateless-36530:47260 (ID:4221191) <> 10.0.0.85:4501 (world) policy-verdict:none EGRESS DENIED (TCP Flags: SYN)
Apr 7 07:38:58.107: infino-fdb/infino-fdb-stateless-36530:47260 (ID:4221191) <> 10.0.0.85:4501 (world) Policy denied DROPPED (TCP Flags: SYN)
These IPs correspond to previous pod IPs, not current ones.
Because of Cilium policies, these connections are dropped, causing:
-
Continuous SYN retries
-
Cluster communication degradation
-
Noise in network observability
Expectation
My understanding is:
-
FDB cluster file / coordinator list should get updated when pods change
-
Processes should eventually stop trying stale IPs
-
Or cluster should converge to only valid addresses
Questions
-
Is it expected that FDB processes continue attempting connections to old IPs after pod restarts?
-
How does FDB propagate updated addresses internally in Kubernetes deployments?
-
Is there a delay or mechanism (e.g. cache, failure detection) that explains this behavior?
-
Could this be related to:
-
cluster file not being updated properly?
-
DNS vs IP-based addressing?
-
operator reconciliation lag?
-
-
Is there a recommended way to:
-
avoid stale IP connections?
-
or make FDB more resilient in dynamic IP environments?
-
Additional Context
-
Using Cilium with strict egress policies
-
These stale connections are being blocked → making the issue very visible
-
Without network policies, this may go unnoticed
-
Also not all the pods of fdb are on same node - they are sometimes on different node
This is how my cluster file looks like
bash-5.1# cat /var/fdb/data/fdb.cluster
# DO NOT EDIT!
# This file is auto-generated, it is not to be edited by hand
infino_fdb:Q4JbW68z3WLm6LmHawZCVC4EVOCoBRKb@infino-fdb-log-20923.infino-fdb.infino-fdb.svc.cluster.local:4501,infino-fdb-log-31566.infino-fdb.infino-fdb.svc.cluster.local:4501,infino-fdb-log-55394.infino-fdb.infino-fdb.svc.cluster.local:4501
bash-5.1#
attaching fdb.yml which is deployed in kubernetes namespace so that we get an idea how it is stitched
fdb.yml
# FoundationDB Cluster for Infino Metadata Storage
# Defines the FoundationDB cluster and Cilium network policies.
#
# Prerequisites (handled by deploy-fdb Makefile target):
# - FDB operator CRDs applied
# - FDB operator deployment applied (from official repo)
#
# Required environment variables:
# STORAGE_CLASS - K8s storage class for PVCs (e.g., gp3-encrypted)
# K8S_NAMESPACE_LABEL_KEY - Namespace label key for Cilium ingress rules (default: io.kubernetes.pod.namespace for exact match)
# K8S_NAMESPACE_LABEL_VALUE - Namespace label value (e.g., infino-core for exact match, or champagne for label-based wildcard)
# K8S_DNS_PORT - DNS port (usually 53)
# FDB_NAMESPACE - Namespace for FDB resources (default: infino-fdb)
# FDB_CLUSTER_NAME - Name of the FoundationDB cluster (default: infino-fdb)
# K8S_WORKLOAD_TYPE - Node workload label for pod scheduling (e.g., app, monitoring)
---
##
## Namespace for FoundationDB
##
apiVersion: v1
kind: Namespace
metadata:
name: ${FDB_NAMESPACE}
labels:
app: foundationdb
app.kubernetes.io/part-of: ${FDB_NAMESPACE}
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: fdb-kubernetes-operator-controller-manager
namespace: ${FDB_NAMESPACE}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: fdb-kubernetes-operator-manager-clusterrole
namespace: ${FDB_NAMESPACE}
rules:
- apiGroups:
- ""
resources:
- nodes
verbs:
- get
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: fdb-kubernetes-operator-manager-role
namespace: ${FDB_NAMESPACE}
rules:
- apiGroups:
- ""
resources:
- configmaps
- events
- persistentvolumeclaims
- pods
- secrets
- services
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- apps
resources:
- deployments
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- apps.foundationdb.org
resources:
- foundationdbbackups
- foundationdbclusters
- foundationdbrestores
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- apps.foundationdb.org
resources:
- foundationdbbackups/status
- foundationdbclusters/status
- foundationdbrestores/status
verbs:
- get
- patch
- update
- apiGroups:
- coordination.k8s.io
resources:
- leases
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: fdb-kubernetes-operator-manager-rolebinding
namespace: ${FDB_NAMESPACE}
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: fdb-kubernetes-operator-manager-role
subjects:
- kind: ServiceAccount
name: fdb-kubernetes-operator-controller-manager
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: fdb-kubernetes-operator-manager-clusterrolebinding
namespace: ${FDB_NAMESPACE}
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: fdb-kubernetes-operator-manager-clusterrole
subjects:
- kind: ServiceAccount
name: fdb-kubernetes-operator-controller-manager
namespace: ${FDB_NAMESPACE}
---
apiVersion: v1
kind: Secret
metadata:
name: backup-credentials
namespace: ${FDB_NAMESPACE}
type: Opaque
stringData:
credentials: "${FDB_BACKUP_CREDENTIALS}"
---
apiVersion: apps/v1
kind: Deployment
metadata:
namespace: ${FDB_NAMESPACE}
labels:
app: fdb-kubernetes-operator-controller-manager
control-plane: controller-manager
name: fdb-kubernetes-operator-controller-manager
spec:
replicas: 1
selector:
matchLabels:
app: fdb-kubernetes-operator-controller-manager
template:
metadata:
labels:
app: fdb-kubernetes-operator-controller-manager
control-plane: controller-manager
spec:
containers:
- command:
- /manager
env:
- name: WATCH_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: FDB_BLOB_CREDENTIALS
value: /var/backup-credentials/credentials
image: foundationdb/fdb-kubernetes-operator:v2.24.0
name: manager
ports:
- containerPort: 8080
name: metrics
resources:
limits:
cpu: 500m
memory: 256Mi
requests:
cpu: 500m
memory: 256Mi
securityContext:
allowPrivilegeEscalation: false
privileged: false
readOnlyRootFilesystem: true
volumeMounts:
- mountPath: /tmp
name: tmp
- mountPath: /var/log/fdb
name: logs
- mountPath: /usr/bin/fdb
name: fdb-binaries
- mountPath: /var/backup-credentials
name: backup-credentials
initContainers:
- args:
- --copy-library
- "7.3"
- --copy-binary
- fdbcli
- --copy-binary
- fdbbackup
- --copy-binary
- fdbrestore
- --output-dir
- /var/output-files
- --mode
- init
image: foundationdb/fdb-kubernetes-monitor:7.3.69
name: foundationdb-kubernetes-init-7-3
volumeMounts:
- mountPath: /var/output-files
name: fdb-binaries
- args:
- --copy-library
- "7.4"
- --copy-binary
- fdbcli
- --copy-binary
- fdbbackup
- --copy-binary
- fdbrestore
- --output-dir
- /var/output-files
- --mode
- init
image: foundationdb/fdb-kubernetes-monitor:7.4.5
name: foundationdb-kubernetes-init-7-4
volumeMounts:
- mountPath: /var/output-files
name: fdb-binaries
securityContext:
fsGroup: 4059
runAsGroup: 4059
runAsUser: 4059
serviceAccountName: fdb-kubernetes-operator-controller-manager
terminationGracePeriodSeconds: 10
volumes:
- emptyDir: {}
name: tmp
- emptyDir: {}
name: logs
- emptyDir: {}
name: fdb-binaries
- name: backup-credentials
secret:
secretName: backup-credentials
---
##
## ServiceAccount for FDB pods (sidecar needs pod read/update permissions)
##
apiVersion: v1
kind: ServiceAccount
metadata:
name: fdb-kubernetes
namespace: ${FDB_NAMESPACE}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: fdb-kubernetes
namespace: ${FDB_NAMESPACE}
rules:
- apiGroups: [""]
resources: [pods]
verbs: [get, watch, update, patch, list]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: fdb-kubernetes
namespace: ${FDB_NAMESPACE}
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: fdb-kubernetes
subjects:
- kind: ServiceAccount
name: fdb-kubernetes
---
##
## FoundationDB Cluster
##
apiVersion: apps.foundationdb.org/v1beta2
kind: FoundationDBCluster
metadata:
name: ${FDB_CLUSTER_NAME}
namespace: ${FDB_NAMESPACE}
spec:
version: 7.3.69
processCounts:
stateless: 3
storage: 3
log: 3
processes:
general:
podTemplate:
spec:
serviceAccountName: fdb-kubernetes
nodeSelector:
workload: "${K8S_WORKLOAD_TYPE}"
containers:
- name: foundationdb
securityContext:
runAsUser: 0
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: 1000m
memory: 2Gi
- name: foundationdb-kubernetes-sidecar
securityContext:
runAsUser: 0
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 100m
memory: 128Mi
storage:
volumeClaimTemplate:
spec:
storageClassName: ${STORAGE_CLASS}
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
log:
volumeClaimTemplate:
spec:
storageClassName: ${STORAGE_CLASS}
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 3Gi
stateless:
volumeClaimTemplate:
spec:
storageClassName: ${STORAGE_CLASS}
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
databaseConfiguration:
redundancy_mode: double
storage_engine: ssd-2
routing:
defineDNSLocalityFields: true
publicIPSource: pod
sidecarContainer:
enableLivenessProbe: true
enableReadinessProbe: false
automationOptions:
replacements:
enabled: true
faultDomain:
key: foundationdb.org/none
---
# =============================================================================
# Cilium Network Policies for ${FDB_NAMESPACE} namespace
# =============================================================================
##
## DNS policy - allow all pods in ${FDB_NAMESPACE} to resolve DNS
##
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
namespace: ${FDB_NAMESPACE}
name: allow-dns-policy
labels:
app.kubernetes.io/part-of: ${FDB_NAMESPACE}
spec:
endpointSelector: {}
egress:
- toEndpoints:
- matchLabels:
"k8s:io.kubernetes.pod.namespace": kube-system
"k8s:k8s-app": kube-dns
toPorts:
- ports:
- port: "${K8S_DNS_PORT}"
protocol: ANY
---
##
## FDB cluster pods network policy
##
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
namespace: ${FDB_NAMESPACE}
name: fdb-${FDB_CLUSTER_NAME}-cluster-policy
labels:
app.kubernetes.io/part-of: ${FDB_NAMESPACE}
spec:
endpointSelector:
matchLabels:
foundationdb.org/fdb-cluster-name: ${FDB_CLUSTER_NAME}
ingress:
# Allow FDB cluster internal traffic (pod-to-pod coordination)
- fromEndpoints:
- matchLabels:
foundationdb.org/fdb-cluster-name: ${FDB_CLUSTER_NAME}
toPorts:
- ports:
- port: "4500"
protocol: TCP
- port: "4501"
protocol: TCP
# Allow from operator (for management)
- fromEndpoints:
- matchLabels:
app: fdb-kubernetes-operator-controller-manager
toPorts:
- ports:
- port: "4500"
protocol: TCP
- port: "4501"
protocol: TCP
# Allow from data-server, auth-server, ai-server matching namespace label
- fromEndpoints:
- matchLabels:
"k8s:${K8S_NAMESPACE_LABEL_KEY}": ${K8S_NAMESPACE_LABEL_VALUE}
app: data-server
- matchLabels:
"k8s:${K8S_NAMESPACE_LABEL_KEY}": ${K8S_NAMESPACE_LABEL_VALUE}
app: auth-server
- matchLabels:
"k8s:${K8S_NAMESPACE_LABEL_KEY}": ${K8S_NAMESPACE_LABEL_VALUE}
app: ai-server
toPorts:
- ports:
- port: "4500"
protocol: TCP
- port: "4501"
protocol: TCP
# Allow fdb-exporter from monitoring namespace to read cluster status
- fromEndpoints:
- matchLabels:
"k8s:io.kubernetes.pod.namespace": infino-monitoring
app: fdb-exporter
toPorts:
- ports:
- port: "4500"
protocol: TCP
- port: "4501"
protocol: TCP
egress:
# Allow FDB cluster internal coordination
- toEndpoints:
- matchLabels:
foundationdb.org/fdb-cluster-name: ${FDB_CLUSTER_NAME}
toPorts:
- ports:
- port: "4500"
protocol: TCP
- port: "4501"
protocol: TCP
# Allow sidecar to access Kubernetes API server (pod status updates)
- toServices:
- k8sService:
serviceName: kubernetes
namespace: default
toPorts:
- ports:
- port: "443"
protocol: TCP
---
##
## Operator network policy
##
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
namespace: ${FDB_NAMESPACE}
name: fdb-${FDB_CLUSTER_NAME}-operator-policy
labels:
app.kubernetes.io/part-of: ${FDB_NAMESPACE}
spec:
endpointSelector:
matchLabels:
app: fdb-kubernetes-operator-controller-manager
egress:
# Allow operator to manage FDB pods
- toEndpoints:
- matchLabels:
foundationdb.org/fdb-cluster-name: ${FDB_CLUSTER_NAME}
toPorts:
- ports:
- port: "4500"
protocol: TCP
- port: "4501"
protocol: TCP
# Allow operator to access Kubernetes API server
- toServices:
- k8sService:
serviceName: kubernetes
namespace: default
toPorts:
- ports:
- port: "443"
protocol: TCP
---
