Fdb cluster deployed with operator automatically going down

We’re using foundationdb on prod, with fdb-kubernetes-operator to deploy on k8s. Recently we saw the whole cluster going down, with a lot of pods in ‘Completed’ or ‘StatusUnknownState’ [ screenshot ]

I’ve verified that the storage nodes have enough storage space, and couldn’t find any other obvious issues. I’m not sure about a couple of things

  1. When do pods reach (and stay) in the Completed state? I thought the operator should have rescheduled them
  2. Why did the operator not kill pods in ContainerStatusUnknown and reschedule new pods to get the cluster up automatically?

Here are some more logs of the operator pod

{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.chooseRemovals"}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Subreconciler finished run","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.chooseRemovals","duration_seconds":0.000013421}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.excludeProcesses"}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"current exclusions","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.excludeProcesses","exclusions":[]}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Subreconciler finished run","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.excludeProcesses","duration_seconds":0.00001296}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.changeCoordinators"}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Skip process with empty localities","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.changeCoordinators","process":""}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Cluster has an unhealthy coordinator","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.changeCoordinators","address":"xyz-fdb-abcd-log-55193.xyz-fdb-abcd.xyz-fdb-abcd.svc.cluster.local:4501"}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Cluster has not enough running coordinators","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.changeCoordinators","runningCoordinators":2,"desiredCount":3}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Deferring coordinator change due to safety check","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.changeCoordinators","error":"cannot: change coordinators: cluster has 3 missing coordinators, clusters last recovery was 74.20 seconds ago, waiting until the last recovery was 120 seconds ago"}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Subreconciler finished run","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.changeCoordinators","duration_seconds":0.00008847}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Delaying requeue for sub-reconciler","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.changeCoordinators","message":"","delayedRequeueDuration":"10s","error":"cannot: change coordinators: cluster has 3 missing coordinators, clusters last recovery was 74.20 seconds ago, waiting until the last recovery was 120 seconds ago"}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.bounceProcesses"}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Ignoring process with missing localities","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.bounceProcesses","address":"10.0.27.119:4501"}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"ignore process group with missing process","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.bounceProcesses","processGroupID":"log-55193"}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"ignore process group with missing process","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.bounceProcesses","processGroupID":"stateless-10681"}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"ignore process group with missing process","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.bounceProcesses","processGroupID":"stateless-39362"}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"ignore process group with missing process","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.bounceProcesses","processGroupID":"stateless-40726"}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Subreconciler finished run","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.bounceProcesses","duration_seconds":0.00003758}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.maintenanceModeChecker"}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Subreconciler finished run","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.maintenanceModeChecker","duration_seconds":0.00000259}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.updatePods"}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Subreconciler finished run","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.updatePods","duration_seconds":0.00028216}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.removeProcessGroups"}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Subreconciler finished run","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.removeProcessGroups","duration_seconds":0.00000603}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.removeServices"}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Subreconciler finished run","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.removeServices","duration_seconds":0.00000322}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.updateStatus"}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"skip updating fault domain for process group with missing process in FoundationDB cluster status","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.updateStatus","processGroupID":"log-55193"}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"skip updating fault domain for process group with missing process in FoundationDB cluster status","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.updateStatus","processGroupID":"stateless-10681"}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"skip updating fault domain for process group with missing process in FoundationDB cluster status","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.updateStatus","processGroupID":"stateless-39362"}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"skip updating fault domain for process group with missing process in FoundationDB cluster status","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.updateStatus","processGroupID":"stateless-40726"}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Disable taint feature","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.updateStatus","Disabled":true}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Skip process with empty localities","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.updateStatus","process":""}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Cluster has an unhealthy coordinator","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.updateStatus","address":"xyz-fdb-abcd-log-55193.xyz-fdb-abcd.xyz-fdb-abcd.svc.cluster.local:4501"}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Cluster has not enough running coordinators","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.updateStatus","runningCoordinators":2,"desiredCount":3}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Has unhealthy process group","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.updateStatus","method":"CheckReconciliation","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","processGroupID":"log-55193","state":"HasUnhealthyProcess","conditions":["MissingProcesses","PodFailing"]}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Has unhealthy process group","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.updateStatus","method":"CheckReconciliation","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","processGroupID":"stateless-10681","state":"HasUnhealthyProcess","conditions":["MissingProcesses","PodFailing"]}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Has unhealthy process group","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.updateStatus","method":"CheckReconciliation","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","processGroupID":"stateless-39362","state":"HasUnhealthyProcess","conditions":["MissingProcesses","PodFailing"]}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Has unhealthy process group","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.updateStatus","method":"CheckReconciliation","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","processGroupID":"stateless-40726","state":"HasUnhealthyProcess","conditions":["MissingProcesses","PodFailing","PodPending"]}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Not all process groups are reconciled","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.updateStatus","method":"CheckReconciliation","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","desiredProcessGroups":9,"reconciledProcessGroups":5}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Pending coordinator change","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.updateStatus","method":"CheckReconciliation","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","state":"NeedsNewCoordinators"}
{"level":"info","ts":"2026-04-10T05:34:04Z","msg":"unknown field \"spec.processes.general.podTemplate.metadata.creationTimestamp\"","controller":"foundationdbcluster","controllerGroup":"apps.foundationdb.org","controllerKind":"FoundationDBCluster","FoundationDBCluster":{"name":"xyz-fdb-abcd","namespace":"xyz-fdb-abcd"},"namespace":"xyz-fdb-abcd","name":"xyz-fdb-abcd","reconcileID":"621fa184-2655-44c4-a706-312d4e0fddeb"}
{"level":"info","ts":"2026-04-10T05:34:04Z","msg":"unknown field \"spec.processes.log.volumeClaimTemplate.metadata.creationTimestamp\"","controller":"foundationdbcluster","controllerGroup":"apps.foundationdb.org","controllerKind":"FoundationDBCluster","FoundationDBCluster":{"name":"xyz-fdb-abcd","namespace":"xyz-fdb-abcd"},"namespace":"xyz-fdb-abcd","name":"xyz-fdb-abcd","reconcileID":"621fa184-2655-44c4-a706-312d4e0fddeb"}
{"level":"info","ts":"2026-04-10T05:34:04Z","msg":"unknown field \"spec.processes.stateless.volumeClaimTemplate.metadata.creationTimestamp\"","controller":"foundationdbcluster","controllerGroup":"apps.foundationdb.org","controllerKind":"FoundationDBCluster","FoundationDBCluster":{"name":"xyz-fdb-abcd","namespace":"xyz-fdb-abcd"},"namespace":"xyz-fdb-abcd","name":"xyz-fdb-abcd","reconcileID":"621fa184-2655-44c4-a706-312d4e0fddeb"}
{"level":"info","ts":"2026-04-10T05:34:04Z","msg":"unknown field \"spec.processes.storage.volumeClaimTemplate.metadata.creationTimestamp\"","controller":"foundationdbcluster","controllerGroup":"apps.foundationdb.org","controllerKind":"FoundationDBCluster","FoundationDBCluster":{"name":"xyz-fdb-abcd","namespace":"xyz-fdb-abcd"},"namespace":"xyz-fdb-abcd","name":"xyz-fdb-abcd","reconcileID":"621fa184-2655-44c4-a706-312d4e0fddeb"}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Subreconciler finished run","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.updateStatus","duration_seconds":0.017100822}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Delaying requeue for sub-reconciler","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","reconciler":"controllers.updateStatus","message":"cluster is not fully reconciled","delayedRequeueDuration":"10s","error":null}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Cluster was not fully reconciled by reconciliation process","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","status":{"reconciled":1,"needsCoordinatorChange":1,"hasUnhealthyProcess":1},"CurrentGeneration":1,"OriginalGeneration":1,"DelayedRequeue":"10s"}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Reconciliation run finished","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"5a007819-4815-4f68-b5b7-f619788013e9","duration_seconds":1.579783953,"cacheStatus":true}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"ba36628c-a354-4bcb-934d-a5460d59755a","reconciler":"controllers.addPods"}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Subreconciler finished run","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"ba36628c-a354-4bcb-934d-a5460d59755a","reconciler":"controllers.addPods","duration_seconds":0.00024448}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller","msg":"Fetch machine-readable status for reconciliation loop","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"ba36628c-a354-4bcb-934d-a5460d59755a","cacheStatus":true}
{"level":"info","ts":"2026-04-10T05:34:04Z","logger":"controller.fdbclient","msg":"Fetch values from FDB","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"ba36628c-a354-4bcb-934d-a5460d59755a","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":"2026-04-10T05:34:05Z","logger":"controller.fdbclient","msg":"Done fetching values from FDB","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"ba36628c-a354-4bcb-934d-a5460d59755a","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":"2026-04-10T05:34:05Z","logger":"controller.fdbclient","msg":"database is unavailable","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"ba36628c-a354-4bcb-934d-a5460d59755a","status":{"client":{"coordinators":{"coordinators":[{"address":"xyz-fdb-abcd-log-48370.xyz-fdb-abcd.xyz-fdb-abcd.svc.cluster.local:4501","reachable":true},{"address":"xyz-fdb-abcd-log-55193.xyz-fdb-abcd.xyz-fdb-abcd.svc.cluster.local:4501"},{"address":"xyz-fdb-abcd-log-60478.xyz-fdb-abcd.xyz-fdb-abcd.svc.cluster.local:4501","reachable":true}],"quorum_reachable":true},"database_status":{}},"cluster":{"configuration":{"redundancy_mode":"double","storage_engine":"ssd-2","usable_regions":1,"logs":3,"proxies":3,"commit_proxies":2,"grv_proxies":1,"resolvers":1,"log_routers":-1,"remote_logs":-1,"log_spill":2,"storage_migration_type":"disabled","perpetual_storage_wiggle":0,"perpetual_storage_wiggle_locality":"0","perpetual_storage_wiggle_engine":"none"},"processes":{"345b3f029e79da96931a5aecc83fc3e5":{"address":"10.0.27.113:4501","class_type":"storage","command_line":"/usr/bin/fdbserver --cluster_file=/var/fdb/data/fdb.cluster --seed_cluster_file=/var/dynamic-conf/fdb.cluster --public_address=[10.0.27.113]:4501 --class=storage --logdir=/var/log/fdb-trace-logs --loggroup=xyz-fdb-abcd --datadir=/var/fdb/data/1 --locality_process_id=storage-10819-1 --locality_instance_id=storage-10819 --locality_machineid=xyz-fdb-abcd-storage-10819 --locality_zoneid=xyz-fdb-abcd-storage-10819 --listen_address=[10.0.27.113]:4501 --locality_dns_name=xyz-fdb-abcd-storage-10819.xyz-fdb-abcd.xyz-fdb-abcd.svc.cluster.local","locality":{"dns_name":"xyz-fdb-abcd-storage-10819.xyz-fdb-abcd.xyz-fdb-abcd.svc.cluster.local","instance_id":"storage-10819","machineid":"xyz-fdb-abcd-storage-10819","process_id":"storage-10819-1","processid":"345b3f029e79da96931a5aecc83fc3e5","zoneid":"xyz-fdb-abcd-storage-10819"},"version":"7.3.69","uptime_seconds":700.004,"run_loop_busy":0.0391424,"roles":[{"role":"log","id":"36fc1d16f8af228d","data_lag":{},"kvstore_used_bytes":104861752,"kvstore_total_bytes":20940668928,"kvstore_free_bytes":20723617792,"kvstore_available_bytes":20723617792,"read_latency_statistics":{"count":null,"median":null,"p99":null},"commit_latency_statistics":{"count":null,"median":null,"p99":null},"grv_latency_statistics":{"batch":{"count":null,"median":null,"p99":null},"default":{"count":null,"median":null,"p99":null}}}]},"5177543970da0444bc5f4a6f4e0a6a97":{"address":"10.0.27.143:4501","class_type":"log","command_line":"/usr/bin/fdbserver --cluster_file=/var/fdb/data/fdb.cluster --seed_cluster_file=/var/dynamic-conf/fdb.cluster --public_address=[10.0.27.143]:4501 --class=log --logdir=/var/log/fdb-trace-logs --loggroup=xyz-fdb-abcd --datadir=/var/fdb/data/1 --locality_process_id=log-48370-1 --locality_instance_id=log-48370 --locality_machineid=xyz-fdb-abcd-log-48370 --locality_zoneid=xyz-fdb-abcd-log-48370 --listen_address=[10.0.27.143]:4501 --locality_dns_name=xyz-fdb-abcd-log-48370.xyz-fdb-abcd.xyz-fdb-abcd.svc.cluster.local","locality":{"dns_name":"xyz-fdb-abcd-log-48370.xyz-fdb-abcd.xyz-fdb-abcd.svc.cluster.local","instance_id":"log-48370","machineid":"xyz-fdb-abcd-log-48370","process_id":"log-48370-1","processid":"5177543970da0444bc5f4a6f4e0a6a97","zoneid":"xyz-fdb-abcd-log-48370"},"version":"7.3.69","uptime_seconds":80.0002,"run_loop_busy":0.014618599999999999,"roles":[{"role":"data_distributor","id":"724c965074c79300","data_lag":{},"kvstore_used_bytes":null,"kvstore_total_bytes":null,"kvstore_free_bytes":null,"kvstore_available_bytes":null,"read_latency_statistics":{"count":null,"median":null,"p99":null},"commit_latency_statistics":{"count":null,"median":null,"p99":null},"grv_latency_statistics":{"batch":{"count":null,"median":null,"p99":null},"default":{"count":null,"median":null,"p99":null}}},{"role":"ratekeeper","id":"3d37276fb82b13c3","data_lag":{},"kvstore_used_bytes":null,"kvstore_total_bytes":null,"kvstore_free_bytes":null,"kvstore_available_bytes":null,"read_latency_statistics":{"count":null,"median":null,"p99":null},"commit_latency_statistics":{"count":null,"median":null,"p99":null},"grv_latency_statistics":{"batch":{"count":null,"median":null,"p99":null},"default":{"count":null,"median":null,"p99":null}}},{"role":"log","id":"1d35c76aeaae9209","data_lag":{},"kvstore_used_bytes":null,"kvstore_total_bytes":null,"kvstore_free_bytes":null,"kvstore_available_bytes":null,"read_latency_statistics":{"count":null,"median":null,"p99":null},"commit_latency_statistics":{"count":null,"median":null,"p99":null},"grv_latency_statistics":{"batch":{"count":null,"median":null,"p99":null},"default":{"count":null,"median":null,"p99":null}}},{"role":"coordinator","data_lag":{},"kvstore_used_bytes":null,"kvstore_total_bytes":null,"kvstore_free_bytes":null,"kvstore_available_bytes":null,"read_latency_statistics":{"count":null,"median":null,"p99":null},"commit_latency_statistics":{"count":null,"median":null,"p99":null},"grv_latency_statistics":{"batch":{"count":null,"median":null,"p99":null},"default":{"count":null,"median":null,"p99":null}}},{"role":"log","id":"7f35d6b16861f03b","data_lag":{},"kvstore_used_bytes":104882352,"kvstore_total_bytes":3077349376,"kvstore_free_bytes":2812010496,"kvstore_available_bytes":2812010496,"read_latency_statistics":{"count":null,"median":null,"p99":null},"commit_latency_statistics":{"count":null,"median":null,"p99":null},"grv_latency_statistics":{"batch":{"count":null,"median":null,"p99":null},"default":{"count":null,"median":null,"p99":null}}}]},"9a1c56e284882ec31d8b9642e891f10b":{"address":"10.0.27.167:4501","class_type":"storage","command_line":"/usr/bin/fdbserver --cluster_file=/var/fdb/data/fdb.cluster --seed_cluster_file=/var/dynamic-conf/fdb.cluster --public_address=[10.0.27.167]:4501 --class=storage --logdir=/var/log/fdb-trace-logs --loggroup=xyz-fdb-abcd --datadir=/var/fdb/data/1 --locality_process_id=storage-16106-1 --locality_instance_id=storage-16106 --locality_machineid=xyz-fdb-abcd-storage-16106 --locality_zoneid=xyz-fdb-abcd-storage-16106 --listen_address=[10.0.27.167]:4501 --locality_dns_name=xyz-fdb-abcd-storage-16106.xyz-fdb-abcd.xyz-fdb-abcd.svc.cluster.local","locality":{"dns_name":"xyz-fdb-abcd-storage-16106.xyz-fdb-abcd.xyz-fdb-abcd.svc.cluster.local","instance_id":"storage-16106","machineid":"xyz-fdb-abcd-storage-16106","process_id":"storage-16106-1","processid":"9a1c56e284882ec31d8b9642e891f10b","zoneid":"xyz-fdb-abcd-storage-16106"},"version":"7.3.69","uptime_seconds":340.007,"run_loop_busy":0.0313403,"roles":[{"role":"log","id":"2eb91fd600d2affd","data_lag":{},"kvstore_used_bytes":104861752,"kvstore_total_bytes":20940668928,"kvstore_free_bytes":20723617792,"kvstore_available_bytes":20723617792,"read_latency_statistics":{"count":null,"median":null,"p99":null},"commit_latency_statistics":{"count":null,"median":null,"p99":null},"grv_latency_statistics":{"batch":{"count":null,"median":null,"p99":null},"default":{"count":null,"median":null,"p99":null}}}]},"c41ff8d4f58a6d7a1295866c42ba4af6":{"address":"10.0.27.13:4501","class_type":"storage","command_line":"/usr/bin/fdbserver --cluster_file=/var/fdb/data/fdb.cluster --seed_cluster_file=/var/dynamic-conf/fdb.cluster --public_address=[10.0.27.13]:4501 --class=storage --logdir=/var/log/fdb-trace-logs --loggroup=xyz-fdb-abcd --datadir=/var/fdb/data/1 --locality_process_id=storage-18963-1 --locality_instance_id=storage-18963 --locality_machineid=xyz-fdb-abcd-storage-18963 --locality_zoneid=xyz-fdb-abcd-storage-18963 --listen_address=[10.0.27.13]:4501 --locality_dns_name=xyz-fdb-abcd-storage-18963.xyz-fdb-abcd.xyz-fdb-abcd.svc.cluster.local","locality":{"dns_name":"xyz-fdb-abcd-storage-18963.xyz-fdb-abcd.xyz-fdb-abcd.svc.cluster.local","instance_id":"storage-18963","machineid":"xyz-fdb-abcd-storage-18963","process_id":"storage-18963-1","processid":"c41ff8d4f58a6d7a1295866c42ba4af6","zoneid":"xyz-fdb-abcd-storage-18963"},"version":"7.3.69","uptime_seconds":340.001,"run_loop_busy":0.0057557,"roles":[{"role":"consistency_scan","id":"55da43bacd8c0733","data_lag":{},"kvstore_used_bytes":null,"kvstore_total_bytes":null,"kvstore_free_bytes":null,"kvstore_available_bytes":null,"read_latency_statistics":{"count":null,"median":null,"p99":null},"commit_latency_statistics":{"count":null,"median":null,"p99":null},"grv_latency_statistics":{"batch":{"count":null,"median":null,"p99":null},"default":{"count":null,"median":null,"p99":null}}}]},"e140142d674b9df941bd73f1afe76727":{"address":"10.0.27.119:4501","class_type":"stateless","command_line":"/usr/bin/fdbserver --cluster_file=/var/fdb/data/fdb.cluster --seed_cluster_file=/var/dynamic-conf/fdb.cluster --public_address=[10.0.27.119]:4501 --class=stateless --logdir=/var/log/fdb-trace-logs --loggroup=xyz-fdb-abcd --datadir=/var/fdb/data/1 --locality_process_id=stateless-40726-1 --locality_instance_id=stateless-40726 --locality_machineid=xyz-fdb-abcd-stateless-40726 --locality_zoneid=xyz-fdb-abcd-stateless-40726 --listen_address=[10.0.27.119]:4501 --locality_dns_name=xyz-fdb-abcd-stateless-40726.xyz-fdb-abcd.xyz-fdb-abcd.svc.cluster.local","version":"7.3.69","roles":[{"role":"master","id":"74013759391c5855","data_lag":{},"kvstore_used_bytes":null,"kvstore_total_bytes":null,"kvstore_free_bytes":null,"kvstore_available_bytes":null,"read_latency_statistics":{"count":null,"median":null,"p99":null},"commit_latency_statistics":{"count":null,"median":null,"p99":null},"grv_latency_statistics":{"batch":{"count":null,"median":null,"p99":null},"default":{"count":null,"median":null,"p99":null}}},{"role":"cluster_controller","id":"25d553c9b89bb9e7","data_lag":{},"kvstore_used_bytes":null,"kvstore_total_bytes":null,"kvstore_free_bytes":null,"kvstore_available_bytes":null,"read_latency_statistics":{"count":null,"median":null,"p99":null},"commit_latency_statistics":{"count":null,"median":null,"p99":null},"grv_latency_statistics":{"batch":{"count":null,"median":null,"p99":null},"default":{"count":null,"median":null,"p99":null}}},{"role":"commit_proxy","id":"bed406b911a58dd3","data_lag":{},"kvstore_used_bytes":null,"kvstore_total_bytes":null,"kvstore_free_bytes":null,"kvstore_available_bytes":null,"read_latency_statistics":{"count":null,"median":null,"p99":null},"commit_latency_statistics":{"count":null,"median":null,"p99":null},"grv_latency_statistics":{"batch":{"count":null,"median":null,"p99":null},"default":{"count":null,"median":null,"p99":null}}},{"role":"grv_proxy","id":"bd3b22dfd8643e01","data_lag":{},"kvstore_used_bytes":null,"kvstore_total_bytes":null,"kvstore_free_bytes":null,"kvstore_available_bytes":null,"read_latency_statistics":{"count":null,"median":null,"p99":null},"commit_latency_statistics":{"count":null,"median":null,"p99":null},"grv_latency_statistics":{"batch":{"count":null,"median":null,"p99":null},"default":{"count":null,"median":null,"p99":null}}},{"role":"resolver","id":"ddb04176fb27353c","data_lag":{},"kvstore_used_bytes":null,"kvstore_total_bytes":null,"kvstore_free_bytes":null,"kvstore_available_bytes":null,"read_latency_statistics":{"count":null,"median":null,"p99":null},"commit_latency_statistics":{"count":null,"median":null,"p99":null},"grv_latency_statistics":{"batch":{"count":null,"median":null,"p99":null},"default":{"count":null,"median":null,"p99":null}}}]}},"data":{"total_kv_size_bytes":15271718,"moving_data":{},"state":{"healthy":true,"name":"healthy","min_replicas_remaining":2},"team_trackers":[{"primary":true,"state":{"healthy":true,"name":"healthy","min_replicas_remaining":2},"unhealthy_servers":0}],"total_disk_used_bytes":329105408},"full_replication":true,"generation":178,"clients":{},"layers":{"_valid":true,"backup":{}},"logs":[{"current":true,"log_fault_tolerance":1,"log_replication_factor":2},{"log_replication_factor":2}],"qos":{"limiting_durability_lag_storage_server":{"seconds":5.07766,"versions":5077658},"worst_data_lag_storage_server":{},"worst_durability_lag_storage_server":{"seconds":5.08494,"versions":5084936},"worst_queue_bytes_storage_server":9991865},"fault_tolerance":{},"recovery_state":{},"connection_string":"xyz_fdb_abcd:qfPNwpLRByg4F4gBo5k6G3CC6DA0dwZP@xyz-fdb-abcd-log-48370.xyz-fdb-abcd.xyz-fdb-abcd.svc.cluster.local:4501,xyz-fdb-abcd-log-55193.xyz-fdb-abcd.xyz-fdb-abcd.svc.cluster.local:4501,xyz-fdb-abcd-log-60478.xyz-fdb-abcd.xyz-fdb-abcd.svc.cluster.local:4501","messages":[{"name":"status_incomplete","description":"Unable to retrieve all status information."}],"bounce_impact":{"can_clean_bounce":false},"database_available":true,"active_primary_dc":"","database_lock_state":{"locked":false}}}}
{"level":"info","ts":"2026-04-10T05:34:05Z","logger":"controller","msg":"could not fetch machine-readable status and therefore didn't cache it","namespace":"xyz-fdb-abcd","cluster":"xyz-fdb-abcd","traceID":"ba36628c-a354-4bcb-934d-a5460d59755a","error":"fdb timeout: database is unavailable"}

Could you share some additional information about your setup (Kubernetes version, Operator version)? Were all those pods running on the same underlying Kubernetes node? Have you checked the Kubernetes logs why those pods are in the Completed status, that should never happen. Was there an operation on the Kubernetes cluster, e.g. a Kubernetes upgrade?

The operator is not replacing those pods because the cluster is currently unavailable. In such a scenario the operator doesn’t do any replacements (since the exclusion will not work) and the operator doesn’t do a pod recreation because it’s not implemented yet (we haven’t see this behaviour).

If you could share some more details about the setup and how this happened, we could create a test case and look more closely into this.

Thank you for the reply. Here are the details

Kubernetes version: 1.33 ( Deployed through Amazon Elastic Kubernetes Service )
Operator version: v2.23.0

Were all those pods running on the same underlying Kubernetes node?

Yes. We’ve set pod affinity so all pods always run on the same node.

Was there an operation on the Kubernetes cluster, e.g. a Kubernetes upgrade?

Nope. There was nothing in progress.

Have you checked the Kubernetes logs why those pods are in the Completed status, that should never happen.

We checked the describe output of the pod but could not find anything interesting.

We’ve seen this issue a couple of times now. What are specific things to check/ logs to capture the next time it happens?

Could you check the logs of the main container and paste them here?

Haven’t seen it in a while now. Will reopen when we see the errors again, along with the logs of the main container