Multi DC Coordinators

When purposefully deleting coordinators pods in primary , it will be marked as unreachable , but no new coordinators will be recruited, even though there are plenty of candidate processes available. Is this expected behavior ?

1 Like

Could you share some more information? Are you deleting only the Pods that host the coordinator processes or all? Are you able to share the operator logs and the configuration of your setup? What operator version do you use? The logs of the operator(s) should contain information why no new coordinators are chosen: fdb-kubernetes-operator/controllers/change_coordinators.go at main · FoundationDB/fdb-kubernetes-operator · GitHub.

we have one pod per node design, i am deleting only coordinator pods; operator version: v1.34.0

{"level":"info","ts":"2024-06-11T14:23:15Z","logger":"fdbclient","msg":"found cluster message(s) in the machine-readable status","namespace":"dc1","cluster":"fdb-cluster-1","messages":[{"name":"client_issues","description":"Some clients of this cluster have issues."}]}
{"level":"info","ts":"2024-06-11T14:23:15Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus"}
{"level":"info","ts":"2024-06-11T14:23:15Z","logger":"controller","msg":"skip updating fault domain for process group with missing process in FoundationDB cluster status","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","processGroupID":"dc1-storage-1"}
{"level":"info","ts":"2024-06-11T14:23:15Z","logger":"controller","msg":"skip updating fault domain for process group with missing process in FoundationDB cluster status","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","processGroupID":"dc1-storage-10"}
{"level":"info","ts":"2024-06-11T14:23:15Z","logger":"controller","msg":"skip updating fault domain for process group with missing process in FoundationDB cluster status","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","processGroupID":"dc1-storage-11"}
{"level":"info","ts":"2024-06-11T14:23:15Z","logger":"controller","msg":"Disable taint feature","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","Disabled":true}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Unable to build pod client","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","processGroupID":"dc1-storage-1","message":"waiting for pod dc1/fdb-cluster-1/fdb-cluster-1-storage-1 to be assigned an IP"}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Unable to build pod client","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","processGroupID":"dc1-storage-10","message":"waiting for pod dc1/fdb-cluster-1/fdb-cluster-1-storage-10 to be assigned an IP"}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Unable to build pod client","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","processGroupID":"dc1-storage-11","message":"waiting for pod dc1/fdb-cluster-1/fdb-cluster-1-storage-11 to be assigned an IP"}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Has unhealthy process group","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","method":"CheckReconciliation","namespace":"dc1","cluster":"fdb-cluster-1","processGroupID":"dc1-storage-1","state":"HasUnhealthyProcess","conditions":["MissingProcesses","SidecarUnreachable","IncorrectConfigMap","PodPending"]}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Has unhealthy process group","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","method":"CheckReconciliation","namespace":"dc1","cluster":"fdb-cluster-1","processGroupID":"dc1-storage-10","state":"HasUnhealthyProcess","conditions":["MissingProcesses","SidecarUnreachable","PodFailing","PodPending"]}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Has unhealthy process group","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","method":"CheckReconciliation","namespace":"dc1","cluster":"fdb-cluster-1","processGroupID":"dc1-storage-11","state":"HasUnhealthyProcess","conditions":["MissingProcesses","SidecarUnreachable","PodFailing","PodPending"]}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Not all process groups are reconciled","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","method":"CheckReconciliation","namespace":"dc1","cluster":"fdb-cluster-1","desiredProcessGroups":33,"reconciledProcessGroups":30}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","duration_seconds":1.129144975}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateLockConfiguration"}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateLockConfiguration","duration_seconds":0.000017201}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateConfigMap"}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateConfigMap","duration_seconds":0.000416808}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.checkClientCompatibility"}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.checkClientCompatibility","duration_seconds":0.000020701}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.deletePodsForBuggification"}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.deletePodsForBuggification","duration_seconds":0.041893981}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.replaceMisconfiguredProcessGroups"}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.replaceMisconfiguredProcessGroups","duration_seconds":0.00371097}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.replaceFailedProcessGroups"}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.replaceFailedProcessGroups","duration_seconds":0.0000191}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addProcessGroups"}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addProcessGroups","duration_seconds":0.0000218}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addServices"}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addServices","duration_seconds":0.0000226}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addPVCs"}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addPVCs","duration_seconds":0.000297105}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addPods"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addPods","duration_seconds":0.001235123}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.generateInitialClusterFile"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.generateInitialClusterFile","duration_seconds":0.0000078}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.removeIncompatibleProcesses"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.removeIncompatibleProcesses","duration_seconds":0.000006}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateSidecarVersions"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateSidecarVersions","duration_seconds":0.0000114}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updatePodConfig"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Unable to generate pod client","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updatePodConfig","processGroupID":"dc1-storage-1","message":"waiting for pod dc1/fdb-cluster-1/fdb-cluster-1-storage-1 to be assigned an IP"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updatePodConfig","duration_seconds":0.030892177}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Delaying requeue for sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updatePodConfig","message":"Waiting for Pod to receive ConfigMap update","error":null}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateMetadata"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateMetadata","duration_seconds":0.001147721}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateDatabaseConfiguration"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateDatabaseConfiguration","duration_seconds":0.000113002}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.chooseRemovals"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.chooseRemovals","duration_seconds":0.000189004}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.excludeProcesses"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"current exclusions","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.excludeProcesses","exclusions":[]}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.excludeProcesses","duration_seconds":0.000069601}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.changeCoordinators"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.changeCoordinators","duration_seconds":0.003387363}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.bounceProcesses"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"ignore process group with missing process","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.bounceProcesses","processGroupID":"dc1-storage-1"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"ignore process group with missing process","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.bounceProcesses","processGroupID":"dc1-storage-10"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"ignore process group with missing process","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.bounceProcesses","processGroupID":"dc1-storage-11"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.bounceProcesses","duration_seconds":0.000312006}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.maintenanceModeChecker"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.maintenanceModeChecker","duration_seconds":0.000011601}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updatePods"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updatePods","duration_seconds":0.00429598}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.removeProcessGroups"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.removeProcessGroups","duration_seconds":0.000063902}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.removeServices"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.removeServices","duration_seconds":0.0000081}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"skip updating fault domain for process group with missing process in FoundationDB cluster status","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","processGroupID":"dc1-storage-1"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"skip updating fault domain for process group with missing process in FoundationDB cluster status","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","processGroupID":"dc1-storage-10"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"skip updating fault domain for process group with missing process in FoundationDB cluster status","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","processGroupID":"dc1-storage-11"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Disable taint feature","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","Disabled":true}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Unable to build pod client","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","processGroupID":"dc1-storage-1","message":"waiting for pod dc1/fdb-cluster-1/fdb-cluster-1-storage-1 to be assigned an IP"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Unable to build pod client","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","processGroupID":"dc1-storage-10","message":"waiting for pod dc1/fdb-cluster-1/fdb-cluster-1-storage-10 to be assigned an IP"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Unable to build pod client","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","processGroupID":"dc1-storage-11","message":"waiting for pod dc1/fdb-cluster-1/fdb-cluster-1-storage-11 to be assigned an IP"}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"Has unhealthy process group","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","method":"CheckReconciliation","namespace":"dc1","cluster":"fdb-cluster-1","processGroupID":"dc1-storage-1","state":"HasUnhealthyProcess","conditions":["MissingProcesses","SidecarUnreachable","IncorrectConfigMap","PodPending"]}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"Has unhealthy process group","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","method":"CheckReconciliation","namespace":"dc1","cluster":"fdb-cluster-1","processGroupID":"dc1-storage-10","state":"HasUnhealthyProcess","conditions":["MissingProcesses","SidecarUnreachable","PodFailing","PodPending"]}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"Has unhealthy process group","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","method":"CheckReconciliation","namespace":"dc1","cluster":"fdb-cluster-1","processGroupID":"dc1-storage-11","state":"HasUnhealthyProcess","conditions":["MissingProcesses","SidecarUnreachable","PodFailing","PodPending"]}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"Not all process groups are reconciled","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","method":"CheckReconciliation","namespace":"dc1","cluster":"fdb-cluster-1","desiredProcessGroups":33,"reconciledProcessGroups":30}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","duration_seconds":1.095778038}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"Cluster was not fully reconciled by reconciliation process","namespace":"dc1","cluster":"fdb-cluster-1","status":{"hasUnhealthyProcess":2},"CurrentGeneration":0,"OriginalGeneration":2,"DelayedRequeue":true}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"Reconciliation run finished","namespace":"dc1","cluster":"fdb-cluster-1","duration_seconds":2.583983514,"cacheStatus":true}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"Fetch machine-readable status for reconcilitation loop","namespace":"dc1","cluster":"fdb-cluster-1","cacheStatus":true}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"Trying connection options","namespace":"dc1","cluster":"fdb-cluster-1","connectionString":["fdb_cluster_1:oXYryIImmqVkRDlBiDhzT2xXUExHE3ng@10.113.181.140:4502:tls,10.113.181.144:4502:tls,10.113.181.145:4512:tls,10.113.181.146:4504:tls,10.113.181.148:4508:tls,10.113.181.149:4500:tls,10.113.181.150:4500:tls,10.113.181.158:4508:tls,10.113.181.202:4500:tls","fdb_cluster_1:hMJwS2BhvPLzaSB8aWKoHbITgOUSf6Rs@10.113.181.164:4500:tls,10.113.181.203:4500:tls,10.113.181.153:4500:tls,10.113.181.151:4500:tls,10.113.181.148:4500:tls"]}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"Attempting to get connection string from cluster","namespace":"dc1","cluster":"fdb-cluster-1","connectionString":"fdb_cluster_1:oXYryIImmqVkRDlBiDhzT2xXUExHE3ng@10.113.181.140:4502:tls,10.113.181.144:4502:tls,10.113.181.145:4512:tls,10.113.181.146:4504:tls,10.113.181.148:4508:tls,10.113.181.149:4500:tls,10.113.181.150:4500:tls,10.113.181.158:4508:tls,10.113.181.202:4500:tls"}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"fdbclient","msg":"Fetch values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd/coordinators"}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd/coordinators"}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"Chose connection option","namespace":"dc1","cluster":"fdb-cluster-1","connectionString":"fdb_cluster_1:oXYryIImmqVkRDlBiDhzT2xXUExHE3ng@10.113.181.140:4502:tls,10.113.181.144:4502:tls,10.113.181.145:4512:tls,10.113.181.146:4504:tls,10.113.181.148:4508:tls,10.113.181.149:4500:tls,10.113.181.150:4500:tls,10.113.181.158:4508:tls,10.113.181.202:4500:tls"}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"fdbclient","msg":"Fetch values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"fdbclient","msg":"found cluster message(s) in the machine-readable status","namespace":"dc1","cluster":"fdb-cluster-1","messages":[{"name":"client_issues","description":"Some clients of this cluster have issues."}]}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus"}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"skip updating fault domain for process group with missing process in FoundationDB cluster status","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","processGroupID":"dc1-storage-1"}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"skip updating fault domain for process group with missing process in FoundationDB cluster status","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","processGroupID":"dc1-storage-10"}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"skip updating fault domain for process group with missing process in FoundationDB cluster status","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","processGroupID":"dc1-storage-11"}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"Disable taint feature","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","Disabled":true}

I don’t see any log messages around Cluster has an unhealthy coordinator (fdb-kubernetes-operator/internal/locality/locality.go at main · FoundationDB/fdb-kubernetes-operator · GitHub). Based on the logs you probably deleted dc1-storage-1, dc1-storage-10 and dc1-storage-11? Is that an issue that is reproducible? If so, could you please provide either an e2e test or the instructions on how to perform your test? Are the deleted pods stuck in pending?

Could you provide some more details on what you tested and what you observed? I created an e2e test case and I’m not able to reproduce the issue: Add e2e test for HA cluster to make sure coordinators are changed by johscheuer · Pull Request #2058 · FoundationDB/fdb-kubernetes-operator · GitHub

i am using a pod per node topology.
previous operator logs were generated when storage servers are coordinators, now i just left logs to take this role for simplicity (sorry for using simple in an fdb forum :slight_smile: )

  • when injecting a pod failure with chaos-mesh, into 3 log/coordinator pods , coordinators are marked unreachable but operator is not reacting in any way, only if i delete the associated PVCs then storage servers are recruited as coordinators
  • also log pods are left alone, i would assume operator would try to create new logs pods (growth followed by a shrink); also, after errors previously injected are gone, the pod is marked as running again, but because the associated PVCs were deleted (and remain in Pending), only after pods are restarted the PVC is created back (could be i did not wait enough for PVC reconciler to do its thing…)
  • to reporduce just replace log pods in the below file
kind: PodChaos
apiVersion: chaos-mesh.org/v1alpha1
metadata:
  namespace: dc1
  name: fdb-cluster-1 
spec:
  mode: all
  action: pod-failure
  duration: 120m
  selector:
    namespaces:
      - dc1
    pods:
      dc1:
      - fdb-cluster-1-log-1
      - fdb-cluster-1-log-2
      - fdb-cluster-1-log-3

Thanks for sharing the setup, that makes it easier to build an e2e test case for that. Have you somehow configured the automatic replacements or do you use the default values? The default replacement time is 7200 (seconds): fdb-kubernetes-operator/api/v1beta2/foundationdbcluster_types.go at main · FoundationDB/fdb-kubernetes-operator · GitHub. And can I assume that the cluster has enough pods running in that namespace, so that other pods could be recruited as coordinator?

I tried to reproduce that again with an e2e test case (here is the PR: Add another ha e2e test for pod failures by johscheuer · Pull Request #2061 · FoundationDB/fdb-kubernetes-operator · GitHub). In my case I don’t see any issues and the operator is recruiting new coordinators. The pods are not replaced in the test case as the time window is too small, but I can add that tomorrow.

If you could share some more logs and could verify that enough additional Pods are available, that would be helpful.

Hi @johscheuer , thanks for your help, setting replacement time to a lower value seems to fix our problem. Could you please share what is the reason behind 2hours replacement time default value?

i spoke too early, it worked several times when db was empty …

there are enough pods (storage and stateless) for coordinators…, i will provide logs tom

based on my observations, operator pod is not detecting pods being down, hence, no Cluster has an unhealthy coordinator messages ; only if i restart the operator pod, messages about unhealthy coordinators are popping up and new coordinators are recruited (from SS) and new log pods are created

based on my observations, operator pod is not detecting pods being down, hence, no Cluster has an unhealthy coordinator messages ; only if i restart the operator pod, messages about unhealthy coordinators are popping up and new coordinators are recruited (from SS) and new log pods are created

Interesting, is the operator complaining about something during the start up? The operator should receive “events” when somethings changes on the managed Pods. I just checked the change log for the operator from 1.34 to the recent 1.40 and I don’t see any changes related to that logic or changes in the controller-runtime :thinking: