Multi DC Coordinators

stefanvasilic4 · June 7, 2024, 7:05pm

When purposefully deleting coordinators pods in primary , it will be marked as unreachable , but no new coordinators will be recruited, even though there are plenty of candidate processes available. Is this expected behavior ?

johscheuer · June 10, 2024, 9:01am

Could you share some more information? Are you deleting only the Pods that host the coordinator processes or all? Are you able to share the operator logs and the configuration of your setup? What operator version do you use? The logs of the operator(s) should contain information why no new coordinators are chosen: fdb-kubernetes-operator/controllers/change_coordinators.go at main · FoundationDB/fdb-kubernetes-operator · GitHub.

stefanvasilic4 · June 11, 2024, 2:32pm

we have one pod per node design, i am deleting only coordinator pods; operator version: v1.34.0

{"level":"info","ts":"2024-06-11T14:23:15Z","logger":"fdbclient","msg":"found cluster message(s) in the machine-readable status","namespace":"dc1","cluster":"fdb-cluster-1","messages":[{"name":"client_issues","description":"Some clients of this cluster have issues."}]}
{"level":"info","ts":"2024-06-11T14:23:15Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus"}
{"level":"info","ts":"2024-06-11T14:23:15Z","logger":"controller","msg":"skip updating fault domain for process group with missing process in FoundationDB cluster status","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","processGroupID":"dc1-storage-1"}
{"level":"info","ts":"2024-06-11T14:23:15Z","logger":"controller","msg":"skip updating fault domain for process group with missing process in FoundationDB cluster status","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","processGroupID":"dc1-storage-10"}
{"level":"info","ts":"2024-06-11T14:23:15Z","logger":"controller","msg":"skip updating fault domain for process group with missing process in FoundationDB cluster status","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","processGroupID":"dc1-storage-11"}
{"level":"info","ts":"2024-06-11T14:23:15Z","logger":"controller","msg":"Disable taint feature","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","Disabled":true}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Unable to build pod client","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","processGroupID":"dc1-storage-1","message":"waiting for pod dc1/fdb-cluster-1/fdb-cluster-1-storage-1 to be assigned an IP"}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Unable to build pod client","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","processGroupID":"dc1-storage-10","message":"waiting for pod dc1/fdb-cluster-1/fdb-cluster-1-storage-10 to be assigned an IP"}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Unable to build pod client","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","processGroupID":"dc1-storage-11","message":"waiting for pod dc1/fdb-cluster-1/fdb-cluster-1-storage-11 to be assigned an IP"}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Has unhealthy process group","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","method":"CheckReconciliation","namespace":"dc1","cluster":"fdb-cluster-1","processGroupID":"dc1-storage-1","state":"HasUnhealthyProcess","conditions":["MissingProcesses","SidecarUnreachable","IncorrectConfigMap","PodPending"]}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Has unhealthy process group","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","method":"CheckReconciliation","namespace":"dc1","cluster":"fdb-cluster-1","processGroupID":"dc1-storage-10","state":"HasUnhealthyProcess","conditions":["MissingProcesses","SidecarUnreachable","PodFailing","PodPending"]}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Has unhealthy process group","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","method":"CheckReconciliation","namespace":"dc1","cluster":"fdb-cluster-1","processGroupID":"dc1-storage-11","state":"HasUnhealthyProcess","conditions":["MissingProcesses","SidecarUnreachable","PodFailing","PodPending"]}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Not all process groups are reconciled","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","method":"CheckReconciliation","namespace":"dc1","cluster":"fdb-cluster-1","desiredProcessGroups":33,"reconciledProcessGroups":30}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","duration_seconds":1.129144975}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateLockConfiguration"}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateLockConfiguration","duration_seconds":0.000017201}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateConfigMap"}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateConfigMap","duration_seconds":0.000416808}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.checkClientCompatibility"}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.checkClientCompatibility","duration_seconds":0.000020701}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.deletePodsForBuggification"}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.deletePodsForBuggification","duration_seconds":0.041893981}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.replaceMisconfiguredProcessGroups"}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.replaceMisconfiguredProcessGroups","duration_seconds":0.00371097}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.replaceFailedProcessGroups"}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.replaceFailedProcessGroups","duration_seconds":0.0000191}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addProcessGroups"}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addProcessGroups","duration_seconds":0.0000218}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addServices"}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addServices","duration_seconds":0.0000226}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addPVCs"}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addPVCs","duration_seconds":0.000297105}
{"level":"info","ts":"2024-06-11T14:23:16Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addPods"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addPods","duration_seconds":0.001235123}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.generateInitialClusterFile"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.generateInitialClusterFile","duration_seconds":0.0000078}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.removeIncompatibleProcesses"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.removeIncompatibleProcesses","duration_seconds":0.000006}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateSidecarVersions"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateSidecarVersions","duration_seconds":0.0000114}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updatePodConfig"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Unable to generate pod client","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updatePodConfig","processGroupID":"dc1-storage-1","message":"waiting for pod dc1/fdb-cluster-1/fdb-cluster-1-storage-1 to be assigned an IP"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updatePodConfig","duration_seconds":0.030892177}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Delaying requeue for sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updatePodConfig","message":"Waiting for Pod to receive ConfigMap update","error":null}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateMetadata"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateMetadata","duration_seconds":0.001147721}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateDatabaseConfiguration"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateDatabaseConfiguration","duration_seconds":0.000113002}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.chooseRemovals"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.chooseRemovals","duration_seconds":0.000189004}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.excludeProcesses"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"current exclusions","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.excludeProcesses","exclusions":[]}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.excludeProcesses","duration_seconds":0.000069601}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.changeCoordinators"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.changeCoordinators","duration_seconds":0.003387363}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.bounceProcesses"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"ignore process group with missing process","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.bounceProcesses","processGroupID":"dc1-storage-1"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"ignore process group with missing process","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.bounceProcesses","processGroupID":"dc1-storage-10"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"ignore process group with missing process","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.bounceProcesses","processGroupID":"dc1-storage-11"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.bounceProcesses","duration_seconds":0.000312006}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.maintenanceModeChecker"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.maintenanceModeChecker","duration_seconds":0.000011601}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updatePods"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updatePods","duration_seconds":0.00429598}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.removeProcessGroups"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.removeProcessGroups","duration_seconds":0.000063902}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.removeServices"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.removeServices","duration_seconds":0.0000081}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"skip updating fault domain for process group with missing process in FoundationDB cluster status","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","processGroupID":"dc1-storage-1"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"skip updating fault domain for process group with missing process in FoundationDB cluster status","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","processGroupID":"dc1-storage-10"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"skip updating fault domain for process group with missing process in FoundationDB cluster status","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","processGroupID":"dc1-storage-11"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Disable taint feature","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","Disabled":true}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Unable to build pod client","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","processGroupID":"dc1-storage-1","message":"waiting for pod dc1/fdb-cluster-1/fdb-cluster-1-storage-1 to be assigned an IP"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Unable to build pod client","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","processGroupID":"dc1-storage-10","message":"waiting for pod dc1/fdb-cluster-1/fdb-cluster-1-storage-10 to be assigned an IP"}
{"level":"info","ts":"2024-06-11T14:23:17Z","logger":"controller","msg":"Unable to build pod client","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","processGroupID":"dc1-storage-11","message":"waiting for pod dc1/fdb-cluster-1/fdb-cluster-1-storage-11 to be assigned an IP"}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"Has unhealthy process group","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","method":"CheckReconciliation","namespace":"dc1","cluster":"fdb-cluster-1","processGroupID":"dc1-storage-1","state":"HasUnhealthyProcess","conditions":["MissingProcesses","SidecarUnreachable","IncorrectConfigMap","PodPending"]}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"Has unhealthy process group","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","method":"CheckReconciliation","namespace":"dc1","cluster":"fdb-cluster-1","processGroupID":"dc1-storage-10","state":"HasUnhealthyProcess","conditions":["MissingProcesses","SidecarUnreachable","PodFailing","PodPending"]}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"Has unhealthy process group","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","method":"CheckReconciliation","namespace":"dc1","cluster":"fdb-cluster-1","processGroupID":"dc1-storage-11","state":"HasUnhealthyProcess","conditions":["MissingProcesses","SidecarUnreachable","PodFailing","PodPending"]}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"Not all process groups are reconciled","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","method":"CheckReconciliation","namespace":"dc1","cluster":"fdb-cluster-1","desiredProcessGroups":33,"reconciledProcessGroups":30}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","duration_seconds":1.095778038}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"Cluster was not fully reconciled by reconciliation process","namespace":"dc1","cluster":"fdb-cluster-1","status":{"hasUnhealthyProcess":2},"CurrentGeneration":0,"OriginalGeneration":2,"DelayedRequeue":true}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"Reconciliation run finished","namespace":"dc1","cluster":"fdb-cluster-1","duration_seconds":2.583983514,"cacheStatus":true}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"Fetch machine-readable status for reconcilitation loop","namespace":"dc1","cluster":"fdb-cluster-1","cacheStatus":true}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"Trying connection options","namespace":"dc1","cluster":"fdb-cluster-1","connectionString":["fdb_cluster_1:oXYryIImmqVkRDlBiDhzT2xXUExHE3ng@10.113.181.140:4502:tls,10.113.181.144:4502:tls,10.113.181.145:4512:tls,10.113.181.146:4504:tls,10.113.181.148:4508:tls,10.113.181.149:4500:tls,10.113.181.150:4500:tls,10.113.181.158:4508:tls,10.113.181.202:4500:tls","fdb_cluster_1:hMJwS2BhvPLzaSB8aWKoHbITgOUSf6Rs@10.113.181.164:4500:tls,10.113.181.203:4500:tls,10.113.181.153:4500:tls,10.113.181.151:4500:tls,10.113.181.148:4500:tls"]}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"Attempting to get connection string from cluster","namespace":"dc1","cluster":"fdb-cluster-1","connectionString":"fdb_cluster_1:oXYryIImmqVkRDlBiDhzT2xXUExHE3ng@10.113.181.140:4502:tls,10.113.181.144:4502:tls,10.113.181.145:4512:tls,10.113.181.146:4504:tls,10.113.181.148:4508:tls,10.113.181.149:4500:tls,10.113.181.150:4500:tls,10.113.181.158:4508:tls,10.113.181.202:4500:tls"}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"fdbclient","msg":"Fetch values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd/coordinators"}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd/coordinators"}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"Chose connection option","namespace":"dc1","cluster":"fdb-cluster-1","connectionString":"fdb_cluster_1:oXYryIImmqVkRDlBiDhzT2xXUExHE3ng@10.113.181.140:4502:tls,10.113.181.144:4502:tls,10.113.181.145:4512:tls,10.113.181.146:4504:tls,10.113.181.148:4508:tls,10.113.181.149:4500:tls,10.113.181.150:4500:tls,10.113.181.158:4508:tls,10.113.181.202:4500:tls"}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"fdbclient","msg":"Fetch values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"fdbclient","msg":"found cluster message(s) in the machine-readable status","namespace":"dc1","cluster":"fdb-cluster-1","messages":[{"name":"client_issues","description":"Some clients of this cluster have issues."}]}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus"}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"skip updating fault domain for process group with missing process in FoundationDB cluster status","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","processGroupID":"dc1-storage-1"}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"skip updating fault domain for process group with missing process in FoundationDB cluster status","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","processGroupID":"dc1-storage-10"}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"skip updating fault domain for process group with missing process in FoundationDB cluster status","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","processGroupID":"dc1-storage-11"}
{"level":"info","ts":"2024-06-11T14:23:18Z","logger":"controller","msg":"Disable taint feature","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","Disabled":true}

johscheuer · June 12, 2024, 9:19am

I don’t see any log messages around Cluster has an unhealthy coordinator (fdb-kubernetes-operator/internal/locality/locality.go at main · FoundationDB/fdb-kubernetes-operator · GitHub). Based on the logs you probably deleted dc1-storage-1, dc1-storage-10 and dc1-storage-11? Is that an issue that is reproducible? If so, could you please provide either an e2e test or the instructions on how to perform your test? Are the deleted pods stuck in pending?

johscheuer · June 12, 2024, 9:58am

Could you provide some more details on what you tested and what you observed? I created an e2e test case and I’m not able to reproduce the issue: Add e2e test for HA cluster to make sure coordinators are changed by johscheuer · Pull Request #2058 · FoundationDB/fdb-kubernetes-operator · GitHub

stefanvasilic4 · June 12, 2024, 5:02pm

i am using a pod per node topology.
previous operator logs were generated when storage servers are coordinators, now i just left logs to take this role for simplicity (sorry for using simple in an fdb forum )

when injecting a pod failure with chaos-mesh, into 3 log/coordinator pods , coordinators are marked unreachable but operator is not reacting in any way, only if i delete the associated PVCs then storage servers are recruited as coordinators
also log pods are left alone, i would assume operator would try to create new logs pods (growth followed by a shrink); also, after errors previously injected are gone, the pod is marked as running again, but because the associated PVCs were deleted (and remain in Pending), only after pods are restarted the PVC is created back (could be i did not wait enough for PVC reconciler to do its thing…)
to reporduce just replace log pods in the below file

kind: PodChaos
apiVersion: chaos-mesh.org/v1alpha1
metadata:
  namespace: dc1
  name: fdb-cluster-1 
spec:
  mode: all
  action: pod-failure
  duration: 120m
  selector:
    namespaces:
      - dc1
    pods:
      dc1:
      - fdb-cluster-1-log-1
      - fdb-cluster-1-log-2
      - fdb-cluster-1-log-3

johscheuer · June 12, 2024, 5:30pm

Thanks for sharing the setup, that makes it easier to build an e2e test case for that. Have you somehow configured the automatic replacements or do you use the default values? The default replacement time is 7200 (seconds): fdb-kubernetes-operator/api/v1beta2/foundationdbcluster_types.go at main · FoundationDB/fdb-kubernetes-operator · GitHub. And can I assume that the cluster has enough pods running in that namespace, so that other pods could be recruited as coordinator?

johscheuer · June 12, 2024, 5:52pm

I tried to reproduce that again with an e2e test case (here is the PR: Add another ha e2e test for pod failures by johscheuer · Pull Request #2061 · FoundationDB/fdb-kubernetes-operator · GitHub). In my case I don’t see any issues and the operator is recruiting new coordinators. The pods are not replaced in the test case as the time window is too small, but I can add that tomorrow.

If you could share some more logs and could verify that enough additional Pods are available, that would be helpful.

stefanvasilic4 · June 12, 2024, 7:30pm

Hi @johscheuer , thanks for your help, setting replacement time to a lower value seems to fix our problem. Could you please share what is the reason behind 2hours replacement time default value?

stefanvasilic4 · June 13, 2024, 4:11am

i spoke too early, it worked several times when db was empty …

stefanvasilic4 · June 13, 2024, 4:12am

there are enough pods (storage and stateless) for coordinators…, i will provide logs tom

stefanvasilic4 · June 13, 2024, 6:32pm

based on my observations, operator pod is not detecting pods being down, hence, no Cluster has an unhealthy coordinator messages ; only if i restart the operator pod, messages about unhealthy coordinators are popping up and new coordinators are recruited (from SS) and new log pods are created

johscheuer · June 14, 2024, 2:59pm

based on my observations, operator pod is not detecting pods being down, hence, no Cluster has an unhealthy coordinator messages ; only if i restart the operator pod, messages about unhealthy coordinators are popping up and new coordinators are recruited (from SS) and new log pods are created

Interesting, is the operator complaining about something during the start up? The operator should receive “events” when somethings changes on the managed Pods. I just checked the change log for the operator from 1.34 to the recent 1.40 and I don’t see any changes related to that logic or changes in the controller-runtime

ellacosta · July 8, 2024, 6:44am

Hello,
According to me they may be marked as unreachable, but new coordinators might not automatically be recruited, even if there are available candidate processes. This could be due to specific cluster configurations or policies in place.

Topic		Replies	Views
K8s operator fdb.cluster IP addresses issue Kubernetes Operator operator	5	1205	September 18, 2020
maxConcurrentReplacements causing deletion update strategy Kubernetes Operator	1	179	January 10, 2024
FDB operator stuck without recreating pods Kubernetes Operator operator	4	381	February 22, 2024
Fdb database unavailable result of inconsistent coordinator ips Kubernetes Operator operator	2	464	August 24, 2022
Data loss during recovery from mass pod deletion during scale down Kubernetes Operator operator	13	695	March 25, 2022

Multi DC Coordinators

Related topics