Multi region cluster, "regular" cluster elections: how to debug?

I have deployed 2 set of clusters that are using multiple regions and in both of them I’m seeing on a regular basis (ie. 3 to 5 times a day) issues with the cluster controller.
When I look further on one cluster I noticed this:

{"date":1745275934.36894,"Severity":"10","Roles":"CC","Type":"LeaderNoHeartbeat","ID":"xxx","ThreadID":"xxx","Coordinator":"10.1.2.3:4500","Machine":"10.1.2.3:4500","LogGroup":"fdb-cluster"}

More exactly I have 6 messages like that in a row which leads me to think that the controller failed to receive the heartbeat from 6 out of 9 coordinators (I suspect the ones in remote region).

I’m wondering if you have any clue on where to look for more details, ie. if the coordinator will also log something about not being able to send to the coordinator.

Are you using DNS in the cluster file with the K8s operator? I’ve had a similar problem in the past, see this thread. For me it was related to DNS that didn’t return within the heartbeat timeout.

1 Like

We run in a multi-zone FDB clusters and with our cloud provider (Azure) we have decided that DNS is our best course of action.

It’s not always manifesting with the log entry LeaderNoHeartbest though or maybe not right away. It seems that sometime in starts by getting a GetLeaderReply with nominee being set to 00000000 and also some additional events that I’m not 100% clear but I noticed that I’m getting MonitorLeaderAndGetClientInfoLeaderChange and also EndpointNotFound.

Today I deployed coredns (https://coredns.io/) as per node cache (with a daemonset) and use cilium to route the call to the DNS server to the coredns pod running on the node instead of seeing 10+ elections I haven’t seen one since deploying this cache.

2 Likes

(Yes, it was DNS. It’s always DNS.)