Multi region cluster, "regular" cluster elections: how to debug?

mpatou_openai · April 22, 2025, 4:09am

I have deployed 2 set of clusters that are using multiple regions and in both of them I’m seeing on a regular basis (ie. 3 to 5 times a day) issues with the cluster controller.
When I look further on one cluster I noticed this:

{"date":1745275934.36894,"Severity":"10","Roles":"CC","Type":"LeaderNoHeartbeat","ID":"xxx","ThreadID":"xxx","Coordinator":"10.1.2.3:4500","Machine":"10.1.2.3:4500","LogGroup":"fdb-cluster"}

More exactly I have 6 messages like that in a row which leads me to think that the controller failed to receive the heartbeat from 6 out of 9 coordinators (I suspect the ones in remote region).

I’m wondering if you have any clue on where to look for more details, ie. if the coordinator will also log something about not being able to send to the coordinator.

jlemaes · April 24, 2025, 11:06am

Are you using DNS in the cluster file with the K8s operator? I’ve had a similar problem in the past, see this thread. For me it was related to DNS that didn’t return within the heartbeat timeout.

mpatou_openai · April 24, 2025, 3:28pm

We run in a multi-zone FDB clusters and with our cloud provider (Azure) we have decided that DNS is our best course of action.

mpatou_openai · April 25, 2025, 5:06am

It’s not always manifesting with the log entry LeaderNoHeartbest though or maybe not right away. It seems that sometime in starts by getting a GetLeaderReply with nominee being set to 00000000 and also some additional events that I’m not 100% clear but I noticed that I’m getting MonitorLeaderAndGetClientInfoLeaderChange and also EndpointNotFound.

Today I deployed coredns (https://coredns.io/) as per node cache (with a daemonset) and use cilium to route the call to the DNS server to the coredns pod running on the node instead of seeing 10+ elections I haven’t seen one since deploying this cache.

tudor · May 6, 2025, 1:28am

(Yes, it was DNS. It’s always DNS.)

Topic		Replies	Views
FDB regular recovery with operator Kubernetes Operator	16	355	April 29, 2025
Identitying cluster controller node in foundationdb Running FoundationDB	1	271	July 17, 2023
Foundationdb cluster became unavailable after shutting down 1 az Using FoundationDB	0	247	August 23, 2023
Could not communicate with a quorum of coordination servers Using FoundationDB	2	2226	March 5, 2020
EndpointNotFound in trace when configure coordinator Development	1	607	November 23, 2018

Multi region cluster, "regular" cluster elections: how to debug?

Related topics