FDB regular recovery with operator

jlemaes · July 11, 2024, 9:44pm

We are migrating our FDB clusters from gcloud GCE instances to regional GKE with the k8s operator. The operator based clusters are working fine and really performant in our benchmarks.

However, the new clusters on k8s have regular recoveries, without the pods/processes changing. Our old GCE based clusters have a stable generation for weeks/months, these new operator based clusters have a recovery every few hours to even multiple times per hour.

Is there anything that we could have misconfigured that is causing this? There are no crashing processes/roles. From the logs this looks like the coordinators elect a new leader but I can not find the reason why.

<Event Severity="10" Time="1720712716.403994" DateTime="2024-07-11T15:45:16Z" Type="NominatingLeader" ID="0000000000000000" NextNominee="0000000000000000" CurrentNominee="63b9766de9746e9e" Key="testing:lF1YgdssDlUMCkzSzySauaMudygmsAzy" ThreadID="10265174738276486606" Machine="10.20.8.166:4501" LogGroup="testing" Roles="CD,SS" />
<Event Severity="10" Time="1720712716.403994" DateTime="2024-07-11T15:45:16Z" Type="GetLeaderReply" ID="0000000000000000" SuppressedEventCount="0" Coordinator="testing-storage-80198.testing.fdb-testing.svc.cluster.local:4501" Nominee="0000000000000000" ClusterKey="testing:lF1YgdssDlUMCkzSzySauaMudygmsAzy" ThreadID="10265174738276486606" Machine="10.20.8.166:4501" LogGroup="testing" Roles="CD,SS" />
<Event Severity="10" Time="1720712716.403994" DateTime="2024-07-11T15:45:16Z" Type="MonitorLeaderAndGetClientInfoLeaderChange" ID="0000000000000000" NewLeader="63b9766de9746e9e" Key="testing:lF1YgdssDlUMCkzSzySauaMudygmsAzy" ThreadID="10265174738276486606" Machine="10.20.8.166:4501" LogGroup="testing" Roles="CD,SS" />
<Event Severity="10" Time="1720712716.404216" DateTime="2024-07-11T15:45:16Z" Type="LeaderChanged" ID="634ed136a6cbc5d0" ToID="63b9766de9746e9e" ThreadID="13314130535266162720" Machine="10.20.0.94:4501" LogGroup="testing" Roles="RK" />
<Event Severity="10" Time="1720712716.404260" DateTime="2024-07-11T15:45:16Z" Type="GetLeaderReply" ID="0000000000000000" SuppressedEventCount="0" Coordinator="testing-storage-80198.testing.fdb-testing.svc.cluster.local:4501" Nominee="0000000000000000" ClusterKey="testing:lF1YgdssDlUMCkzSzySauaMudygmsAzy" ThreadID="8277690677813700063" Machine="10.20.10.199:4501" LogGroup="testing" Roles="CD,SS" />
...
<Event Severity="10" Time="1720712716.874662" DateTime="2024-07-11T15:45:16Z" Type="LeaderNoHeartbeat" ID="63b9766de9746e9e" Coordinator="10.20.8.177:4501" ThreadID="2175416327912343892" Machine="10.20.8.177:4501" LogGroup="testing" Roles="CC" />
<Event Severity="10" Time="1720712716.874662" DateTime="2024-07-11T15:45:16Z" Type="LeaderNoHeartbeat" ID="63b9766de9746e9e" Coordinator="10.20.8.177:4501" ThreadID="2175416327912343892" Machine="10.20.8.177:4501" LogGroup="testing" Roles="CC" />
<Event Severity="10" Time="1720712716.874662" DateTime="2024-07-11T15:45:16Z" Type="LeaderNoHeartbeat" ID="63b9766de9746e9e" Coordinator="10.20.8.177:4501" ThreadID="2175416327912343892" Machine="10.20.8.177:4501" LogGroup="testing" Roles="CC" />
<Event Severity="10" Time="1720712716.874662" DateTime="2024-07-11T15:45:16Z" Type="ReleasingLeadership" ID="63b9766de9746e9e" ThreadID="2175416327912343892" Machine="10.20.8.177:4501" LogGroup="testing" Roles="CC" />
<Event Severity="10" Time="1720712716.874662" DateTime="2024-07-11T15:45:16Z" Type="Role" ID="424aa8c17bc14f7d" Transition="End" As="ClusterController" Reason="Leader Replaced" ThreadID="2175416327912343892" Machine="10.20.8.177:4501" LogGroup="testing" Roles="CC" TrackLatestType="Original" />

Thanks for the awesome work on the operator & fdb!

johscheuer · July 15, 2024, 9:11am

Hello,

have you checked if the operator is performing any changes to the FDB cluster, like coordinator changes or database configuration changes? Or exclusions, those could cause a recovery. Based on your comment that the pods are the same, it’s probably not a recovery.

I’m not too familiar with GKE (at least not anymore). Is there a dashboard for network latencies?

You could try to use this query: foundationdb/contrib/observability_splunk_dashboard/recovery.xml at main · apple/foundationdb · GitHub (you probably have to change it for your logging solution), to see why the recoveries are triggered.

jlemaes · July 22, 2024, 10:54pm

Yes, the operator itself isn’t doing anything around the time the recoveries happen.

I’ve tried some of those queries you linked, it’s always the clustercontroller that triggers the recovery by releasing it’s leadership. Then a new clustercontroller is elected which then recruits a new master which is the recovery. Looking at the code, the heartbeats that cause the clustercontroller to release it’s leadership are to the coordinators and for some reason they don’t reply in 2s (default POLLING_FREQUENCY). I can not find a reason for them not replying in the logs.

<Event Severity="10" Time="1721671970.698063" DateTime="2024-07-22T18:12:50Z" Type="LeaderTrueHeartbeat" ID="638445b4f57b2e42" Coordinator="10.20.13.46:4501" ThreadID="9164596303472621124" Machine="10.20.13.46:4501" LogGroup="testing" Roles="CC" />
<Event Severity="10" Time="1721671970.698063" DateTime="2024-07-22T18:12:50Z" Type="LeaderNoHeartbeat" ID="638445b4f57b2e42" Coordinator="10.20.13.46:4501" ThreadID="9164596303472621124" Machine="10.20.13.46:4501" LogGroup="testing" Roles="CC" />
<Event Severity="10" Time="1721671970.698063" DateTime="2024-07-22T18:12:50Z" Type="LeaderNoHeartbeat" ID="638445b4f57b2e42" Coordinator="10.20.13.46:4501" ThreadID="9164596303472621124" Machine="10.20.13.46:4501" LogGroup="testing" Roles="CC" />
<Event Severity="10" Time="1721671970.698063" DateTime="2024-07-22T18:12:50Z" Type="ReleasingLeadership" ID="638445b4f57b2e42" ThreadID="9164596303472621124" Machine="10.20.13.46:4501" LogGroup="testing" Roles="CC" />

I had a look at network latencies between the VMs and did not find any noticable spikes around the timestamps the clustercontroller heartbeats fail. Median latency is consistently less than 1ms.

There are no logs with high severity, the only warn logs are an occasional Net2RunLoopTrace and SlowSSLoopx100

jzhou · July 24, 2024, 4:46am

I suspect the Cluster Controller is doing some CPU intensive work that took a long time, thus delaying the processing of LeaderHeartbeatReply. Because these replies are not processed timely, CC “thought” the heartbeat failed.

To validate this hypothesis, you can check CPU usage of the CC around the time recoveries happened. Additionally, if there are Net2RunLoopTrace from CC’s log, you can check the call stack of the trace which can give clues where CC is spinning CPU time. Finally, we know status json processing can take a long time on CC, so it’s worth checking if some client was issuing status json at the time.

SlowSSLoopx100 is a storage server event. Is CC colocated with a storage server (SS)? If so, it might be the SS was CPU busy and didn’t give much CPU cycles to CC. CC should placed on stateless processes, not with SSes.

johscheuer · July 24, 2024, 4:54am

@jzhou was faster If you set CPU limits (which is the default with the operator), you should check if some of the processes are throttled. If a process is throttle it can happen that the responses are delayed (basically what Jingyu said above). GKE should have some tooling to help you: Troubleshooting resource contention issues | Google Distributed Cloud (software only) for VMware | Google Cloud.

jlemaes · July 24, 2024, 11:32am

Thanks for the replies!

I’ll remove the cpu limits to make sure, but I also have this on an empty cluster without any clients (except for status json from monitoring every 10s). There is no cpu usage on the CC and definitely no CPU throttling from our cadvisor metrics. Is it possible that the status json every 10s has impact on an empty cluster without seeing this on CPU usage? I can try stopping monitoring to test.

All roles have their own process, however, coordinators are on storage pods. So CC is it’s own pod. Could coordinator (+storage) slowness result in this behaviour?

Ok, I’ll try this out. I thought they needed to be on pods with disks (storage or log), so I’ll have to add disks to all stateless processes which the operator doesn’t do by default. Again, on an empty clusters without storage pod CPU usage I see this behaviour. Edit: see below, CC is separate on stateless pods.

A screenshot of how “unstable” CC is, although in 12h before this there where no changes in roles.

We run on N4 VMs, also tried C3 VMs.

jlemaes · July 24, 2024, 6:24pm

CC should placed on stateless processes, not with SSes.

Oops ignore (part of) my previous comment, I was confused with the coordinators. CC is it’s own pod and is on stateless pods(without disks). Coordinators do live with the SS (which I assume is ok according to the operator & fdb docs.)

jlemaes · July 25, 2024, 7:56pm

Testing without CPU limit on all FDB pods doesn’t seem to improve the situation…

Full clusterspec:

apiVersion: apps.foundationdb.org/v1beta2
kind: FoundationDBCluster
metadata:
  name: testing
  namespace: fdb-testing
spec:
  automationOptions:
    deletionMode: Zone
    podUpdateStrategy: ReplaceTransactionSystem
    removalMode: Zone
    useLocalitiesForExclusion: true
  coordinatorSelection:
  - priority: 100
    processClass: storage
  databaseConfiguration:
    commit_proxies: 1
    logs: 2
    storage: 3
    storage_engine: ssd-2
  imageType: split
  maxZonesWithUnavailablePods: 1
  minimumUptimeSecondsForBounce: 600
  processes:
    log:
      podTemplate:
        spec:
          containers:
          - name: foundationdb
            resources:
              limits:
                memory: 4Gi
              requests:
                cpu: 1
                memory: 4Gi
      volumeClaimTemplate:
        spec:
          resources:
            requests:
              storage: 25Gi
          storageClassName: hyperdisk-balanced-fdb-logs
    stateless:
      podTemplate:
        spec:
          containers:
          - name: foundationdb
            resources:
              limits:
                memory: 4Gi
              requests:
                cpu: 1
                memory: 4Gi
    storage:
      customParameters:
      - memory = 8GiB
      - cache-memory = 2GiB
      podTemplate:
        spec:
          containers:
          - name: foundationdb
            resources:
              limits:
                memory: 8Gi
              requests:
                cpu: 1
                memory: 8Gi
      volumeClaimTemplate:
        spec:
          resources:
            requests:
              storage: 50Gi
          storageClassName: hyperdisk-balanced-fdb-storage-v2
  replaceInstancesWhenResourcesChange: false
  routing:
    useDNSInClusterFile: true
  skip: false
  version: 7.1.61

A Net2RunLoopTrace that often comes back on CC is this:
addr2line -e fdbserver.debug -p -C -f -i 0x793bf44a3630 0x793bf44a0aa1 0x43cbdc4 0x43cda55 0xdd29f6 0x793bf40e8555 0xe399d2

bash-4.2# addr2line -e /usr/bin/fdbserver -p -C -f -i 0x793bf44a3630 0x793bf44a0aa1 0x43cbdc4 0x43cda55 0xdd29f6 0x793bf40e8555 0xe399d2
?? ??:0
?? ??:0
N2::ASIOReactor::sleep(double) at ??:?
 (inlined by) boost::asio::basic_deadline_timer<boost::posix_time::ptime, boost::asio::time_traits<boost::posix_time::ptime>, boost::asio::any_io_executor>::cancel() at /opt/boost_1_78_0/include/boost/asio/basic_deadline_timer.hpp:348
 (inlined by) N2::ASIOReactor::sleep(double) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Net2.actor.cpp:2089
N2::Net2::run() at ??:?
 (inlined by) MetricHandle<ContinuousMetric<bool> >::operator=(bool const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/TDMetric.actor.h:1374
 (inlined by) N2::Net2::run() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Net2.actor.cpp:1516
main at ??:?
?? ??:0
_start at ??:?

Not sure how to interpret this …

At this point the only differences between our previous architecture are the VM types(N1 vs C3/N4), the disks (pd-ssd vs hyperdisk-balanced) and the use of DNS in the clusterfile.

jlemaes · August 13, 2024, 6:06am

The problem is related to DNS where there are A and AAAA queries of the coordinators are being done in CC for each heartbeat to the coordinators. The problem occurs when the A records does return, but for some reason the AAAA does not (which is always empty as we have an ipv4 only cluster).

Looking at the code there should be a retry after HOSTNAME_RESOLVE_INIT_INTERVAL in resolveWithRetryImpl but that doesn’t seem to work in this case. I can send the tcpdump in case anyone is interested.

I will try if RESOLVE_PREFER_IPV4_ADDR = true fixes this issue in our environment.

jlemaes · August 20, 2024, 6:22pm

RESOLVE_PREFER_IPV4_ADDR = true didn’t fix the problem. It’s also not related to A vs AAAA queries but that does complicates the issue a bit as there’s higher probability for the hostname resolve to fail/hang.

I was able to fix this by adding a dnsconfig option to all pods:

dnsConfig:
  options:
    - name: timeout
      value: "1"

The default in GKE is a timeout of 2 (seconds).

This does raise a question a few questions:

I was under the assumption the FDB should be responsible itself to retry these DNS requests if it doesn’t have a reply after HOSTNAME_RESOLVE_INIT_INTERVAL, see resolveWithRetryImpl: foundationdb/flow/Hostname.actor.cpp at release-7.1 · apple/foundationdb · GitHub I guess my assumption was wrong and only if the dns resolve fails, FDB will retry it after the delay.
Unsure about this but I cannot explain it otherwise: Because these heartbeats happen sequentially and one dns query can hang, it looks like the rest of the heartbeats don’t happen. This causes the quorum of heartbeats not being reached and CC releasing its leadership. So a single failed(hanging) heartbeat results in a recovery, the rest of the heartbeats are potentially not executed.

johscheuer · September 9, 2024, 2:15pm

Thanks for the report. I haven’t looked at the DNS resolution path in detail, but I will take some time. I thought that the DNS entries are only used once during the cluster connection and after that only IP addresses should be used, so I’m a bit surprised that the DNS records are queried multiple times. CC @jzhou

mpatou_openai · April 24, 2025, 5:31pm

Pilling up here a bit, I was also told that fdb would only seldomly use DNS (Which publicIPSource to use when using FDB with multi-region and kubernetes? - #2 by johscheuer) but

date; tcpdump -i any port 53 -n -c 1000|grep fdb-cluster-log-72213.fdb-cluster.fdb |wc -l ; date

Shows 54 queries (A and AAAA) over the course of 8 seconds for a single pod.

When look at all the queries made I have ~500 queries over the 8 seconds

jzhou · April 28, 2025, 3:59pm

When the coordinators are DNS names, we’ll use them:

github.com/apple/foundationdb

fdbserver/LeaderElection.actor.cpp

main


      
          		if (myInfo.changeID != prevChangeID) {
          			TraceEvent("ChangeLeaderChangeID")
          			    .detail("PrevChangeID", prevChangeID)
          			    .detail("NewChangeID", myInfo.changeID);
          		}
          
          		state std::vector<Future<Void>> true_heartbeats;
          		state std::vector<Future<Void>> false_heartbeats;
          		for (int i = 0; i < coordinators.leaderElectionServers.size(); i++) {
          			Future<LeaderHeartbeatReply> hb;
          			if (coordinators.leaderElectionServers[i].hostname.present()) {
          				hb = retryGetReplyFromHostname(LeaderHeartbeatRequest(coordinators.clusterKey, myInfo, prevChangeID),
          				                               coordinators.leaderElectionServers[i].hostname.get(),
          				                               WLTOKEN_LEADERELECTIONREG_LEADERHEARTBEAT,
          				                               TaskPriority::CoordinationReply);
          			} else {
          				hb = retryBrokenPromise(coordinators.leaderElectionServers[i].leaderHeartbeat,
          				                        LeaderHeartbeatRequest(coordinators.clusterKey, myInfo, prevChangeID),
          				                        TaskPriority::CoordinationReply);
          			}
          			true_heartbeats.push_back(onEqual(hb, LeaderHeartbeatReply{ true }));

Checking retryGetReplyFromHostname code, we are using HOSTNAME_RECONNECT_INIT_INTERVAL=0.05 and retry doubling the delay until HOSTNAME_RECONNECT_MAX_INTERVAL=1.0 . So retries can happen many times in a second.

By default, we disable ENABLE_COORDINATOR_DNS_CACHE knob, because the thought is that this feature is not useful for reasons

This FDB cache doesn’t have TTLs and expirations; and
OS should properly caches DNS entries and quickly reply if cache hits, and OS will properly handle DNS entry’s TTL.

I tend to think this DNS resolution problem is a deployment issue, where DNS servers are not responding faster enough or not setting TTL correctly.

If the DNS servers can’t be fixed, we can try setting ENABLE_COORDINATOR_DNS_CACHE to true , which only does DNS resolution once, but without refreshing entries or respecting DNS TTLs.

mpatou_openai · April 28, 2025, 4:12pm

I can see your point about relying on the caching at the OS level, my impression is that by default pods are not doing caching in kubernetes maybe there is some at the kubedns level but it seems not adequate.
It might be a good idea to mention that when using DNS + kubernetes operator it’s highly advisable to ensure that proper caching is configured.

jlemaes · April 28, 2025, 6:57pm

Yes, retries will happen in case the DNS query fails, but the behavior I observed was that it just hangs, it never returns until the pod’s OS DNS resolver reaches the timeout. Only when the DNS resolver returns the timeout, FDB will retry the resolve using HOSTNAME_RESOLVE_INIT_INTERVAL/HOSTNAME_RESOLVE_MAX_INTERVAL. This can still happen if there is a disconnect between the pod OS resolver timeout settings and what FDB expects in the leaderelection heartbeat. In my opinion, if FDB relies on a fast DNS response, it should add a deadline/timeout in the query.

A local cache will help, but a hanging DNS query can still happen with a cache? It is still a UDP packet to a local process on the node in most cases in k8s (see Using NodeLocal DNSCache in Kubernetes Clusters | Kubernetes).

The leaderelection never reaches the actual connection where HOSTNAME_RECONNECT_INIT_INTERVAL / HOSTNAME_RECONNECT_MAX_INTERVAL is used when DNS resolve hangs on this line: foundationdb/fdbrpc/include/fdbrpc/genericactors.actor.h at main · apple/foundationdb · GitHub

As highlighted before in this thread, I fixed this by lowering the pod resolver DNS timeout, which will make FDB retry the DNS query using HOSTNAME_RESOLVE_INIT_INTERVAL/HOSTNAME_RESOLVE_MAX_INTERVAL.

jzhou · April 29, 2025, 5:13am

Yes, FDB assumes a fast DNS response. In the past, FDB doesn’t support DNS and uses IP:Port to avoid external dependency on DNS.

If DNS is slow, I guess we could timeout earlier and retry DNS query, which may help in some cases, but does not solve all cases. I wonder why a DNS retry is faster (and solves your problem)? FDB uses boost::asio::ip::tcp::resolver, which seems to use TCP and should not drop packets.

Agree with @mpatou_openai that “DNS + kubernetes operator” should be carefully configured. If the default pod is not caching DNS entries, then I can see why the problem arises.

jlemaes · April 29, 2025, 11:07pm

Interesting, my tcpdumps captured UDP packets for DNS.

Topic		Replies	Views
Failure / Recovery scenario Kubernetes Operator	1	691	October 12, 2020
FDB kubernetes operator continuously boucing processes Kubernetes Operator	8	1001	May 3, 2020
Multi DC replication fails during DR test Kubernetes Operator operator	16	600	May 29, 2024
FDB operator stuck without recreating pods Kubernetes Operator operator	4	386	February 22, 2024
Recovering from FoundationDB crashes Kubernetes Operator operator	5	914	August 24, 2021

FDB regular recovery with operator

Related topics