We are migrating our FDB clusters from gcloud GCE instances to regional GKE with the k8s operator. The operator based clusters are working fine and really performant in our benchmarks.
However, the new clusters on k8s have regular recoveries, without the pods/processes changing. Our old GCE based clusters have a stable generation for weeks/months, these new operator based clusters have a recovery every few hours to even multiple times per hour.
Is there anything that we could have misconfigured that is causing this? There are no crashing processes/roles. From the logs this looks like the coordinators elect a new leader but I can not find the reason why.
have you checked if the operator is performing any changes to the FDB cluster, like coordinator changes or database configuration changes? Or exclusions, those could cause a recovery. Based on your comment that the pods are the same, it’s probably not a recovery.
I’m not too familiar with GKE (at least not anymore). Is there a dashboard for network latencies?
Yes, the operator itself isn’t doing anything around the time the recoveries happen.
I’ve tried some of those queries you linked, it’s always the clustercontroller that triggers the recovery by releasing it’s leadership. Then a new clustercontroller is elected which then recruits a new master which is the recovery. Looking at the code, the heartbeats that cause the clustercontroller to release it’s leadership are to the coordinators and for some reason they don’t reply in 2s (default POLLING_FREQUENCY). I can not find a reason for them not replying in the logs.
I had a look at network latencies between the VMs and did not find any noticable spikes around the timestamps the clustercontroller heartbeats fail. Median latency is consistently less than 1ms.
There are no logs with high severity, the only warn logs are an occasional Net2RunLoopTrace and SlowSSLoopx100
I suspect the Cluster Controller is doing some CPU intensive work that took a long time, thus delaying the processing of LeaderHeartbeatReply. Because these replies are not processed timely, CC “thought” the heartbeat failed.
To validate this hypothesis, you can check CPU usage of the CC around the time recoveries happened. Additionally, if there are Net2RunLoopTrace from CC’s log, you can check the call stack of the trace which can give clues where CC is spinning CPU time. Finally, we know status json processing can take a long time on CC, so it’s worth checking if some client was issuing status json at the time.
SlowSSLoopx100 is a storage server event. Is CC colocated with a storage server (SS)? If so, it might be the SS was CPU busy and didn’t give much CPU cycles to CC. CC should placed on stateless processes, not with SSes.
I’ll remove the cpu limits to make sure, but I also have this on an empty cluster without any clients (except for status json from monitoring every 10s). There is no cpu usage on the CC and definitely no CPU throttling from our cadvisor metrics. Is it possible that the status json every 10s has impact on an empty cluster without seeing this on CPU usage? I can try stopping monitoring to test.
All roles have their own process, however, coordinators are on storage pods. So CC is it’s own pod. Could coordinator (+storage) slowness result in this behaviour?
Ok, I’ll try this out. I thought they needed to be on pods with disks (storage or log), so I’ll have to add disks to all stateless processes which the operator doesn’t do by default. Again, on an empty clusters without storage pod CPU usage I see this behaviour. Edit: see below, CC is separate on stateless pods.
A screenshot of how “unstable” CC is, although in 12h before this there where no changes in roles.
CC should placed on stateless processes, not with SSes.
Oops ignore (part of) my previous comment, I was confused with the coordinators. CC is it’s own pod and is on stateless pods(without disks). Coordinators do live with the SS (which I assume is ok according to the operator & fdb docs.)
A Net2RunLoopTrace that often comes back on CC is this: addr2line -e fdbserver.debug -p -C -f -i 0x793bf44a3630 0x793bf44a0aa1 0x43cbdc4 0x43cda55 0xdd29f6 0x793bf40e8555 0xe399d2
bash-4.2# addr2line -e /usr/bin/fdbserver -p -C -f -i 0x793bf44a3630 0x793bf44a0aa1 0x43cbdc4 0x43cda55 0xdd29f6 0x793bf40e8555 0xe399d2
?? ??:0
?? ??:0
N2::ASIOReactor::sleep(double) at ??:?
(inlined by) boost::asio::basic_deadline_timer<boost::posix_time::ptime, boost::asio::time_traits<boost::posix_time::ptime>, boost::asio::any_io_executor>::cancel() at /opt/boost_1_78_0/include/boost/asio/basic_deadline_timer.hpp:348
(inlined by) N2::ASIOReactor::sleep(double) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Net2.actor.cpp:2089
N2::Net2::run() at ??:?
(inlined by) MetricHandle<ContinuousMetric<bool> >::operator=(bool const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/TDMetric.actor.h:1374
(inlined by) N2::Net2::run() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Net2.actor.cpp:1516
main at ??:?
?? ??:0
_start at ??:?
Not sure how to interpret this …
At this point the only differences between our previous architecture are the VM types(N1 vs C3/N4), the disks (pd-ssd vs hyperdisk-balanced) and the use of DNS in the clusterfile.
The problem is related to DNS where there are A and AAAA queries of the coordinators are being done in CC for each heartbeat to the coordinators. The problem occurs when the A records does return, but for some reason the AAAA does not (which is always empty as we have an ipv4 only cluster).
Looking at the code there should be a retry after HOSTNAME_RESOLVE_INIT_INTERVAL in resolveWithRetryImpl but that doesn’t seem to work in this case. I can send the tcpdump in case anyone is interested.
I will try if RESOLVE_PREFER_IPV4_ADDR = true fixes this issue in our environment.
RESOLVE_PREFER_IPV4_ADDR = true didn’t fix the problem. It’s also not related to A vs AAAA queries but that does complicates the issue a bit as there’s higher probability for the hostname resolve to fail/hang.
I was able to fix this by adding a dnsconfig option to all pods:
dnsConfig:
options:
- name: timeout
value: "1"
The default in GKE is a timeout of 2 (seconds).
This does raise a question a few questions:
I was under the assumption the FDB should be responsible itself to retry these DNS requests if it doesn’t have a reply after HOSTNAME_RESOLVE_INIT_INTERVAL, see resolveWithRetryImpl: foundationdb/flow/Hostname.actor.cpp at release-7.1 · apple/foundationdb · GitHub I guess my assumption was wrong and only if the dns resolve fails, FDB will retry it after the delay.
Unsure about this but I cannot explain it otherwise: Because these heartbeats happen sequentially and one dns query can hang, it looks like the rest of the heartbeats don’t happen. This causes the quorum of heartbeats not being reached and CC releasing its leadership. So a single failed(hanging) heartbeat results in a recovery, the rest of the heartbeats are potentially not executed.
Thanks for the report. I haven’t looked at the DNS resolution path in detail, but I will take some time. I thought that the DNS entries are only used once during the cluster connection and after that only IP addresses should be used, so I’m a bit surprised that the DNS records are queried multiple times. CC @jzhou