FDB regular recovery with operator

jlemaes · August 20, 2024, 6:22pm

RESOLVE_PREFER_IPV4_ADDR = true didn’t fix the problem. It’s also not related to A vs AAAA queries but that does complicates the issue a bit as there’s higher probability for the hostname resolve to fail/hang.

I was able to fix this by adding a dnsconfig option to all pods:

dnsConfig:
  options:
    - name: timeout
      value: "1"

The default in GKE is a timeout of 2 (seconds).

This does raise a question a few questions:

I was under the assumption the FDB should be responsible itself to retry these DNS requests if it doesn’t have a reply after HOSTNAME_RESOLVE_INIT_INTERVAL, see resolveWithRetryImpl: foundationdb/flow/Hostname.actor.cpp at release-7.1 · apple/foundationdb · GitHub I guess my assumption was wrong and only if the dns resolve fails, FDB will retry it after the delay.
Unsure about this but I cannot explain it otherwise: Because these heartbeats happen sequentially and one dns query can hang, it looks like the rest of the heartbeats don’t happen. This causes the quorum of heartbeats not being reached and CC releasing its leadership. So a single failed(hanging) heartbeat results in a recovery, the rest of the heartbeats are potentially not executed.

Topic		Replies	Views
Failure / Recovery scenario Kubernetes Operator	1	690	October 12, 2020
FDB kubernetes operator continuously boucing processes Kubernetes Operator	8	1000	May 3, 2020
Multi DC replication fails during DR test Kubernetes Operator operator	16	597	May 29, 2024
FDB operator stuck without recreating pods Kubernetes Operator operator	4	384	February 22, 2024
Recovering from FoundationDB crashes Kubernetes Operator operator	5	904	August 24, 2021

FDB regular recovery with operator

Related topics