FDB regular recovery with operator

RESOLVE_PREFER_IPV4_ADDR = true didn’t fix the problem. It’s also not related to A vs AAAA queries but that does complicates the issue a bit as there’s higher probability for the hostname resolve to fail/hang.

I was able to fix this by adding a dnsconfig option to all pods:

dnsConfig:
  options:
    - name: timeout
      value: "1"

The default in GKE is a timeout of 2 (seconds).

This does raise a question a few questions:

  • I was under the assumption the FDB should be responsible itself to retry these DNS requests if it doesn’t have a reply after HOSTNAME_RESOLVE_INIT_INTERVAL, see resolveWithRetryImpl: foundationdb/flow/Hostname.actor.cpp at release-7.1 · apple/foundationdb · GitHub I guess my assumption was wrong and only if the dns resolve fails, FDB will retry it after the delay.
  • Unsure about this but I cannot explain it otherwise: Because these heartbeats happen sequentially and one dns query can hang, it looks like the rest of the heartbeats don’t happen. This causes the quorum of heartbeats not being reached and CC releasing its leadership. So a single failed(hanging) heartbeat results in a recovery, the rest of the heartbeats are potentially not executed.