Proposal: Don't identify a coordinator based on its IP

I could be wrong, but I think FDB already solves the specific problem you are trying to solve with unique IDs.

  1. fdbserver roles other than coordinators only expect their IP address to be stable while the process is actually running. Every time they start up they register with the coordinators, cluster controller service and/or system keyspace the IP address needed to reach them, and to the extent that their identities actually need to persist, they use various unique identifiers.

  2. #1 is tested in simulation by swapping data folders (which should be isomorphic to swapping IP addresses)

  3. Coordinators need to have stable IP addresses, because that’s the only way we have to reach them via the cluster file. If the cluster file doesn’t contain enough correct IP addresses (perhaps just one if a coordinator change has occurred to reflect the new IP addresses?), an fdbserver won’t be able to join the cluster. I don’t think this can be improved by using more unique identifiers, but see below.

  4. I think it is sound, however, for coordinators to swap or change IP addresses. A unique identifier is used to identify the cluster for which coordination information is stored, and I don’t think the consensus algorithm cares about the identity of different coordinators as long as the same coordinator isn’t reached by more than one address. As long as IP address changes occur sufficiently slowly, you could keep the cluster working indefinitely by just doing periodic coordinator changes.

  5. I’m not sure if we are adequately testing #4, however. The tricky part would be to not break availability in simulations by running into #3. @Evan?

The feature that I would like to see for integration with external service discovery is to make it so that, as a command line parameter, environment variable or (client) network option you can pass cluster file contents or just coordinator IP addresses obtained from your service discovery system to FDB servers and clients. FDB attempts to contact these coordinators in parallel with the ones in the cluster file, and then updates its cluster file (as it does today) if a more up to date configuration is found. If service discovery is down or the information is out of date it does no harm. Ideally then your FDB cluster keeps working if either your service discovery is up or enough of your coordinators have kept their IP addresses. I would still recommend setting up coordinators with stable IP addresses if your orchestration system permits this, but this should make the best of a given situation.