Proposal: Don't identify a coordinator based on its IP

basgys · June 9, 2018, 1:09pm

As many people pointed out, deploying FoundationDB with a container-orchestration tool, such as Kubernetes, Docker Swarm, and others can be problematic for multiple reasons. One of the main problem is the unstable nature of IP addresses. Kubernetes provides stable network IDs with its StatefulSet, but when a Pod restarts, the IP address will most certainly change.

There is a suggestion to use hostname instead and thus rely on DNS to uniquely identify nodes. However, despite that it would solve the problem on Kubernetes/Swarm, this change would introduce another potential source of failure (See wwilson answer).

My proposal would be to uniquely identify nodes with an arbitrary string that would be either generated by FoundationDB when a node starts for the first time or defined by in the config file.
The main benefit of such a solution is that it does not rely on movable parts, such as IP addresses, or DNS names to identify a node. I think that what defines a node identity is the data it carries, no matter its IP address or hostname. Moreover, because it is an arbitrary string, it would be flexible enough to allow people to still use IP addresses or DNS names if they would like to.

Now one of the problems is to reach these nodes. Here is an example of an fdb.cluster file:

<desc>:<id>@urn:node:<uuid>:<ip>:<port>:(<tls>),urn:node:<uuid>:<ip>:<port>:(<tls>)

Since IP, port, and TLS can change, the canonical ID would be urn:node:<uuid>.

I am not familiar with the internal implementation of FoundationDB, so I cannot assess the impact of such a change, but I am aware that this would introduce new challenges. For example, what happen when a node B has the IP address of node A that just left the cluster. Requests between nodes would have to contain this unique identifier to check its recipient.

I would be interested to hear what the community thinks about it whether they run FoundationDB in a container or not.

Related issues/threads:

https://github.com/apple/foundationdb/issues/354

dave · June 9, 2018, 3:00pm

I could be wrong, but I think FDB already solves the specific problem you are trying to solve with unique IDs.

fdbserver roles other than coordinators only expect their IP address to be stable while the process is actually running. Every time they start up they register with the coordinators, cluster controller service and/or system keyspace the IP address needed to reach them, and to the extent that their identities actually need to persist, they use various unique identifiers.
#1 is tested in simulation by swapping data folders (which should be isomorphic to swapping IP addresses)
Coordinators need to have stable IP addresses, because that’s the only way we have to reach them via the cluster file. If the cluster file doesn’t contain enough correct IP addresses (perhaps just one if a coordinator change has occurred to reflect the new IP addresses?), an fdbserver won’t be able to join the cluster. I don’t think this can be improved by using more unique identifiers, but see below.
I think it is sound, however, for coordinators to swap or change IP addresses. A unique identifier is used to identify the cluster for which coordination information is stored, and I don’t think the consensus algorithm cares about the identity of different coordinators as long as the same coordinator isn’t reached by more than one address. As long as IP address changes occur sufficiently slowly, you could keep the cluster working indefinitely by just doing periodic coordinator changes.
I’m not sure if we are adequately testing #4, however. The tricky part would be to not break availability in simulations by running into #3. @Evan?

The feature that I would like to see for integration with external service discovery is to make it so that, as a command line parameter, environment variable or (client) network option you can pass cluster file contents or just coordinator IP addresses obtained from your service discovery system to FDB servers and clients. FDB attempts to contact these coordinators in parallel with the ones in the cluster file, and then updates its cluster file (as it does today) if a more up to date configuration is found. If service discovery is down or the information is out of date it does no harm. Ideally then your FDB cluster keeps working if either your service discovery is up or enough of your coordinators have kept their IP addresses. I would still recommend setting up coordinators with stable IP addresses if your orchestration system permits this, but this should make the best of a given situation.

basgys · June 10, 2018, 11:43am

First off, thanks for your prompt answer @dave!

I would like to rectify my proposal. When I mentioned nodes, I was actually referring to coordinators. I deployed a 3 nodes cluster and because all of them act both as a regular node and a coordinator, I mixed up the wording. I changed the title.

Coordinators need to have stable IP addresses, because that’s the only way we have to reach them via the cluster file. […] I don’t think this can be improved by using more unique identifiers.

Basically I see two cases where the IP addresses should be handled differently:

Contact the cluster coordinators (when a node starts)
Ongoing communication among peers

Case 1
I agree that unique identifiers won’t solve the problem in this specific context. We need to rely on stable IP addresses at some point to contact at least one coordinator to join the cluster.

That means the fdb.cluster could probably stay as is. <desc>:<id>@192.168.1.1:4500,192.168.1.2:4500

Case 2
Here is the part where I think unique identifiers would solve a problem. But first I would like to stress that I’m fairly new to FoundationDB, so feel free to correct me if I’m wrong here.

I made the assumption that a node relies solely on the fdb.cluster file to find coordinators in the cluster.

Scenario: Cluster of 4 nodes and 3 of them are coordinators.
If two Pods (coordinators) that are being restarted by Kubernetes (e.g. update, crash, …), they will re-appear with a different IP address and the cluster is now blocked. I cannot update the list of coordinators anymore because there is no quorum and a force eviction won’t work because it is a coordinator. As far as I am aware, Kubernetes does not allow me to assign a static IP to fix the issue. The cluster is basically locked forever.

Using an arbitrary string to uniquely identify a coordinator would solve this problem. When a node or a coordinator (re-)join a cluster, it would share its IP/Port and ID with the cluster. The cluster would then share a table with all peers that would look like:

| ID | Type | IP | Port |
| 1 | C | 192.168.1.1 | 4500 |
| 2 | C | 192.168.1.2 | 4500 |
| 3 | C | 192.168.1.3 | 4500 |
| 4 | N | 192.168.1.4 | 4500 |

This table would be continuously shared among all peers in the cluster, so that when a coordinator leaves the cluster and joins back again with a different IP address, all peers would be informed and thus not affected by an IP/Port change.

The feature that I would like to see for integration with external service discovery is to make it so that, as a command line parameter, environment variable or (client) network option you can pass cluster file contents or just coordinator IP addresses obtained from your service discovery system to FDB servers and clients. FDB attempts to contact these coordinators in parallel with the ones in the cluster file, and then updates its cluster file (as it does today) if a more up to date configuration is found.

That is a good idea! This fdb.cluster file is definitely not ideal. Is it even necessary to keep it?

Using a command line parameter, environment variable, or anything else should be enough to contact the cluster. Then the cluster would continuously share a transient table as I explained above.

Ideally then your FDB cluster keeps working if either your service discovery is up or enough of your coordinators have kept their IP addresses. I would still recommend setting up coordinators with stable IP addresses if your orchestration system permits this, but this should make the best of a given situation.

Obviously the idea is to keep IP addresses as stable as possible, but when shit hits the fan, FoundationDB (deployed with a container-orchestration tool) should be able to recover from it.

ajbeamon · June 11, 2018, 4:29pm

I think one of the main arguments for the cluster file is that it provides a durable local record of the current connection information for the cluster and doesn’t have any dependencies on external services.

While a process is connected to the cluster, the cluster can undergo all kinds of changes to its members. If, for example, you migrate your cluster to new hosts (i.e. replace every process in the cluster), then the connection information you used when your process started up before the migration will not be valid after the migration. The processes (both client and server) that are connected to a cluster will update their cluster files in response to these changes, ensuring they can reconnect if they die.

Without the cluster file, if your process dies and tries to reconnect with the same connection string it did originally, it would be unable to in the scenario I described above. Or, if you were relying on some external service discovery to provide you with up-to-date information, you are now dependent on that service being up and functioning in order for your processes to be able to reconnect.

basgys · June 12, 2018, 12:20pm

Thanks for your answer.

I realised after I replied to @dave that he already explained it. It makes sense to keep a fallback in case the external service you’d be relying upon is down.

alexmiller · June 14, 2018, 1:15am

I don’t believe that would entirely solve the issue? While an FDB process is running, one could have e.g. kubernetes restart all of the coordinators and assign them a different IP. It would be expected for the cluster to continue working as long as their data volumes are the same and DNS now resolves them to the new IP. I don’t think that would be the case with this proposed feature, as there’s no way to re-resolve what was provided from service discovery?

dave · June 14, 2018, 5:16pm

I said IP addresses, but there is no obvious reason why you couldn’t allow DNS names or even an arbitrary command that outputs addresses. The cluster file would still contain IPs, and we would only need to poll resolution when not successfully connected to a quorum of coordinators.

You still need a mechanism for doing coordination changes automatically when needed. And honestly this probably just scratches the surface of problems with running anything reliable on Kubernetes. How do you handle changes to the cluster size safely? How do you run a cluster across multiple datacenters (and hence Kubernetes clusters)? I dunno.

Zatte · August 2, 2018, 9:57am

allow DNS names or /…/ arbitrary command that outputs addresses /…/ The cluster file would still contain IPs

I don’t see how this solves the issue of setting new coordinator servers when there isn’t a quorum or coordinator servers available. Or in Basgys words:

I cannot update the list of coordinators anymore because there is no quorum and a force eviction won’t work because it is a coordinator. As far as I am aware, Kubernetes does not allow me to assign a static IP to fix the issue. The cluster is basically locked forever.

If the coordinators are identified by a DNS instead of IP, then no quorum change is needed, nodes need only to re-query the dns server on startup or IF coordinator is un-reachable. This would allow for a completely dead cluster to recover.

I suppose the issue could also be fixed if the option to force new coordination servers to be set (without a quorum) was available but this seems like a dangerous approach.

(@basgys : I did found a workaround in k8 by giving each coordinator pod its own service (with static ip); ugly but seems to work)

alexmiller · August 2, 2018, 7:40pm

There’s a couple pieces of trouble hidden in those steps though:

How would a coordinator know that it’s a coordinator?
What happens if DNS is unavailable?

For (1), we could have each process resolve the hostname and check against its own IP. This seems like it’d run into trouble with multi-IP’d hosts or NAT’d situations. We could have each process connect to each hostname and request an ID that it compares with a locally randomly generated (or assigned) one. This starts to sound very similar to @basgys’s proposal.

Considering DNS unavailability, this would mean that a process knowing that it’s a coordinator would rely on DNS. It would probably be wise to cache the IP resolved from DNS for a coordinator in the cluster file as well, and I believe we should have the ability to store a tiny bit of state about if we were formerly the coordinator in the pidfile where we store other preferences as well.

Is it just the case that if you’re running in kubernetes and DNS is unavailable, then so many other things would be broken and unavailable that FDB unavailability wouldn’t even comparatively be a problem?

Zatte · August 3, 2018, 1:46pm

(2) I agree that DNS availability adds another failure point. However I think the discussion is not if DNS should be mandatory but if it could be a supported alternative; If so, then everyone can decide if DNS is an acceptable point of failure given their respective infrastructure. In a K8 environment I would argue that depending on DNS is reasonable.

kwasi · August 5, 2018, 3:59am

I’d agree with @Zatte. I believe it makes sense for FDB to optionally support DNS resolution of coordinator host names. To answer your question @alexmiller: yes, DNS is extremely robust in Kubernetes. If that mechanism ceases to work, there are bigger problems to worry about.

Chr1st0ph · August 7, 2018, 6:33am

I believe we need another feature based on DNS: support for “seeds identifited by hostnames”.

Thus, you do not have to “copy” the fdb.cluster file in Kubernetes for adding new nodes. You just define one or few seed nodes ( probably coordinators ), and these seed nodes provide the necessary topology information for new nodes and applications.

What are your thoughts ?

tobad357 · December 15, 2018, 1:41pm

Is there any progress on this as we would prefer to run FDB on K8S but currently it gets quite difficult if we for example delete all pods and they change ips

Was there any news during the conference?

alexmiller · December 20, 2018, 7:11am

Nope. None of the talks were about how to cloudify FDB.

There’s been a few iterations of discussion about that DNS support in FDB will be necessary for kubernetes, but no one has signed up to do the work yet.

jared2501 · May 21, 2019, 5:29pm

Hey @alexmiller, my team and I are somewhat interested in making these changes. We’re especially interested in it since we’d like to run FDB as a Kubernetes StatefulSet. A couple assumptions:

The DNS server will be run in a highly available configuration. If DNS is down, we’ve got more major problems that FDB being down.
StatefulSets ensure that each pod is assigned a stable network identity. I.e. there will be hosts fdb-0.cluster, fdb-1.cluster, fdb-2.cluster, etc…
For a StatefulSet’s network identity, there will only be a single IP address it resolves to at any one t ime.

I had a quick perusal over the code, and it seems that the best way to make this change might be to:

Extend the NetworkAddress class to be able to take a hostname.
Change connect in network.cpp to resolve the host before connecting.
Fix up all other usages of NetworkAddress to handle the hostname.

Does this sound like a reasonable change to make?

jared2501 · May 27, 2019, 6:13am

Just a quick heads up: I looked into making this code change and decided it was a bit too difficult, and that it would introduce some issues with ser/de back compat. Instead, I opted for the following strategy:

There’s a process that is responsible for monitoring the Pods that are a member of the FDB StatefulSet, and also for running the fdbmonitor command.
When it detects a change in the IP addresses of the Pods that are members of the SS, it stops the local fdbmonitor process (if one was previously started), generates a new fdb.cluster file with the new IP addresses, and restarts the fdbmonitor process with this new cluster file.

I’ve got this working for small cluster of 3 FDB nodes and am attempting see if I can break it.

Obviously, this strategy is pretty naive and results in a small window of downtime in the case that a Pod is rescheduled to another Node. After I get it working, I’m going to attempt some simple strategies to reduce/eliminate this downtime such as having the Pods that are alive co-ordinate their restarts with neighbouring Pods that they can reach.

I will likely continue to assume that the number of Pods that make up the StatefulSet (i.e. there is a fixed number of fdbmonitor processes, and therefore a fixed number of co-ordinators) for the foreseeable future.

If I’m successful, and if people are interested, I’m happy to release this code as an a Go library that others can use.

alexmiller · May 28, 2019, 9:41pm

Sorry for being slow in getting back to you. (Vacation, long weekend, nontrivial question, and all that.)

This is an area that we had really hoped that the community would be able to contribute to, being the ones more actively hitting these struggles. I would hope that with some help and guidence, you (and your team) would be able to make better progress on fixing this within FDB properly.

As we’re now discussing implementation details, I’d like to shuffle this conversation back to within GitHub on #354, and I’ll go start typing out the rest of my thoughts there.

john_brownlee · June 4, 2019, 3:44pm

There are a couple of other areas of concerns with IPs changing that we may want to address with a different kind of process identifier. Exclusions and kills are both done by IP today. For kills, it’s feasible to look up the IP immediately before killing the process, but it would be better if we could avoid this work in administrative tooling by supplying another identifier that we use instead. For excludes, I think there’s a bigger safety concern. It’s plausible that a process could get rescheduled after it is excluded, but before it is fully evacuated, and I think this would cause it to rejoin the cluster and start taking on new data. This could lead the person or system doing the exclusion to falsely believe it is safe to take down the instance. If this happens to multiple instances, that could lead to data loss. This seems like an area where we’ll need an identifier that we can track across the lifetime of an instance.

Caveat: I’ve never seen the exclusion problem I described above, so it’s possible it’s not a problem in practice.

rishabh · June 14, 2019, 11:12am

I see good discussion on k8s here, hence posting my requirements. Please correct me if I should post it somewhere else.

I have 2 separate requirements for deploying fdb on k8s as a StatefulSet, in which, each pod has its dedicated Persistent Volume .

Deploying/Scaling on k8s
I will first illustarate how cassandra is deployed/scaled on k8s.
Cassandra has a concept of “seed”, where you can provide any node of the cluster to a new node and the new node will join the cluster.
The steps are:

“New node” contacts “Seed Node” and gets cluster info (effectively the cluster file)
“New Node” makes this info durable (effectively creating cluster file locally)
“New Node” joins the cluster

Note: Here, “Seed Node” is not required to be a special node (ex: a coordinator).

The first k8s pod is self seeded and next are seeded by the previous ones (satetefulset has ordered pods).
In general , one should try to have 3 designated seed nodes (per datacenter), so that we are resilient of pod failures. For this, one can manage initial nodes as trustable seeds.

It’s important to note that you might have your seed node excluded from cluster and so your new node might get old cluster information (stale cluster file) and that is something that needs to be managed (I will not go into its details, as this needs to be manged in any approach).

Now lets go to fdb’s doc: Adding machines to a cluster

Copy an existing cluster file from a server in your cluster to the new machine, overwriting the existing fdb.cluster file.
Restart FoundationDB on the new machine so that it uses the new cluster file

How should I do this automatically while I deploy fdb as StatefulSet?
Please note that when I say automatically, I mean that I should just use “kubectl statefuleset scale” command and my cluster should scale (assuming all of my pods are identical by default).

There are workarounds:

Use a shared Persistent Volume across all pods to share cluster file
Fix coordinator pods (staefulset is ordered)
Use kubectl cp inside pod
Enable ssh across pods to get cluster file
…

But all workarouds have drawbacks.

Feature request: Could we have fdb support such seeding as part of inter-node communication?

Recovering pod after failure
When a pod in a StatefulSet fails, k8s allocates a new pod, with same DNS name, but can be at differnet IP. Note that the data it had is persistent in attached Persistent Volume.

How should I make fdb cluster rejoin such a node after it comes back with its data? (assume here taht it has cluster file and it connects to clster effectively)?

Feature request: Could we have fdb support identification of nodes in a cluster by some genearted ID, independent of IP/host?

Some more info

Please note that here I have not requested features like

identify coordinator by DNS hostname
support DNS hostname names in cluster file

I see “cluster file” as seed in a generic way, and I don’t care if it has DNS names or IPs inside it. I know my pods in the cluster (DNS name and current IP) and just want generic seeding to be avialable in fdb.

Similarly, I don’t want node identifier based on DNS name. It helps me in StatefulSet deployment. But I believe that it’s not something fdb should care about. Fdb should implement this in a generic way.

Note: I know that writing a custom k8s controller (with a custom scale operation) is one of a sol’n here. But, maintaining that is another overhead for me.

alexmiller · June 14, 2019, 7:55pm

Do you have enough of an understanding of Kubernetes to know if vending clusterfiles via a custom resource is an appropriate solution to this?