Proposal: Don't identify a coordinator based on its IP

I said IP addresses, but there is no obvious reason why you couldn’t allow DNS names or even an arbitrary command that outputs addresses. The cluster file would still contain IPs, and we would only need to poll resolution when not successfully connected to a quorum of coordinators.

You still need a mechanism for doing coordination changes automatically when needed. And honestly this probably just scratches the surface of problems with running anything reliable on Kubernetes. How do you handle changes to the cluster size safely? How do you run a cluster across multiple datacenters (and hence Kubernetes clusters)? I dunno.

allow DNS names or /…/ arbitrary command that outputs addresses /…/ The cluster file would still contain IPs

I don’t see how this solves the issue of setting new coordinator servers when there isn’t a quorum or coordinator servers available. Or in Basgys words:

I cannot update the list of coordinators anymore because there is no quorum and a force eviction won’t work because it is a coordinator. As far as I am aware, Kubernetes does not allow me to assign a static IP to fix the issue. The cluster is basically locked forever.

If the coordinators are identified by a DNS instead of IP, then no quorum change is needed, nodes need only to re-query the dns server on startup or IF coordinator is un-reachable. This would allow for a completely dead cluster to recover.

I suppose the issue could also be fixed if the option to force new coordination servers to be set (without a quorum) was available but this seems like a dangerous approach.

(@basgys : I did found a workaround in k8 by giving each coordinator pod its own service (with static ip); ugly but seems to work)

There’s a couple pieces of trouble hidden in those steps though:

  1. How would a coordinator know that it’s a coordinator?
  2. What happens if DNS is unavailable?

For (1), we could have each process resolve the hostname and check against its own IP. This seems like it’d run into trouble with multi-IP’d hosts or NAT’d situations. We could have each process connect to each hostname and request an ID that it compares with a locally randomly generated (or assigned) one. This starts to sound very similar to @basgys’s proposal.

Considering DNS unavailability, this would mean that a process knowing that it’s a coordinator would rely on DNS. It would probably be wise to cache the IP resolved from DNS for a coordinator in the cluster file as well, and I believe we should have the ability to store a tiny bit of state about if we were formerly the coordinator in the pidfile where we store other preferences as well.

Is it just the case that if you’re running in kubernetes and DNS is unavailable, then so many other things would be broken and unavailable that FDB unavailability wouldn’t even comparatively be a problem?

1 Like

(2) I agree that DNS availability adds another failure point. However I think the discussion is not if DNS should be mandatory but if it could be a supported alternative; If so, then everyone can decide if DNS is an acceptable point of failure given their respective infrastructure. In a K8 environment I would argue that depending on DNS is reasonable.

1 Like

I’d agree with @Zatte. I believe it makes sense for FDB to optionally support DNS resolution of coordinator host names. To answer your question @alexmiller: yes, DNS is extremely robust in Kubernetes. If that mechanism ceases to work, there are bigger problems to worry about.

1 Like

I believe we need another feature based on DNS: support for “seeds identifited by hostnames”.

Thus, you do not have to “copy” the fdb.cluster file in Kubernetes for adding new nodes. You just define one or few seed nodes ( probably coordinators ), and these seed nodes provide the necessary topology information for new nodes and applications.

What are your thoughts ?

Is there any progress on this as we would prefer to run FDB on K8S but currently it gets quite difficult if we for example delete all pods and they change ips

Was there any news during the conference?

Nope. None of the talks were about how to cloudify FDB.

There’s been a few iterations of discussion about that DNS support in FDB will be necessary for kubernetes, but no one has signed up to do the work yet.

Hey @alexmiller, my team and I are somewhat interested in making these changes. We’re especially interested in it since we’d like to run FDB as a Kubernetes StatefulSet. A couple assumptions:

  • The DNS server will be run in a highly available configuration. If DNS is down, we’ve got more major problems that FDB being down.
  • StatefulSets ensure that each pod is assigned a stable network identity. I.e. there will be hosts fdb-0.cluster, fdb-1.cluster, fdb-2.cluster, etc…
  • For a StatefulSet’s network identity, there will only be a single IP address it resolves to at any one t ime.

I had a quick perusal over the code, and it seems that the best way to make this change might be to:

  1. Extend the NetworkAddress class to be able to take a hostname.
  2. Change connect in network.cpp to resolve the host before connecting.
  3. Fix up all other usages of NetworkAddress to handle the hostname.

Does this sound like a reasonable change to make?

1 Like

Just a quick heads up: I looked into making this code change and decided it was a bit too difficult, and that it would introduce some issues with ser/de back compat. Instead, I opted for the following strategy:

  • There’s a process that is responsible for monitoring the Pods that are a member of the FDB StatefulSet, and also for running the fdbmonitor command.
  • When it detects a change in the IP addresses of the Pods that are members of the SS, it stops the local fdbmonitor process (if one was previously started), generates a new fdb.cluster file with the new IP addresses, and restarts the fdbmonitor process with this new cluster file.

I’ve got this working for small cluster of 3 FDB nodes and am attempting see if I can break it.

Obviously, this strategy is pretty naive and results in a small window of downtime in the case that a Pod is rescheduled to another Node. After I get it working, I’m going to attempt some simple strategies to reduce/eliminate this downtime such as having the Pods that are alive co-ordinate their restarts with neighbouring Pods that they can reach.

I will likely continue to assume that the number of Pods that make up the StatefulSet (i.e. there is a fixed number of fdbmonitor processes, and therefore a fixed number of co-ordinators) for the foreseeable future.

If I’m successful, and if people are interested, I’m happy to release this code as an a Go library that others can use.

Sorry for being slow in getting back to you. (Vacation, long weekend, nontrivial question, and all that.)

This is an area that we had really hoped that the community would be able to contribute to, being the ones more actively hitting these struggles. I would hope that with some help and guidence, you (and your team) would be able to make better progress on fixing this within FDB properly.

As we’re now discussing implementation details, I’d like to shuffle this conversation back to within GitHub on #354, and I’ll go start typing out the rest of my thoughts there.

There are a couple of other areas of concerns with IPs changing that we may want to address with a different kind of process identifier. Exclusions and kills are both done by IP today. For kills, it’s feasible to look up the IP immediately before killing the process, but it would be better if we could avoid this work in administrative tooling by supplying another identifier that we use instead. For excludes, I think there’s a bigger safety concern. It’s plausible that a process could get rescheduled after it is excluded, but before it is fully evacuated, and I think this would cause it to rejoin the cluster and start taking on new data. This could lead the person or system doing the exclusion to falsely believe it is safe to take down the instance. If this happens to multiple instances, that could lead to data loss. This seems like an area where we’ll need an identifier that we can track across the lifetime of an instance.

Caveat: I’ve never seen the exclusion problem I described above, so it’s possible it’s not a problem in practice.

I see good discussion on k8s here, hence posting my requirements. Please correct me if I should post it somewhere else.

I have 2 separate requirements for deploying fdb on k8s as a StatefulSet, in which, each pod has its dedicated Persistent Volume .

Deploying/Scaling on k8s
I will first illustarate how cassandra is deployed/scaled on k8s.
Cassandra has a concept of “seed”, where you can provide any node of the cluster to a new node and the new node will join the cluster.
The steps are:

  • “New node” contacts “Seed Node” and gets cluster info (effectively the cluster file)
  • “New Node” makes this info durable (effectively creating cluster file locally)
  • “New Node” joins the cluster

Note: Here, “Seed Node” is not required to be a special node (ex: a coordinator).

The first k8s pod is self seeded and next are seeded by the previous ones (satetefulset has ordered pods).
In general , one should try to have 3 designated seed nodes (per datacenter), so that we are resilient of pod failures. For this, one can manage initial nodes as trustable seeds.

It’s important to note that you might have your seed node excluded from cluster and so your new node might get old cluster information (stale cluster file) and that is something that needs to be managed (I will not go into its details, as this needs to be manged in any approach).

Now lets go to fdb’s doc: Adding machines to a cluster

  1. Copy an existing cluster file from a server in your cluster to the new machine, overwriting the existing fdb.cluster file.
  2. Restart FoundationDB on the new machine so that it uses the new cluster file

How should I do this automatically while I deploy fdb as StatefulSet?
Please note that when I say automatically, I mean that I should just use “kubectl statefuleset scale” command and my cluster should scale (assuming all of my pods are identical by default).

There are workarounds:

  • Use a shared Persistent Volume across all pods to share cluster file
  • Fix coordinator pods (staefulset is ordered)
  • Use kubectl cp inside pod
  • Enable ssh across pods to get cluster file

But all workarouds have drawbacks.

Feature request: Could we have fdb support such seeding as part of inter-node communication?

Recovering pod after failure
When a pod in a StatefulSet fails, k8s allocates a new pod, with same DNS name, but can be at differnet IP. Note that the data it had is persistent in attached Persistent Volume.

How should I make fdb cluster rejoin such a node after it comes back with its data? (assume here taht it has cluster file and it connects to clster effectively)?

Feature request: Could we have fdb support identification of nodes in a cluster by some genearted ID, independent of IP/host?

Some more info

Please note that here I have not requested features like

  • identify coordinator by DNS hostname
  • support DNS hostname names in cluster file

I see “cluster file” as seed in a generic way, and I don’t care if it has DNS names or IPs inside it. I know my pods in the cluster (DNS name and current IP) and just want generic seeding to be avialable in fdb.

Similarly, I don’t want node identifier based on DNS name. It helps me in StatefulSet deployment. But I believe that it’s not something fdb should care about. Fdb should implement this in a generic way.

Note: I know that writing a custom k8s controller (with a custom scale operation) is one of a sol’n here. But, maintaining that is another overhead for me.

Do you have enough of an understanding of Kubernetes to know if vending clusterfiles via a custom resource is an appropriate solution to this?

I started looking in k8s a month ago. Writing a custom resource is powerful, but I can’t take a call on if this is an “appropriate” solution here.

To answer this, I looked at different databases and how they scale.

  • Cassandra: supports seeding from any node and uses StatefulSet. See doc
  • CockraochDb: supports seeding from any node (Scale the cluster) and uses StatefulSet. See: cockroachdb-statefulset.yaml
  • CouchBase: Supports seeding from any node (server-add) and has a custom k8s controller: Couchbase Operator.
  • Redis: Supports seeding from any node (See Adding a new node section here) and there are custom controllers available. This can also be deployed as a simple StatefulSet.

Some clarifications

  • Few dbs above have a second step of enabling data-rebalancing which I have not discussed
  • redis.conf provided to every node is essentially a static config (except self-port) and comes from k8s ConfigMap.

clusterfile is a dynamic file.

Some other findings
CouchBase has concept of multi-dimensional scaling (See Multi Dimensional Scaling section here). Dividing fdb pods in 3 categories, namely: stateless, log and storage, and then using a similar multi-dimensional scaling, where each dimension is a pod type, would be great for scaling in different situations.This is where custom controller will be great.

Conclusions as of now
I am not sure if vending clusterfile is a good use case for custom controller or not. But implementing and maintianing custom controller is another effort in itself. And many dbs support seeding in general and not take the custom controller way for “seeding based on a dynamic file”.

Next Steps

  • To start with, could we simply have “fdbmonitor” take a “seed” node (host:port) of an existing cluster to effectively get cluster file?
  • We need to resolve the changing IP thingy of coordinators. Could you suggest an action plan for this?

If you wouldn’t mind digging into how TiDB/TiKV and Yugabyte are deployed on Kubernetes, I’d appreciate it. Both of them have a model that is closer to FDB’s model where they have one component (Placement Driver and YB masters, respectively) which all of the other pieces connect to much like how all FDB processes connect to coordinators. I think FDB is still in a more difficult situation here, as we expect coordinators to be able to be identified individually, whereas PD and yb-master have a raft leader, so contacting any of the members probably works?

What I’ve vaguely heard of suggests that maintaining something to run FDB on Kubernetes (ie. the “operator” pattern) is a thing that will likely eventually happen by someone. I’m still not sure exactly what support is needed in FDB for this.

Allowing a client to pull a cluster file from any FDB worker wouldn’t be difficult to write. I’m not clear how this solves the bootstrapping problem of starting many FDB processes at the same time, when a cluster does not yet exist? You’re still going to need to provide a cluster file to get the initial set of processes running?

Thanks for pointing me to right direction. I looked at yugabyte. I will summarize it based on this doc, skipping the Admin UI stuff.

Yugabyte is deployed as two (stateful) sets: yb-master set and yb-tserver set.
Each set is accompanied with a (headless) service, which provides hostname + current IP of all its members. The yb-master set service is used by (1) a master to discover other masters and (2) by a tserver to initialize/join cluster.

I simply visualized yb-master as fdb coordinator and tried to answer the question “If I deploy coordinators as a separate (stateful) set, can I use its (headless) service to seed cluster file to other nodes”. The anwer is no, as I need ID of cluster file as well.

description:ID@ip1:port1,ip2:port2,ip3:port3

And as per this post, it’s not any random ID.

K8s stateful set starts pods one by one (with pod index)and ensures that all previous pods are sane before adding another pod. Hence no 2 pods start simultaneously during initialization (except edge case scenarios arising from uncontrolled pod restarts). This ability could be used to intialize cluster at a particualr step of deployement.
A short ex: Self-seed first pod (initialize single-node cluster), keep on seeding next pods by first pod and then in 6th pod initialization, initialize cluster with first 5 pods as coordinators.

Next steps
I didn’t go further into “how to deploy fdb by statefulset”, as I now know that fdb team has plans to explore k8s custom resource. This is good for both venting cluster files and also for maintaining multiple sets within a Deployment, like stateless, log, storage etc.
I liked the idea of multi-set deployement, CouchBase provides a cutom resource for this, whereas YB tells you to use 2 statefulsets. By this, you can independently scale what roles you want more.

I am now looking into k8s custom resource. My current understanding is that the problem of “coordinator being identified by IP” can also be solved by cutom-operator. We have to maintain set of coordinatos independently in fdb Deployment. And our custom-operator can ensure that all (healthy) members of this set are coordinators of the cluster (k8s operator is essentially a feedback loop). If not, initialize the cluster with coordinators. Will see where it goes :slight_smile:

I would suggest not interpreting my question that strongly. I was just parroting back a question that I’ve received before and didn’t have an answer to. AFAIK, the three companies that have an FDB dev team all do not run it via kubernetes, so there’s an overall lack of kubernetes best practices knowledge to know exactly what improvements need to be made.

This was particularly helpful. So DNS in clusterfiles and fetching coordinators from a headless service provide the same thing, but neither solve the full problem of how to propagate the description:ID part. DNS in clusterfiles makes this easier, as it’s something that could still be written by hand/automation and then distributed, but doesn’t make the problem go away.

Please do keep us updated on your kubernetes digging. :slight_smile:

In a containerized deployment (be it docker compose, docker swarm, consul+nomad, kubernetes), you generally assume that service discovery is a critical component. That’s why service discovery implementations (like nomad or etcd) are highly-available and run their own consensus protocols. If service discovery is not available, your entire cluster should be considered unavailable.

In other words, relying on service discovery being up and functioning is not a problem.

1 Like

I agree with this. Once you start getting into hundreds of nodes, being able to update 1 consul or etcd cluster is much more scalable. If a node can’t connect to etcd, there is more than likely bigger problems to be concerned with. For kubernetes at least, the orchestration of nodes itself relies on such services.