Proposal: Don't identify a coordinator based on its IP

I started looking in k8s a month ago. Writing a custom resource is powerful, but I can’t take a call on if this is an “appropriate” solution here.

To answer this, I looked at different databases and how they scale.

  • Cassandra: supports seeding from any node and uses StatefulSet. See doc
  • CockraochDb: supports seeding from any node (Scale the cluster) and uses StatefulSet. See: cockroachdb-statefulset.yaml
  • CouchBase: Supports seeding from any node (server-add) and has a custom k8s controller: Couchbase Operator.
  • Redis: Supports seeding from any node (See Adding a new node section here) and there are custom controllers available. This can also be deployed as a simple StatefulSet.

Some clarifications

  • Few dbs above have a second step of enabling data-rebalancing which I have not discussed
  • redis.conf provided to every node is essentially a static config (except self-port) and comes from k8s ConfigMap.

clusterfile is a dynamic file.

Some other findings
CouchBase has concept of multi-dimensional scaling (See Multi Dimensional Scaling section here). Dividing fdb pods in 3 categories, namely: stateless, log and storage, and then using a similar multi-dimensional scaling, where each dimension is a pod type, would be great for scaling in different situations.This is where custom controller will be great.

Conclusions as of now
I am not sure if vending clusterfile is a good use case for custom controller or not. But implementing and maintianing custom controller is another effort in itself. And many dbs support seeding in general and not take the custom controller way for “seeding based on a dynamic file”.

Next Steps

  • To start with, could we simply have “fdbmonitor” take a “seed” node (host:port) of an existing cluster to effectively get cluster file?
  • We need to resolve the changing IP thingy of coordinators. Could you suggest an action plan for this?

If you wouldn’t mind digging into how TiDB/TiKV and Yugabyte are deployed on Kubernetes, I’d appreciate it. Both of them have a model that is closer to FDB’s model where they have one component (Placement Driver and YB masters, respectively) which all of the other pieces connect to much like how all FDB processes connect to coordinators. I think FDB is still in a more difficult situation here, as we expect coordinators to be able to be identified individually, whereas PD and yb-master have a raft leader, so contacting any of the members probably works?

What I’ve vaguely heard of suggests that maintaining something to run FDB on Kubernetes (ie. the “operator” pattern) is a thing that will likely eventually happen by someone. I’m still not sure exactly what support is needed in FDB for this.

Allowing a client to pull a cluster file from any FDB worker wouldn’t be difficult to write. I’m not clear how this solves the bootstrapping problem of starting many FDB processes at the same time, when a cluster does not yet exist? You’re still going to need to provide a cluster file to get the initial set of processes running?

Thanks for pointing me to right direction. I looked at yugabyte. I will summarize it based on this doc, skipping the Admin UI stuff.

Yugabyte is deployed as two (stateful) sets: yb-master set and yb-tserver set.
Each set is accompanied with a (headless) service, which provides hostname + current IP of all its members. The yb-master set service is used by (1) a master to discover other masters and (2) by a tserver to initialize/join cluster.

I simply visualized yb-master as fdb coordinator and tried to answer the question “If I deploy coordinators as a separate (stateful) set, can I use its (headless) service to seed cluster file to other nodes”. The anwer is no, as I need ID of cluster file as well.

description:ID@ip1:port1,ip2:port2,ip3:port3

And as per this post, it’s not any random ID.

K8s stateful set starts pods one by one (with pod index)and ensures that all previous pods are sane before adding another pod. Hence no 2 pods start simultaneously during initialization (except edge case scenarios arising from uncontrolled pod restarts). This ability could be used to intialize cluster at a particualr step of deployement.
A short ex: Self-seed first pod (initialize single-node cluster), keep on seeding next pods by first pod and then in 6th pod initialization, initialize cluster with first 5 pods as coordinators.

Next steps
I didn’t go further into “how to deploy fdb by statefulset”, as I now know that fdb team has plans to explore k8s custom resource. This is good for both venting cluster files and also for maintaining multiple sets within a Deployment, like stateless, log, storage etc.
I liked the idea of multi-set deployement, CouchBase provides a cutom resource for this, whereas YB tells you to use 2 statefulsets. By this, you can independently scale what roles you want more.

I am now looking into k8s custom resource. My current understanding is that the problem of “coordinator being identified by IP” can also be solved by cutom-operator. We have to maintain set of coordinatos independently in fdb Deployment. And our custom-operator can ensure that all (healthy) members of this set are coordinators of the cluster (k8s operator is essentially a feedback loop). If not, initialize the cluster with coordinators. Will see where it goes :slight_smile:

I would suggest not interpreting my question that strongly. I was just parroting back a question that I’ve received before and didn’t have an answer to. AFAIK, the three companies that have an FDB dev team all do not run it via kubernetes, so there’s an overall lack of kubernetes best practices knowledge to know exactly what improvements need to be made.

This was particularly helpful. So DNS in clusterfiles and fetching coordinators from a headless service provide the same thing, but neither solve the full problem of how to propagate the description:ID part. DNS in clusterfiles makes this easier, as it’s something that could still be written by hand/automation and then distributed, but doesn’t make the problem go away.

Please do keep us updated on your kubernetes digging. :slight_smile:

In a containerized deployment (be it docker compose, docker swarm, consul+nomad, kubernetes), you generally assume that service discovery is a critical component. That’s why service discovery implementations (like nomad or etcd) are highly-available and run their own consensus protocols. If service discovery is not available, your entire cluster should be considered unavailable.

In other words, relying on service discovery being up and functioning is not a problem.

1 Like

I agree with this. Once you start getting into hundreds of nodes, being able to update 1 consul or etcd cluster is much more scalable. If a node can’t connect to etcd, there is more than likely bigger problems to be concerned with. For kubernetes at least, the orchestration of nodes itself relies on such services.