Versioning for the Kubernetes Operator

Over time, the Kubernetes Operator will accumulate technical debt. Paying down this technical debt will sometimes require backward incompatible changes, like removing deprecated fields or dropping support for old versions of FoundationDB. We have to balance our desire to pay down this technical debt must be weighed against the need for stability in the tools that people are using to manage production deployments. To achieve that balance, I propose the following process for evolving the Kubernetes Operator. Most of the changes described here would only take effect once we reach version 1.0, but the process of transitioning between major versions will be available for transitioning from 0.x to 1.0.

We will version the Kubernetes operator using SemVer (major.minor.patch). Backward incompatible changes will only be introduced in a new major version of the operator. Some of the possible backwards incompatible changes are:

  • Dropping support for old versions of FDB
  • Dropping support for old versions of Kubernetes
  • Removing deprecated fields from the cluster resource
  • Changing the default values when fields are omitted in the spec

Minor or patch versions of the operator may introduce behavior changes that break people’s usage of it, but in general we will try to introduce feature flags to control that kind of change. This should allow people to opt in to new behavior, and make sure that we have an opportunity to merge something into a release while limiting the risks of problems with new features.

There will be at least 12 months between each major version. This should ensure that the work of supporting a new major version doesn’t create a continuous drain on people running the operator, while also limiting how much technical debt we will accrue between major versions.

If we need to remove a field from the cluster resource, we will release a new version of the CRD. If we need to add a field to the cluster resource, we will do that without changing the CRD version. We will develop a tool for people to scan their clusters for any use of deprecated fields in preparation for a major upgrade.

A new major version of the operator will support at least the following versions of FDB:

  • All patches for the most recent minor version of FDB
  • The most recent patch for the second-most recent minor version of FDB

We may also retain support for older versions of FDB, depending on what is required to retain that support.

Once a major version is released, bug fixes for older versions of the operator will be offered on a case-by-case basis.

The supported versions of FDB, Kubernetes, and the CRD for each major version of the operator will be documented in a version compatibility guide in the operator documentation.

Does this sound like a reasonable upgrade story?

2 Likes

Broadly yes; though I can suggest a couple of tweaks.

  • rather than scanning for deprecated fields, offer a tool to read in version N of the CRD and output version N+1 in undeprecated form. For instance, the volumeSize + storageClass can be upgraded to a volumeClaimTemplate by generating one in the processes hash under ‘general’ with only those fields populated; or if that is already filled, discarding those keys. The logic to do this already exists in the operator, by definition, so this should be a low-effort task - particularly if we refactor the operators code to upgrade on parsing rather than upgrading on evaluating, the way it does now.

  • support FDB versions according to the FDB support policy; if there isn’t one - perhaps that should be a lemma that is solved first? Something along the lines of ‘if FDB is still taking patches for a branch and publishing binaries, the operator should support it’ is what I’m thinking.

I think that giving a proposed replacement spec seems like a reasonable course. We’re pretty close to being at a point where the operator never updates the cluster spec, so at that point we should be able to have a single action for rewriting deprecated fields and then have the rest of the reconciliation read that without a risk of introducing conflicts with the spec as the user has written it.

I think in the long term we would want the operator to follow the same support model that the main project supports, and it’s worth discussing how to codify that. The operator may have reason to move more quickly in dropping older versions, since the Kubernetes solution is evolving more rapidly, and there are going to be features in FDB that dramatically change how we run things (e.g. supporting DNS names in cluster files, and the RPC layer). Hopefully it will be less painful with the operator, since it automates a decent chunk of the work of doing upgrades.

1 Like

I can certainly see the pressure to move faster right now, but in my experience longer term the exact same pressures that keep folk on older versions will still apply in K8s environments: the fleet management aspect is managed - yes, but the client upgrades in all the things using fdb, the workload testing, confidence building, testing of replacements for deprecated features etc - these don’t go away.