Running Kubernetes Operator e2e tests

I’m trying to get the end-to-end tests to run in my own environment (at least the basic test_operator suite). It seems like an important ability in order to contribute.

This is how far I got:
When I just run it with default settings, the test fails quickly. The default operator image it tries to load does not understand the flag replace-on-security-context-change which the test sets, so the operator doesn’t start up.

I can convince the tests to load a test candidate from my own container registry, while loading the FDB unified image containers from docker.io.
The test starts in a Kubernetes cluster (self-hosted by my cloud infra department in the AWS cloud). I also have to specify the storage class explicitly, gp3 is tagged as default in my cluster but maybe not exactly in the way the e2e test expects. With those tweaks, the test runs.

ENABLE_CHAOS_TESTS=false STORAGE_CLASS=gp3 REGISTRY= OPERATOR_IMAGE=<redacted>.dkr.ecr.us-east-1.amazonaws.com/fdb-kubernetes-operator:v1.54.0 UNIFIED_FDB_IMAGE=docker.io/foundationdb/fdb-kubernetes-monitor:7.1.57 make -kj -C e2e test_operator.run

Even with tests checked out exactly at the v1.54.0 version, I can’t get them to pass.

[ReportAfterSuite] Autogenerated ReportAfterSuite for --junit-report
autogenerated by Ginkgo
[ReportAfterSuite] PASSED [6.913 seconds]
------------------------------

Summarizing 15 Failures:
  [FAIL] Operator when increasing the number of log Pods by one [It] should increase the count of log Pods by one [e2e, pr]
  /home/ingowalther/fork/fdb-kubernetes-operator/e2e/test_operator/operator_test.go:1116
  [FAIL] Operator when Replacing a Pod with PVC stuck in Terminating state [BeforeEach] should replace the PVC stuck in Terminating state [e2e, pr]
  /home/ingowalther/fork/fdb-kubernetes-operator/e2e/fixtures/fdb_cluster.go:686
  [FAIL] Operator when using the buggify option to ignore a process during the restart [BeforeEach] should not restart the process on the ignore list [e2e, pr]
  /home/ingowalther/fork/fdb-kubernetes-operator/e2e/fixtures/fdb_cluster.go:1306
  [FAIL] Operator when a process group has no address assigned and should be removed [BeforeEach] when automatic replacements are disabled should not remove the Pod as long as it is unschedulable [e2e, pr]
  /home/ingowalther/fork/fdb-kubernetes-operator/e2e/test_operator/operator_test.go:1464
  [FAIL] Operator when a process group has no address assigned and should be removed [BeforeEach] when automatic replacements are enabled should remove the Pod [e2e, pr]
  /home/ingowalther/fork/fdb-kubernetes-operator/e2e/test_operator/operator_test.go:1464
  [FAIL] Operator when a process is in the maintenance zone [BeforeEach] should not replace the process group [e2e, pr]
  /home/ingowalther/fork/fdb-kubernetes-operator/e2e/fixtures/chaos_common.go:167
  [FAIL] Operator when adding and removing a test process [BeforeEach] should create the test Pod [e2e, pr]
  /home/ingowalther/fork/fdb-kubernetes-operator/e2e/fixtures/fdb_cluster.go:378
  [FAIL] Operator when using proxies instead of grv and commit proxies [BeforeEach] should configure the database to run with GRV and commit proxies but keep the proxies in the status field of the FoundationDB resource [e2e, pr]
  /home/ingowalther/fork/fdb-kubernetes-operator/e2e/test_operator/operator_test.go:1842
  [FAIL] Operator when running with tester processes when there is a unidirectional partition between the tester and the rest of the cluster [BeforeEach] should show the status without any messages [e2e, pr]
  /home/ingowalther/fork/fdb-kubernetes-operator/e2e/fixtures/chaos_common.go:167
  [FAIL] Operator when the cluster makes use of DNS in the cluster file when all Pods are deleted [It] should recreate all Pods and bring the cluster into a healthy state again [e2e, pr]
  /home/ingowalther/fork/fdb-kubernetes-operator/e2e/test_operator/operator_test.go:2101
  [FAIL] Operator [AfterEach] when setting a locality that is using an environment variable should update the locality with the substituted environment variable [e2e, pr]
  /home/ingowalther/fork/fdb-kubernetes-operator/e2e/fixtures/fdb_cluster.go:1448
  [FAIL] Operator [AfterEach] when enabling the node watch feature should have enabled the node watch feature on all Pods [e2e, pr]
  /home/ingowalther/fork/fdb-kubernetes-operator/e2e/fixtures/fdb_cluster.go:1448
  [FAIL] Operator [AfterEach] when the Pod is set into isolate mode should shutdown the fdbserver processes of this Pod [e2e, pr]
  /home/ingowalther/fork/fdb-kubernetes-operator/e2e/test_operator/operator_test.go:122
  [FAIL] Operator when a new knob for storage servers is rolled out to the cluster [It] should update the locality with the substituted environment variable [e2e, pr]
  /home/ingowalther/fork/fdb-kubernetes-operator/e2e/test_operator/operator_test.go:2665
  [FAIL] Operator when a process is marked for removal and was excluded [BeforeEach] when the process gets included again should be excluded a second time [e2e, pr]
  /home/ingowalther/fork/fdb-kubernetes-operator/e2e/test_operator/operator_test.go:2696

Ran 41 of 56 Specs in 47246.212 seconds
FAIL! -- 26 Passed | 15 Failed | 9 Pending | 6 Skipped
--- FAIL: TestOperator (47253.13s)
FAIL
FAIL    github.com/FoundationDB/fdb-kubernetes-operator/e2e/test_operator       47253.275s
FAIL
  • Does someone manage to run the tests in their own environment? Any experience to share how you got it to pass?
  • Is this kind of test duration expected for the test_operator suite (13 hours), or is my cluster control plane just unusually slow?
  • Are those tests normally passing or is a certain flakiness to be expected?
  • Do those errors like timeouts or flakes? I can’t make sense of them yet.
  • Is anyone using the “kind” local kubernetes simulator to run the e2e tests? With a few tweaks to image reload policies, I managed run the tests in kind but it didn’t pass (with different errors, looked like timeouts).

Thanks for any help!

Hello,

regarding the failure with the latest image, we had a bug in our GitHub actions that prevent that the latest image was update: Fix PR action for docker build by johscheuer · Pull Request #2225 · FoundationDB/fdb-kubernetes-operator · GitHub. The latest tag should now be pushed when new changes are merged into main.

I’m not sure about the storage class setup in your Kubernetes cluster but the test frame work makes use of this annotation: Storage Classes | Kubernetes.

  • Does someone manage to run the tests in their own environment? Any experience to share how you got it to pass?
  • Is this kind of test duration expected for the test_operator suite (13 hours), or is my cluster control plane just unusually slow?

We run all tests in the test suite that are labeled with the nightly (and the pr labeled ones will run during PRs against the operator repo). Most of the time those tests take ~3h to complete.

  • Are those tests normally passing or is a certain flakiness to be expected?

Most tests pass consistently but depending on the underlying infrastructure we have seen some cases where EKS or something differently caused a flaky test. We try to make the test suite as reliable as possible but if you observe any flakiness, feel free to open a GitHub issue with some additional information.

  • Do those errors like timeouts or flakes? I can’t make sense of them yet.

We have seen both in the past, depending on what the test suite is validating.

  • Is anyone using the “kind” local kubernetes simulator to run the e2e tests? With a few tweaks to image reload policies, I managed run the tests in kind but it didn’t pass (with different errors, looked like timeouts).

We don’t use them right now.

This is helpful. I experimented with a more standard cluster (set up a plain AWS EKS cluster). Tests failed in a very similar way, but now I can experiment with the cluster’s settings. Using larger nodes (r6a.4xlarge instead of the default) helped getting a lot more tests to pass and test cases take on average 4x less time than with the standard node sizes that EKS picks. However, I end up with only two nodes in the cluster; yesterday one failed during test and the cluster lost 3/5 coordinators and the test was stuck. So maybe there’s some middle ground.

I still can’t get all the tests to pass, but the remaining issues look like timing issues or problems with the PVCs. I won’t be able to work on it for the next week, afterwards I’ll see if I can get it to work. If someone would like to share their setup for the storage (looks like others use EKS, too?), that would be much appreciated. In the meantime, I wrote up some of my findings in Improvements to e2e tests and documentation by iwalther · Pull Request #2233 · FoundationDB/fdb-kubernetes-operator · GitHub in case someone finds this useful.