I’m trying to get the end-to-end tests to run in my own environment (at least the basic test_operator suite). It seems like an important ability in order to contribute.
This is how far I got:
When I just run it with default settings, the test fails quickly. The default operator image it tries to load does not understand the flag replace-on-security-context-change
which the test sets, so the operator doesn’t start up.
I can convince the tests to load a test candidate from my own container registry, while loading the FDB unified image containers from docker.io.
The test starts in a Kubernetes cluster (self-hosted by my cloud infra department in the AWS cloud). I also have to specify the storage class explicitly, gp3 is tagged as default in my cluster but maybe not exactly in the way the e2e test expects. With those tweaks, the test runs.
ENABLE_CHAOS_TESTS=false STORAGE_CLASS=gp3 REGISTRY= OPERATOR_IMAGE=<redacted>.dkr.ecr.us-east-1.amazonaws.com/fdb-kubernetes-operator:v1.54.0 UNIFIED_FDB_IMAGE=docker.io/foundationdb/fdb-kubernetes-monitor:7.1.57 make -kj -C e2e test_operator.run
Even with tests checked out exactly at the v1.54.0 version, I can’t get them to pass.
[ReportAfterSuite] Autogenerated ReportAfterSuite for --junit-report
autogenerated by Ginkgo
[ReportAfterSuite] PASSED [6.913 seconds]
------------------------------
Summarizing 15 Failures:
[FAIL] Operator when increasing the number of log Pods by one [It] should increase the count of log Pods by one [e2e, pr]
/home/ingowalther/fork/fdb-kubernetes-operator/e2e/test_operator/operator_test.go:1116
[FAIL] Operator when Replacing a Pod with PVC stuck in Terminating state [BeforeEach] should replace the PVC stuck in Terminating state [e2e, pr]
/home/ingowalther/fork/fdb-kubernetes-operator/e2e/fixtures/fdb_cluster.go:686
[FAIL] Operator when using the buggify option to ignore a process during the restart [BeforeEach] should not restart the process on the ignore list [e2e, pr]
/home/ingowalther/fork/fdb-kubernetes-operator/e2e/fixtures/fdb_cluster.go:1306
[FAIL] Operator when a process group has no address assigned and should be removed [BeforeEach] when automatic replacements are disabled should not remove the Pod as long as it is unschedulable [e2e, pr]
/home/ingowalther/fork/fdb-kubernetes-operator/e2e/test_operator/operator_test.go:1464
[FAIL] Operator when a process group has no address assigned and should be removed [BeforeEach] when automatic replacements are enabled should remove the Pod [e2e, pr]
/home/ingowalther/fork/fdb-kubernetes-operator/e2e/test_operator/operator_test.go:1464
[FAIL] Operator when a process is in the maintenance zone [BeforeEach] should not replace the process group [e2e, pr]
/home/ingowalther/fork/fdb-kubernetes-operator/e2e/fixtures/chaos_common.go:167
[FAIL] Operator when adding and removing a test process [BeforeEach] should create the test Pod [e2e, pr]
/home/ingowalther/fork/fdb-kubernetes-operator/e2e/fixtures/fdb_cluster.go:378
[FAIL] Operator when using proxies instead of grv and commit proxies [BeforeEach] should configure the database to run with GRV and commit proxies but keep the proxies in the status field of the FoundationDB resource [e2e, pr]
/home/ingowalther/fork/fdb-kubernetes-operator/e2e/test_operator/operator_test.go:1842
[FAIL] Operator when running with tester processes when there is a unidirectional partition between the tester and the rest of the cluster [BeforeEach] should show the status without any messages [e2e, pr]
/home/ingowalther/fork/fdb-kubernetes-operator/e2e/fixtures/chaos_common.go:167
[FAIL] Operator when the cluster makes use of DNS in the cluster file when all Pods are deleted [It] should recreate all Pods and bring the cluster into a healthy state again [e2e, pr]
/home/ingowalther/fork/fdb-kubernetes-operator/e2e/test_operator/operator_test.go:2101
[FAIL] Operator [AfterEach] when setting a locality that is using an environment variable should update the locality with the substituted environment variable [e2e, pr]
/home/ingowalther/fork/fdb-kubernetes-operator/e2e/fixtures/fdb_cluster.go:1448
[FAIL] Operator [AfterEach] when enabling the node watch feature should have enabled the node watch feature on all Pods [e2e, pr]
/home/ingowalther/fork/fdb-kubernetes-operator/e2e/fixtures/fdb_cluster.go:1448
[FAIL] Operator [AfterEach] when the Pod is set into isolate mode should shutdown the fdbserver processes of this Pod [e2e, pr]
/home/ingowalther/fork/fdb-kubernetes-operator/e2e/test_operator/operator_test.go:122
[FAIL] Operator when a new knob for storage servers is rolled out to the cluster [It] should update the locality with the substituted environment variable [e2e, pr]
/home/ingowalther/fork/fdb-kubernetes-operator/e2e/test_operator/operator_test.go:2665
[FAIL] Operator when a process is marked for removal and was excluded [BeforeEach] when the process gets included again should be excluded a second time [e2e, pr]
/home/ingowalther/fork/fdb-kubernetes-operator/e2e/test_operator/operator_test.go:2696
Ran 41 of 56 Specs in 47246.212 seconds
FAIL! -- 26 Passed | 15 Failed | 9 Pending | 6 Skipped
--- FAIL: TestOperator (47253.13s)
FAIL
FAIL github.com/FoundationDB/fdb-kubernetes-operator/e2e/test_operator 47253.275s
FAIL
- Does someone manage to run the tests in their own environment? Any experience to share how you got it to pass?
- Is this kind of test duration expected for the test_operator suite (13 hours), or is my cluster control plane just unusually slow?
- Are those tests normally passing or is a certain flakiness to be expected?
- Do those errors like timeouts or flakes? I can’t make sense of them yet.
- Is anyone using the “kind” local kubernetes simulator to run the e2e tests? With a few tweaks to image reload policies, I managed run the tests in kind but it didn’t pass (with different errors, looked like timeouts).
Thanks for any help!