Show the FDB Forums: prototyping simulation runs on Kubernetes

(Austin Seipp) #1

Hello *,

I’ve been working on improving the FoundationDB packaging for NixOS, and in the course of testing various new things for FoundationDB 6.1, I wanted to run simulation tests on a larger, more convenient scale. This is important for making sure the binaries shipped to users are reliable to some extent, and generally speaking this applies to any 3rd party packaging. Out of curiosity I set out to do this using Kubernetes. I’d like any feedback on this.

The result is a set of tooling available here. TL;DR if you’re impatient you can read the README and get a feel for it:

There are many details in the README and, if you follow the instructions, it should Pretty Much Work™.

It’s quite easy to build Docker images using Nix, which is the basis of this tooling. Essentially, I create a docker image out of the FoundationDB Nix packages, equipped with a shell script to run simulation tests. The tests come directly from the source repository and are packaged up into the distribution, the same ones run with ctest in the CMake build system – the list of included tests is here:

The wrapper script gives you a little tool for running any simulation test a number of times right out of the Docker image. For example, you can run the fast/AtomicOps.txt test like so:

$ docker run --rm foundationdb:6.1.5pre4879_91547e7 simulate fast/AtomicOps 10
NOTE: simulation test fast/AtomicOps (10 rounds)...
NOTE: running simulation #1 (seed = 0x136276c4)... ok
NOTE: running simulation #2 (seed = 0x2b807568)... ok
NOTE: running simulation #3 (seed = 0x8b3e79e3)... ok
NOTE: running simulation #4 (seed = 0x11581fb5)... ok
NOTE: running simulation #5 (seed = 0x88159bf0)... ok
NOTE: running simulation #6 (seed = 0xa27d94fb)... ok
NOTE: running simulation #7 (seed = 0x5081e040)... ok
NOTE: running simulation #8 (seed = 0x9ad8c268)... ok
NOTE: running simulation #9 (seed = 0x741db05f)... ok
NOTE: running simulation #10 (seed = 0x462fd316)... ok
NOTE: finished fast/AtomicOps; 10 total sim rounds, 10/10 successful sim runs

The intention here is that you then create a K8S batch job out of every simulation test, possibly with tweaked concurrency/memory limits, and then throw it all into a cluster a number of times to test things out. The remaining tooling is built on top of this basic image, e.g. a concurrent, 100-round simulation of fast/AtomicOps.txt is packaged into a job, like so:

$ cat result/simulation-fast-atomicops.yaml
apiVersion: batch/v1
kind: Job
  name: simulation-fast-atomicops
    group: simulation
    test: rare-fast-atomicops
  parallelism: 2
  completions: 4
      name: sim-fast-atomicops
        group: simulation
        test: fast-atomicops
      - name: sim-fast-atomicops
        image: foundationdb:6.1.5pre4879_91547e7
        args: [ "simulate", "fast/AtomicOps", "25" ]
            memory: 768M
            memory: 128M
      restartPolicy: Never

This project isn’t done yet and is currently only used to smoke test my own FoundationDB packages. In the future it would be nice to:

  • Let people specify a source directory of their own (e.g. their own local working copy) to build and package into the image. It’ll be compiled, packaged, etc automatically by Nix.
  • Adjust the limits, memory requirements, etc for each job to make the K8S scheduler’s life easier. Right now the estimates are all very conservative.

In the long run, this could perhaps help serve as a basis for an open way to run large scale simulation tests for FoundationDB builds, CI systems, etc (something that is, to my knowledge, only done by people like Apple, Snowflake, etc).

Some questions for feedback:

  • Does this seem like a remotely sane method of doing tests, or am I doing something horribly wrong?
    • E.g. is there a better way to achieve this with K8S?
    • Am I missing anything critical when running the simulation mode? (something like fdbserver -r simulation -f "${PATH_TO_TEST_FILE}" -s "${SEED_NUMBER}")
  • What kind of simulation numbers should we be aiming at for “reliable” testing? What numbers does Apple use? For example, you could in theory run every test thousands or tens of thousands of times if you desired with this framework. 1,000 runs per-test? 50,000? A mix depending on the test category (fast/slow/rare)?
  • Are these sets of tests I’ve chosen (in test-list.txt) correct and/or reasonable?
  • Would it be possible to update the CMake build system to include the testing .txt simulation files? I do this in the NixOS package for FoundationDB, but this is a bespoke change. They’re very small and it might be nice to have the build system do this officially and ship the set of tests that are expected to work reliably.
(Alex Miller) #2

I recently filed a set of issues about developer infrastructure. In particular, you’re welcome to show up on Provide a tool for scaleable correctness testing, as I’m totally willing to believe that hooking simulation up to batch Kubernetes jobs might be the most generally usable solution.

There’s also now Centralize on tests/TestRunner, as we already have some divergence on scripting for running simulation tests. If it’s possible for you to leverage TestRunner, I’d recommend doing so, as it already has support for correctly running restarting tests and will be maintained for any future changes to simulation.

It’s also likely that if you start running through enough simulation runs, you’ll probably find a failure sometime. Whenever you do, please feel free to file an issue. (git hash, buggify state, test seed, test file, unseed) is what we would need to be able to hunt it down.

The source code is filled with BUGGIFY macros that conditionally cause simulation to do something antagonistic to FDB. They’re controlled by the -b flag, so adding a mix of -b on and -b off to your tests will give more interesting results.

--crash is sometimes convenient, as it turns ASSERT() failures in to crashes.

Restarting tests need a little bit of special handling. They’re run in pairs where, ideally, you run -1.txt with an older version of FDB, and then -2.txt with the current version, to test upgrading with old on-disk files. At minimum, you need to pass --restarting when running -2.txt, otherwise it won’t try to re-use the existing on-disk files, and you lose the whole benefit of the restarting tests.

Apple does runs where a random (test, seed) pair is chosen, and then run. ~30k is the cutoff of what to let correctness grind to before posting a PR, but our correctness test running system runs to 100k by default. Snowflake does 48 runs per test, which puts them in the same ballpark.

The correct way to generate this list would be to pull it from tests/CMakeLists.txt, as there’s already a check in CMake that every testspec in tests/ exists as an entry in that file. You’d want to scrape out all the ones that don’t have IGNORE.

You’re more than welcome to file an issue (or submit a PR) for a target to build the test list file.