Show the FDB Forums: prototyping simulation runs on Kubernetes

alexmiller · May 13, 2019, 8:44pm

I recently filed a set of issues about developer infrastructure. In particular, you’re welcome to show up on Provide a tool for scaleable correctness testing, as I’m totally willing to believe that hooking simulation up to batch Kubernetes jobs might be the most generally usable solution.

There’s also now Centralize on tests/TestRunner, as we already have some divergence on scripting for running simulation tests. If it’s possible for you to leverage TestRunner, I’d recommend doing so, as it already has support for correctly running restarting tests and will be maintained for any future changes to simulation.

It’s also likely that if you start running through enough simulation runs, you’ll probably find a failure sometime. Whenever you do, please feel free to file an issue. (git hash, buggify state, test seed, test file, unseed) is what we would need to be able to hunt it down.

The source code is filled with BUGGIFY macros that conditionally cause simulation to do something antagonistic to FDB. They’re controlled by the -b flag, so adding a mix of -b on and -b off to your tests will give more interesting results.

--crash is sometimes convenient, as it turns ASSERT() failures in to crashes.

Restarting tests need a little bit of special handling. They’re run in pairs where, ideally, you run -1.txt with an older version of FDB, and then -2.txt with the current version, to test upgrading with old on-disk files. At minimum, you need to pass --restarting when running -2.txt, otherwise it won’t try to re-use the existing on-disk files, and you lose the whole benefit of the restarting tests.

Apple does runs where a random (test, seed) pair is chosen, and then run. ~30k is the cutoff of what to let correctness grind to before posting a PR, but our correctness test running system runs to 100k by default. Snowflake does 48 runs per test, which puts them in the same ballpark.

The correct way to generate this list would be to pull it from tests/CMakeLists.txt, as there’s already a check in CMake that every testspec in tests/ exists as an entry in that file. You’d want to scrape out all the ones that don’t have IGNORE.

You’re more than welcome to file an issue (or submit a PR) for a target to build the test list file.