Each test file (more or less) contains a different scenario, and because each test is randomized, each test run (with a given file) is also, in some sense, a different scenario. (Each test also gives you a seed you can supply at startup so you can re-run the exact scenario that surfaces an error.)
Ideally, the developer would run each of the tests many times (with different seeds) before submitting a PR. We also run these tests are ourselves, so we’d almost certainly catch something introduced in a PR prior to it being released in the wild, but it would be harder to track down at that point. If you have some idea as to which part of the code you are affecting, you could probably get away with running only a subset of the tests at least initially. (For example, if you made a change to the backup code, you might want to run the tests that have “Backup” in the name.)
Hm, I may be mistaken. I think you also want to run the status tests, but I could be wrong. The other directories are: (1) python_tests, which are only about testing the Python bindings, so don’t really make sense to run with simulation; (2) restarting, which requires that you first run the server at an older version, stop it, and then start it at a newer version (which are good to run, especially if you change anything about the on-disk format, but requires you have multiple fdbserver versions lying around); and (3) right in the tests directory, which are tests that are generally more about performance than correctness, so they don’t really make all that much sense to run.
I guess it depends on what you mean. Most of our documentation is oriented towards someone trying to use FoundationDB rather than develop it, so there aren’t a lot of docs. The page you found about simulation and testing is probably the closest thing we had to docs, I suppose. But there isn’t, to my knowledge, a comprehensive guide on which tests do what.
For simulation, it’s essentially what I outlined: a test harness picks a random test and a random seed and runs the simulation test and then sees if it succeeds or fails. This happens in a loop, so over the course of a night, you can get many, many runs. For the performance tests, we run the in-built performance tester that has come up elsewhere on the forums. It’s just a matter of setting up a (real–not simulated) FDB cluster and then running the load tester that’s in the binaries and then collecting statistics.
Generally, pass/fail should be enough. We have some API stability tests already to make sure that the external contract is preserved (which is generally the behavior that we care the most about keeping consistent across versions), though that could definitely be improved, so if a change might affect the external API, there might be slightly more work to be done to verify that the contract hasn’t changed. (Sorry if this answer is somewhat vague.)