Not totally sure where to ask this, but I thought the FDB crowd might have ideas is anyone familiar with other interesting systems that use variants of simulation / deterministic testing? Will Wilson’s talk (https://www.youtube.com/watch?v=4fFDFbi3toc) remains one of my favorites, and I’m curious to gather more data points in how to practically integrate those ideas into applications.
I found this while researching if the Go scheduler could do this kind of thing. According to the author’s LinkedIn profile, he worked at LeftHand Networks from 2006-2008, which made a SAN product and was acquired by HP.
It is actually a bit funny how close the description of the benefits for FDB from various people match the author’s descriptions here. A SAN and a database are pretty similar if you squint, so it makes sense why they would take this approach.
This is more difficult than just implementing deterministic scheduling – all sources of nondeterminism must be eliminated. I worked on a complex distributed block-level storage product (written in C) that supported this kind of pseudo-random testing. We had to simulate (mock) all interactions with the outside world, including networking, file access and, most importantly, time itself. The network and files were simulated using memory buffers. The entire simulated cluster of nodes ran within a single executable (one address space, one scheduler), which was single-threaded (we didn’t use threads since we had no deterministic thread scheduler, among other reasons). Each fake network message and file IO access would appear to take a (pseudo) random variable amount of time. The scheduler (which was specific to this test mode) would use a priority queue to determine at what time the next event (network message arrival, whatever) occurs, and then simply advance the simulated clock to that time. So simulation time usually moved faster than real time!
Every failure was guaranteed to be reproducible. It was amazing. We ran this test (with different random number generator starting seeds) on racks of systems 24 hours a day. It found many unbelievably subtle timing-window bugs.
One enormous benefit of doing all this is that, when a failure is found, you can enable (or add) arbitrary amounts of tracing then re-run that seed. You can also add massive amounts of checking / audit code and possibly catch the bug earlier. These changes have no effect on simulated time. (I’m sure you’ve all seen the Heisenberg Uncertainty Principle at work when you try to reproduce bugs on a real system with tracing enabled.)
Your comment, rsc, about not knowing if you’ve fixed the bug is very astute. If necessary to get around this problem, I would actually “time-bomb” the fix, such that it would only kick in just at the (simulated) moment the wrong decision was made (I knew when that was from tracing output), so the run would reproduce to that exact state, and then I could see it continue from that point okay (or not okay if the fix didn’t work or had some bad side-effect).
Since becoming interested in Go, I’ve thought about this very idea, but I agree that it’s a huge project, plus it takes significant forethought when designing the code. It would be great, though.
https://github.com/osrg/namazu also exists, which can do sort of simulation-like testing of systems that weren’t written for it.
There’s a few random database/storage startup people that I’ve talked to over time that have built some amount of a simulation-like system, but none of them open and released or something I could point at.
It is relatively common to build something simulation-like for verifying paxos/raft implementations, but it’s generally not extended to the whole system.