A precise test oracle for FoundationDB simulation workloads

Goal: be able to detect database property failures in simulation very precisely even when running very weird randomized workloads, which might sometimes deliberately weaken serializability or just be nondeterministic in nature. See for example issue 126 which I guess must have slipped through a crack in existing workloads?

The basic idea is that rather than try to determine if an arbitrary result is consistent with some legal behavior of the database, which is a challenging search problem in general, we take advantage of the fact that FoundationDB’s implementation actually constructs a (supposedly) valid serialization order to check a strictly stronger property.

Start with an arbitrary concurrent workload, running in simulation, and using the API pretty freely (but I guess staying away from system keys and management APIs). The distribution of behavior, of course, matters and there is plenty of room for creativity.

Have each simulated workload client keep (out of band, safe from simulated failures, and above the level of the RYW API) a record of all its read and write transactions, including true simulation times before and after, an identifier unique for each transaction (retry) snapshot and (where commit() succeeds) commit versions, and details of reads and writes. It’s probably OK to store only a hash of the results for reads.

Augment each transaction with a versionstamped operation (either key or value) that includes the unique identifier.

After the workload is complete, from looking at the versionstamped operation log we know the final serialization order of all transactions that were attempted by any client: if the transaction was read only, it serializes at (the end of?) the snapshot version. If it was read/write and does not occur in the operation log it did not occur, and if it does occur it should serialize in just that order.

Combining this information with the out of band logs from each client, we can check, separately:
(1) that the serialization order is compatible with the before and after times recorded for each transaction (external consistency), that every transaction that the client saw successfully commit() serialized (durability), and that transactions the client received not_committed() for did not serialize, and
(2) that every read and the final contents of the database are (isolation, atomicity, many other properties) exactly what would be predicted by executing each operation sequentially in the serialization order (using MemoryKeyValueStore, and perhaps ideally an independent implementation of the atomic mutations as in the specific workload for that?)

Together these properties should more or less precisely validate the database’s contract. The only restriction on workloads is that there can’t be a transaction without a versionstamped write but with other writes, but it is hard for me to imagine a realistic bug that can’t manifest with that restriction.

It would be interesting to implement this technique as a wrapper on the transaction API, so that it could easily be reused by different workloads.

Thoughts?

I really like this test idea Dave.

To catch issue 126 this would have to be implemented above the read your writes transaction, but that would make it more difficult to validate because we would have to re-implement read your writes at that level. My thoughts would be to have a separate test that would catch that issue, and put this code at the NativeAPI level.

I don’t think you need to re-implement read your writes to get it under test. The whole idea of read your writes is to exactly emulate the behavior of carrying out each operation sequentially. So you loop over transactions sequentially and then over operations within the transaction sequentially and execute them one at a time to find the correct answer. I think the serialization order for RYW is supposed to be exactly the issue order. Right?

For this to catch the class of problems found in #126, I believe one would have to additionally check that all transactions stored with the same versionstamp don’t conflict? Otherwise we’d still have nothing that covers testing of adding read or write conflicts.

It’s interesting to note that this would handle replayed commits correctly.

This is, evidently, a good question, because I’ve flip flopped on whether you are right several times in the course of responding.

I think that my proposal as written works for #126 (as detailed below), but in fact has a closely related soundness bug: it will falsely complain about the behavior of regular snapshot reads in a read/write transaction (since I said that a read/write transaction serializes at its commit versionstamp, and after all snapshot reads are not serializable). Trying to fix that in the obvious way by treating snapshot reads as taking place in a separate read only transaction at the snapshot version then leads to the problem you are concerned about.

I would keep looking for a more precise and elegant solution, but maybe it is necessary to look at the read conflict ranges and promote snapshot reads that have a read conflict range to serializable for the purposes of the oracle. (That still might not be 100% precise for cases involving extra conflict ranges. It’s possible to order two particular transactions through conflict ranges that don’t involve any actual reads or writes, but if we don’t detect this then the oracle wouldn’t promote the reads to serializable and so might not validate that it worked.)

Another strategy, which I think you allude to, would be to add a third class of check that the serialization order is consistent with the conflict ranges of each transaction.


If the workload does the exact operations (with the same real-time ordering) that Alec described in issue #126, because of the bug the transactions tr1 and tr2 do not conflict, and they serialize in order. Let’s say the read snapshots of both transactions are version 100, the commit version of tr1 is 101 and the commit version of tr2 is 102.

When the test oracle replayed the operations, it would do so in the serialization order (not the real-time ordering), ignoring versions and conflict ranges entirely (and skipping transactions which have no place in the serialization order, but there are no such in this example). Because the serialization order generated by the database broke contract, it won’t see the same results:

(tr1) read(k) -> 1
(tr1) set(k,2)

(tr2) read(k) -> 2 (!!! ERROR !!! Got 2, expected 1!)

Ah, yes, I agree that your proposal would cover #126, as the specific bug dealt with operations that overlapped with added conflict ranges, and that it covers checking if the serializable schedule recorded on the tlogs is actually a correct schedule for the operations performed. And in the ~5min our team had previously spent discussing your comment on #126, we agreed that it’s a thing that would be good to have a test over.

I had interpreted #126 as general commentary that we lack testing over manually added conflict ranges. To illustrate the pathological case, I could create a workload that performs operations, and whose only conflicts come from manually added conflict ranges. Any ordering of these transactions would be valid according to their reads and writes, but just checking the serializable schedule against the operations wouldn’t validate that the serializable schedule was a correctly accepted transaction schedule. I’m unsure if there’s any way to solve this without the oracle checking conflict ranges as well.

And, as long as I’m being pedantic, doing validation without checking conflict ranges means that we’d be asserting that the produced schedule is view-serializable and not conflict-serializable, the latter of which is stricter and what FDB claims to produce. This would leave a gap in coverage, where the resolvers could incorrectly accept view-serializable schedules. However, them doing so would, essentially by definition, not be observable by the client (unless they stare very very closely at the transaction schedule). If there was a way FDB could be view-serializable, I’d fully support it, so this doesn’t actually bother me at all.

Your outlined oracle approach follows what I’ve previously seen of validating strictly serializable systems’ transaction schedules, so it seems like a good overall approach to me.


re: soundness bug, I believe I don’t follow your concerns, unless you’re specifically talking about snapshot reads done without adding conflict ranges, which I don’t think you are. There’s a couple pieces I’m confused about.

I specifically don’t follow the “snapshot reads aren’t serializable” part. I would claim that they are serializable, and are serialized at the end of their read version, as you had noted previously. This would complicate the implementation slightly by having to make sure to get them ordered in with the recorded commits correctly, but not the soundness.

For read-write transactions, suppose we’re checking a schedule of transactions [T1 {r:0, w:1}, T2{r:0, w:2}, T3{r:1,w:3}], where r: specifies its read version and w: specifies its commit version. Suppose that there’s a key K that was written by T1. When validating a read of key K in T3, we’ll have two cases to consider: either K was written by T2, or K was not written by T2. If K was not written by T2, then the most recently written version of K would be T1’s write. If K was written by T2, then T3 should not have committed, and any error reported would be correctly reported. So I don’t see a soundness issue there either.

This is all to say, I believe that I’m misunderstanding your point, and would you mind illustrating your comment slightly more?

I exactly mean snapshot reads done without adding conflict ranges. They need to be checked at the snapshot version (as if in a read only transaction of their own), not at the commit version. This is pretty obvious in hindsight, but I didn’t actually say it in my original proposal; that’s the soundness bug. (I don’t actually know if “soundness” and “precision” are the best words for the two types of errors but I think it’s clear from context what I mean)

And if you check all snapshot reads only at the snapshot version, then #126 isn’t detected, and as you say just checking if the read is upgraded still doesn’t seem to be precise.

So I think I support the idea of explicitly checking that the serialization history respects the specified conflict ranges, in addition to checking snapshot reads at the snapshot version and explicitly serializable reads at the commit version. (The latter is kind of optional, I guess, but a valuable belt and suspenders that could detect some unknown unknown that breaks the isolation property we actually want while somehow following the rules about conflicts. And I don’t think there’s any intended API feature for “downgrading” a read once you have made it serializably.)

Unfortunately I thought of another source of complexity. Evan’s comment about RYW being complicated to model is also possibly true of snapshot reads in that design. Snapshot reads are supposed to see previous writes in the same transaction, and I guess we would have to model that if we are “moving” them to the snapshot version.

Let me know if this is clear.

Regarding view serializability, for what it’s worth I guess the two methods this oracle can use to identify serialization order are evidence that the exact serialization order is nearly always observable in FDB :-)

Hm, well, I guess that since the oracle is supposed to say whether the transactions were consistent with the supposed serial order (provided by versionstamps), it isn’t surprising that it is complicated to make it work with the parts of API that allow the user to massage that guarantee (i.e., add snapshot reads so that mean transactions might not actually be serializable or add conflict ranges so that certain serial orderings are actually invalid). It might be the case that there is value in a test that doesn’t worry about those complexities and just verifies that if you use the database with only serializable reads and writes, the serialized order actually is correct.

That being said, I think the following modifications would allow this test to encompass the whole API (without requiring it explicitly check conflict ranges):

  1. When doing a snapshot read, one should also do it in a fresh transaction at the same read version without the RYW behavior. Write that value at the read version as a serialized read. Then also write the value that was actually read at the commit version as a snapshot read. Then verify that the two values are the same when the test has completed (except account for mutations that have happened within the same transaction, i.e., reimplement RYW and use it here, unless the disableSnapshotRYW option has been set).
  2. For every manual read conflict range, perform a read in that other fresh transaction with the same read version, but serialize it as the first things that happened within the transaction at the commit version (or at least before all writes). (I believe this would catch #126.)
  3. The in-memory data structure being used to validate correctness should keep, in addition to the value of a key, the version at which it was modified. For range clears, you will need a tombstone that includes the end key, and during sets, you might have to split the tombstones apart.
  4. When a write conflict range is added manually, only the “last modified version” of keys in the range need to be updated. (This also protects against problems where a transaction sets a key to the same value that it was before, which might slip through the cracks of the original test, if we want to call that a definiciency.)
  5. When validating a read, one should check that the value read is the same as the value in the in-memory validating structure, but also, one should check that the last modified version of those keys is less than or equal to the read version.

I think (think) that this should verify the contract. Note that some of those steps are essentially equivalent to just checking the conflict ranges, but like, I think this might be as close as one can get before giving up and just re-implementing the resolver (maybe less optimized under the assumption that it’s less error-prone to write less optimized code and maybe without the need to have efficient truncation).

At present, I believe it only validates that the purported serialized order is a valid serialized order, but it does not validate that the order isn’t too pessimistic. (In other words, I think a transaction resolver that just says “no” to all transactions passes the test.) To fix that, if you can get the versionstamp associated with uncommitted transactions (which exists but isn’t exposed in our API), then you could store somewhere in memory on the side all reads (including read conflict ranges but not snapshot reads) and verify that something in the committed serial sequence has changed (again, using the “last modified version” plus value combo). (Except that that doesn’t work in the presence of multiple resolvers, because they actually are more pessimistic than is strictly necessary even without bugs. Hmph.)


I will also add that this test doesn’t have to be done in simulation (or, rather, it doesn’t only have to be done at the key and value level). If an application developer has a layer that can be modeled in-memory, and they want to validate that their operations, in the presence of concurrent actors, still works correctly, then can use their layer while also transactionally maintaining a log of their operations. This is particularly useful if they do things like mess with conflict ranges in order to decrease contention and want to validate that they still produce operations that ultimately serialize correctly. (For example, if you had application that popped things from a queue by reading the first n elements at snapshot isolation level and then picking one randomly, you could validate that each queue item is popped exactly once with a scheme like this.) It is probably hard to make a test like this validate that the conflict ranges aren’t too big (i.e., that there weren’t operations that “could” have been committed but weren’t), but often, it’s probably good enough just to validate that they aren’t too small (i.e., that there aren’t any operations that shouldn’t have been committed but were).

A slowish implementation of range conflict checking shouldn’t be too hard; in fact there is a SlowConflictSet implemented in a few lines in Skiplist.cpp for use in (apparently defunct) unit tests there!

I’m not sure, but Alec I think your proposal largely boils down to mixing the conflict checking (in the form of verifying last modified versions) in with verifying reads.

Rather than duplicate RYW (yuck) I think I might try to support (single) checkpoint/rollback of the fake kv store with a simple undo log. So you do all reads and writes of a transaction in order at its snapshot version, then roll them back. Then at the commit version you do serializable reads, writes, and conflict checking. Would that work? I think it’s unlikely to have the exact same bugs as the real database.

Yeah, that sounds right. I suppose that there’s argument somewhere that it is better than just running a SlowConflictSet on the side in that you are verifying that your read conflict ranges aren’t stale by actually reading them, which is a more direct test than relying on the SlowConflictSet, but maybe that’s marginal.

Hm, I think that would work. It does seem like an undo log, given that FDB doesn’t use any undo logs anywhere else, is likely to have a different set of bugs than the real database, so maybe that’s a plus, but it does seem annoying that one would have to implement “undo a clear range” to make it work. And your writes here would either have to exclude SET_VERSIONSTAMPED_KEY and SET_VERSIONSTAMPED_VALUE mutations (or include them, but reads are now just verifying that none of your reads looked at them à la unreadable ranges in the RYW cache).

One benefit that the test verifier has over the real database when it comes to implementing RYW is that it doesn’t have to be concerned with also being the read cache or with being efficient, so reading a range with RYW could even be implemented as (1) read the range at the old version (i.e., without YW) then (2) apply all mutations in order. I claim (without proof) that that shouldn’t be too hard to implement (at least easier than the real RYW cache). You could then implement the single-checkpoint in-memory DB you mentioned as one map containing the current DB state and a list of uncommitted mutations, and then “rollback” is just forgetting the mutation list.

By conflict checking, do you mean that you could check all (manually- and possibly also automatically-added) read conflict ranges against a SlowConflictSet and add all write conflict ranges at the end to the same set? I do think that that would work, and it would probably necessary if you don’t keep most-recently-modified version information with all of the keys in the in-memory representation to catch certain classes of bugs. I guess I’m somewhat concerned that the SlowConflictSet might have the same set of bugs as the main resolution skip list, which is partially why I suggested keeping most-recently-modified metadata around.

I suppose one other benefit is that if you had uncommitted transactions somewhere, then you can simulate what is “supposed” to happen with multiple resolvers by using multiple SlowConflictSets and dirtying them with uncommitted transactions that failed only on the other resolvers. But at this point, the tester is trying pretty hard to simulate the database itself rather than just verify its contract. Hm.

I think that continuing to check that serialized reads are serializable is a decent defense against bugs (which probably mean conceptual errors) that affect both SlowConflictSet and the real resolution machinery. I think you are down to problems that (1) like issue #126, don’t affect regular serializable reads, AND (2) affect the determination of conflict ranges in the oracle’s transaction wrapper as well as in the client implementation. This is not totally inconceivable, especially when we are talking about poorly specified new features, but I can’t really think of a way to do better.

An undo log for this purpose isn’t too hard, I don’t think. Unlike the real database amortized performance is fine. Ignoring lifetimes to an extent, basically something like

void beginTransaction() {
    ASSERT( !inTxn );
    inTxn = true;
}
void commitTransaction() {
    ASSERT( inTxn );
    undoLog.clear();
    inTxn = false;
}
void rollbackTransaction() {
    ASSERT( inTxn );
    inTxn = false;
    for( auto const& item : reversed(undoLog) )
        if (item.value.present())
            set( KeyRangeRef( item.key, item.value.get() ) );
        else
            clear( KeyRangeRef( item.key, keyAfter( item.key ) ) );
    undoLog.clear();
}
void setRange( KeyValueRef kv ) {
    if (inTxn) {
        Optional<ValueRef> v = get( kv.key );
        undoLog.push_back( KeyMaybeValueRef( kv.key, v ) );
    }
    map[kv.key] = kv.value;
}
void clearRange( KeyRangeRef range ) {
    if (inTxn)
        for(auto const& it : getRange( range ))
            undoLog.push_back( KeyMaybeValueRef( it.first, it.second ) );
    map.erase( range );
}

etc. A separate unit test is probably unnecessary given that it will be tested very extensively by the oracle test itself.

I guess ideally you might want to have this fake K/V store support unreadable ranges to deal with the SET_VERSIONSTAMPED_KEY issue, or else check that separately.