How are you testing your layers?

I think we didn’t bother since we kind of expect client applications to make heavy use of threads and even if the client is single-threaded the ordering of events (i.e. calling callbacks when futures become ready) still depends on what order the server responds to requests. That said I don’t think we’d be opposed to say a network option which sets the state of the random number generator.

My personal pet client testing crusade has to do with “in-flight” commits. Usually when a commit future becomes ready, the fdb client has taken great care to ensure that the commit is not in flight. For transaction_timed_out and cluster_version_changed though, the commit might still be in flight! A reasonable user might do the following:

  1. Attempt to commit a transaction
  2. That commit future fails with transaction_timed_out
  3. Read a unique key that was to be set in that transaction
  4. Observe that the key is absent and conclude incorrectly that the transaction did not commit and will never commit
  5. The commit (which has been “in-flight” this whole time) succeeds after the read in step 4

For transaction_timed_out this isn’t so bad, as the default retry loop (i.e. on_error) does not consider transaction_timed_out to be retryable. It does however consider cluster_version_changed to be retryable.

Client buggify does attempt to simulate this situation if you have a timeout set on your transaction. I suppose we could do the same with cluster_version_changed if you’re using the multiversion client (or maybe just unconditionally?)

Developer Guide — FoundationDB 7.1 has some more detail about this (and should probably be updated to mention cluster_version_changed). It also mentions that cancelling a commit future also does not stop that commit from being in-flight.

Here’s a few more popular ways of writing bugs that don’t show up in local testing:

  1. Attempt to maintain an in-memory data structure that is derived from database state. This turns out to be tricky to get right. Importantly, if you set $key to some $value in a transaction, and commit that transaction with $commit_version, you cannot assume that the $key is set to $value at $commit_version. This is because commits are done in batches, and the entire batch of commits gets the same commit version. A transaction appearing later in the batch may have performed a blind write to $key. It’s much easier if you maintain your in-memory data structure based only on reads. If you’re making decisions about what keys to write based on the state of this in-memory data structure this gets even more complicated. I would consider just actually performing the necessary reads to re-derive the data structure in every transaction, but this might cause a read hotspot. Otherwise you could carefully track the read version and keys your data structure is derived from, and use that read version as the read snapshot of your transaction and add those keys to your read conflict range. If you need to avoid a read hotspot and need linearizability (i.e. a fresh read version) then you can look into a scheme using \xff/metadataVersionKey (again also tricky).

  2. Performing operations on a transaction outside of a retry loop. Most bindings come with some kind of doTransaction function that accepts a lambda and implements the default retry loop, so if you use that you probably don’t need to worry about this.

  3. Reading at snapshot isolation. If you don’t test with concurrency then you may find that you’re missing a read conflict you actually wanted.

  4. Interacting with mutable data structures that have a lifetime longer than the body of your retry loop. E.g. if you append to the same list every time you retry.

If you’re using the default retry loop, using snapshot isolation sparingly, testing with concurrency, checking invariants in your data model, and making sure your transactions actually do a lap around the retry loop sometimes you’re probably in pretty good shape.

Btw if you’re using client buggify I would recommend introducing a few bugs intentionally and tweaking the “section_activated” and “section_fired” probabilities until you are actually catching those bugs, but still completing transactions.