Deferred value checks as an alternative to the metadataVersion key for caching in layers

After having implemented caching in the Directory Layer and several other complex layers, using the \xff/metadaVersion key, and having used it for a while, I ended up having to scrap this idea for mainly three reasons:

  • Relying on a single global key creates to much noise from other sources in the cluster.

  • Adding another key to the Directory Layer (used to test if the change is indeed coming from the DL or other) is a breaking change that needs requires a new version of the DL implementation in all bindings.

  • Both the global metadata key and the “local” key used by layers will totally miss any admin operation performed by fdbcli (or outside) script (see Should mutations performed via fdbcli always update the \xff/metadataVersion key?). It can be argued that “it should not happen”, but if it happens this is catastrophic (all running instances will continue with the cache and data corruption will occur).

My feeling is that the most robust implementation for caching, would be the ability for a transaction to set read conflict ranges that can “go back in time”, ie: instead of setting a conflict range starting at the current transaction’s read-version, it could specify an older value (probably the value of the read-version when the cache was first created).

This feature does not exist, and I’m not sure if it could be done in such a way that a cache could live longer than 5 seconds anyway.

In the mean time, I had to find another solution that 1) can be retrofit onto existing layers without requiring versioning, and 2) would be robust enough to no be vulnerable to admin mistakes, like doing a clear range in fdbcli.

I ended up added a “deferred value-checks” feature in the binding itself: a “value-check” is an outstanding asynchronous read, paired with an expected result value, which will run concurrently with the execution of the transaction handler. Just before commit (for r/w transactions) or before the handler completes (for read-only transactions), the binding will wait for all these outstanding reads to complete, and verify that they all return the expected values. If any read fails or returns a different value, the attempt will be failed (by simulating a not_committed error), and the retry loop will retry the execution.

Of course, in the next attempt, the layer code must be able to know that a value-check failed (or risk looping for ever). The binding offers another API that can be used to check if something failed in the previous attempt. The layer can then decide to drop any previous cache, and re-read everything.

This is very similar to the way FDB does optimistic locking by inducing retries by forcing code to run again. Except in this case, the layer code must notice this happening and react accordingly.

Pros:

  • Layers can check multiple keys in // without having to run them sequentially. (ex: the Directory Layer can “revalidate” all the nodes from the root in a deeply nested path in // instead of sequentially).

  • Layers can have their own set of keys that are not shared with anyone else.

  • Layers can keep a cache around when creating new entries (ex: the DL only tracks subspaces opened previously, creating a directory or deleting a directory not in the cache will not bust the cache!)

  • Layers can reuse existing keys for the value-checks (ex: the DL reuses the keys that map names to prefixes)

  • This method allows multiple layers to work on the same transaction without interference, and easily decide if their cache is bad, or it it’s someone else’s cache.

  • If the data is changed by any external tool or script, the layer will observe it. (ex: a clear-range in fdbcli will make all value checks fail, and the DL will notice it on the next attempt).

Cons:

  • Compared to the global metadata version key, this method requires at least two “hops” with the cluster, because we will still need to wait for the read-version, and then wait for the outstanding reads to complete, before calling commit. It is possible that the two hops will be merged into one if the transaction only writes.

  • The implementation of caching is more complex and error-prone because it works “accross” retries.

  • It is currently done at the binding layer, but there has some limitations. It would work better if done at a lower level.

These are the methods I have added to the .NET API:

note: the “context” class is the type that manages the lifetime of a transaction in the .NET binding. It is accessed via the tr.Context property, which “hides” these API away from the view or “normal” users, since this is essentially an expert-level API!!!

public class FdbOperationContext
{
    //...

    public void AddValueCheck(string tag, Slice key, Slice expectedValue);
    public void AddValueChecks(string tag, IEnumerable<KeyValuePair<Slice, Slice>> items);

    public bool? ValueCheckFailedInPreviousAttempt(string tag);
    public bool HasAtLeastOneFailedValueCheck { get; }
}

During the first attempt, the layer calls tr.Context.AddValueCheck(...) or tr.Context.AddValueChecks(...) with a list of key/value pairs, but the call is ‘void’. Each value-check is paired with a tag which is an opaque string.

In the next retry, meaning that at least on value-check failed, the layer can call tr.Context.ValueCheckFailedInPreviousAttempt(...) with the same tag, and get a result. If the method returns true, the layer knows that the value changed at some time in the past, and throws out the cache.

Here is a somewhat contrived example of code using these checks:

async Task SomeLayerMethod(IFdbTransaction tr, ....)
{
     CacheContainer? cachedMetadata = .....;
     if (cachedData != null)
     { // we have to re-validate the cache!
        if (tr.Context.ValueCheckFailedInPreviousAttempt("acmeLayerId"))
        { // we know from the previous attempt that something changed!
            cachedMetadata = null; // drop the cache
        }
        else
        { // optimistically use the cached data, but add a value-check for this transcation.
            tr.Context.AddValueCheck("acmeLayerId", cachedMetadata.CheckKey, cachedMetadata.CheckValue);
        }
     }

     if (cachedMetadata == null)
     { // we must read all the metadata from the database
        cachedMetadata = await GetMetadata(tr, ...); // reload metadata from the db
        // keep this around _only_ if the transaction commits with success
        tr.Context.OnSuccess((ctx, _) => { /* store the cached metadata somewhere */ });
     }

     // run the rest of the transaction as normal
     tr.Set(.....);
     await tr.GetAsync(....);
}

There are several performances issues, due to the fact that I’m currently handling this at the binding level:

  • I don’t dedup value checks for the same key in a transaction: I rely on the fact that the C binding already does that (merging multiple concurrent reads on the same key), but I still need to allocate tasks and futures.

  • When a value-check is created, I start the read immediately. If the same transaction changes the value after that, the value-check should not see the updated value. But if the code creates a new value-check for the same key, then that read should see the updated value. This alone means that I cannot easily “merge” checks for the same key, because it is difficult to know if the value was changed locally or not. The C binding does know that, and could easily implement this optimization!

  • Most layers have to check multiple values at once. I re-used the fake “GetValues(…)” method of the .NET binding that creates multiple futures (one for each key) but only calls back into the .NET world once (with the array of results). If this feature is more heavily used, it would be nice if the C binding could provide such bulk read method, to reduce the number of allocations and native interop.

After using this for a while, I found a few cases that need special care when using this technique.

The main issue is that the retry loop handler may fail because the application code throws errors/exceptions due to invalid assumptions. Meaning that it (or code in a layer it uses) assumed that the value of a key was something, but then following reads did not return the expected result, and the layer or application code interpreted that as some invalid situation (corrupted data? invalid state in the worflow?).

Real life example: layer code is using a cached directory subspace’s prefix to read keys. The directory subspace was just deleted and then re-created with a new prefix. Old prefix is still in cache, and points to either empty data, or WORSE, new data because the prefix was reused somehow! This is very frequent inside unit test suites that repeatedly delete and re-create entire sub-trees between each test.

When this happens, and if there is AT LEAST one failed value-check, the retry loop HAS to retry the handler.

  • If the exception was due to an invalid assumption, the layer or app code should see that the check failed, reload everything from the cluster, get the correct data, and no throw again.
  • If the exception was unrelated, if should (probably) fail again.

This makes this even more difficult to implement well, because if any of the layer code mixes things up in the value-check retry logic, it may end up retrying forever…

But anyway, with that in place, here is an example of this in action:

The layer code is very simplistic and assume that it has a "Foo" key that is either "NotReady" or "Ready". The value of Foo becomes "Ready" after some (slow) initialization step that will set the value of a key "Bar" to some important value, which should NOT be null !

So the initial state of the database may be:

  • Foo = NotReady
  • Bar does not exist.

After execution the slow initialization process, the database SHOULD be:

  • Foo = Ready
  • Bar = Something important for the application

The transaction code expects to be in the “Ready” state, and will optimistically assume that it is true without doing any reads, and rely on retries to discover the ugly truth (that someone forgot to run the init script!)

First transaction (that discovers something is wrong):

Transaction #10 (read/write, 17 operations, '#' = 0.5 ms, started 12:26:15.0039736Z [1590236775.003], ended 12:26:15.0139512Z [1590236775.013])
┌  oper. ┬─────────────────────┬──── start ──── end ── duration ──┬─ sent  recv ┐
│ 0   // │                     │ T+  0.013                        │     -     - │ // APP: I'm feeling lucky! Let's assume the db is already initialized
│:0   G °│ ##$                 │ T+  0.022 ~   1.066 (  1,045 µs) │    19     8 │ Get ("value_checks", "Foo") => 'NotReady'
│:0   // │ `                   │ T+  0.035                        │     -     - │ // APP: The value of 'Bar' better not be empty...
│:0   G °│ ##$                 │ T+  0.036 ~   1.046 (  1,009 µs) │    19     0 │ Get ("value_checks", "Bar") => not_found
│ 1   // │ __`                 │ T+  1.060                        │     -     - │ // APP: UH OH... something's wrong! let's throw an exception!!
│ 2   // │ ___`                │ T+  1.186                        │     -     - │ // Handler failed with error: [InvalidOperationException] Oh noes! There is some corruption in the database!
│:2   // │ ___`                │ T+  1.208                        │     -     - │ // Failed value-check 'foo' for ("value_checks", "Foo"): expected 'Ready', actual 'NotReady'
│:2   Er°│ ___X###########+    │ T+  1.232 ~   5.709 (  4,478 µs) │             │ OnError NotCommitted (1020)
├────────┼─────────────────────┼──────────────────────────────────┼─────────────┤ == Attempt #2 == 12:26:15.0097157Z (1590236775.009)
│ 3   // │ °°°°°°°°°°°°°°°`    │ T+  5.742                        │     -     - │ // SetOption(Timeout, 15000)
│:3   // │ °°°°°°°°°°°°°°°`    │ T+  5.854                        │     -     - │ // Previous attempt failed because of the following failed value-check(s): foo
│:3   // │ °°°°°°°°°°°°°°°`    │ T+  5.872                        │     -     - │ // APP: doing the actual work to check the state of the db, and initialize the schema if required...
│:3   G °│ °°°°°°°°°°°°°°°.##$ │ T+  5.878 ~   6.982 (  1,105 µs) │    19     8 │ Get ("value_checks", "Foo") => 'NotReady'
│ 4   // │ °°°°°°°°°°°°°°°___` │ T+  6.991                        │     -     - │ // APP: initializing the database!
│:4   s  │ °°°°°°°°°°°°°°°___` │ T+  7.010 ~   7.017 (      7 µs) │    24       │ Set ("value_checks", "Foo") = 'Ready'
│ 5   s  │ °°°°°°°°°°°°°°°___` │ T+  7.019 ~   7.020 (      1 µs) │    28       │ Set ("value_checks", "Bar") = 'Something'
│ 6   // │ °°°°°°°°°°°°°°°___` │ T+  7.020                        │     -     - │ // APP: The value of 'Bar' better not be empty...
│:6   G  │ °°°°°°°°°°°°°°°___` │ T+  7.022 ~   7.038 (     16 µs) │    19     9 │ Get ("value_checks", "Bar") => 'Something'
└────────┴─────────────────────┴──────────────────────────────────┴─────────────┘
> Read 25 bytes in 9.976 ms and 2 attempt(s)

The code then first optimistically assumes that the layer is “Ready”, and expects a non-null value in “Bar”. If it is null then it assumes that there is some corruption in the database, and throws an exception.

The retry loop handler sees the exceptions, but sees the failed value-check, and retries.

On the second attempt, the layer code does not assume anything, reads “Foo”, sees that it’s “NotReady” and then does the actual work to initialize the db.

All the `following transactions show the case when the value-check is successfull.

Transaction #11 (read/write, 4 operations, '#' = 0.5 ms, started 12:26:15.0141316Z [1590236775.014], ended 12:26:15.0159212Z [1590236775.015])
┌  oper. ┬─────┬──── start ──── end ── duration ──┬─ sent  recv ┐
│ 0   // │     │ T+  0.008                        │     -     - │ // APP: I'm feeling lucky! Let's assume the db is already initialized
│:0   G °│ ### │ T+  0.014 ~   1.696 (  1,682 µs) │    19     5 │ Get ("value_checks", "Foo") => 'Ready'
│:0   // │ `   │ T+  0.023                        │     -     - │ // APP: The value of 'Bar' better not be empty...
│:0   G °│ ### │ T+  0.024 ~   1.683 (  1,659 µs) │    19     9 │ Get ("value_checks", "Bar") => 'Something'
└────────┴─────┴──────────────────────────────────┴─────────────┘
> Read 14 bytes in 1.787 ms and 1 attempt(s)

Transaction #12 (read/write, 4 operations, '#' = 0.5 ms, started 12:26:15.0160095Z [1590236775.016], ended 12:26:15.0169962Z [1590236775.016])
┌  oper. ┬───┬──── start ──── end ── duration ──┬─ sent  recv ┐
│ 0   // │   │ T+  0.007                        │     -     - │ // APP: I'm feeling lucky! Let's assume the db is already initialized
│:0   G  │ # │ T+  0.011 ~   0.945 (    934 µs) │    19     5 │ Get ("value_checks", "Foo") => 'Ready'
│:0   // │ ` │ T+  0.016                        │     -     - │ // APP: The value of 'Bar' better not be empty...
│:0   G  │ # │ T+  0.017 ~   0.935 (    918 µs) │    19     9 │ Get ("value_checks", "Bar") => 'Something'
└────────┴───┴──────────────────────────────────┴─────────────┘
> Read 14 bytes in 0.986 ms and 1 attempt(s)

What we can observe is that both the reads for “Foo” and “Bar” start and complete at roughly the same time, even though, logically, the application should first read “Foo” and then if the value is “Ready”, read the value of “Bar”:

  • Start “Foo” read
  • Await “Foo” read
  • Start “Bar” read
  • Await “Bar” read

In the above examples, though, both reads are merged, or rather the read for “Foo” is not awaited, but rather deferred right until the end.

So the actual order of operation is:

  • Start “Foo” read
  • Start “Bar” read
  • Await “Bar” read
  • Await “Foo” read

It looks like the fdb client multiplexes both reads so they are essentially merged into a single network roundtrip! If you look carefully, you see that the end time for the “Foo” reads is a few µsec later than the end time for “Bar”.

This is what it looks like in a “real life” transaction that combines the Directory Layer and a Document Store layer.

The transaction queries an index in a collection of Books, and both the directory subspaces and the metadata for the collection are cached from previous requests. Both DL and Document layer use value-checks are used to ensure that the cache is valid.

Querying Genre == 'Science Fiction' ...
Transaction #21 (read-only, 17 operations, '#' = 0.5 ms, started 14:21:20.5156647Z [1590502880.515], ended 14:21:20.5171854Z [1590502880.517])
┌  oper. ┬─────┬──── start ──── end ── duration ──┬─ sent  recv ┐
│ 0   // │     │ T+  0.404                        │     -     - │ // Unpacking Book documents...
│:0   ?? │ ##+ │ T+  0.430 ~   1.176 (    746 µs) │    13     1 │ CheckValue (|Directory|, `<FE>`, 0, "Tests") =?= <14> => <14> [Success]
│:0   ?? │ ##. │ T+  0.442 ~   1.099 (    657 µs) │    18     3 │ CheckValue (0, |Directory|, `<14><FE>`, 0, "ACME") =?= <14><15><16> => <14><15><16> [Success]
│:0   ?? │ ##. │ T+  0.445 ~   1.096 (    652 µs) │    20     3 │ CheckValue (0, |Directory|, `<14><15><16>`, 0, "MUTSUMI") =?= <14><15>5 => <14><15>5 [Success]
│:0   ?? │ &#. │ T+  0.447 ~   1.093 (    646 µs) │    23     4 │ CheckValue (0, |Directory|, `<14><15>5`, 0, "BookFacts") =?= <14><16><15><11> => <14><16><15><11> [Success]
│:0   ?? │ &#. │ T+  0.452 ~   1.087 (    635 µs) │    20     4 │ CheckValue (0, |Directory|, `<14><16><15><11>`, 0, "Books") =?= <14><16><16>b => <14><16><16>b [Success]
│:0   // │ `   │ T+  0.469                        │     -     - │ // Checking schema snapshot 19228081243440 for collection /Tests/Acme/MUTSUMI/BookFacts/Books[Doxense:DocStore:Collection] at <14><16><16>b against content of directory at DirectorySubspace(path=/Tests/Acme/MUTSUMI/BookFacts/Books[Doxense:DocStore:Collection], prefix=(0, 5730))
│:0   // │ `   │ T+  0.471                        │     -     - │ // Schema cache likely still valid for collection '/Tests/Acme/MUTSUMI/BookFacts/Books[Doxense:DocStore:Collection]'.
│:0   ?? │ &#; │ T+  0.472 ~   1.149 (    677 µs) │    15     5 │ CheckValue (0, 5730, "Name") =?= 'Books' => 'Books' [Success]
│:0   ?? │ $#; │ T+  0.492 ~   1.147 (    655 µs) │    20     1 │ CheckValue (0, 5730, "SchemaVersion") =?= <01> => <01> [Success]
│:0   ?? │ $#; │ T+  0.498 ~   1.144 (    646 µs) │    20    10 │ CheckValue (0, 5730, "Type") =?= 'Collection' => 'Collection' [Success]
│:0   ?? │ $#; │ T+  0.500 ~   1.141 (    641 µs) │    20     6 │ CheckValue (0, 5730, "Metadata") =?= 0<C5><F3><E2>|<11> => 0<C5><F3><E2>|<11> [Success]
│:0   // │ `   │ T+  0.566                        │     -     - │ // Lookup value ("science fiction",) from non-unique index [Books.IX_Genres]
│:0   R  │ +#: │ T+  0.625 ~   1.136 (    511 µs) │    54    30 │ GetRange fGE{(0, 5730, 1, 2, 0, "science fiction").<00>} <= k < fGE{(0, 5730, 1, 2, 0, "science fiction").<FF>}, WantAll, Keys => 1 result(s)
│ 10  R  │ __= │ T+  1.209 ~   1.371 (    162 µs) │    20   299 │ GetRange fGE{(0, 5730, 0, 88038).<00>} <= k < fGE{(0, 5730, 0, 88038).<FF>}, WantAll, Values => 1 result(s)
│ 11  // │ ___ │ T+  1.500                        │     -     - │ // Verifying 9 pending value-check(s)
│:11  // │ ___ │ T+  1.512                        │     -     - │ // All value-checks passed
└────────┴─────┴──────────────────────────────────┴─────────────┘
> Read 366 bytes in 1.520 ms and 1 attempt(s)

The first 5 value checks are from the Directory Layer, and the next 4 checks are for the metadata cache of the collection itself. The only “real” operations are the GetRange(…) used to query the index.

note: the first column of this log displays the “operation index”, which is “the number of I/Os that have completed before this operation”. All the lines that start with :0 mean that they started while the previous operations are still pending. The second GetRange has 10 which means that when it was issued, 10 operations before it completed. So in short: all the lines that start with a ':' are multiplexed with other operations. Only lines that increment this number are sequential.

We can see that the first 8 value-check reads have been merged with the first GetRange(…), so they don’t add to the overall transaction latency. Without this, and without any cache, the first 5 reads from the DL would have been sequentials, and only the 4 reads from the collection metadata would have been done in //, so a total of 5 round-trips required before issuing the first GetRange.