How to safely add a metadata caching "layer" on top of existing layers?

I’m currently tackling the issue of adding cached metadata to a lot of already-written layers in our application, using the “new” metadataVersion key.

These layers predate the introduction of this feature, and most of them either completely ignored the issue (caching data like subspace’s prefix and praying that they NEVER change at runtime) or did not implement any cache and have a lot of latency that could be reduced.

My goals are:

  • Have a global “API” that looks the same across all layers (in my case in C#/.NET but could apply to any other language) that deals with the caching of state shared between multiple transactions.
  • Make sure that multiple layers can participate in the SAME transaction, so if they all have their own cached state, they can still be sure that they are all on the same page.
  • Be sure that the global cache implementation is bullet proof so that all the rest of the code can rely on it without doing its own checks.
  • Include the Directory Layer in this too, because it also has a cached state (prefix of all the subspaces, and we use them a lot)

The end result being that we could reduce the latency of transactions as much as possible.

My initial naive approach is to have some way for each layer to register a lambda “init” function that is called to create a new “state”, that would include any data read from the database, that change infrequently, and return an object that encapsulate all of this. Then all other transaction would obtain a reference to this state and use it.

By experience, I know that dealing with caches and transaction can be complex, because the cache must only be updated if the transaction commits successfully !

So for example here, from a cold start with an empty cache, and initial metadataVersion (“MV”) equal to 123.

T1 reads the metadataVersion (123), check the cache which is empty. It then invoke the “init” lambda of the layer which will read all the metadata from the database, create a new “state”. The transaction can use this state, but it cannot be published in the cache yet (only after a commit).

Once T1 commits, the cache is updated with the state at MV=123.

T2 starts later, read the MV which is still 123, find a state in the cache with the same MV, and can use the state (created by another transaction).

First questions:

  • What if the transaction is read-only, but the “init” code for the layer needs to write some things (to populate missing things) ? Does this mean that the “init” code can only read metadata, and all pre-initialization should be performed elsewhere?
  • What happens if multiple transactions (T1a, T1b) both starts roughly at the same time, and the cache is still empty? How do I know which state to insert in the cache? If they were both constructed at the same metadataversion, they should be identical? or maybe should I use the more recent read or commit version?
  • How do I deal with long running transactions that are reset multiple times (ex: bulk reads, that read 5s, then reset and continue reading from previous cursor).

Now, some other application or unrelated code does some change into another part of the cluster, and bumps the metadataversion to MV = 456, without any change to our layer’s metadata:

T3 starts, reads the MV which is 456. This does not match the state in the cache. Before dropping the state entirely, let’s say we have our own private “version” key that could be quickly read. We find that this key has not changed, so we can still safely use the state from MV=123, even if we are now at MV=456 in the cluster. The transaction can continue with doing only a few quick reads, compared to having to re-read the entire state from the db.

Questions:

  • When is it safe to update the cache ? Do I need to also wait for a successful commit, or can I immediately update the cache to say that the state is also valid for MV=456? Let’s say T4 starts right after that, but before T3 commits, can it then safely use the cache?

Next, how do I deal with multiple rapid changes to the metadataVersion, that are observed at different time by multiple concurrent transactions, and let’s say that some of the transaction lag a lot behind, and by the time they have read the new state (which is already deprecated), they attempt to update the cache while transactions started after them but that were quicker to commit, have already update the cache with the most recent state? What kind of parameters do I have to provide to the cache so that it can safely update or discard proposed state from all the callbacks of concurrent transactions?

TBD: need to draw a graph for this situation as well but it’s getting late! sorry :slight_smile:

Questions:

  • Is there a universal way to safely update the cache given the tuple (read_version, metadata_version, commit_version) of a transaction that will deal with concurrent / lagging transactions? I have a feeling that this will always be the same question for anyone wanting to do such caching.

So, after attempting to add a cache on top of the directory layer that can compose well with other layers (when used in the same transaction), the answer is that it is not very easy to do, and require careful care and also as a few restrictions.

  1. Reading the \xff/metadataVersion key will FAIL after it has been changed in the same transaction. An it will also doom the transaction when it attempt to commit. The only sensible way to deal with this is to prevent the read and return ‘null’ when this is the case. Ideally, this should be addressed at the binding layer with a dedicated API because it is difficult to implement correctly!

  2. When a layer A observe a ‘null’ value for the metadata version, it means that another layer B changed something somewhere previously in the current transaction. This does not mean that the cache of layer A is unusable, but layer A cannot know it and has to check again.

  3. Even if layer A detect a change and update its own cache context, it cannot use that context in the next transaction, because it can fail to commit, or another layer can interfere and prevent it from knowing what is the metadataVersion value linked to that context. The only way is to check right before a commit, and discard the newly constructed cache context. Only transactions where nobody change the metadata version can publish a new cache context for the next ones.

  4. Attempting to build a “smart” cache that will observe local schema mutation in the same transaction is very difficult to do, if the same transaction can be accessed from multiple threads. The best bet is to enforce a mutual exclusions between operations that use the cache, and operations that mutates the schema. Especially: resources obtained from the cache before calling methods that change the schema are suspect and should be read again!

  5. Each layer needs at least a local “version” key on top of the global metadataVersion key, which has to be updated everytime the schema is changed, and can then be used to quickly revalidate the cache (ideally with a single read). If the layer uses a versionStamp for that key (which is sensible), then it will fall in the same trap as in 1) and has to be extra careful to not attempt to read that key again in the same transaction (locking is required if layer code is multi-threaded!)

  6. Any layer at level N, in the stack of layers, SHOULD NOT cache any data obtained from the cache of layer at level N-1. Instead it should request the data everytime it needs it, and rely on that layer’s cache to be efficient. Ideally, any cached resource returned to the outside should have a “self destruct” option that the layer could trigger, if the previous cache context is invalidated, enforcing the rule of “don’t put the cached resource in a static somewhere!”

If these requirements are met, then it looks like it is possible to build multiple levels of caching built on top of each other, and should be efficient for transactions that do not mutate the schema of any of these layers.

So an example when combining the Directory Layer with an hypothetical Record Layer that uses directory subspaces to store the content of tables and indexes, and has complex metadata that needs to be parsed in memory to be efficient:

  • The Directory Layer has a TryOpenCached(path) API that returned a new subspace instance but that will use a cache context to store the prefixes of each subspaces. In most cases, no reads will be performed (except the initial GRV but that is inevitable).

  • The Record Layer has a GetCachedTable(tableName) method that will return a new “Table” instance where the metadata (schema, indexes, …) comes from the cache context, but this layer HAS to call the “TryOpenCached(…)” method on the DL everytime it wants to read/write keys from the tables or indexes. It MUST NOT store the subspace instance obtained in the transaction in its own cache, because it has NO WAY to know that the Directory Layer’s cache has been invalidated since.

Ideally, the cached subspace instance returned by the Directory Layer has a pointer to the original cache context, and it will always check that this context is still active before encoding or decoding keys. If the original cache context is destroyed by the DL, then that instance will be poisoned and throw errors instead.