This may be controversial, but this is something that could prevent a lot of issues and headaches.
tl;dr: fdbcli should - by default - update the
\xff/metadataVersionafter committing any mutation to the cluster, in order to not break Layers using that key to cache metadata.
I’ve added support for caching metadata to all my layers, using the
\xff/metadataVersion key to decide when they should throw any previous cached data. All my layers (and including the Directory Layer of the .NET binding) use this to cache data that does not change frequently and is expensive to query on every transaction (prefix of directory subspaces, schema and index definitions of a collection of documents, etc…) and everything is based on the
\xff/metadataVersion being changed when an admin updates the schema.
My main issue is that, whenever I use fdbcli to quickly patch something - and most frequently it consists on performing a large
clearrange on the db - it completely breaks that contract and borks all the layers that use caching, because the metadata version was not changed!
Welcome to the fdbcli. For help, type `help'. fdb> writemode on fdb> get \xff/metadataVersion `\xff/metadataVersion' is `\x00\x00\x0f[F\x93p\x09\x00\x00' fdb> clearrange \x00 \xFF Committed (16884943757645) fdb> get \xff/metadataVersion `\xff/metadataVersion' is `\x00\x00\x0f[F\x93p\x09\x00\x00' fdb>
In this example, I have completely wiped the database, including the directory layer!, but the metadata version has not changed. Following this command, any layer that had already cached any metadata will blissfully ignore this, and continue writing data into the keyspace, at least for a while, until some tool or command is used to change the value.
This very situation happened to me minutes before a demo, and it took some time for me to figure out what happens (and the fix was simply to bounce all the pods in the kubernetes cluster… not great).
I think that, by default, any mutation performed by fdbcli, should touch the
\xff/metadataVersion key at commit time. This would ensure that any layer using caching would at reload its cache and see the change.
If there are situations were the admin using fdbcli does not want this to happen, maybe there could be a different argument to writemode, like
writemode unsafe? I think that it should be very clear that randomly setting or clearing ranges in the keyspace, without touche the metadata version can be very dangerous!
Any opinions on this?
Anyone wondering why sometimes the caching feature of a layer did not work properly (maybe inducing data corruption), and that would now make the connection with a random “clearrange” issued via fdbcli when troubleshooting ?