Expose serialization of MutationRef in api?

There are a few use cases I can think of where it might be useful.

  • Read your own mutations (from the special key space)? Then you could do your own change data capture layer by reading all your mutations right before you commit, and then writing them back with a versionstamped key. It could also be useful for logging/debugging.

  • Inspect backup files in preparation for a point-in-time restore. E.g. I accidentally issued a clearrange “” \xff, but I had an incremental backup running. Now I’d like to restore the database to the point right before that mutation. You would need a way to identify the accidental mutation.

The format is (type, param1, param2) where type is an int, and param1 and param2 are strings. Type is even already exposed to clients here: https://github.com/apple/foundationdb/blob/master/fdbclient/vexillographer/fdb.options#L292. (It’s missing set and clear but we could add those).

As far as I can tell, this has never been changed in a backwards-incompatible way, only new types have been added.

What do people think of exposing “mutation = (type, param1, param2)” in the api?

Just to make sure I understand… are you talking about exposing just the current transaction’s mutations via a virutal key in xff? Or are you talking about exposing the recent mutations that have been committed to the database?

I suppose my question could be put more succinctly as “if we were to expose the current transaction’s mutations via a special key in \xff\xff, what should that look like?”

I think that could be a useful primitive to have. It also probably has debugging usefulness, in the same way that turning on logTransaction dumps mutations, though making it accessible through the API would let someone right a tool that also interpreted those keys in a way that is meaningful for their use case.

I guess the ask would be that if this is a thing that’s done, that the deserialization of the mutation blobs also probably should be exposed in someway through either the bindings or some kind of utility API that has tools for dealing with this.

Yeah I can see this being useful. I’ve been thinking of making some experimental changes to the nodejs bindings to support realtime change feeds of key ranges. With a feature like this it would be much simpler to do stuff like this:

  • The bindings can expose a beforeCommit hook to user code. This is the only change we need in the bindings.
  • A realtime change feed extension can then query back the set of local mutations and write metadata to a more durable operation log, or whatever.

As for encoding, the simplest implementation would be to just have the mirrored set of key/value pairs under \xff\xff that were written. So if a user does tn.set('x', 'hi') then I should later be able to tn.getRange('\xff\xff' + 'mutations') (or whatever) and get back [['\xff\xff' + 'mutations' + 'x', 'hi'], ...]. Then I can just strip off the prefix and read back the key/value pairs I set.

The other way to do it would be to have a single key contain all the key/value pair data encoded as tuples, since all bindings have a tuple encoder implementation already.

Oh wait, that won’t work. The set of mutations the client issues includes:

  • set(key, val)
  • clear(key)
  • clearRange(start, end)
  • atomics (add / max / min / bitAnd / bitOr / etc)
  • setVersionstampedKey / setVersionstampedValue

Maybe just a single \xff\xffMutations generated dynamically on read, with a value being a tuple-encoded list of all the mutations issued against the transaction.

I think this is a good ask. However, I know there are two different serialization formats in the current code base:

  1. type, param1_len, param1, param2_len, param2. See MutationRef::serialize(). This is generated at Proxy and passed to TLogs, LogRouters, and BackupWorkers.

  2. type, param1_len, param2_len, param1, param2. The backup mutations at \xff/blog/UID/<hash><Version> => Mutations are saved in this format. Specifically, MutationListRef::push_back_deep packs a MutationRef as <len><type><p1len><p2len><p1><p2>.

So you can see that we have two different serialization formats that are persisted. I guess the easy way is to standardize a format and migrate the format to be consistent, and then expose the API.