Cannot commit transaction that reads the metadataVersion key after changing it

In a transaction, if I update the \xff/metadataVersion key, and then attempt to read it again, I get a “Read or wrote an unreadable key” error (1036) when trying to read again. That was a bit surprising at first, but it seems it is the side effecting of using versionstamps that are not known under after commit, and is the expected behavior.

Looking at the implementation in the record layer, it seems to specifically catch this error, and return the null versionstamp instead. I changed my code to do the same thing if the read fails.

My surprise is that then if I attempt to commit the transaction, then the commit also fails with the same error … ? Using snapshot isolation or not has no impact. The error is throw by fdb_transaction_commit, and is rethrown by fdb_transaction_on_error.

Transaction #5 (read/write, 4 operations, '#' = 0.5 ms, started 22:24:26.1274739Z, ended 22:24:26.1584759Z)
┌  oper. ┬─────────────────────────────────────────────────────────────────┬──── start ──── end ── duration ──┬─ sent  recv ┐
│ 0   a  │ X                                                               │ T+  1.841 ~   2.196 (    355 µs) │    31       │ Atomic_VersionStampedValue <FF>/metadataVersion, <00 00 00 00 00 00 00 00 00 00>
│ 1   mv°│ :###`                                                           │ T+  2.199 ~   3.746 (  1,547 µs) │             │ GetMetadataVersion => [AccessedUnreadable] => <null>
│ 2  !Co*│ _______$###################################################X    │ T+  5.270 ~  30.141 ( 24,871 µs) │             │ Commit => [AccessedUnreadable] Read or wrote an unreadable key
│ 3  !Er°│ ____________________________________________________________=## │ T+ 30.516 ~  31.707 (  1,192 µs) │             │ OnError AccessedUnreadable (1036) => [AccessedUnreadable] Read or wrote an unreadable key

note: timings are weird because it’s a unit test and the JIT has to compile all the code on the fly!

What is then the point of catching the error when reading the key, if it will inevitably throw the same error at commit time? This means that any combinations of layers, one that mutate the schema, and another that attempts to use a cache, will systematically throw at commit time.

The record-layer attempts to keep a dirtyMetaDataVersionStamp flag that knows that it has already been changed and not even attempt to read the key (and doom the transaction) but:

  1. it only works locally for the record-layer, not if another layer changes the key separately.
  2. it seems racy (no locks) if multiple threads use the same transaction.
  3. if the read still happens somehow, and the error is caught, the code sets the flag to true but it will still fail to commit later, so … ??

Either I missing something critical, or it means that it is impossible to compose two layers that use the metadataVersion key for caching in the same read/write transaction?

I could maybe see moving this “dirty versionstamp” protection logic from layer code up to the binding itself, so that no one can slip past, but then I’m not sure if this would be race-proof: both a read and write happening on the metadataVersion key for same transaction.

I believe the idea is that if your transaction ever reads one of these keys, then it is assumed that there is some logic error, so one can’t rely on the value one computed. I’m…not sure that’s necessarily correct all of the time, and it’s also perhaps I’m missing something in the rationale. Perhaps @Evan has more on that?

Right. But the benefit is that if, somehow, the transaction is not committed because, for example, the user decides what they really wanted was for it to be a read-only operation, then at least their reads succeed. It’s not a huge benefit, but it’s there. (And it sets dirtyMetaDataVersionStamp to true so that at least if future callers call the function, then they don’t need to make a JNI hop to determine that they will just get an error anyway.)

I’ll also say that when I wrote it, I, too, thought that the error would be isolated to the read, and then by catching the error, I would still be able to commit the transaction. When that ended up being not the case, it seemed worthwhile to keep it in (because of what I mentioned in the above paragraph), though it’s definitely not as useful as one that allowed a full read-write transaction to take place.

Yeah, a little bit. I believe the actual race is:

  1. A user calls getMetaDataVersionStamp, which checks if the version stamp is dirty. Let’s say that the value is currently false, so it moves on, but is halted before it begins to actually read.
  2. A different thread then calls setMetaDataVersionStamp, which mutates the key.
  3. The original thread resumes. It then reads the (mutated) key and hits the error.

I think one could actually patch the race by: (1) setting dirtyMetaDataVersionStamp to true and actually setting the key in one synchronized block (or critical section or what have you) and (2) checking the value of the key and then spawning the future to read the key in another synchronized block (but not waiting on the future necessarily). I think that would work, but it depends on the FDB operation being placed onto the network thread before returning, which I think is true of the client. (So then if a write to the meta-data version key comes in while a read is outstanding, it will be applied after the read completes. Or I think those are the semantics without having actually tested them.)

I was aware of the race when I wrote this code, but I suppose I hadn’t quite thought through how one could make it work by relying on serializing steps onto the FDB network thread. I’ll also note that this race only means that some things that should have succeeded (by correctly ordering their accesses to this key) might fail with error 1036. That’s not fantastic, but it doesn’t corrupt the meta-data.

Yeah, I suspect it’s impossible (for certain definitions of compose) without either bubbling down this dirty versionstamp protection to either the bindings or to some kind of common library that is shared by all layers. It would be great if this were race proof (possibly realized by the logic I outlined above with a few extra critical sections); if not, the protection even in its racy form can go a long way.

My thinking is if nobody checks the error code when doing the read-after-changed, then the transaction handler will fail with a non-retryable error so the layer developer will notice the issue.

But if the layer developer DOES try and catch that specific code, and continues up until commit, maybe it should be allowed? The only way for the commit to throw that error is if a read failed but was caught and the error ignored. Wouldn’t this be a hint that the layer developer is OK with it?

I can see that logic for any random key, but maybe, specifically for the metadataVersion key, there could be an exception to the rule, or maybe the layer must set a transaction option if you want an explicit opt-out of this safety check ?

I’ve attempted to move the logic in the binding itself, and I’m getting something that looks a bit more usable:

The following test:

await logged.ReadWriteAsync(async tr =>
	// We can read the version before
	var before = await tr.GetMetadataVersionAsync();
	Log($"Before: {before}");
	Assert.That(before, Is.Not.Null);

	// Another read attempt should return the cached value
	var cached = await tr.GetMetadataVersionAsync();
	Log($"Cached: {before}");
	Assert.That(cached, Is.Not.Null.And.EqualTo(before));

	// change the version from inside the transaction

	// we should not be able to get the version anymore (should return null)
	var after = await tr.GetMetadataVersionAsync();
	Log($"After: {after}");
	Assert.That(after, Is.Null, "Should not be able to get the version right after changing it from the same transaction.");

}, this.Cancellation);

Produces the following result and log:

Before: @5626280078355-0
Cached: @5626280078355-0

Transaction #14 (read/write, 5 operations, '#' = 0.5 ms, started 16:34:48.8935839Z, ended 16:34:48.8968885Z)
┌  oper. ┬────────┬──── start ──── end ── duration ──┬─ sent  recv ┐
│ 0   G  │ $      │ T+  0.003 ~   0.438 (    435 µs) │             │ Get <FF>/metadataVersion => @5625762147352-0
│ 1   G  │ `      │ T+  0.455 ~   0.456 (      1 µs) │             │ Get <FF>/metadataVersion => @5625762147352-0
│ 2   a  │ `      │ T+  0.491 ~   0.497 (      6 µs) │    31       │ Atomic_VersionStampedValue <FF>/metadataVersion, <00 00 00 00 00 00 00 00 00 00>
│ 3   G  │ `      │ T+  0.497 ~   0.498 (      1 µs) │             │ Get <FF>/metadataVersion => <null>
│ 4   Co°│ .##### │ T+  0.502 ~   3.287 (  2,784 µs) │             │ Commit
> Committed 31 bytes in 3.304 ms and 1 attempt(s)

The first read is passed through to the native binding, the second read is cached by the .NET binding (no native interop), and the last read attempt after the change is dropped by the binding and null is returned instead. That way, the native binding never sees the last read, and commit won’t fail.

I’m also using a lock to serialize the atomic operation and the async read, hopefully this should not be racy, but I can’t prove it easily. I hope that slowdown in the network thread won’t mean that I’m locking the transaction for too long!

So, if I understand well, the logic in setMetadataVersionStamp() would be:

lock {
    if (dirtyMetaDataVersionStamp) return; // already done!

    // set the flag
    dirtyMetaDataVersionStamp  = true;

    // start an async read, but we do not care for the result

    // change the version
    tr.setVersionstampedValue(MetadataVersionKey, MetadataVersionValue);

And in getMetadataVersionStamp():

future = null;
lock {
    if (dirtyMetaDataVersionStamp) return null;
    // start the async read inside the lock
    future = tr.getAsync(MetadataVersionKey);
// wait for result outside the lock
return future.getResult().parseVersionStamp();

I’m not exactly sure then what the fire-and-forget read done in setMetadataVersionStamp() accomplishes?

Do you mean maybe that calling atomic(…) on a key does not impact any read started previously (but not yet completed), and that’s how we can ensure that a pending call to getMetadataVersionStamp() will not collide with the mutation ?

Oh, sorry. I only meant to imply that the “get” request just needs to be in the same critical section as the read of dirtyMetaDataVersionStamp in the getMetaDataVersionStamp() method. I think that’s necessary basically so that one never calls setVersionstampedValue between the read of the variable and the read of the key. But you shouldn’t need to do any reads in the set method.

Right, yeah, that was the idea. I think that the way FDB serializes operations onto the network thread, the read for the key will always be executed before the atomic operation as long as .get() is called before .setVersionstampedValue, and I believe you’ll get the value back prior to calling .setVersionstampedValue. But it would be worth testing, etc.

So the correct way to deal with such versionstamps would look like, in pseudo code:

void setMetadataVersionStamp() {
    lock(...) {
        if (self.dirtyMetaDataVersionStamp) return; // already done!

        // prohibit this transaction from reading the key again
        self.dirtyMetaDataVersionStamp = true;

        // change the version
        self.setVersionstampedValue(MetadataVersionKey, MetadataVersionValue);

versionstamp getMetadataVersionStamp() {
    future = null;
    lock(...) {
        // don't do it if it has already been changed!
        if (self.dirtyMetaDataVersionStamp) return null;
        // start the async read inside the lock
        future = self.getAsync(MetadataVersionKey);
    // wait for result outside the lock
    return future.getResult().parseVersionStamp();

I’m also thinking that the issue is not specific to the \xff/metadataVersion key. Basically any layer wanting to implement its own cache will probably need a verisonstamped key to track changes to its own schema, and it will be confronted to the very same issue. This means that this method should be made generic and accept a ‘key’ argument (default value meaning the global metadataVersion key), and instead of a single boolean flag, should have a map of keys that have been marked as “tombstoned” …

Looks like a lot of memory allocation just to protect against the commit failing!

I think I need to write a bunch of unit tests exploring the limits of what happens when concurrent reads/writes happen on the same transaction, especially when involving versionstamps.

I added a generic API to deal with versionstamped metadataversion keys, which if called with no key will simply read \xff/metadataVersion instead, and it helps layers build a caching infrastructure without having to duplicate all the logic above.

The test creates two versionstamped keys Foo, Bar, and a third key Baz that does not exist yet. The transaction reads all 3 keys, mutate the key Foo, and reads all 3 keys again. Only the key Foo will return null, the other 2 are still using the cached result.

This API serves two purposes:

  • Prevent the application developer from nuking the transaction and having to implement complex logic everywhere
  • Add a cache for these keys that does not need to call into the native binding evertime (JNI call for Java, Interop call for .NET, etc…) which can speed things up a bit.

I think that the second point alone can justify the existence of this API, even if the behavior of the fdb binding changes in the future regarding commits failing with 1036.

Only downside: if the application developer calls the regular Get(…) / GetRange(…) API to read these keys, then the code will fail with 1036 error. Hopefully there will be some documentation somewhere that explains the situation and the workaround (using the dedicated binding API) !

The following unit tests:

// changing the metadata version and then reading it back from the same transaction CANNOT WORK!
await logged.ReadWriteAsync(async tr =>

	// Read all three keys before

	var before1 = await tr.GetMetadataVersionAsync(foo);
	Log($"BeforeFoo: {before1}");
	Assert.That(before1, Is.Not.Null);

	var before2 = await tr.GetMetadataVersionAsync(bar);
	Log($"BeforeBar: {before2}");
	Assert.That(before2, Is.Not.Null.And.Not.EqualTo(before1));

	var before3 = await tr.GetMetadataVersionAsync(baz);
	Log($"BeforeBaz: {before3}");
	Assert.That(before3, Is.EqualTo(new VersionStamp()));

	// Change the version of first key from inside the transaction


	// Read all three keys again

	var after1 = await tr.GetMetadataVersionAsync(foo);
	Log($"AfterFoo: {after1}");
	Assert.That(after1, Is.Null, "Should not be able to get the version right after changing it from the same transaction.");

	var after2 = await tr.GetMetadataVersionAsync(bar);
	Log($"AfterBar: {after2}");
	Assert.That(after2, Is.Not.Null.And.EqualTo(before2));

	var after3 = await tr.GetMetadataVersionAsync(baz);
	Log($"AfterBaz: {after3}");
	Assert.That(after3, Is.EqualTo(new VersionStamp()));

}, this.Cancellation);

Produces the following result:

BeforeFoo: @5700013585814-0
BeforeBar: @5700013521295-0
BeforeBaz: @0-0
AfterBar: @5700013521295-0
AfterBaz: @0-0

Transaction #7 (read/write, 8 operations, '#' = 0.5 ms, started 13:12:21.2760796Z, ended 13:12:21.2800773Z)
┌  oper. ┬─────────┬──── start ──── end ── duration ──┬─ sent  recv ┐
│ 0   G  │ #       │ T+  0.003 ~   0.506 (    503 µs) │     9    10 │ Get <02>TEST<00>Foo => @5700013585814-0
│ 1   G  │ `:      │ T+  0.516 ~   0.613 (     97 µs) │     9    10 │ Get <02>TEST<00>Bar => @5700013521295-0
│ 2   G  │ _:      │ T+  0.632 ~   0.724 (     92 µs) │     9    10 │ Get <02>TEST<00>Baz => @0-0
│ 3   a  │ _`      │ T+  0.735 ~   0.741 (      6 µs) │    23       │ Atomic_VersionStampedValue <02>TEST<00>Foo, <00 00 00 00 00 00 00 00 00 00>
│ 4   G  │ _`      │ T+  0.742 ~   0.743 (      1 µs) │     9    10 │ Get <02>TEST<00>Foo => <null>
│ 5   G  │ _`      │ T+  0.745 ~   0.746 (      1 µs) │     9    10 │ Get <02>TEST<00>Bar => @5700013521295-0
│ 6   G  │ _`      │ T+  0.756 ~   0.757 (      1 µs) │     9    10 │ Get <02>TEST<00>Baz => @0-0
│ 7   Co°│ _=##### │ T+  0.765 ~   3.671 (  2,907 µs) │             │ Commit
> Read 60 bytes and Committed 23 bytes in 3.682 ms and 1 attempt(s)