Layer for read-write transactions lasting longer than 5 seconds

#1

(This is mostly just out of curiousity)

Suppose I want read-write transactions lasting longer than 5 seconds. This could in theory be done in a layer by maintaining an index from version to write conflict range in a special subspace. A transaction would get an initial read version r0, perform reads at this version, and then have a long think about things (maybe it’s a user deciding what to do). To commit the transaction you could get a second read version r1, accumulate the write conflict range from r0 to r1 by reading the index at r1, manually check if the accumulated write conflict range intersects your read conflict range, and if not submit the transaction with read version r1 and the union of all reads performed at r0 and r1 as your read conflict range. If the index at r0 has been purged by some background cleaner job you have to abort the transaction. Basically you check for read-write conflicts from r0 to r1 manually in a layer, and then the resolver checks for read-write conflicts after r1.

A couple questions:

  1. Does this seem sound? I think it still achieves serializability but maybe I’m missing something.
  2. It would be useful to read your transaction’s read conflict range and write conflict range before committing. It would be possible to keep track of this manually but the native api is already keeping track of it so it’d be nice to just reuse that. Maybe there could be \xff\xff keys for this.
(Markus Pilman) #2

Of course this would work - you are basically just proposing to materialize conflicts. The main problems you would run into:

  1. Abort-rate could be quite high if there’s enough contention. Your transaction could conflict on your version-index as many transactions will potentially have the same read-version. You could make this better to have an index like (version, randomUID) -> write-set.
  2. If your transactions run for a very long time, this index would become rather large - therefore your range-query could potentially time-out. So there would be clearly a limit on how long you can have a transaction running.
  3. Garbage-collection: when do you delete entries from this index? This is basically a distributed systems problem. How do you know when you can truncate at version x? How do you handle client failures? The easiest solution here would be to only allow transactions to run for y second (and you can configure your y).

The overhead you would introduce would be quite high. Alternatively you could simply set some knobs and increase the 5 second limit (while this is not tested there are no conceptual reasons why you wouldn’t be able to do this). This would be much cheaper.

#3

The index here would be commit version -> write-set. For each range read from r0 to r1, any subsequent writes to the index are going to have a key > r1. I don’t think there are issues with contention on this index.

This is definitely an issue. The other interesting part is how to actually perform the accumulate-write-ranges-and-check-intersection computation efficiently if the index is large.

Yup. But no matter what policy you use for purging old versions from the index, as long as you can read r0 when you try to commit you’re safe serializability-wise.

Edit:

That’s true. I guess resolver memory might be a limit here. Also if we’re using the resolver then a recovery would abort all in-flight transactions, where-as with the commit version -> write conflict range index in-flight transactions can span a recovery.

(Markus Pilman) #4

How do you write this index? You don’t know your own commit version until you committed… Also if you have batches of 1000 transactions all of them would try to write into the same key…

I think what you are trying to do is a more expensive variant of:

  1. make resolvers store much more data so they can verify transactions over a very long period of time (by for example using a disk based storage)
  2. whenever a client receives a past_version exception from a storage get a new read version and retry in the same transaction.
  3. as an optimization you could always get the max-version of every read and if one is larger than your initial read-version you can abort as you already know that you will conflict.
  4. at commit time send your initial read-version to the proxy and let the resolver figure this out for you.
#5

https://apple.github.io/foundationdb/api-python.html#fdb.tuple.Versionstamp explains how to handle these issues.

That would mean resolvers are no longer a stateless role, or that in-flight transactions don’t survive recoveries. Either way it’d be a non-trivial foundationdb core change, so not available to users currently.

From https://apple.github.io/foundationdb/known-limitations.html#long-running-transactions,

If an application wants long transactions because of an external process in the loop, it can perform optimistic validation itself at a higher layer

So I guess ‘optimistic validation at a higher layer’ is the current best practice here. I’m just wondering how to actually do that.

(Alex Miller) #6

If we assume a perfect and scaleable implementation of changefeeds, I suppose one could potentially try to subscribe to read keys to monitor them for writes that would cause conflicts later.

If you wanted to be really fancy, you’d then do something like Repairing Conflicts among MVCC Transactions to be able to fix your transaction and advance its read version rather than abort.

(Markus Pilman) #7

Ah I forgot about this feature. But my second point still remains - you would need something like a tuple of version and UID to make sure you don’t conflict on this key.

I don’t think in-flight transactions can ever survive recoveries. I also don’t believe that this would be a very important features as recoveries are relatively uncommon.

I don’t know :wink: maybe we should request some documentation from the people who wrote this?

(Alec Grieser) #8

I think you could actually get away without storing any additional data persistently at all if you did the following:

  1. Keep track of all read conflicts (maybe adding them to an \xff\xff key range to read them back from the client). Any mutations are stored in a separate data structure.
  2. Every, let’s say, 3 seconds, you issue an empty “commit” against FDB that validates all of the read ranges haven’t changed yet. (At present, one will need to include a dummy write conflict range in the transaction to have the transaction actually get committed. Or one could imagine adding a fairly trivial commit_read_only option.)
  3. You resume your transaction at a read version that is equal to the commit version of the above transaction (uh, or commit version minus 1, actually) and add in all read conflict ranges from the previous transaction.
  4. When you’re ready to commit, apply all of the mutations at once.

I think you would need to either implement RYW yourself here or make your “separate data structure for storing mutations” a different FDB transaction or something.

I, um, don’t think this is necessarily a good idea. For one, one creates a lot of resolver traffic sending read conflict ranges back and forth, and I think there are a lot of subtle things to get wrong.

I also suspect that the abort rate would be higher than desired not because of conflicts with the version history range but just because longer transactions, as they do more work, are more likely to touch more keys and therefore touch a key that changes later.

I think for long running transactions like this what you’d really want would be able to do would be to acquire locks and just use pessimistic concurrency control. I think this could also be accomplished with a layer, but I think you’d then have to worry about the cost of persisting locks to FDB, which could be high if you’re doing a lot of them.

(Markus Pilman) #9

I think you are right, what you’re proposing sounds correct to me.

In practice most long-running transactions are read-only, so I don’t think you would want to have pessimistic concurrency control for those. What we really need for this to work is a versioned storage (like Redwood V2).