I have a scenario where I have a range of keys k1, k2, ... kn, and a transaction selects a key from this range arbitrarily (say kx), reads the value of kx, and then write some value back at kx.
There is an additional constraint - that only one transaction could be modifying this entire range at a given time (i.e. concurrent modifications in this range, even if on different keys, are disallowed). To achieve this, each of the above transactions could either add an explicit ReadConflcit, or a WriteConflict on the entire range of keys (k*).
Should I prefer using explicit ReadConflcits or WriteConflict for achieving the above “locking” semantics? Are there any performance, correctness, or any other kinds of considerations when choosing between the two, for above kind of situations (where each tr is doing a Read as well as a Write on a key, and we want to lock an entire key range)?
Somebody else could correct me if my reasoning is wrong.
Serializable isolation roughly means there is a global order across all transactions. The database is at state x, Let’s say there are 3 transactions A, B, C that happen concurrently and the database allows all of them to commit, it basically means there is at least one order and if you apply transactions one by one in that order, you will get the new database state y.
if all of them read/write different set of keys, then the database can order them in anyway and the final state would be same.
assume key k1 is 1
let’s say A reads key k1 and sets it to 2 and B reads key k2 and sets it to 3. Now, there is no way we can allow both of the transactions, because once we apply A the value of k1 will be changed, we can no longer allow B (because k1 value is no longer 1, which invalidates one of the reads done by B) and vice versa
assume key k1 is 1
let’s say A writes k1 as 2 and B writes k1 as 3. Database is still free to allow both of them. it can either select AB or BA as order.
To answer your question,
setting only write conflict will cause the issue I have outlined in 3
you have to set read conflict on the whole range.
as for write conflict,
you can set it on single key you modify (this is done by default, so no need to do it explicitly), this will still allow concurrent reads on the range (except the single key you modify), concurrent writes are not allowed for any key in the range
you can set it on the whole range, this will not allow any concurrent reads or writes on the range
If you consider the scenario I described, I need to lock the entire “key range”, and not just the specific key (and under the specific scenario where the transaction is doing both a read and a write on a key from that range). Under that requirement, I believe, that both the options I described will lead to desired results in a consistent manner (no anomalies).
I am looking for opinions on whether there should be a preference given to one of the two options (Read/Write Conflict Ranges), or are they equivalent in all aspects?
I have a scenario where I have a range of keys k1, k2, … kn, and a transaction selects a key from this range arbitrarily (say kx), reads the value of kx, and then write some value back at kx.
You might want to elaborate a bit more on what else you’re doing.
It isn’t clear that having a range-wide conflict (of any kind!) will affect the semantics of the system. If you’re just reading kx, running some pure function on it, and writing the result back to the database, then there would be no way to ever tell what order you processed the keys in. You could turn off all conflicts.
If there are any other processes active in this key range, then what they want to do will have an impact on what conflict you’d want to use. If this key range is heavily read, then you would want to avoid unnecessary write conflicts.
I tried to keep the application details abstract, to not get distracted by it; here are more details that should clarify the scenario:
There are two types of transactions on this range (a) Update Transactions : that <read, modify, write> a set of keys in each tr; and (b) Read Only Transactions: that only perform read operations on one more more keys from this range.
Additionally, there are some range-wide invariants that need to be maintained by each Update Transactions. So, I am fine to give up concurrency when updating any key in this range (there could be smarter ways to lock only a smaller subrange of this entire range, but at the moment we could assume that to be an independent optimization).
However, I do not want to block the Read Only Transactions (of type (b)). These should read a consistent view of the database.
I too earlier considered this to be a potential factor - namely WriteConflicts conflicting other read-only operations. But, Read-only Transactions do not perform conflict-checks at commit time (or rather, commit() is a no-op for them, unless one adds explicit conflictranges to them), and hence will not be conflicted by WriteConflicts added by Update Transactions
From a performance point of view, they are…roughly equivalent. The way they are implemented, adding a read conflict range or write conflict range to a transaction first just serializes the endpoints of the ranges and includes them in the commit message. This is sent to the resolver (through a proxy), and each read conflict range is checked against the write history, then (if the transaction doesn’t have any conflicting ranges) each write conflict range is added to the resolver’s in-memory history of cluster writes. So the low-level details are slightly different…but not much.
I think the bigger concern would be what effect this will have on other operations. If one uses a write conflict range, then that means that all outstanding read-write transactions that read a key in the range will be failed even if the actual key they read didn’t change. This may or may not be a problem depending on your application’s write patterns. But if you used a read conflict range, then it means that the operation this transaction is performing is more likely to fail.
Thinking it over, I think I would go with a read conflict range for the whole range and write conflicts only for the keys that are actually being modified. For this specific read/write pattern (where every modifier homogeneously adds the appropriate conflict ranges), I don’t think it matters from either a correctness or performance point of view, but I think using a read conflict range is more natural for the following reasons. Your read conflict set should, in some sense, be the set of things whose values you need to be kept constant between transaction start time and commit time for your invariants to hold. The write conflict set is then the set of things whose values are updated as a result of the current values in the read conflict set. Hearing about this model, it seems like the things that need to be constant are the values in the range, and the things that are updated are the key within the range and the other keys that are updated as part of keeping the invariants constant.
Depending on the nature of these invariants, it might actually be unnecessary to add either a read conflict or write conflict range over the whole range. I’m not sure what these invariants are, but I would check to see if they can be maintained with something like an atomic operation which allows for multiple clients to update the same key concurrently without conflict. If not, it’s also possible that you could rely on the read-modify-write pattern on these range-wide keys instead of explicit read and write conflict ranges on the keys themselves. The benefit there is that if one day, one figures out how to update those keys without conflict, then you get concurrency “for free” without having to also modify your reads and writes to stop adding manual ranges.