Thanks Alex for your reply.
I went through that post earlier and it did help me in getting some of the basics. However it did opened up some more questions which I raised in this post as well. Thats a positive thing to happen.
If you can help me understand this, that will be great. Thanks a lot for your time to reply me.
So, a bit on our write pipeline, which might be useful. The order of operations is:
A client sends transaction to a proxy.
The proxy asks the master for a commit version.
The master sends back a commit version that is higher than any commit version seen before.
The proxy sends the read and write conflict ranges to the resolver(s) with the commit version included.
The resolver responds back with whether the transaction has any conflicts with previous transactions.
If there are conflicts, the proxy responds back to the client with a not_committed error.
If there are no conflicts, the proxy sends the mutations and commit version of this transaction to the transaction logs.
Once the mutations are durable on the logs, the proxy responds back success to the user.
So, to answer your questions:
Yep, that’s exactly what it does. In particular, each transaction says what its read and write conflict ranges were (or, roughly, which keys it read and which keys it wrote). This it gets from the client. But each transaction also includes its commit version.* This allows the resolver to sort the transactions by this version, which then becomes the serial order in which the transactions could be rearranged. (In other words, if you know each transaction’s commit version, you can pretend that all transactions were executed in serial in the same order as their commit versions.) That is to say, this is how we guarantee “serializability”, and any transaction which has a conflict is rejected as that would break our serializability guarantee.
So, as you can tell by the flow listed above, the tlogs don’t actually have the latest version. In particular, any version that is associated with any transaction that has begun but not finished the commit pipeline will not be known to the transaction log. For example, let’s say transaction A is sent to a proxy and (hypothetically) were to get its commit version by asking the tlogs for a version. It then sends information to the resolver and is waiting to hear back. Transaction B then comes along and also asks the transaction logs for a version. They haven’t heard that A has been committed, so there’s nothing to stop them from giving B a version that is the same as the version given A.
Now, the transaction logs could somehow track the latest version they’ve given out…or something. But at this point, they are basically all having to do what the master is doing, so it ends up being easier just to elect a singleton, which can trivially enforce monotonicity of versions. It’s also important for us to know for each commit version what the previous commit version was, because that’s how we know on the resovlers and the logs that there aren’t any gaps. Because the master is the only one giving these out, it knows that the last one was, too.
Read versions are never communicated to the tlogs–only commit versions. The commit version is included along with the mutations of each transaction. The master already has a commit version at least as new by the nature of the commit pipeline.
When the proxy gets a request for a new read version, it asks all of the other proxies what was the latest commit version of the most recent transaction they’ve responded “success” to. The proxy takes the max of all responses (plus the max of the answer to that question asked to itself) to produce a version that is greater than or equal to any commit version that had been responded successfully to the user at the time the get read version request was issued. This guarantees causal consistency, in that the version the client receives is therefore guaranteed to include any commit from any transaction that heard back “success” before the client made the request.
In theory, the proxies could actually talk to the logs instead. If they asked each transaction log what the most recent version they had committed was, then the proxy could produce a similar version. I believe the math here is that the proxy should take the minimum of all versions it receives back from the transaction logs as that’s the greatest value that the user could have heard back success from.**
* I’m being a little sloppy here. It actually includes its commit version and a two-byte “batch version” for ordering transactions within a single commit version. But if you pretend each commit version contains exactly one transaction, then this is equivalent to what I said.
** This assumes you aren’t running with anti-quorums enabled. If you are, you can ignore the k smallest if you anti-quorum is k.
Thanks Alec for the reply. I have tried to dipict my understanding here. Can you please check if my understanding is correct? I to have 2 questions that.
Is commit version and read versions are same
What is the significance of read version?
My understanding is in Serializable optimiztic MVCC, the client receives the read version, Read time stamp ( RTS) and the sends back the mutation, read version RTS and WTS.
then the Resolver check for any conflict action between RTS and WTS for that key/key range based on commit version for the read version.
If any conflict actions then marks it as conflict otherwise signal to go ahead with the transaction.
Is there a possibility of conflict post Resolver send a success to Proxy and while executing at Tlog and storage system? If so how that is being taken care?
Thank you all guys for creating a Wonderfull platform to interact and help each other to understand better.
Both of these numbers, in a sense, originate from the master, but at any point in time, the read version and commit version won’t be the same. The read version is the smallest number that is equal than or greater to any commit that has been acknowledged as committed by FDB. The commit version is a number that is greater than any previously granted commit version. The commit version should be greater than the read version, and this difference in versions represents the transactions that are “in flight” and FDB is working to commit but hasn’t fully durably committed yet.
One could just reply with the commit version when asked for a read version, but that would mandate waiting for every pipelined commit to be finished and applied on storage servers before a read could be done, which would be higher than necessary latency.
Your chart looks mostly correct, with a few corrections:
In bullet point 2, the claim is made that the cluster controller is one of the coordinators. This isn’t actually true; the cluster controller is elected from members of the cluster, which usually includes the coordinators but it doesn’t have to.*
Step 6 (getting a read version) and step 7 precede step 5. (In other words, it’s: (4) client gets read version from proxies, (5) proxies talk to each other to determine read version and send it to the client, (6) client sends transaction information to the proxies, (7) the proxies get a commit version from the master.)
In step 7 (in the labeling in the chart), I don’t think the read time stamp or the write time stamp are included in the message being passed. Or if they are, they aren’t important for the flow. All that’s important are (1) the read version, (2) the read conflict ranges, (3) the write conflict ranges, and (4) the list of mutations. Unlike a few other distributed databases, FoundationDB doesn’t rely on timestamps because of not wanting to have to deal with clock (a)synchronization issues.
In step 8, the read time stamp and write time stamp are again omitted. The read and commit versions are both sent along, as are the set of read and write conflict ranges. (The resolver uses the read version and the read conflict ranges to make sure there haven’t been any modifications to any data that the transaction relied upon. It then uses the write conflict range set to update its data structure. It uses the commit version to order incoming requests, and it updates its internal data structure of mutations with entries using the new commit version when it’s done.)
For step 10, I wouldn’t quite say “execute” is the correct term. A few of our atomic operations require our transactions to do things that are a little bit “execution”-y, e.g., if you include an ADD mutation in your commit, then the storage server will (1) read the key from disk, (2) add your value, and (3) write it out to disk, which kind of feels like “executing” the query. But it doesn’t, e.g., execute a user-created script. I feel like “apply mutations” is a better term, and that doesn’t happen on the tlogs–it just writes them down and makes them durable.
I don’t really know how the chart would communicate this, but steps 11 and 12 actually happen in parallel, i.e., we communicate success to the user after the transaction has been made durable by all of the transaction logs, but the storage servers start picking up mutations from the transaction logs before that.**
I guess I’ll also say that this chart leaves out the read path, i.e., everything the client does between getting a read version and sending a transaction out to be committed. That’s probably fine to omit; I just wanted to make sure it was known so that we’re all clear that there’s more to the FDB-to-client interaction flow than is diagramed here.
The response from @alexmiller goes into more detail about what the difference between, in some sense, the “cluster’s” current read version (to the extent that it’s a thing) and the “cluster’s” given commit version. For any given transaction, the semantics are slightly different. In particular, the “read version” is a version the client gets from the proxies at the beginning of the transaction. Whenever the client asks the storage server for the value associated with some key (or range of keys), it gets the keys as they existed at that version. That is to say, when it reads those keys, it won’t see the effects of any transaction that was committed with a version greater than the transaction’s commit version. Then when it’s ready to commit, it needs to get a commit version, which is essentially the version at which point this transaction will become visible, i.e., when other transactions will start seeing its effects.
I tried to go into more detail about the read version’s significance when it gets sent to the resolver in my post above, but if that’s unclear, let me know.
* It also is only somewhat arbitrary. It will choose a “stateless” class process if one is available, for example. In newer versions of FoundationDB, it also takes the locality information into account so that you choose a cluster controller in a favorable datacenter if you run a multi-datacenter cluster. But this is somewhat advanced. Knowing that it will pick a process in the cluster is probably sufficient for most users to know starting out.
** Actually, it might be before they’re durable. But we never make a transaction’s mutations durable on a storage server until it is durable on the transaction logs, which means that we can always roll back anything that hasn’t been committed yet but wiping in-memory state on the storage servers.
Thanks Alex and Alec for your patience and detailed explanation to all of us on this technical topic. It is helping us a lot.
I would like to get some more clarification on the reply.
Just need a clarification on point 4 and 5 that you mentioned. How come client gets a read version in point from Proxies and then in point 5 they talk to each to determine a read version. Isn’t it to be reverse? Like Proxies talk to each other and determine a read version and in point 5, Client gets this determined read version? May be I am wrong here, please let me know if my understanding is wrong.
Since FDB supports Serializable Optimistic Snapshot MVCC, and as per optimistic approach, shouldn’t client send both Read time stamp and write time stamp. I understand your point of clock synchronization between the systems. Just waned to clear my doubt.
Can you please brief me a bit little more on this. I will add that to this diagram so that it can be a little more detailed than what it is now.
Hm, maybe my newly proposed step 4 should be “client requests a read version from the proxies”. (It won’t receive it back or “get” it until the proxies have talked to each other, if that was the point of confusion.)
FoundationDB doesn’t use timestamps for any part of the multi-version concurrency scheme. In particular, it only ever uses the read version and commit version. Now, these are isomorphic in many ways to “read time stamp” and “write time stamp” in other optimistic MVCC schemes (in fact, you could imagine replacing read versions and write versions with timestamps if you wanted to assuming you could somehow synchronize clocks). In fact, they are so closely related, that if you look around the forums, you might see people referring to the “read timestamp” of a transaction when what the really mean is “read version”. (And the same with commit version and write timestamp.) In a context like that, I’d hesitate to say it’s “wrong” to use the word timestamp (as really just code for “version”), but in any context that is already using the term “version”, I think also throwing around timestamp isn’t quite right.
Sure. Once the client already has a read version, in order to read a key or key range from the database, it needs to know which storage servers contain the “shard” (i.e., a contiguous range of key-value pairs that is a primitive of data distribution) the key is in. The clients keep a local cache of this information, but if the shard location isn’t in the local cache, the client will ask for the location from the proxy (I think–I’m somewhat rusty at this part). After determining which storage servers contain the data, the client will directly query them to perform the read.
Thanks Alex for the detailed reply.
I tried incorporating the comments. If you see any issues with this, please let us all know.
Now I understand why Read transaction in “Technical overview of the database” has a direct communication with the storage. It basically is step # 5 in the below diagram.