I am trying to understand the FoundationDB distributed transaction protocol, but unfortunately, I cannot find any document or paper explaining the details of the protocol. The only reference seems to be this video by Evan that does not cover many details:
I have questions, especially about the transaction logs and proxy:
Can the transaction log be sharded? If yes, when we have more than one shard does a proxy run a 2PC to append the write-set of a transaction to them. Otherwise, how atomicity is guaranteed, e.g. when proxy crashes after writing to some transaction log and before writing to the rest of shards?
Shouldn’t only committed transactions be appended to the transaction logs? If Yes, then why when we have transaction logs with highest timestamps 400, 410, and 420, he says only 400 could have been returned success to the client? So 410 and 420 are appended to the transaction logs but returned failure to the client?
And other questions like this. I have encountered other people having the same questions. The document provided on the website is very basic and seems to target the general audience.
There is some internal sharding that happens when you have a sufficient number of logs.
It doesn’t use 2PC. Commits are versioned with monotonically increasing versions and applied to each log in order, and I believe the way recovery works when there was a commit is in flight and only written to a subset of the logs is that we recover to the most recent version present on all (or enough?) logs. The proxy won’t tell a client that a commit is successful until it’s written to all the logs, so any later versions have not been reported durable and can be legally discarded.
Transactions will only be appended to the logs if we intend to commit them, but the commit isn’t complete until we tell the proxy replies to the client. In other words, the commit could still fail after writing to a log if, for example, processes die and we have to do a recovery. As discussed in your previous question, that may mean that you’ve started writing commits with higher versions but not finished, so they could be present on some logs and not others.
This is certainly true, and there’s a lot we could do to improve this. There is at least some existing documentation of these details, including a more technical version of my rough description, that can be seen in these pages (neither of which has made it into our website documentation):
I am trying to understand the details of the recovery process from the document you provided. I have a question regarding the KnownCommittedVersion.
The example says:
“Consider an old generation with three TLogs: A, B, C . Their durable versions are 100, 110, 120 , respectively, and their knownCommittedVersion are at 80, 90, 95 , respectively.”
Is it for a single-shard transaction log, i.e. we have only one shard with three replicas A, B, C? If yes, then when the minimum of durable versions is 100, that means all replicas do have the versions <=100. So why we need knownCommittedVersion? Why we have to copy from 96 to 100 to the new transaction logs?
The old logs have 0-100, and new logs have > 100. Storage servers first finish up the old logs, and then start consuming from new logs.
If this example is not single shard, could you please tell me how many shards we have and each have how many replicas?
I have also a more fundamental question: When a transaction log crashes, is it possible that storage servers that used to read from it, now start consuming from a different old transaction log?
For example, in the example above, suppose transaction log A permanently fails. Suppose there was a storage server S that was reading from A, and it was way behind such that the last offset it read was 20. The new transaction logs start from 96, i.e. they don’t have any offset less than 96, and A is permanently died. So from which transaction log serer S will get offsets 21-96? I guess from old B or C. Is that true?
It did get added to the source for the website, so when 6.3 is pushed, I think it should then go live. One will need to take a look at the source for now to see the more detailed architecture information.
This part was a little confusing to me when I was trying to understand the transaction subsystem.
While not every transaction gets appended to the logs, every transaction batch does get appended. When it’s the turn of a proxy to submit the next batch, that proxy effectively has a cluster-wide lock; nothing else can commit before it.
Same thing happens with resolver batches.
This has implications on scalability of the cluster. If we make the commit batches larger to accommodate for more proxies, the batch submission rate can be kept constant but the variance of batch tardiness will increase, causing availability dips.