It doesn’t. All ongoing transactions by clients will be aborted with transaction_too_old if a recovery happens before they commit. Resolvers only need to conflict check transactions started after the recovery, which is exactly the same as transactions started as of when the resolver was created.
If FoundationDB tried to allow recoveries to not cause client transaction aborts, then something like what you proposed would need to be implemented. Given that recoveries are a significantly rarer event than committing, it hasn’t seemed like the right trade off to prioritize.
Thanks for the details Alex! I wanted to reconfirm something that @junius mentioned above:
If the write succeeds at n1 and fails to reach n2 and n3 (such as network timeout), the transaction subsystem will be restarted as proxy fails to persist the write. Will the logging subsystem be restarted?
Is the above statement correct - that is, does the whole transaction sub-system aborts and re-inits on a single mutation failure? Does that make transaction sub-system fragile to temporary network errors, in practice? What is the downtime cost of tearing down and recreating the transaction subsystem?
Got it. Thanks Alex! If some transactions are persisted to the logs, client read will wait till the storage node sees the latest transaction version in the logs. Then client will submit transactions to proxy.
One more question: when proxy writes the transaction to the logs, the whole transaction will not be split, e.g. all changed keys in the transaction are written to the same set of nodes, right? If the transaction succeeds on node1, but fails on the other 2 nodes. How do we handle it after the logging subsystem restarts?
Pretty much. The only way for a “single mutation failure” to occur would be for a transaction to crash or become network partitioned – a failure either way. Random packet drops, connection breaks, or delayed packets will mostly just introduce additional latency, but not generally introduce a long enough delay that the failure monitor will kick in. If you’re running on a network that has frequent O(seconds) stutters in ability to send packets, then yes, the transaction subsystem will seem pretty fragile. Thankfully, that’s not the sort of environment most people have.
Recoveries are typically done in about a few hundred milliseconds. If a process crashes and restarts, then one can immediately recognize that a process failed and a recovery is needed.
For power loss or network partitions, nodes are considered failed if they don’t heartbeat once a second, you’ll probably see ~1.5s commit latency spikes whenever there’s a recovery. If you already have a read version, then storage servers are likely to continue to happily serve reads, as they’re not involved in the whole recovery process.
A recovery version is chosen, and the database restores itself to a consistent snapshot at that version. We have to chose a recovery version that’s greater than or equal to any commit versions that we might have replied “commit succeeded” to a client. In this case, it’d be correct to recover to either a database with node1’s acceptance of the transaction, or node2/3’s state of not having received or made durable the transaction. As it was only persisted on one transaction log, a proxy wouldn’t have informed a client the transaction was committed.
To reiterate, there’s no allowance in the system for a transaction log to reject a commit. It either accepts a commit, makes it durable, and sends an acknowledgement, or it dies. If it dies, then we go find someone else that’s alive and working to be a transaction log.
One last thing: is there a knob that could be used to control the tolerance to missed/delayed heartbeats, to accommodate for environment where packet delays could be more common? Are there any unwanted side-affects of increasing the tolerance to missed/delayed heartbeats, other that higher transaction latency when a node fails?