Technical overview of the database

The proxy will send each resolver their respective key ranges, if either resolver detects a conflict then the transaction is not committed.

This has the flaw that if only one of the two resolvers detects a conflict, the other resolver will still think the transaction succeeded and may failure future transactions based on its write conflict ranges.

In practice, a well designed workload will only have a very small percentage of conflicts, so this amplification will not effect performance.

Because of this effect we generally only recommend increasing the number of resolvers if you are running into a performance bottleneck because of the resolver.

2 Likes

Hi,

I have a query about the above workflow and it’s failure tolerance. Looking at the transaction commit flow (and I’ve dug through the code a bit too to check this), it looks like transaction commit doesn’t increment the version number through a paxos round, instead just relying on the master/tlogs to coordinate the version number. That seems to offer a notable performance improvement of course, but I’m wondering how it handles failure.

The query I have is - isn’t this unsafe? If a master, an associated proxy and a set of tlog servers gets partitioned by something that causes them to “hang” (as apposed to losing network connectivity), is it possible that when they recover a client could commit a write through this “old” set, and not the newly elected set of master/tlogs (which the coordinators would handle re-recruiting), thereby breaking ACID?

I assume this case is very much handled! But I’m not sure I understand how it’s handled, and I’d love to understand better.

The short version is that any later replication epoch (as ordered by full consensus on the coordinated state) must recover from, and lock against further writes from this epoch through, a quorum of logs guaranteed to overlap with any quorum of logs that can complete a commit in this epoch. So any commit to the “old” transaction subsystem will be unable to complete, and any logs that do accept it will already have been removed permanently from the official chain of custody of version history.

This comment on HN, though answering a slightly different objection, gives some insight into how the happy case synchronous replication and consensus work together to provide ACID.

Ah, that’s great! Thanks. I had missed “the new set of tlogs has to quorum overlap with the old one”.

I’m still not 100% clear on how this works fully. What I’m imagining is a worked example talking through what happens in the bad case:

t0: cluster is running normally, latest version is 0, epoch is 0
t1: proxy A, master B, and tlogs C, D, E become partitioned from the coordinators (F,G,H)
t2: client X starts a new transaction (T1) on proxy A
t3: coordinators (F,G,H) elect a new master (J) and set of tlogs (K, L, M). There’s another proxy already running (N)
t4: client Y starts a new transaction (T2) on proxy N, getting the read version from master J
t5: client X commits their transaction (using master B)
t6: client Y commits their transaction (using master J)
t7: now state is diverged

From my understanding, I think that t3 is prevented - the coordinators can’t elect a new set of tlogs, because there’s no overlap with a quorum of tlogs from the previous set. So client X can still commit it’s transaction, but Y wouldn’t be able to (assuming it’s on the “coordinators” side of the partition).

Is that accurate?

I think this means that the master and tlogs can always make progress even if they can’t talk to the coordinators. Is that correct too? I assume at some point maybe they fail after not being able to talk to the coordinators, but haven’t seen that obviously in the code yet (but it’s a big and daunting codebase).

2 Likes

You are on the right track. Recovery won’t even try to write new tlogs into the coordinated state until it’s successfully locked the old ones. And yes, if all coordinators are partitioned away you might still be able to commit transactions for a while, though I don’t think the system offers any particular guarantees about this.

The recovery process starts here. The following comment from that function might be illuminating, though it looks to be a little out of date in its references to names in the source:

// Multiple masters prevent conflicts between themselves via CoordinatedState (self->cstate)
//  1. If SetMaster succeeds, then by CS's contract, these "new" Tlogs are the immediate
//     successors of the "old" ones we are replacing
//  2. logSystem->recoverAndEndEpoch ensured that a co-quorum of the "old" tLogs were stopped at
//     versions <= self->lastEpochEnd, so no versions > self->lastEpochEnd could be (fully) committed to them.
//  3. No other master will attempt to commit anything to our "new" Tlogs
//     because it didn't recruit them
//  4. Therefore, no full commit can come between self->lastEpochEnd and the first commit
//     we made to the new Tlogs (self->recoveryTransactionVersion), and only our own semi-commits can come between our
//     first commit and the next new TLogs
1 Like

Just to be pedantically clear: you must lock a set of old tlogs such that it overlaps with any quorum of old tlogs that could be used to commit before starting a new set of tlogs. The new tlogs and old tlogs have no quorum interactions.

Has some questions about the resolvers. Assume keys are sharded among resolvers. The previous comment says that the resolver holds the last 5 seconds of committed writes in memory. How do we ensure resolvers get updated once the transaction writes are persisted to the distributed logs?

For example, transaction1 updates 2 keys, keyA and keyB. Assume there are 2 resolvers, and keyA owned by resolver1 and keyB owned by resolver2. Transaction1 passes the resolver check and gets persisted to the distributed logs. Transaction2 updates keyA and keyB again, and actually conflicts with transaction1. How do we ensure the resolver1 and resolver2 check transaction2 against the value that are committed by transaction1?

How does proxy persist the transaction to the distributed logs? Is there a leader in the logging subsystem? So proxy sends the transaction to leader. Or proxy directly writes to the log servers using like two-phase commit?

How does proxy coordinate with each other? Following the same example above that both transaction1 and transaction2 update the same keyA and keyB. Is it possible that transaction1 is sent to proxy1, while transaction2 is sent to proxy2? Or both will be routed to the same proxy server?
For the case that transaction1 sent to proxy1 and transaction2 to proxy2, assume proxy1 passes the resolver check and starts persisting to the logs. How do we ensure transaction2 on proxy2 wait till transaction1 completes, and then get rejected by the resolver?

1 Like

I could be wrong about this, as my understanding is purely based on reading the descriptions in forums: if a resolver determines the transaction to pass the conflict-resolution checks (based on transaction 's read/write conflict ranges sent to it), it would just “assume” all of the transaction’s write-conflict keys that were sent to it as committed (without any regards for whether the transaction eventually succeeded at proxy or not).

I observe this based on the comment made here

Proxies send the transaction key-values to all the required tlogs and wait for all of those to acknowledge before responding back to the client.

What I am not sure of is who keeps track of which key ranges are tracked by which tlogs? And what happens if some of the tlog servers fail to commit the transactions (sau one out of the two tlogs failed to commit the data)? How do storage servers ignore the data from tlogs that were able to commit it?

Resolvers would serialize the transactions.

Even if different proxies receive transactions t1 and t2 concurrently, operating on the same two keys, each proxy would contact a common resolver for a given key range. This would ensure that either only one of the transactions succeed and other is rejected - if t1 (or t2), was able to get resolved by both resolvers before t2 (or t1), or if in a case when resolver1 resolved keyA from t1 and resolver2 resolved keyB from t2, concurrently - then both these transactions should get rejected, when resolution is attempted on the second key from each transaction.

But, please wait for someone from FDB team to verify, or correct my observations.

Thanks for the reply! If the resolver assumes the transactions as committed, the resolver would cache the results in-memory. Assume transaction1 is handled by client1. If the proxy crashes before the transaction is persisted to the distributed logs, client1 could retry the transaction. The resolver would be able to pass as it sees the same transaction id (time)?
If client1 also crashes, transaction1 will never get persisted to the distributed logs. And resolver will make wrong decision as it assumes transaction1 is committed. So it is possible that some decisions are wrong in the 5s window?

The messages that are sent from the proxy to the transaction logs are a list of (tags, mutation), where a ‘tag’ is a short identifier for a particular storage server. Each storage server is assigned one transaction log as the place that it can receive all the data that it requires, and then additional copies of the data, those required to fulfill the requested replication factor, are spread randomly across the other transaction logs as a form of load balancing.

For a tlog to fail to commit, it would need to crash, or throw an io_error would would lead to tlog death as well. If a transaction log dies, then the entire transaction subsystem gets destroyed and replaced.

The transactions are serialized via the commit timestamp received from the master, and resolvers use that timestamp to know which message from different proxies to process first. But the major part here is the resolver logic, which was correct.

A client would have received commit_result_unknown, and then retried the transaction. It’s completely permissible for FDB to commit both the first attempt, and the second retry, as commit_result_unknown means exactly that, your commit may or may not have happened. This is why it’s strongly recommended that transactions should be idempotent, so that they handle commit_result_unknown correctly.

In the case of any failures in {proxy, resolver, tlog}, the entire subsystem is torn down and recreated. Resolvers don’t have any durable state, so after a transaction log fails, they won’t know anything about conflicts in the re-recovered state. Recoveries also “fast forward” time by 90s, which would kill any in-progress client transactions as well.

1 Like

Thanks Alex! So in case of any failure happens at any proxy or resolver, the entire transaction processing subsystem will be restarted. Resolvers will need to load the last 5s transactions from the transaction logging subsystem. All writes will have to wait till the transaction processing subsystem comes back.

How does the resolver load the last transactions from the logging system? One key would be logged to multiple nodes such as 3 to tolerate the single node/disk failure. For example, the logging system has 6 nodes. key1 is written to n1, n2 and n3. Proxy will directly send the writes to these 3 nodes? If the write succeeds at n1 and fails to reach n2 and n3 (such as network timeout), the transaction subsystem will be restarted as proxy fails to persist the write. Will the logging subsystem be restarted? If not, how does resolver know whether it should load key1 change at n1? Is there a primary node for one key?

It doesn’t. All ongoing transactions by clients will be aborted with transaction_too_old if a recovery happens before they commit. Resolvers only need to conflict check transactions started after the recovery, which is exactly the same as transactions started as of when the resolver was created.

If FoundationDB tried to allow recoveries to not cause client transaction aborts, then something like what you proposed would need to be implemented. Given that recoveries are a significantly rarer event than committing, it hasn’t seemed like the right trade off to prioritize.

Thanks for the details Alex! I wanted to reconfirm something that @junius mentioned above:

If the write succeeds at n1 and fails to reach n2 and n3 (such as network timeout), the transaction subsystem will be restarted as proxy fails to persist the write. Will the logging subsystem be restarted?

Is the above statement correct - that is, does the whole transaction sub-system aborts and re-inits on a single mutation failure? Does that make transaction sub-system fragile to temporary network errors, in practice? What is the downtime cost of tearing down and recreating the transaction subsystem?

Got it. Thanks Alex! If some transactions are persisted to the logs, client read will wait till the storage node sees the latest transaction version in the logs. Then client will submit transactions to proxy.

One more question: when proxy writes the transaction to the logs, the whole transaction will not be split, e.g. all changed keys in the transaction are written to the same set of nodes, right? If the transaction succeeds on node1, but fails on the other 2 nodes. How do we handle it after the logging subsystem restarts?

Pretty much. The only way for a “single mutation failure” to occur would be for a transaction to crash or become network partitioned – a failure either way. Random packet drops, connection breaks, or delayed packets will mostly just introduce additional latency, but not generally introduce a long enough delay that the failure monitor will kick in. If you’re running on a network that has frequent O(seconds) stutters in ability to send packets, then yes, the transaction subsystem will seem pretty fragile. Thankfully, that’s not the sort of environment most people have.

Recoveries are typically done in about a few hundred milliseconds. If a process crashes and restarts, then one can immediately recognize that a process failed and a recovery is needed.
For power loss or network partitions, nodes are considered failed if they don’t heartbeat once a second, you’ll probably see ~1.5s commit latency spikes whenever there’s a recovery. If you already have a read version, then storage servers are likely to continue to happily serve reads, as they’re not involved in the whole recovery process.

Correct.

A recovery version is chosen, and the database restores itself to a consistent snapshot at that version. We have to chose a recovery version that’s greater than or equal to any commit versions that we might have replied “commit succeeded” to a client. In this case, it’d be correct to recover to either a database with node1’s acceptance of the transaction, or node2/3’s state of not having received or made durable the transaction. As it was only persisted on one transaction log, a proxy wouldn’t have informed a client the transaction was committed.

To reiterate, there’s no allowance in the system for a transaction log to reject a commit. It either accepts a commit, makes it durable, and sends an acknowledgement, or it dies. If it dies, then we go find someone else that’s alive and working to be a transaction log.

Thanks Alex! So the logging subsystem does have the recovery mechanism to ensure the consistency. This is great.

That explains a lot!

One last thing: is there a knob that could be used to control the tolerance to missed/delayed heartbeats, to accommodate for environment where packet delays could be more common? Are there any unwanted side-affects of increasing the tolerance to missed/delayed heartbeats, other that higher transaction latency when a node fails?

For anyone who is interested, my presentation at the FoundationDB summit gives a pretty good overview of how FoundationDB works

4 Likes