FoundationDB

Technical overview of the database


#1

:wave:

Is it possible to find some more information about the database internals which goes deeper into how it works than the technical overview and if so, where?

For example, the transaction processing page says that the database “uses a sophisticated data-parallel and multithreaded algorithm to optimize conflict-checking” but leaves the reader wondering what it actually means. Is it in any way similar to SSI? And, a bit off-topic but, does the database somehow rely on NTP servers?

Then the fault tolerance page talks about “teams” and how using them the “chance of data unavailability is reduced to about 0.5%” … But leaves the already buffled reader wondering again: what these teams actually are, how they work, and how the new probability of “0.5%” (is it really 0.005?) is calculated.

TL;DR docs are great, but could they go deeper into the nitty-gritty?


Design document of internals & storage?
#2

I don’t work on FDB, but https://news.ycombinator.com/item?id=16877586 has a summary of the internals that may be more helpful than the documentation.


About transaction
(Evan Tschannen) #3


Bandwidth consumption from one fdb client to fdbserver process
Understanding inter communication
(Evan Tschannen) #4

Besides diagrams I posted above, I added some documentation that just has not been built yet.


Understanding inter communication
(David Scherer) #5

The calculation regarding teams is just: 450/(40 choose 4) = .0049. There are 40 choose 4 sets of 4 machines that could be failed. This will only lead to data loss if the set is one of the 450 teams of 4. Thus the probability above.

More generally, teams are a compromise between the random distribution model, where losing more than the fault tolerance level machines essentially guarantees data loss, and the “replica set” model where each machine is part of a single replica set, the probability of having any data loss from a multiple failure is minimized, but even a single failure is very slow to recover from because only a small number of machines have data from the failed replica set.

If Apple wanted to put my old architecture talk on YouTube I wouldn’t mind, though it is probably a bit out of date now.


(Alex Miller) #6

Copysets: Reducing the Frequency of Data Loss in Cloud Storage presents the same overall idea, and would be worth the read for a deeper understanding.


#7

A transaction conflicts if it reads a key that has been written between the transaction’s read version and commit version. The resolver does this by holding the last 5 seconds of committed writes in memory, and comparing a new transaction’s reads against this set of commits.

What happens when there are multiple resolvers handling the sharded key-space and keys involved in tx are found to be owned by multiple resolvers? How do resolvers agree on that decision?


(Evan Tschannen) #8

The proxy will send each resolver their respective key ranges, if either resolver detects a conflict then the transaction is not committed.

This has the flaw that if only one of the two resolvers detects a conflict, the other resolver will still think the transaction succeeded and may failure future transactions based on its write conflict ranges.

In practice, a well designed workload will only have a very small percentage of conflicts, so this amplification will not effect performance.

Because of this effect we generally only recommend increasing the number of resolvers if you are running into a performance bottleneck because of the resolver.


(Tom Crayford) #9

Hi,

I have a query about the above workflow and it’s failure tolerance. Looking at the transaction commit flow (and I’ve dug through the code a bit too to check this), it looks like transaction commit doesn’t increment the version number through a paxos round, instead just relying on the master/tlogs to coordinate the version number. That seems to offer a notable performance improvement of course, but I’m wondering how it handles failure.

The query I have is - isn’t this unsafe? If a master, an associated proxy and a set of tlog servers gets partitioned by something that causes them to “hang” (as apposed to losing network connectivity), is it possible that when they recover a client could commit a write through this “old” set, and not the newly elected set of master/tlogs (which the coordinators would handle re-recruiting), thereby breaking ACID?

I assume this case is very much handled! But I’m not sure I understand how it’s handled, and I’d love to understand better.


(David Scherer) #10

The short version is that any later replication epoch (as ordered by full consensus on the coordinated state) must recover from, and lock against further writes from this epoch through, a quorum of logs guaranteed to overlap with any quorum of logs that can complete a commit in this epoch. So any commit to the “old” transaction subsystem will be unable to complete, and any logs that do accept it will already have been removed permanently from the official chain of custody of version history.

This comment on HN, though answering a slightly different objection, gives some insight into how the happy case synchronous replication and consensus work together to provide ACID.


(Tom Crayford) #11

Ah, that’s great! Thanks. I had missed “the new set of tlogs has to quorum overlap with the old one”.

I’m still not 100% clear on how this works fully. What I’m imagining is a worked example talking through what happens in the bad case:

t0: cluster is running normally, latest version is 0, epoch is 0
t1: proxy A, master B, and tlogs C, D, E become partitioned from the coordinators (F,G,H)
t2: client X starts a new transaction (T1) on proxy A
t3: coordinators (F,G,H) elect a new master (J) and set of tlogs (K, L, M). There’s another proxy already running (N)
t4: client Y starts a new transaction (T2) on proxy N, getting the read version from master J
t5: client X commits their transaction (using master B)
t6: client Y commits their transaction (using master J)
t7: now state is diverged

From my understanding, I think that t3 is prevented - the coordinators can’t elect a new set of tlogs, because there’s no overlap with a quorum of tlogs from the previous set. So client X can still commit it’s transaction, but Y wouldn’t be able to (assuming it’s on the “coordinators” side of the partition).

Is that accurate?

I think this means that the master and tlogs can always make progress even if they can’t talk to the coordinators. Is that correct too? I assume at some point maybe they fail after not being able to talk to the coordinators, but haven’t seen that obviously in the code yet (but it’s a big and daunting codebase).


(David Scherer) #12

You are on the right track. Recovery won’t even try to write new tlogs into the coordinated state until it’s successfully locked the old ones. And yes, if all coordinators are partitioned away you might still be able to commit transactions for a while, though I don’t think the system offers any particular guarantees about this.

The recovery process starts here. The following comment from that function might be illuminating, though it looks to be a little out of date in its references to names in the source:

// Multiple masters prevent conflicts between themselves via CoordinatedState (self->cstate)
//  1. If SetMaster succeeds, then by CS's contract, these "new" Tlogs are the immediate
//     successors of the "old" ones we are replacing
//  2. logSystem->recoverAndEndEpoch ensured that a co-quorum of the "old" tLogs were stopped at
//     versions <= self->lastEpochEnd, so no versions > self->lastEpochEnd could be (fully) committed to them.
//  3. No other master will attempt to commit anything to our "new" Tlogs
//     because it didn't recruit them
//  4. Therefore, no full commit can come between self->lastEpochEnd and the first commit
//     we made to the new Tlogs (self->recoveryTransactionVersion), and only our own semi-commits can come between our
//     first commit and the next new TLogs

(Alex Miller) #13

Just to be pedantically clear: you must lock a set of old tlogs such that it overlaps with any quorum of old tlogs that could be used to commit before starting a new set of tlogs. The new tlogs and old tlogs have no quorum interactions.


Is there more detailed design documents?
(junius) #14

Has some questions about the resolvers. Assume keys are sharded among resolvers. The previous comment says that the resolver holds the last 5 seconds of committed writes in memory. How do we ensure resolvers get updated once the transaction writes are persisted to the distributed logs?

For example, transaction1 updates 2 keys, keyA and keyB. Assume there are 2 resolvers, and keyA owned by resolver1 and keyB owned by resolver2. Transaction1 passes the resolver check and gets persisted to the distributed logs. Transaction2 updates keyA and keyB again, and actually conflicts with transaction1. How do we ensure the resolver1 and resolver2 check transaction2 against the value that are committed by transaction1?


(junius) #15

How does proxy persist the transaction to the distributed logs? Is there a leader in the logging subsystem? So proxy sends the transaction to leader. Or proxy directly writes to the log servers using like two-phase commit?

How does proxy coordinate with each other? Following the same example above that both transaction1 and transaction2 update the same keyA and keyB. Is it possible that transaction1 is sent to proxy1, while transaction2 is sent to proxy2? Or both will be routed to the same proxy server?
For the case that transaction1 sent to proxy1 and transaction2 to proxy2, assume proxy1 passes the resolver check and starts persisting to the logs. How do we ensure transaction2 on proxy2 wait till transaction1 completes, and then get rejected by the resolver?


(gaurav) #16

I could be wrong about this, as my understanding is purely based on reading the descriptions in forums: if a resolver determines the transaction to pass the conflict-resolution checks (based on transaction 's read/write conflict ranges sent to it), it would just “assume” all of the transaction’s write-conflict keys that were sent to it as committed (without any regards for whether the transaction eventually succeeded at proxy or not).

I observe this based on the comment made here


(gaurav) #17

Proxies send the transaction key-values to all the required tlogs and wait for all of those to acknowledge before responding back to the client.

What I am not sure of is who keeps track of which key ranges are tracked by which tlogs? And what happens if some of the tlog servers fail to commit the transactions (sau one out of the two tlogs failed to commit the data)? How do storage servers ignore the data from tlogs that were able to commit it?

Resolvers would serialize the transactions.

Even if different proxies receive transactions t1 and t2 concurrently, operating on the same two keys, each proxy would contact a common resolver for a given key range. This would ensure that either only one of the transactions succeed and other is rejected - if t1 (or t2), was able to get resolved by both resolvers before t2 (or t1), or if in a case when resolver1 resolved keyA from t1 and resolver2 resolved keyB from t2, concurrently - then both these transactions should get rejected, when resolution is attempted on the second key from each transaction.

But, please wait for someone from FDB team to verify, or correct my observations.


(junius) #18

Thanks for the reply! If the resolver assumes the transactions as committed, the resolver would cache the results in-memory. Assume transaction1 is handled by client1. If the proxy crashes before the transaction is persisted to the distributed logs, client1 could retry the transaction. The resolver would be able to pass as it sees the same transaction id (time)?
If client1 also crashes, transaction1 will never get persisted to the distributed logs. And resolver will make wrong decision as it assumes transaction1 is committed. So it is possible that some decisions are wrong in the 5s window?


(Alex Miller) #19

The messages that are sent from the proxy to the transaction logs are a list of (tags, mutation), where a ‘tag’ is a short identifier for a particular storage server. Each storage server is assigned one transaction log as the place that it can receive all the data that it requires, and then additional copies of the data, those required to fulfill the requested replication factor, are spread randomly across the other transaction logs as a form of load balancing.

For a tlog to fail to commit, it would need to crash, or throw an io_error would would lead to tlog death as well. If a transaction log dies, then the entire transaction subsystem gets destroyed and replaced.

The transactions are serialized via the commit timestamp received from the master, and resolvers use that timestamp to know which message from different proxies to process first. But the major part here is the resolver logic, which was correct.

A client would have received commit_result_unknown, and then retried the transaction. It’s completely permissible for FDB to commit both the first attempt, and the second retry, as commit_result_unknown means exactly that, your commit may or may not have happened. This is why it’s strongly recommended that transactions should be idempotent, so that they handle commit_result_unknown correctly.

In the case of any failures in {proxy, resolver, tlog}, the entire subsystem is torn down and recreated. Resolvers don’t have any durable state, so after a transaction log fails, they won’t know anything about conflicts in the re-recovered state. Recoveries also “fast forward” time by 90s, which would kill any in-progress client transactions as well.


(junius) #20

Thanks Alex! So in case of any failure happens at any proxy or resolver, the entire transaction processing subsystem will be restarted. Resolvers will need to load the last 5s transactions from the transaction logging subsystem. All writes will have to wait till the transaction processing subsystem comes back.

How does the resolver load the last transactions from the logging system? One key would be logged to multiple nodes such as 3 to tolerate the single node/disk failure. For example, the logging system has 6 nodes. key1 is written to n1, n2 and n3. Proxy will directly send the writes to these 3 nodes? If the write succeeds at n1 and fails to reach n2 and n3 (such as network timeout), the transaction subsystem will be restarted as proxy fails to persist the write. Will the logging subsystem be restarted? If not, how does resolver know whether it should load key1 change at n1? Is there a primary node for one key?