Why there is no document explaining FoundationDB internals?

roohitavaf · August 11, 2020, 4:37am

Hi,

I am trying to understand the FoundationDB distributed transaction protocol, but unfortunately, I cannot find any document or paper explaining the details of the protocol. The only reference seems to be this video by Evan that does not cover many details:

I have questions, especially about the transaction logs and proxy:

Can the transaction log be sharded? If yes, when we have more than one shard does a proxy run a 2PC to append the write-set of a transaction to them. Otherwise, how atomicity is guaranteed, e.g. when proxy crashes after writing to some transaction log and before writing to the rest of shards?
Shouldn’t only committed transactions be appended to the transaction logs? If Yes, then why when we have transaction logs with highest timestamps 400, 410, and 420, he says only 400 could have been returned success to the client? So 410 and 420 are appended to the transaction logs but returned failure to the client?

And other questions like this. I have encountered other people having the same questions. The document provided on the website is very basic and seems to target the general audience.

Thank you,
Mohammad

ajbeamon · August 11, 2020, 3:35pm

There is some internal sharding that happens when you have a sufficient number of logs.

It doesn’t use 2PC. Commits are versioned with monotonically increasing versions and applied to each log in order, and I believe the way recovery works when there was a commit is in flight and only written to a subset of the logs is that we recover to the most recent version present on all (or enough?) logs. The proxy won’t tell a client that a commit is successful until it’s written to all the logs, so any later versions have not been reported durable and can be legally discarded.

Transactions will only be appended to the logs if we intend to commit them, but the commit isn’t complete until we tell the proxy replies to the client. In other words, the commit could still fail after writing to a log if, for example, processes die and we have to do a recovery. As discussed in your previous question, that may mean that you’ve started writing commits with higher versions but not finished, so they could be present on some logs and not others.

This is certainly true, and there’s a lot we could do to improve this. There is at least some existing documentation of these details, including a more technical version of my rough description, that can be seen in these pages (neither of which has made it into our website documentation):

github.com

apple/foundationdb/blob/main/design/recovery-internals.md

# FDB Recovery Internals

FDB uses recovery to handle various failures, such as hardware and network failures. When the current transaction system no longer works properly due to failures, recovery is automatically triggered to create a new generation of the transaction system.

This document explains at the high level how the recovery works in a single cluster. The audience of this document includes both FDB developers who want to have a basic understanding of the recovery process and database administrators who need to understand why a cluster fails to recover. This document does not discuss the complexity introduced to the recovery process by the multi-region configuration.

## Background

## `ServerDBInfo` data structure

This data structure contains transient information which is broadcast to all workers for a database, permitting them to communicate with each other. It contains, for example, the interfaces for cluster controller (CC), master, ratekeeper, and resolver, and holds the log system's configuration. Only part of the data structure, such as `ClientDBInfo` that contains the list of GRV proxies and commit proxies, is available to the client.

Whenever a field of the `ServerDBInfo`is changed, the new value of the field, say new master's interface, will be sent to the CC and CC will propagate the new `ServerDBInfo` to all workers in the cluster.

## When will recovery happen?
Failure of certain roles in FDB can cause recovery. Those roles are cluster controller, master, GRV proxy, commit proxy, transaction logs (tLog), resolvers, log router, and backup workers.

Network partition or failures can make CC unable to reach some roles, treating those roles as dead and causing recovery. If CC cannot connect to a majority of coordinators, it will be treated as dead by coordinators and recovery will happen.

Better master exists event can trigger recoveries. Better master exists event is the cluster changes such that there is a better location for some already recruited processes (say master role).

This file has been truncated. show original

roohitavaf · August 16, 2020, 4:15am

Thank you for your response.

I am trying to understand the details of the recovery process from the document you provided. I have a question regarding the KnownCommittedVersion.

The example says:
“Consider an old generation with three TLogs: A, B, C . Their durable versions are 100, 110, 120 , respectively, and their knownCommittedVersion are at 80, 90, 95 , respectively.”

Is it for a single-shard transaction log, i.e. we have only one shard with three replicas A, B, C? If yes, then when the minimum of durable versions is 100, that means all replicas do have the versions <=100. So why we need knownCommittedVersion? Why we have to copy from 96 to 100 to the new transaction logs?

The old logs have 0-100, and new logs have > 100. Storage servers first finish up the old logs, and then start consuming from new logs.

If this example is not single shard, could you please tell me how many shards we have and each have how many replicas?

roohitavaf · August 16, 2020, 10:27pm

I have also a more fundamental question: When a transaction log crashes, is it possible that storage servers that used to read from it, now start consuming from a different old transaction log?

For example, in the example above, suppose transaction log A permanently fails. Suppose there was a storage server S that was reading from A, and it was way behind such that the last offset it read was 20. The new transaction logs start from 96, i.e. they don’t have any offset less than 96, and A is permanently died. So from which transaction log serer S will get offsets 21-96? I guess from old B or C. Is that true?

alexmiller · August 18, 2020, 6:53pm

It did get added to the source for the website, so when 6.3 is pushed, I think it should then go live. One will need to take a look at the source for now to see the more detailed architecture information.

random-uuid · November 10, 2020, 10:58pm

This part was a little confusing to me when I was trying to understand the transaction subsystem.

While not every transaction gets appended to the logs, every transaction batch does get appended. When it’s the turn of a proxy to submit the next batch, that proxy effectively has a cluster-wide lock; nothing else can commit before it.

Same thing happens with resolver batches.

This has implications on scalability of the cluster. If we make the commit batches larger to accommodate for more proxies, the batch submission rate can be kept constant but the variance of batch tardiness will increase, causing availability dips.

Topic		Replies	Views
Technical overview of the database Using FoundationDB	26	12822	January 11, 2019
Understanding inter communication Using FoundationDB	11	3566	October 11, 2018
Transaction Log files Info and three_data_center data replication model Using FoundationDB	2	1080	July 2, 2020
Is fdb's recovery may drop committed transaction? FoundationDB Core	3	666	June 30, 2021
Transaction implement related doubt FoundationDB Core	4	1092	October 20, 2018

Why there is no document explaining FoundationDB internals?

Related topics