This a first draft proposal for how we can mantain read availability in a partitioned data center. It will also generally improve read availability of clusters during failures. The 6.0 release of FoundationDB (currently being developed on the master branch) adds support for asynchronous replication between data centers within a single cluster. These changes are dependent on that feature.
Goals
Allow for any DC to acquire a read version with less than WAN latencies.
Read only transactions work within a data center even when it is disconnected from other DCs.
Allow for clients to get read versions during master recoveries in the primary DC.
Allow storage servers in a remote data center to all reach a consistent version when disconnected from the primary.
Reads in a remote DC are casually consistent with writes done in the same DC.
Proposed Design
Each datacenter has their own local coordinators and local cluster controller.
Workers register only with their local cluster controller.
Clients only connect to their local cluster controller.
The local cluster controllers use the global coordinators to elect one of them to be responsible for electing the master.
Failure monitoring data is shared between cluster controllers.
The master asks the local cluster controllers in remote data centers for workers to use for remote logs and log routers.
After a master is fully recovered, it will attempt to write the coordinated state to all of the local coordinators in each data center.
The master registers with all cluster controllers, if a local cluster controller dies it must resend the registration to the newly elected local cluster controller.
Tlog rejoins are processed by the local cluster controllers.
The current proxies are split into two separate roles, commit proxies are responsible for processing client commits, and version proxies are responsible for providing read versions and answering key server location requests.
Version proxies are elected by the cluster controller in each data center independently of the master.
In the configuration we chose an extra amount of log replicas reserved for read availability during log failures.
To provide a version to a client, the version proxy asks all tlogs for their committed version. The version proxy waits for replies until all but [extra log replicas] logs have responded. The minimum version from the replies is the read version.
Master recovery treats the extra log replicas similar to having an anti-quorum.
Version proxies each connect to their own copy of ratekeeper. (multiple not needed for first version)
Commits from remote data centers wait for their version proxies to have their commit version before returning success to the client.
At the end of master recovery, after writing the final write to the coordinated state, but before notifying the cluster controller about being fully recovered, all old tlogs still in the coordinated state are read-locked which will prevent them from providing read versions.
Read-locking is done for remote datacenters at a similar time, after the remote tlogs are written to the coordinated state, but before the cluster controller is notified about the new generation of remote logs.
A new client API can be called from a transaction after a commit succeeds which waits for the committed version to be readable in all data centers. This would be implemented by having a client poll all of the remote version providers until all the versions are greater than the commit version of the transaction.
The read proxies peaks the TXS tag from the local logs and are responsible for answering getKeyServerLocations requests.
Add read proxies, which clients use when requesting data from a storage server across the WAN. This is a pure proxy role, just forwarding requests so that clients do not need to keep lots of connections open cross-dc.
Storage servers tell the cluster controller their interface, and clients receive the list of all storage server interfaces from the cluster controller. This is needed because storage servers in remote data centers may not be able to commit their new interface to the database.
Negatives of the Design
We must replicate mutations to more logs than strictly necessary. If we are giving out read versions when only receiving replies from all but one tlog, if the one missing tlog has a version less than the one we gave out it cannot help us in a recovery.
Slightly longer cluster controller recruitment time, because two leader elections must take place before recruiting a master.
If a master fails after read-locking transaction logs, but before notifying the cluster controller about the new generation we will be read-unavailable until the next recovery succeeds.
Our current read versions only need responses from [log replicas] number of tlogs, this design requires responses from [tlogs] - [extra log replicas] number of tlogs, so it will have slightly worse average latency.
Positives of the Design
Removes the n-squared communication currently required to get read versions between proxies
Does not add any additional latency to commits or read version requests.
Read versions provided in the primary data center maintain the same casual consistency guarantees.
Minimizes network communication between data centers for worker registration and failure monitoring.