Maintain Read Availability in Partitioned Data Centers

This a first draft proposal for how we can mantain read availability in a partitioned data center. It will also generally improve read availability of clusters during failures. The 6.0 release of FoundationDB (currently being developed on the master branch) adds support for asynchronous replication between data centers within a single cluster. These changes are dependent on that feature.

Goals

Allow for any DC to acquire a read version with less than WAN latencies.

Read only transactions work within a data center even when it is disconnected from other DCs.

Allow for clients to get read versions during master recoveries in the primary DC.

Allow storage servers in a remote data center to all reach a consistent version when disconnected from the primary.

Reads in a remote DC are casually consistent with writes done in the same DC.

Proposed Design

Each datacenter has their own local coordinators and local cluster controller.

Workers register only with their local cluster controller.

Clients only connect to their local cluster controller.

The local cluster controllers use the global coordinators to elect one of them to be responsible for electing the master.

Failure monitoring data is shared between cluster controllers.

The master asks the local cluster controllers in remote data centers for workers to use for remote logs and log routers.

After a master is fully recovered, it will attempt to write the coordinated state to all of the local coordinators in each data center.

The master registers with all cluster controllers, if a local cluster controller dies it must resend the registration to the newly elected local cluster controller.

Tlog rejoins are processed by the local cluster controllers.

The current proxies are split into two separate roles, commit proxies are responsible for processing client commits, and version proxies are responsible for providing read versions and answering key server location requests.

Version proxies are elected by the cluster controller in each data center independently of the master.

In the configuration we chose an extra amount of log replicas reserved for read availability during log failures.

To provide a version to a client, the version proxy asks all tlogs for their committed version. The version proxy waits for replies until all but [extra log replicas] logs have responded. The minimum version from the replies is the read version.

Master recovery treats the extra log replicas similar to having an anti-quorum.

Version proxies each connect to their own copy of ratekeeper. (multiple not needed for first version)

Commits from remote data centers wait for their version proxies to have their commit version before returning success to the client.

At the end of master recovery, after writing the final write to the coordinated state, but before notifying the cluster controller about being fully recovered, all old tlogs still in the coordinated state are read-locked which will prevent them from providing read versions.

Read-locking is done for remote datacenters at a similar time, after the remote tlogs are written to the coordinated state, but before the cluster controller is notified about the new generation of remote logs.

A new client API can be called from a transaction after a commit succeeds which waits for the committed version to be readable in all data centers. This would be implemented by having a client poll all of the remote version providers until all the versions are greater than the commit version of the transaction.

The read proxies peaks the TXS tag from the local logs and are responsible for answering getKeyServerLocations requests.

Add read proxies, which clients use when requesting data from a storage server across the WAN. This is a pure proxy role, just forwarding requests so that clients do not need to keep lots of connections open cross-dc.

Storage servers tell the cluster controller their interface, and clients receive the list of all storage server interfaces from the cluster controller. This is needed because storage servers in remote data centers may not be able to commit their new interface to the database.

Negatives of the Design

We must replicate mutations to more logs than strictly necessary. If we are giving out read versions when only receiving replies from all but one tlog, if the one missing tlog has a version less than the one we gave out it cannot help us in a recovery.

Slightly longer cluster controller recruitment time, because two leader elections must take place before recruiting a master.

If a master fails after read-locking transaction logs, but before notifying the cluster controller about the new generation we will be read-unavailable until the next recovery succeeds.

Our current read versions only need responses from [log replicas] number of tlogs, this design requires responses from [tlogs] - [extra log replicas] number of tlogs, so it will have slightly worse average latency.

Positives of the Design

Removes the n-squared communication currently required to get read versions between proxies

Does not add any additional latency to commits or read version requests.

Read versions provided in the primary data center maintain the same casual consistency guarantees.

Minimizes network communication between data centers for worker registration and failure monitoring.

There’s a lot here; I will have to think about it more! When you say “After a master is fully recovered, it will attempt to write the coordinated state to all of the local coordinators in each data center,” I assume you mean treating all the coordinators everywhere as a single disk paxos group?

For the moment I will take your word for the properties of the design.

On that basis:

  1. I don’t think there should be global database configuration / modes where external consistency is compromised. Transactions should opt in to even subtly lower consistency levels with transaction options, as they can today with options like CAUSAL_READ_RISKY. The default behavior in an asynchronously replicated datacenter should be to go to the active datacenter for a read version, slow as that might be. Of course it is acceptable to batch such requests in the local datacenter to reduce WAN traffic.

  2. You need to argue for the benefits of “Reads in a remote DC are casually consistent with writes done in the same DC.” It sounds like the kind of property that is just close enough to what you actually need to fool you into doing something dangerous :-). How much better is this then just opting in to CAUSAL_READ_DISABLE or whatever for reads that can accept being slightly stale?

  3. Similarly I’m a little bit worried by the “wait for all datacenters after write” functionality; it seems like any application that relies on that will effectively be down as soon as one datacenter fails.

  4. A slightly different idea, which I don’t know if you already plan, is to try to speed up read/write transactions initiated from an asynchronous (but obviously not partitioned) DC to a single global RTT: the transaction gets its read version with CAUSAL_READ_DISABLE, which picks a read version that will be immediately readable in the local datacenter but might be a few hundred milliseconds behind external consistency. It can then do its reads efficiently against local replicas. If it commits successfully (in the normal way, over the WAN to the active DC) then it can be serialized at its write version so the fact that its read version wasn’t externally consistent is immaterial. If this works, to make it easier to use correctly I would suggest an option something like CAUSAL_READ_DEFER which does the same thing as CAUSAL_READ_DISABLE for read/write transactions, but in the case that you actually make the transaction read only, commit() will either commit it anyway (with reads but zero writes) so that it reestablishes causal consistency, or just throw an error.

Hey Dave, thanks for the feedback!

The global coordinators are responsible for what the current coordinators do today. The next generation of master will need to get the last set of tlogs from them. The local coordinators do not need the new set of logs for correctness, just the version proxies in the associated data center will not be able to give out new versions until the logs are written there. Because each data center can be up or down independently, we do independent disk paxos to each set of local coordinators.

  1. Currently the only method of fast reads across data centers is for clients to cache read versions. The main problem with caching read versions is that we cannot bound when an individual commit is visible to all users. You are right about the need to explicitly opt into using the version proxies. I intend to require a transaction option before using the local version proxy.

  2. The main (and maybe only) benefit is reading your own writes. After a client commits, it is expected that their next transaction will see the result of the previous one. This could be done locally caching read versions, but that does not save you from restarting a client. The user also has to explicitly set their data center as an option, so it is very clear which groups of clients have causal consistency with each other, in case there was a reason to care across multiple clients.

  3. One use case for this feature is related to migrating data. Consider using FoundationDB to store metadata about the location of files and you want to move a file. With this feature you can first write the file in the new location, then update the metadata in FoundationDB, then wait for all data centers to see the updated metadata, and only then delete the original file. If a data center is down it is generally okay (and maybe preferred) that this process is halted.

  4. I really like this idea. I will need to think about it more, but it seems like something that could co-exist with this design. The version proxies have a lot of benefits even when run with just one data center. A large class of use cases care a lot more about a 1 second interruption in reads than a 1 second interruption in writes, so having the ability to decouple providing read versions from the transaction subsystem is a major benefit.