Multi DC replication

ryanworl · June 8, 2018, 11:15am

I see on the roadmap for 6.0 is multi-datacenter replication where a number of remote DCs can participate either synchronously or asynchronously in a cluster.

I am asking because I also saw a goal is to provide the ability for clients to read from their local DC storage, which is related to a use case I’m interested in.

Are there plans to allow certain data to be replicated globally and others to only be replicated within that region?

Presenting an entirely flat keyspace globally and providing that feature sounds… difficult and slow, so I think allowing each region to only present non-overlapping prefixes of their keyspaces globally would be fine. So like eu-west/ and us-east/ would be replicated to each other, but the rest of the keyspace would only be replicated within each region. The consistency would be like DR where the database version is consistent at some past point in time for reads. If you want to do a read-write transaction to a remote region’s keyspace you have to be prepared for very high latency, more conflicts, etc. I’m sure it would also add complications for transactions spanning the local and global prefix.

This would be a killer feature for complying with data protection and data sovereignty regulations while still sharing some data across countries where it is needed and lawful to transfer.

dave · June 8, 2018, 9:29pm

What’s planned for 6.0 is databases that are (each) fast in one region at a time, and can quickly fail over to a different region while preserving full transaction durability. This is, in itself, revolutionary- I’m not aware of any other system of any kind with these properties!

The way to do the sort of thing you are trying to do is to operate multiple FDB databases, each with different fast regions, and connect to each of them from your client. Initially you will have to manage two phase commit for transactions between them at a higher layer with advisory locks, but I imagine that as we get more knowledge of the right approach there might be fdb features to make that faster or more transparent. Having the component databases extremely (multi region) fault tolerant does help a lot, though.

Over the years I have spent a fair amount of time thinking about how to push these concerns below the FDB abstraction, because obviously it would be very attractive. Unfortunately I haven’t been able to come up with a satisfactory abstraction.

So while I still hope for a great idea for how to do it at the kv store, and welcome discussion of that, for the moment I think geographically aware partitioning has to be a layer feature.

ryanworl · June 8, 2018, 9:40pm

I completely understand. It does seem like my solution is only a fit for a certain class of applications, and therefore needs to be at the application level to make sense in this context.

I did some more thinking about this and I think the existing DR system could be modified to provide this without any additional features on the FDB side.

The DR system locks the target to prevent accidental use.

What happens if I just… unlock it? Modify the key in the system keyspace underneath the DR system to the clients can read. Combine with DRing a different prefix in each region, hook the DR up on both places, and go.

One could obviously implement their own transaction log on top, but the existing DR system is very attractive just because it provides ACI out of the box for all the existing FDB features.

I’m sure there’s some limitation I’m overlooking here that would cause the database to blow up by breaking the lock, but conceptually I just want ACI from remote regions and ACID locally. A generalized global database that doesn’t compromise somewhere on the feature set doesn’t seem realistic. Building a layer that adds on non-interactive transactions for operating on the remote region would be a great layer feature but is not right for FDB core.

andrew.noyes · October 15, 2018, 9:33pm

Hi David,

Do you have more details about the design for quickly failing over to a different region while preserving full transaction durability?

Thanks!

alexmiller · October 16, 2018, 2:07am

The high level idea is to ensure that, for each mutation sent to the tlogs, one replica of that mutation is stored on a tlog that’s in a separate failure domain from your main DC. We’ve called this nearby separate failure domain a satellite. In the case of a single DC or satellite failure, there will always exist a transaction log with all mutations that we’ve promised to clients are durable. This is largely just extending the current write pipeline to have transaction logs spread across failure domains.

Concretely, using AWS as an example, one could have a database set up such that you have provisioned two FDB clusters worth of processes in us-west-1 and us-east-1, and set of processes running in us-west-2 and us-east-2 configured to be recruited only as transaction logs, which function as the satellites of the clusters in us-west-1 and us-east-1 respectively. In the case of a availability zone failure of us-west-1 going down, the cluster in us-east-1 can still read the last second of mutations from the transaction logs running in us-west-2.

Additionally, work was done to minimize WAN traffic by only sending each mutation over the WAN once, and then re-indexing and replicating the mutation on the non-active side. A stronger notion of locality was introduced to help clients understand there exist DC-local entities, that storage servers should pull from DC-local tlogs, etc. There exist configuration options to control the number of satellites, replication factor for a satellite, anti-quorums over satellites.

I think @Evan was planning to write the docs for the new DR method soon, as the initial cut of it is a part of the 6.0 release, but if you have any specific questions we can probably answer them sooner.

wll · October 17, 2018, 3:08pm

Does satellite automatically repair primary (and vice versa)?
Is there a change in requirements for satellites (due to no storage servers) (e.g., less than 4GB RAM per fdbserver)?
How much space would the combination of single and one_satellite_single modes (2 DCs) save compared to double mode (1 DC)?

alexmiller · October 17, 2018, 10:46pm

The satellites don’t hold a full copy of the data. They’re at a high level equivalent to witness replicas in more quorum-based systems. They only exist to hold the difference between what has been applied on one DC (the primary), but not yet applied on all replicas (the secondary).

No, the only major difference is that you’ll require significantly less processes in a satellite location, as you only need to run transaction logs.

Having two DCs set up with configure single one_satellite_single would cost ever so slightly more than just one DC with double replication. In both cases, you’ll end up storing two copies of each mutation, and two copies of your full database. In the case of having satellites, you’ll need to store up to the past 5 seconds of changes one additional time. Likely not a lot of data, so a minimal cost to gain the ability to recover from a DC loss.