Cross Datacenter/Region?

Sorry if this is covered elsewhere, I must’ve missed it! FoundationDB sounds really exciting but I was wondering if it’s something we could use cross-region or cross-datacenter like Cassandra or if it’s designed to have a database in each region? Is there a limit the latency between nodes? Could we have a world-wide cluster?

Thanks for the help!

There are two options for multiple data centers in 5.1:

  • You can setup different clusters in each datacenter, and use fdbdr to setup one cluster as a primary and the other cluster as read only replicas of the primary. I wrote a little more about this here: Indication in ChangeLog to identify when FDB went private

  • You can use the three_datacenter configuration which allows a single cluster to span three datacenters. In this mode your commit latencies will include a round trip latency between the datacenters. If you are willing to give up causal consistency your read only transactions can still be very fast if you cache read versions, but you should really understand what you are doing before going this route.

I forgot to mention that 6.0 will include additional options for cross-region deployments. I will write up more about the new options soon.

2 Likes

I’m interested to hear more about the cross-region features slated for 6.0, but in the meantime: what is the required cluster geometry of the three_datacenter redundancy mode? In testing, I found that I couldn’t achieve a healthy cluster unless I had at least two instances running in each of three datacenters (i.e., five datacenters with one instance each gave warnings in “status”). Also, I’d originally assumed that this mode required exactly three datacenters, but I also found that it seemed to work with four. I couldn’t find any discussion of any of this in the documentation, though it’s possible I didn’t know exactly what I was looking for.

I’m also curious to know what the exact replication semantics are in that mode. The docs state that

FoundationDB attempts to replicate data across two datacenters and will stay up with only two available. Data is triple replicated.

and my interpretation of this is that data is replicated to a total of three independent FDB instances, and that care is taken to place two copies in one datacenter and one in another. To me, this implies that there is no guarantee that reads are serviced from a client’s local datacenter: if you’re unlucky enough to be in the third datacenter, which has no copies of the data, it appears as though you won’t get fast reads whether you cache read versions or not. Moreover the probability of being in this pickle seems to increase as more datacenters are added to the cluster (assuming a three_datacenter cluster can consist of more than three datacenters; see above).

Is my understanding of all these points correct?

I believe that the docs for three_datacenter mode might be out of date. They were originally written for an older thing called three_datacenter mode, but that’s been changed, and it doesn’t look like they’ve been updated to reflect that (at least according to git blame).

The current three datacenter mode actually sextuple replicates data, with two copies per DC. Each client can read from one of its local instances, and if a storage server dies, then its data can be re-replicated to other storage servers in the same DC from the remaining DC-local replica, which is why there are two copies per DC instead of just one (and if both die, then the data must be replicated over the WAN, but that should be relatively rare). The transaction subsystem is configured in such a way that it is replicated across only two DCs at any given time, but it will re-configure itself if either of those two DCs goes down to use the two remaining DCs.

The three datacenter mode is designed to support operations for clusters spanning exactly three datacenters, and I’m actually not sure what happens if you add more. If you can add more, that’s interesting, but it might not quite give you the semantics you are after.

The issue with running a three_datacenter configuration with more that three DCs is that in the event of a DC failure all the data will be moved to the three remaining DCs. This means that you need to budget disk space such that everything will fit in 3 of the 4 data centers, and you need to be okay with the heavy WAN traffic involved in moving the data during and after the failure.

Is it possible to use the three_datacenter mode with “2.5” datacenters? By this I mean two equivalent datacenters, and a third location with a smaller number of nodes for use only in case of emergency?

In other words, is it possible to say that DC 1 and DC 2 should always be the two active DCs when everything is running fine, and only fallback on DC 3 (the smaller one) in case of either DC1 / DC2 going down. And as soon as they go up again, DC 3 should step down (not the other bigger DC)

I’m in a situation with a lot of short lived keys, so a lot of read/writes, but not a large volume of data, and I’d like to ensure that applications can still read/write when either DC1 or DC2 goes down, and I’m okay with a bit of performance degradation when DC3 is active.

No, but what you want is possible with the multi-dc work that’s coming with release 6.0. You can specify priorities for datacenters being the commit location, so you could place 1 and 2 at equal priority, and 3 at a lower priority, which would only be used when the other two aren’t both available.

Looks like what I need. I think I already know the answer, but what is the expected time frame for 6.0 ? :sunglasses:

One question about using the DR agent to synchronize one datacenter with another: If I understand correctly, there is one active cluster at a time, that ships its log to a secondary (inactive) cluster with the dr_agent tool. When the first DC “dies”, the second one takes the lead (with a possible loss of the last few transactions)

What happens if the network connectivity between the two DC goes down? Is it possible for the secondary cluster to assume that the other one is dead, and start accepting writes, while the other DC is still alive and accepting writes also ? Or does the secondary cluster need a more explicit signal to take over than just not being able to talk with the primary?

How do the clients know that they need to switch to the new coordinators in the secondary cluster? Do we need to give them two cluster files? or only one with all the coordinators of both clusters?

With the current DR setup, no, you wouldn’t see the secondary start up automatically. There is a manual command (invokable through the fdbdr command line utility) that can switch the primary and secondary in a running DR pair without any data loss as well as stop a DR so that the secondary is unlocked (in the case of, say, a meteor strike hitting a data center).

The general advice is that clients should live inside the DC (or near the DC) where the cluster is located, so something like running two sets of clients, one each DC, where each client only has a cluster file that connects to its side might be reasonable. (You can also detect whether a cluster is up by seeing if you get the “database_locked” error back from trying to read from it.) If you’re in a situation where this is easy, you could also conceivably only spin up the secondary side once you’ve noted that you need to fail over.

Alternatively, you could provide clients with both cluster files and then create one Database object per cluster. Again, you could conceivably notice which ones are hot by seeing which ones return “database_locked” when reading. There is also a system key that (if ye be brave or fool enough) you could read to see if a cluster is locked if that’s more your speed.

In the new DR configuration currently being developed on the master branch, there is only really one cluster with one cluster file, and any operations from the “cold” DC will be slower, but they should still be possible. The system will usually be configured to automatically fail over, and if you do nothing, then everything should continue to work with degraded performance. It would be better if, in the case of a failure, you also switch to using app servers closer to the hot side, but that’s not strictly required. A split brain is avoided by (1) electing a master in a DC (based on priority) that can talk to a quorum of coordinators and (2) recruiting the other components of the transaction subsystem in that same region. The quorum requirement on coordinators stops more than one master being elected.