High available solution with 2.5 datacenters

Hello!

I’m looking for an efficient high available solution for two datacenters + a small third site with one coordinator only (quorum site).

The requirements:

  1. Not more than two datacenters exist. They are in the same region.
  2. No data must be placed at the quorum site.
  3. Active/Passive. Under normal conditions the first datacenter operates data and the second one keeps the replica
  4. When the whole second datacenter fails or it is down for maintenance purposes, the first datacenter should continue working without any downtime. Some performance penalty is acceptable.
  5. When the whole first datacenter fails, the second datacenter should be activated automatically and start working without any manual intervention. Small downtime (< 10s) is acceptable
  6. No data loss when switching between datacenters (all committed transactions should be present)
  7. Capability of manual switching roles of two datacenters for maintenance without any data loss. Small downtime and a manual reconfiguration are acceptable
  8. The number of nodes should be as minimal as possible at each site, but disabling one node shouldn’t cause switching.

My approach is to use the following topology with 13 nodes:

  1. Each of two datacenters has 3 big nodes - primary, and 3 small nodes - satellite.
  2. There is one tiny node on the quorum site.
  3. There are 7 coordinators: all 6 big primary nodes + the tiny node on the quorum site
  4. Using a multiregional configuration with two logical regions and four logical datacenters. Each logical region corresponds to one physical datacenter.
  5. Each primary has two satellites: one in the same physical datacenter and one in the second physical datacenter. So each satellite datacenter is a satellite for two primaries.
  6. Using the two_satellite_safe satellite_redundancy_mode
           Quorum site 

      Primary1------Primary2 
      |         \  /       | 
      |         /  \       | 
      |       /      \     | 
      Satellite1-------Satellite2 

The region configuration file looks as:

{ 
  "regions": [ 
    {
      "datacenters": [ 
        { "id": "dc11", "priority": 2 }, 
        { "id": "dc12", "priority": 0, "satellite": 1 }, 
        { "id": "dc22", "priority": 0, "satellite": 1 } 
      ], 
     "sattelitee_redundancy_mode": "two_satellite_safe" 
    }, 
    { 
      "datacenters": [ 
        { "id": "dc21", "priority": 1 }, 
        { "id": "dc22", "priority": 0, "sattellite": 1 }, 
        { "id": "dc12", "priority": 0, "sattellite": 1 } 
      ], 
      "satellite_redundancy_mode": "two_satellite_safe" 
    }
  ]
}

So each of the satellite datacenters is presented in two regions.

A small test shows that this configuration seems to work.

I understand that if the connectivity between two datacenters down right before the first datacenter fails, then the switching without data loss is not possible. However, I assume that this case has a very low probability.

My doubts and questions are as follows:

  1. I couldn’t find any examples in the fdb documentation that one satellite might be shared between two primaries. Is it a supported configuration?
  2. What happens in the two_satellite_safe mode if the connectivity between two datacenters works but sometimes is not stable: some IP packages are lost so the transfer requires a lot of retransmissions. Will it switch the synchronous log transfer off or cause performance issues and latency? Are there any knobs controlling this?

As far as I recall the multi-region implementation, I’d expect this to work. I do not believe that this exact setup is tested by simulation though.

Synchronous replication to satellites is required in multi-region. Thus, an unreliable network will result in performance/latency issues. If you have two satellites and an unreliable network, using two_satellite_fast seems like a good idea, as hopefully the slowness of the two satellites are uncorrelated.

I presume that your choice of “unreliable” was intentional, so as not to include unavailability. Exceedingly slow/unreliable network links are an area where FDB is currently weak overall at being able to handle well. As long as the network is reliable enough for some communication to proceed, FDB will not recognize that network links slowing down to 100KB/s is effectively the same as unavailability, and initiate failure handling accordingly. There’s work going on in this are (or was scheduled? I’ve lost track of it) to handle such situations better, but I’m not aware that any of it has been merged?

Also note that TCP has rather unfriendly behavior towards frequent packet loss, so you’ll likely see the effective bandwidth between your two datacenters drop drastically. I’ve previously encouraged evaluations of changing TCP’s congestion control algorithm to BBR, which better handles networks with a low degree of loss that’s not indicative of congestion, but I’ve yet to hear the results of any such evaluation.

Thank you very mach, @alexmiller, for your response.

Unfortunally, using two_satellite_fast is not an acceptable solution. Because the network latency between two sites is always greater than inside the same site, this configuration causes that commits will never wait for the logs are transfered to the sattellite on another site. If the whole first site goes down, the last commits will not be present at the second site that contradicts the requirement 6 (zero data loss).

two_satellite_fast gives synchronous log transfer to the sallellite on the same site and asynchronous log transfer to another site. But we’d like to have a solution with synchronous data transfer in both directions while the network is quite “good” and switching the transfer to the asynchronous mode when the network becomes “bad” with capability of tuning the good/bad-ness criteria.

Yeah, multi-region just isn’t that level of smart yet, especially around gray failures.

Do note the fallback text here:

two_satellite_safe mode

Keep two copies of the mutation log in each of the two satellite datacenters with the highest priorities, for a total of four copies of each mutation. This mode will protect against the simultaneous loss of both the primary and one of the satellite datacenters. If only one satellite is available, it will fall back to only storing two copies of the mutation log in the remaining datacenter.

If Satellite2 in your diagram is unavailable, then FDB would continue to run with the mutation log being stored only in the primary and Satellite1. That still satisfies FDB’s availability promises, as it’d take losing Primary1 and Satellite1 on top of the existing Satellite2 unavailability to cause FDB unavailability/data loss, but you might care slightly more if you are concerned about losing {Primary1, Satellite1} simultaneously while Satellite2 might be flakily unavailable. On the other hand, I suppose this does satisfy your desire to some degree of removing Satellite2 when it’s being “excessively” bad, but still allowing FDB to continue running.

Indeed. But the probability of two simultaneous troubles is lower than the probability of each trouble alone.

I thought that might be using two_satellite_fast mode with adding some artifitial network delay from the Primary1 to Satellite1 (and the some delay from Primary2 to Satellite2). In this case when the network is “good”, the logs is usually be written to the satellite on the other site earlier than to the one on the same site, so Satellite 2 contains the logs of all commited transactions almost always. And only when the network is unavailable or “bad”, the logs start being written synchronously to Satellite1.