Dataloss when doing failover with a multi region cluster?

Hello there,
In the series of tests that I have been conducting one of them is related to failover with a multi region fdb.

Here is the configuration:

Configuration:
  Redundancy mode        - double
  Storage engine         - ssd-2
  Log engine             - ssd-2
  Encryption at-rest     - disabled
  Coordinators           - 9
  Desired Commit Proxies - 3
  Desired GRV Proxies    - 5
  Desired Resolvers      - 2
  Desired Logs           - 3
  Desired Remote Logs    - 3
  Desired Log Routers    - 2
  Usable Regions         - 2
  Regions:
    Remote -
        Datacenter                    - trt0
        Satellite datacenters         - trt4
        Satellite Redundancy Mode     - one_satellite_double
        Satellite Logs                - 2
    Primary -
        Datacenter                    - trt1
        Satellite datacenters         - trt5
        Satellite Redundancy Mode     - one_satellite_double
        Satellite Logs                - 2

Is the region of the datacenter trt0 I deleted the operator and then all the pod related to FDB which correctly led to a failover but I noticed that there was no datamovement after the fail over which is a sign that something is not going well.
And when I connected to a pod and ran fdbcli I got

  Warning: the database may have data loss and availability loss. Please restart following tlog interfaces, otherwise storage servers may never be able to catch up.
  Old log epoch: 416 begin: 1469148180768 end: 1469914254878, missing log interfaces(id,address): 5ca7231f5aa0c2b1, 13a9b489bdb7574a, d5bf8a8d39512900,

This somewhat came as a surprise as we are supposed to have multi region with a primary and satellite so in theory anything committed to the primary should also be committed to the satellite and copied over when there is a failover but somehow it seems not to be the case ?

Is it a real issue ? also how do you map the id to pods or log server ?

Maybe this is related to: 3DC2regions--Simulating Primary Datacenter Failure

AFAIK, this is not a real issue.

When datacenter trt0 completed failed, FDB failed over to trt1. Because of complete failure of trt0, there won’t be any data movement traffic cross regions.

The fdbcli output is warning about old primary DC trt0, where data loss has happened (I bet these interfaces belong to trt0). However, because trt1 has replicas, so the database as a whole didn’t have data loss. Because satellite trt4 is available, the new primary trt1 has all the mutation data before failover, thus can catch up without data loss. The warning is a bit confusing, but probably errs on the safe side.

A tip on the operation:

Usually when this happens, we will manually issue a region failover via fdbcli by updating DC priorities so that the primary will switch to be the original secondary, i.e., trt1 in your case.

If such a manual failover is not issued, when the original primary trt0 (tlogs and storage servers) is brought back, FDB will try to use the original primary DC again. However, at this point of time, many storage servers are lagging behind, so Ratekeeper should throttle traffic until these storage servers are caught up. This behavior can validate that the database is already past the accepting_commit stage (write available) and the Ratekeeper’s RkUpdate events (reason ID tells you why).

Had the manual failover been issued, the behavior will be different: no throttling from Ratekeeper, and the database is available after brining the old primary DC back. Old primary will gradually catch up. When the old primary is sufficiently caught up, a manual failback can be issued and the old primary becomes primary again. If the failback is issued but the old primary has not caught up yet, the database will wait until DC lag is small enough and then switch regions (always available before failback).

how do you map the id to pods or log server ?

We usually search the ID string in the logs, e.g., TLogMetrics events for tlogs.

1 Like