Dataloss when doing failover with a multi region cluster?

mpatou_openai · May 31, 2025, 1:52am

Hello there,
In the series of tests that I have been conducting one of them is related to failover with a multi region fdb.

Here is the configuration:

Configuration:
  Redundancy mode        - double
  Storage engine         - ssd-2
  Log engine             - ssd-2
  Encryption at-rest     - disabled
  Coordinators           - 9
  Desired Commit Proxies - 3
  Desired GRV Proxies    - 5
  Desired Resolvers      - 2
  Desired Logs           - 3
  Desired Remote Logs    - 3
  Desired Log Routers    - 2
  Usable Regions         - 2
  Regions:
    Remote -
        Datacenter                    - trt0
        Satellite datacenters         - trt4
        Satellite Redundancy Mode     - one_satellite_double
        Satellite Logs                - 2
    Primary -
        Datacenter                    - trt1
        Satellite datacenters         - trt5
        Satellite Redundancy Mode     - one_satellite_double
        Satellite Logs                - 2

Is the region of the datacenter trt0 I deleted the operator and then all the pod related to FDB which correctly led to a failover but I noticed that there was no datamovement after the fail over which is a sign that something is not going well.
And when I connected to a pod and ran fdbcli I got

  Warning: the database may have data loss and availability loss. Please restart following tlog interfaces, otherwise storage servers may never be able to catch up.
  Old log epoch: 416 begin: 1469148180768 end: 1469914254878, missing log interfaces(id,address): 5ca7231f5aa0c2b1, 13a9b489bdb7574a, d5bf8a8d39512900,

This somewhat came as a surprise as we are supposed to have multi region with a primary and satellite so in theory anything committed to the primary should also be committed to the satellite and copied over when there is a failover but somehow it seems not to be the case ?

Is it a real issue ? also how do you map the id to pods or log server ?

johscheuer · June 5, 2025, 7:40am

Maybe this is related to: 3DC2regions--Simulating Primary Datacenter Failure

jzhou · June 9, 2025, 3:11am

AFAIK, this is not a real issue.

When datacenter trt0 completed failed, FDB failed over to trt1. Because of complete failure of trt0, there won’t be any data movement traffic cross regions.

The fdbcli output is warning about old primary DC trt0, where data loss has happened (I bet these interfaces belong to trt0). However, because trt1 has replicas, so the database as a whole didn’t have data loss. Because satellite trt4 is available, the new primary trt1 has all the mutation data before failover, thus can catch up without data loss. The warning is a bit confusing, but probably errs on the safe side.

A tip on the operation:

Usually when this happens, we will manually issue a region failover via fdbcli by updating DC priorities so that the primary will switch to be the original secondary, i.e., trt1 in your case.

If such a manual failover is not issued, when the original primary trt0 (tlogs and storage servers) is brought back, FDB will try to use the original primary DC again. However, at this point of time, many storage servers are lagging behind, so Ratekeeper should throttle traffic until these storage servers are caught up. This behavior can validate that the database is already past the accepting_commit stage (write available) and the Ratekeeper’s RkUpdate events (reason ID tells you why).

Had the manual failover been issued, the behavior will be different: no throttling from Ratekeeper, and the database is available after brining the old primary DC back. Old primary will gradually catch up. When the old primary is sufficiently caught up, a manual failback can be issued and the old primary becomes primary again. If the failback is issued but the old primary has not caught up yet, the database will wait until DC lag is small enough and then switch regions (always available before failback).

how do you map the id to pods or log server ?

We usually search the ID string in the logs, e.g., TLogMetrics events for tlogs.

noctella · August 15, 2025, 2:22pm

@jzhou

This is my current configuration:

"regions" : [
                {
                    "datacenters" : [
                        {
                            "id" : "dc1",
                            "priority" : 1
                        },
                        {
                            "id" : "us-west-2",
                            "priority" : 1,
                            "satellite" : 1
                        }
                    ],
                    "satellite_logs" : 3,
                    "satellite_redundancy_mode" : "one_satellite_single"
                },
                {
                    "datacenters" : [
                        {
                            "id" : "dc2",
                            "priority" : 0
                        }
                    ]
                }
            ],

I killed dc1, so now I have the following:

  Regions:
    Remote -
        Datacenter                    - dc1
        Satellite datacenters         - us-west-2
        Satellite Redundancy Mode     - one_satellite_single
        Satellite Logs                - 3
    Primary -
        Datacenter                    - dc2

You mentioned about issuing a manual failover, but that doesn’t work. I swapped the priorities of dc1/dc2 so that dc2 is primary.

sh-5.1$ fdbcli --exec "configure  regions='[{\"datacenters\":[{\"id\":\"dc1\",\"priority\":0},{\"id\":\"us-west-2\",\"priority\":1,\"satellite\":1}],\"satellite_logs\":3,\"satellite_redundancy_mode\":\"one_satellite_single\"},{\"datacenters\":[{\"id\":\"dc2\",\"priority\":1}]}]'"
ERROR: These changes would make the configuration invalid

jzhou · August 15, 2025, 3:55pm

The configuration is invalid because for the remote region, you don’t have satellites configured, which is required for the region to become the primary.

noctella · August 15, 2025, 7:01pm

I did also try swaping them so that dc2 has satellit:

sh-5.1$ fdbcli --exec "configure double ssd-rocksdb-v1 usable_regions=2 logs=3 resolvers=1 log_routers=3 remote_logs=3 commit_proxies=2 grv_proxies=1 storage_migration_type=gradual perpetual_storage_wiggle=1 perpetual_storage_wiggle_locality=0 regions='[{\"datacenters\":[{\"id\":\"dc2\",\"priority\":1},{\"id\":\"us-west-2\",\"priority\":1,\"satellite\":1}],\"satellite_logs\":3,\"satellite_redundancy_mode\":\"one_satellite_single\"},{\"datacenters\":[{\"id\":\"dc1\",\"priority\":0}]}]'"
ERROR: These changes would make the configuration invalid

Right now after the failover from dc1 to dc2 after I killed dc2 has left the cluster in a bad state:

  Warning: the database may have data loss and availability loss. Please restart following tlog interfaces, otherwise storage servers may never be able to catch up.
  Old log epoch: 12 begin: 827334449 end: 1093552744, missing log interfaces(id,address): 61f46c21d2e4658d, 6c502d17604de6a5, 7e3b34298e52abf4,

  Server time            - 08/15/25 18:58:38

I saw the message here: Dataloss when doing failover with a multi region cluster? that this might be benign.

But when I also tried adding back dc1, now it’s not recruiting any logs in dc1. A very similar thing happened when I tested killing dc2 with a healthy cluster – as soon as I added dc2 back, it was unable to recruit any logs there (and not enough storage servers either). The routers also moved to dc1 and did not move back to remote dc2.

Seems like a fairly common failure testing scenario so not sure what I’m doing wrong.

noctella · August 18, 2025, 10:50am

To add a bit more detail here: I am trying to test multiregion failover in a setup managed by the fdb kubernetes operator.

Primary: dc1, satellite us-west-2
Remote: dc2

When I set skip=true in dc1 and deleted all of the dc1 pods, fdb failed over to dc2 fine, and when I set skip=false in dc1, everything came back fine and it automatically failed back to dc1. All good!

However, I also wanted to test the case where dc1 is completely unrecoverable and we need to bring it back. For this test, I uninstalled the fdb deployment in dc1 (which removed the foundationdbcluster resource (that has all the pod/process class/pv information). It failed over to dc2 again as expected, but now I cannot figure out how to revive dc1.

I cannot change the fdb configuration via cli – literally every attempt leads to ERROR: These changes would make the configuration invalid.

I also tried re-installing the dc1 fdb deployment (with updated seed connection string), and although it can talk to the cluster, it cannot recover past the accepting_commits stage, and no tlogs are being recruited.

What is the suggested way here to bring this cluster back? ty!

jzhou · August 22, 2025, 10:13pm

Not sure if fdb kubernetes operator can do the following and you might need manually do:

After fail over to DC2, change usable_regions to 1, and then drop DC1 in the configuration. These configuration changes might need to be done in two steps.
Bring back a set of processes in DC1.
Add DC1 to the configuration, and then change usable_regions to 2.

Note DC2 will also need satellites when there are two regions and DC2 is the primary.

johscheuer · September 1, 2025, 4:43pm

Not sure if fdb kubernetes operator can do the following and you might need manually do:

After fail over to DC2, change usable_regions to 1, and then drop DC1 in the configuration. These configuration changes might need to be done in two steps.

Bring back a set of processes in DC1.

Add DC1 to the configuration, and then change usable_regions to 2.

Note DC2 will also need satellites when there are two regions and DC2 is the primary.

The operator can perform those steps, but they must be executed one by one.

Depending on your setup and the failures that you are trying to test it might be worth to take a look at the kubectl-fdb plugin that has support for recovering a multi-region cluster in such cases: fdb-kubernetes-operator/kubectl-fdb/cmd/recover_multi_region_cluster.go at main · FoundationDB/fdb-kubernetes-operator · GitHub. The intention of this command is a bit different since the idea is to make use of it if the cluster is in such a bad state, that you are not able to recover a majority of coordinators.

Another side note for the multi-region setup with the operator: Since we don’t have a global FoundationDBCluster spec (yet), you have to make sure to update all the configuration changes in all the different FoundationDBCluster specs that form the multi-region cluster (I hope that makes sense). Otherwise the different operator instances might have a different view on the desired state and they try to configure the cluster in different ways.

noctella · September 5, 2025, 12:15pm

Thanks all. Wanted to follow up here in case anyone else is running into this issue.

This is my exact setup:

dc: ${DC}
usableRegions: 2
redundancyMode: double
storageCount: 3
logCount: 3
grvProxyCount: 1
commitProxyCount: 2
resolverCount: 1
seedConnectionString: ${SEED_CONNECTION_STRING}
regions:
  - datacenters:
      - id: ${PRIMARY}
        priority: 1
      - id: ${SATELLITE}
        satellite: 1
        priority: 1
    satellite_logs: 3
    satellite_redundancy_mode: one_satellite_single
  - datacenters:
      - id: ${REMOTE}
        priority: 0

Scenario 1: PRIMARY is unavailable. This works well and is the common case of failure – we can fail back to PRIMARY automatically after it comes back online. I mocked this by killing all the fdb pods in PRIMARY and setting skip on operator so it doesn’t revive them:

setSkipOnOperatorForDC "$PRIMARY" "true"

killPodsInDC "$PRIMARY"

# The 'false' here means don't check data distribution state here. It will be stuck saying "initializing" until we fail back over.
awaitFailover "$REMOTE" false
checkReconciliationForDC "$REMOTE"

print_green "Automated failover to "$REMOTE" complete!"
setSkipOnOperatorForDC "$PRIMARY" "false"

checkReconciliationForDC "$PRIMARY"
awaitFailover "$PRIMARY"

Scenario 2: PRIMARY is unavailable and completely unrecoverable-- maybe kubernetes cluster went down and all data is lost. I mocked this by uninstalling the deployment in PRIMARY. dcs-final-failover.yaml contains the cluster in a failed-over state (ie REMOTE is now primary). Note that we need to manually force the failover into REMOTE – otherwise it complains that the configuration is invalid.

uninstallDC "$PRIMARY"

# The 'false' here means don't check data distribution state here. It will be stuck saying "initializing" until we fail back over.
awaitFailover "$REMOTE" false

setSkipOnOperatorForDC "$SATELLITE" "true"
setSkipOnOperatorForDC "$REMOTE" "true"

print_green "Manually forcing failover to $REMOTE"
execInPodInDC "$REMOTE" "configure FORCE double ssd-rocksdb-v1 usable_regions=2 logs=3 resolvers=1 log_routers=3 remote_logs=-1 commit_proxies=2 grv_proxies=1 regions=[{\\\"datacenters\\\":[{\\\"id\\\":\\\"$PRIMARY\\\",\\\"priority\\\":-1},{\\\"id\\\":\\\"$SATELLITE\\\",\\\"priority\\\":1,\\\"satellite\\\":1}],\\\"satellite_logs\\\":3,\\\"satellite_redundancy_mode\\\":\\\"one_satellite_single\\\"},{\\\"datacenters\\\":[{\\\"id\\\":\\\"$REMOTE\\\",\\\"priority\\\":1}]}] storage_migration_type=gradual perpetual_storage_wiggle=1 perpetual_storage_wiggle_locality=0"

print_green "Setting usable regions to 1"
execInPodInDC "$REMOTE" "configure  FORCE double ssd-rocksdb-v1 usable_regions=1 logs=3 resolvers=1 log_routers=3 remote_logs=-1 commit_proxies=2 grv_proxies=1 regions=[{\\\"datacenters\\\":[{\\\"id\\\":\\\"$PRIMARY\\\",\\\"priority\\\":-1},{\\\"id\\\":\\\"$SATELLITE\\\",\\\"priority\\\":1,\\\"satellite\\\":1}],\\\"satellite_logs\\\":3,\\\"satellite_redundancy_mode\\\":\\\"one_satellite_single\\\"},{\\\"datacenters\\\":[{\\\"id\\\":\\\"$REMOTE\\\",\\\"priority\\\":1}]}] storage_migration_type=gradual perpetual_storage_wiggle=1 perpetual_storage_wiggle_locality=0"

awaitHealthyDataDistributionInDC "$REMOTE"

print_green "Updating all DCs with the failed-over configuration and re-creating $PRIMARY."
connectionString=$(getConnectionStringForDC "$REMOTE")

installInDC "$REMOTE" "dcs-final-failover.yaml" "$connectionString"
installInDC "$SATELLITE" "dcs-final-failover.yaml" "$connectionString"
installInDC "$PRIMARY" "dcs-final-failover.yaml" "$connectionString"

setSkipOnOperatorForDC "$SATELLITE" "false"
setSkipOnOperatorForDC "$REMOTE" "false"

checkReconciliationForDC "$PRIMARY" 30 # This recon takes longer due to the addition of new DCs
checkReconciliationForDC "$SATELLITE"
checkReconciliationForDC "$REMOTE"

print_green "All datacenters reconciled successfully! Now manually failing back to $PRIMARY..."
installInDC "$REMOTE" "dcs-final.yaml" "$connectionString"
installInDC "$SATELLITE" "dcs-final.yaml" "$connectionString"
installInDC "$PRIMARY" "dcs-final.yaml" "$connectionString"

awaitFailover "$PRIMARY"

Topic		Replies	Views
Simulating FDB data center failure Using FoundationDB performance	4	977	December 9, 2019
3DC2regions--Simulating Primary Datacenter Failure Using FoundationDB bindings	14	617	September 3, 2025
Region failover: the criteria for auto failover and incremental/full refresh Using FoundationDB	0	455	November 23, 2021
Working High Avalable Solutions with Two Datacenters Using FoundationDB	7	1174	December 1, 2020
Two region setup: fdb doesn't switch back to recovered primary idc Using FoundationDB	4	510	January 25, 2022

Dataloss when doing failover with a multi region cluster?

Related topics