3DC2regions--Simulating Primary Datacenter Failure

Clindy · May 17, 2022, 8:55am

Hello,
My cluster deployment was 3 DCs, 2 regions.The FDB database version is 6.3.23 .

region.json

{
“regions”: [ {
“datacenters”:[{
“id” : “dc1”,
“priority”: 1
},{
“id”: “dc2”,
“priority”: 0,
“satellite”: 1,
“satellite_logs”: 2
}],
“satellite_redundancy_mode”: “one_satellite_double”
},{
“datacenters”:[{
“id”: “dc3”,
“priority”: 0
}]
}]
}
DC1 is the primary DC. DC2 is satellite DC, and DC3 is the one in a different region.
fdbcli --exec status

multi-region637×781 18.6 KB

I simulated DC1 failure by stopping all processes deployed in primary datacenter DC1, then the status of cluster changes.

It seems the primary datacenter has changed to DC3,but fault tolerance changes to -1, and replication health keeps initializing automatic data distribution for a long time. However, read and write requests can be submitted by the cluster normally.
fdbcli --exec 'status json' shows DC3 hasn’t fully recovered yet
failure2

And a stateless process deployed in DC3 shows the recovery_state is still accepting_commits.
failure4

Question:

Are there any problems with my cluster configuration? The status of cluster hasn't fully recovered yet of remote datacenter is not the expect behavior of me. Should the recovery_state of the cluster become fully_recovered? What can I do to handle this problem?
What should I do to recover the remote DC after the failure of primary datacenter? or the failure of satellite?
I plan to simulate all data loss of DC1, The next step is to change regions.json that sets priority of DC1 to -1,and configure usable_regions=1 .Is it a safe step when facing these situations?
Thanks in advance！

MarkSh1 · November 20, 2024, 1:39pm

I have the same problem. I tried using versions 6.3, 7.1, 7.3 and the latest version 7.4 and got the same behavior. When I disconnect the main DC cluster, the cluster switches to the second region and hangs in the initialization status.

{
  "regions":[
    {
        "datacenters":[
          {"id":"dc1","priority":2},
          {"id":"dc3","priority":1,"satellite":1}
        ],
        "satellite_redundancy_mode": "one_satellite_double",
        "satellite_logs":2
    },
    {
        "datacenters":[
          {"id":"dc2","priority":1},
          {"id":"dc4","priority":1,"satellite":1}
        ],
        "satellite_redundancy_mode": "one_satellite_double",
        "satellite_logs":2
    }
  ]
}

fdb> status details

Configuration:
  Redundancy mode        - single
  Storage engine         - ssd-2
  Coordinators           - 3
  Usable Regions         - 2
  Regions:
    Remote -
        Datacenter                    - dc1
        Satellite datacenters         - dc3
        Satellite Redundancy Mode     - one_satellite_double
        Satellite Logs                - 2
    Primary -
        Datacenter                    - dc2
        Satellite datacenters         - dc4
        Satellite Redundancy Mode     - one_satellite_double
        Satellite Logs                - 2

Cluster:
  FoundationDB processes - 6
  Zones                  - 6
  Machines               - 6
  Memory availability    - 8.0 GB per process on machine with least available
  Fault Tolerance        - -1 machines

  Warning: the database may have data loss and availability loss. Please restart following tlog interfaces, otherwise storage servers may never be able to catch up.
  Old log epoch: 27 begin: 70577542348 end: 71344145316, missing log interfaces(id,address): 7b91b08e089a7e4e,

  Server time            - 11/20/24 15:19:03

Data:
  Replication health     - (Re)initializing automatic data distribution
  Moving data            - unknown (initializing)
  Sum of key-value sizes - unknown
  Disk space used        - 420 MB

Operating space:
  Storage server         - 187.0 GB free on most full server
  Log server             - 185.1 GB free on most full server

Workload:
  Read rate              - 162 Hz
  Write rate             - 9 Hz
  Transactions started   - 60 Hz
  Transactions committed - 9 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:
  10.121.35.122:4500      (  0% cpu;  0% machine; 0.000 Gbps;  0% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.121.35.200:4500      (  1% cpu;  1% machine; 0.000 Gbps;  6% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.121.35.210:4500      (  1% cpu;  1% machine; 0.000 Gbps;  6% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.121.35.230:4500      (  0% cpu;  0% machine; 0.000 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.121.35.240:4500      (  0% cpu;  0% machine; 0.000 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.121.35.245:4500      (  3% cpu;  3% machine; 0.001 Gbps;  6% disk IO; 0.3 GB / 8.0 GB RAM  )

jzhou · November 22, 2024, 5:06pm

This state means the database is available, i.e., read or write transactions can be performed.

the database may have data loss and availability loss. Please restart following tlog interfaces, otherwise storage servers may never be able to catch up.
Old log epoch: 27 begin: 70577542348 end: 71344145316, missing log interfaces(id,address): 7b91b08e089a7e4e,

This is saying because of the DC failure, the new primary region can’t fetch the tail of mutation logs from the failed tlogs. Until the failed tlog is brought up, some storage servers can’t catch up (because they have to apply mutation logs in order and miss some of the logs).

The fault tolerance of your configuration is 1 DC + 1 satellite tlog. Did you tear down the whole primary region?

MarkSh1 · November 26, 2024, 10:55am

In first region I turned off only the primary dc1, the satellite dc3 continued to work. The main dc2 in the second region should pick up the log from the satellite dc3 of the first region, but the cluster state remains in initialization all time

MarkSh1 · January 21, 2025, 11:19am

Why the new primary region can’t use logs from the satellite? Why does it need logs from a faulty data center?
If I turn off DC1, I see in the log:

"log_interfaces": [
      {
        "healthy": false,
        "id": "edb6db3124477c84"
      },
...
"possibly_losing_data": true,

And the cluster enters the mode

{
  "can_clean_bounce": false,
  "reason": "cluster hasn't fully recovered yet"
}
{
  "description": "(Re)initializing automatic data distribution",
  "name": "initializing"
}

MarkSh1 · April 17, 2025, 10:34am

Fix status [link] (Fixed database status in multi-region mode when primary DC fails by MarkSh1 · Pull Request #12071 · apple/foundationdb · GitHub)

mpatou_openai · June 12, 2025, 1:50am

More than the status display issue my main issue is with the data movement it seems that while this is happening the data mover will not be working and hence if we starts to see issue on the secondary region (ie. a pod/node is failing) then we will not try to rebalance to others ss which seems more risky.

gridem-openai · June 12, 2025, 2:07am

I see that there was a fix Fixed database status in multi-region mode when primary DC fails by MarkSh1 · Pull Request #12071 · apple/foundationdb · GitHub, that fixed a database status. But the actual code still contains the warning about potential data loss. What is the point of the PR? It seems that original problem was not solved.

Topic		Replies	Views
Simulating FDB data center failure Using FoundationDB performance	4	964	December 9, 2019
Satellite dc is wrong in a remote region Using FoundationDB operator	3	632	March 2, 2021
Two region setup: fdb doesn't switch back to recovered primary idc Using FoundationDB	4	497	January 25, 2022
Fault Tolerance changes from "2 machines" to "0 machines (2 without data loss)" Using FoundationDB	1	569	June 5, 2019
Dataloss when doing failover with a multi region cluster? Running FoundationDB	2	76	June 9, 2025

3DC2regions--Simulating Primary Datacenter Failure

Related topics