3DC2regions--Simulating Primary Datacenter Failure

Hello,
My cluster deployment was 3 DCs, 2 regions.The FDB database version is 6.3.23 .

region.json

{
“regions”: [ {
“datacenters”:[{
“id” : “dc1”,
“priority”: 1
},{
“id”: “dc2”,
“priority”: 0,
“satellite”: 1,
“satellite_logs”: 2
}],
“satellite_redundancy_mode”: “one_satellite_double”
},{
“datacenters”:[{
“id”: “dc3”,
“priority”: 0
}]
}]
}
DC1 is the primary DC. DC2 is satellite DC, and DC3 is the one in a different region.
fdbcli --exec status


I simulated DC1 failure by stopping all processes deployed in primary datacenter DC1, then the status of cluster changes.


It seems the primary datacenter has changed to DC3,but fault tolerance changes to -1, and replication health keeps initializing automatic data distribution for a long time. However, read and write requests can be submitted by the cluster normally.
fdbcli --exec 'status json' shows DC3 hasn’t fully recovered yet
failure2
And a stateless process deployed in DC3 shows the recovery_state is still accepting_commits.
failure4

Question:

  1. Are there any problems with my cluster configuration? The status of cluster hasn't fully recovered yet of remote datacenter is not the expect behavior of me. Should the recovery_state of the cluster become fully_recovered? What can I do to handle this problem?
  2. What should I do to recover the remote DC after the failure of primary datacenter? or the failure of satellite?
  3. I plan to simulate all data loss of DC1, The next step is to change regions.json that sets priority of DC1 to -1,and configure usable_regions=1 .Is it a safe step when facing these situations?
    Thanks in advance!

I have the same problem. I tried using versions 6.3, 7.1, 7.3 and the latest version 7.4 and got the same behavior. When I disconnect the main DC cluster, the cluster switches to the second region and hangs in the initialization status.

{
  "regions":[
    {
        "datacenters":[
          {"id":"dc1","priority":2},
          {"id":"dc3","priority":1,"satellite":1}
        ],
        "satellite_redundancy_mode": "one_satellite_double",
        "satellite_logs":2
    },
    {
        "datacenters":[
          {"id":"dc2","priority":1},
          {"id":"dc4","priority":1,"satellite":1}
        ],
        "satellite_redundancy_mode": "one_satellite_double",
        "satellite_logs":2
    }
  ]
}

fdb> status details

Configuration:
  Redundancy mode        - single
  Storage engine         - ssd-2
  Coordinators           - 3
  Usable Regions         - 2
  Regions:
    Remote -
        Datacenter                    - dc1
        Satellite datacenters         - dc3
        Satellite Redundancy Mode     - one_satellite_double
        Satellite Logs                - 2
    Primary -
        Datacenter                    - dc2
        Satellite datacenters         - dc4
        Satellite Redundancy Mode     - one_satellite_double
        Satellite Logs                - 2

Cluster:
  FoundationDB processes - 6
  Zones                  - 6
  Machines               - 6
  Memory availability    - 8.0 GB per process on machine with least available
  Fault Tolerance        - -1 machines

  Warning: the database may have data loss and availability loss. Please restart following tlog interfaces, otherwise storage servers may never be able to catch up.
  Old log epoch: 27 begin: 70577542348 end: 71344145316, missing log interfaces(id,address): 7b91b08e089a7e4e,

  Server time            - 11/20/24 15:19:03

Data:
  Replication health     - (Re)initializing automatic data distribution
  Moving data            - unknown (initializing)
  Sum of key-value sizes - unknown
  Disk space used        - 420 MB

Operating space:
  Storage server         - 187.0 GB free on most full server
  Log server             - 185.1 GB free on most full server

Workload:
  Read rate              - 162 Hz
  Write rate             - 9 Hz
  Transactions started   - 60 Hz
  Transactions committed - 9 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:
  10.121.35.122:4500      (  0% cpu;  0% machine; 0.000 Gbps;  0% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.121.35.200:4500      (  1% cpu;  1% machine; 0.000 Gbps;  6% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.121.35.210:4500      (  1% cpu;  1% machine; 0.000 Gbps;  6% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.121.35.230:4500      (  0% cpu;  0% machine; 0.000 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.121.35.240:4500      (  0% cpu;  0% machine; 0.000 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.121.35.245:4500      (  3% cpu;  3% machine; 0.001 Gbps;  6% disk IO; 0.3 GB / 8.0 GB RAM  )

This state means the database is available, i.e., read or write transactions can be performed.

the database may have data loss and availability loss. Please restart following tlog interfaces, otherwise storage servers may never be able to catch up.
Old log epoch: 27 begin: 70577542348 end: 71344145316, missing log interfaces(id,address): 7b91b08e089a7e4e,

This is saying because of the DC failure, the new primary region can’t fetch the tail of mutation logs from the failed tlogs. Until the failed tlog is brought up, some storage servers can’t catch up (because they have to apply mutation logs in order and miss some of the logs).

The fault tolerance of your configuration is 1 DC + 1 satellite tlog. Did you tear down the whole primary region?

In first region I turned off only the primary dc1, the satellite dc3 continued to work. The main dc2 in the second region should pick up the log from the satellite dc3 of the first region, but the cluster state remains in initialization all time

Why the new primary region can’t use logs from the satellite? Why does it need logs from a faulty data center?
If I turn off DC1, I see in the log:

"log_interfaces": [
      {
        "healthy": false,
        "id": "edb6db3124477c84"
      },
...
"possibly_losing_data": true,

And the cluster enters the mode

{
  "can_clean_bounce": false,
  "reason": "cluster hasn't fully recovered yet"
}
{
  "description": "(Re)initializing automatic data distribution",
  "name": "initializing"
}