3DC2regions--Simulating Primary Datacenter Failure

Hello,
My cluster deployment was 3 DCs, 2 regions.The FDB database version is 6.3.23 .

region.json

{
“regions”: [ {
“datacenters”:[{
“id” : “dc1”,
“priority”: 1
},{
“id”: “dc2”,
“priority”: 0,
“satellite”: 1,
“satellite_logs”: 2
}],
“satellite_redundancy_mode”: “one_satellite_double”
},{
“datacenters”:[{
“id”: “dc3”,
“priority”: 0
}]
}]
}
DC1 is the primary DC. DC2 is satellite DC, and DC3 is the one in a different region.
fdbcli --exec status


I simulated DC1 failure by stopping all processes deployed in primary datacenter DC1, then the status of cluster changes.


It seems the primary datacenter has changed to DC3,but fault tolerance changes to -1, and replication health keeps initializing automatic data distribution for a long time. However, read and write requests can be submitted by the cluster normally.
fdbcli --exec 'status json' shows DC3 hasn’t fully recovered yet
failure2
And a stateless process deployed in DC3 shows the recovery_state is still accepting_commits.
failure4

Question:

  1. Are there any problems with my cluster configuration? The status of cluster hasn't fully recovered yet of remote datacenter is not the expect behavior of me. Should the recovery_state of the cluster become fully_recovered? What can I do to handle this problem?
  2. What should I do to recover the remote DC after the failure of primary datacenter? or the failure of satellite?
  3. I plan to simulate all data loss of DC1, The next step is to change regions.json that sets priority of DC1 to -1,and configure usable_regions=1 .Is it a safe step when facing these situations?
    Thanks in advance!

I have the same problem. I tried using versions 6.3, 7.1, 7.3 and the latest version 7.4 and got the same behavior. When I disconnect the main DC cluster, the cluster switches to the second region and hangs in the initialization status.

{
  "regions":[
    {
        "datacenters":[
          {"id":"dc1","priority":2},
          {"id":"dc3","priority":1,"satellite":1}
        ],
        "satellite_redundancy_mode": "one_satellite_double",
        "satellite_logs":2
    },
    {
        "datacenters":[
          {"id":"dc2","priority":1},
          {"id":"dc4","priority":1,"satellite":1}
        ],
        "satellite_redundancy_mode": "one_satellite_double",
        "satellite_logs":2
    }
  ]
}

fdb> status details

Configuration:
  Redundancy mode        - single
  Storage engine         - ssd-2
  Coordinators           - 3
  Usable Regions         - 2
  Regions:
    Remote -
        Datacenter                    - dc1
        Satellite datacenters         - dc3
        Satellite Redundancy Mode     - one_satellite_double
        Satellite Logs                - 2
    Primary -
        Datacenter                    - dc2
        Satellite datacenters         - dc4
        Satellite Redundancy Mode     - one_satellite_double
        Satellite Logs                - 2

Cluster:
  FoundationDB processes - 6
  Zones                  - 6
  Machines               - 6
  Memory availability    - 8.0 GB per process on machine with least available
  Fault Tolerance        - -1 machines

  Warning: the database may have data loss and availability loss. Please restart following tlog interfaces, otherwise storage servers may never be able to catch up.
  Old log epoch: 27 begin: 70577542348 end: 71344145316, missing log interfaces(id,address): 7b91b08e089a7e4e,

  Server time            - 11/20/24 15:19:03

Data:
  Replication health     - (Re)initializing automatic data distribution
  Moving data            - unknown (initializing)
  Sum of key-value sizes - unknown
  Disk space used        - 420 MB

Operating space:
  Storage server         - 187.0 GB free on most full server
  Log server             - 185.1 GB free on most full server

Workload:
  Read rate              - 162 Hz
  Write rate             - 9 Hz
  Transactions started   - 60 Hz
  Transactions committed - 9 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:
  10.121.35.122:4500      (  0% cpu;  0% machine; 0.000 Gbps;  0% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.121.35.200:4500      (  1% cpu;  1% machine; 0.000 Gbps;  6% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.121.35.210:4500      (  1% cpu;  1% machine; 0.000 Gbps;  6% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.121.35.230:4500      (  0% cpu;  0% machine; 0.000 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.121.35.240:4500      (  0% cpu;  0% machine; 0.000 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.121.35.245:4500      (  3% cpu;  3% machine; 0.001 Gbps;  6% disk IO; 0.3 GB / 8.0 GB RAM  )

This state means the database is available, i.e., read or write transactions can be performed.

the database may have data loss and availability loss. Please restart following tlog interfaces, otherwise storage servers may never be able to catch up.
Old log epoch: 27 begin: 70577542348 end: 71344145316, missing log interfaces(id,address): 7b91b08e089a7e4e,

This is saying because of the DC failure, the new primary region can’t fetch the tail of mutation logs from the failed tlogs. Until the failed tlog is brought up, some storage servers can’t catch up (because they have to apply mutation logs in order and miss some of the logs).

The fault tolerance of your configuration is 1 DC + 1 satellite tlog. Did you tear down the whole primary region?

In first region I turned off only the primary dc1, the satellite dc3 continued to work. The main dc2 in the second region should pick up the log from the satellite dc3 of the first region, but the cluster state remains in initialization all time

Why the new primary region can’t use logs from the satellite? Why does it need logs from a faulty data center?
If I turn off DC1, I see in the log:

"log_interfaces": [
      {
        "healthy": false,
        "id": "edb6db3124477c84"
      },
...
"possibly_losing_data": true,

And the cluster enters the mode

{
  "can_clean_bounce": false,
  "reason": "cluster hasn't fully recovered yet"
}
{
  "description": "(Re)initializing automatic data distribution",
  "name": "initializing"
}

Fix status [link] (Fixed database status in multi-region mode when primary DC fails by MarkSh1 · Pull Request #12071 · apple/foundationdb · GitHub)

More than the status display issue my main issue is with the data movement it seems that while this is happening the data mover will not be working and hence if we starts to see issue on the secondary region (ie. a pod/node is failing) then we will not try to rebalance to others ss which seems more risky.

I see that there was a fix Fixed database status in multi-region mode when primary DC fails by MarkSh1 · Pull Request #12071 · apple/foundationdb · GitHub, that fixed a database status. But the actual code still contains the warning about potential data loss. What is the point of the PR? It seems that original problem was not solved.

I am also seeing the exact same problem in 7.4 – kill the primary DC (which has a satellite), it fails over to secondary region, then gets stuck:

 Warning: the database may have data loss and availability loss. Please restart following tlog interfaces, otherwise storage servers may never be able to catch up.
  Old log epoch: 12 begin: 827334449 end: 1093552744, missing log interfaces(id,address): 61f46c21d2e4658d, 6c502d17604de6a5, 7e3b34298e52abf4,

  Server time            - 08/15/25 13:25:03

Data:
  Replication health     - (Re)initializing automatic data distribution
  Moving data            - unknown (initializing)

I’ve done some tests, the results are in the table.

It turns out that data loss is only possible in the A3 test.
Total: if remote_log_fault_tolerance: -1, then we write about possible data loss

I fixed it in PR.

MarkSh1, thank you for providing an update. What is not clear is that what your PR is doing. From my understanding, it doesn’t fix the situation, the warning still be there, no any logic change. So, what is the point of the PR?

According to your table with tests. I see 2 regions configuration, but it’s not clear about DCs and association of tlogs with different DCs. The original post operates with at least 3 DCs: 1 for primary, 1 for satellite, 1 for secondary. There is another option of using 4th DC, for secondary satellite, but let’s omit it for now.

So in your tests you still see some warning, but the main question: is there any real data loss that may happen? If yes, under what conditions? If no - then why do we have this warning at all?

If there is a possible of dataloss, then what’s the point of having primary and satellites at all? The main motivation for such a deployment is to use satellite with synchronous replication, that allows to catch up from secondary during DC failover. If it’s not the case, then what’s the point of having it?

Hi @jzhou

>>the database may have data loss and availability loss. Please restart following tlog interfaces, otherwise storage servers may never be able to catch up. Old log epoch: 27 begin: 70577542348 end: 71344145316, missing log interfaces(id,address): 7b91b08e089a7e4e,

>This is saying because of the DC failure, the new primary region can’t fetch the tail of mutation logs from the failed tlogs. Until the failed tlog is brought up, some storage servers can’t catch up (because they have to apply mutation logs in order and miss some of the logs).

We were seeing similar messages with a cluster that has a primary with a replication factor of 2 and 4 Tlogs and a satellite with 2 tlogs and a one_satellite_double mode.
What happened is that we were shutting down the main DC and so FDB doesn’t have a quorum for the tlogs but still we have some copies in the satellite.

What if we can’t recover the original DC at all ? Isn’t there a risk that can never catch up because we can’t apply the mutation log and if so doesn’t defeat the purpose of the satellite ?

If main DC is down, the database will automatically fail over to the remote region. And the remote region can pull mutation logs from original satellite, thus no data loss. What I meant in “Until the failed tlog is brought up, some storage servers can’t catch up” is referring to the original primary side storage servers. After failover, the new primary side won’t have data loss.

After failover, the new primary side won’t have data loss

So do you mean that after switching from primary to secondary there won’t be any data loss? In such a case, what is the point of having the warning and disabling the data distribution? Is it a bug?

For this diff it looks like it fixes warning only: now it may not be shown in fdbcli status. But the real problem is that FDB reports that dataLoss == -1 and availLoss == -1. Which has implications on data mover (at least, maybe there are some other consequences). So the complaining here is not just having a warning which is misleading according to discussion, but that it has a consequences on the entire FDB system: the system enters into “degraded” mode where at least data movement is disabled. Maybe there are some other consequences which is hidden and not shown in fdbcli.

So the biggest issue is that “degraded” mode is not expected for failover from primary to secondary. The expectation is that secondary should continue under normal situation/condition.