Hello,
My cluster deployment was 3 DCs, 2 regions.The FDB database version is 6.3.23 .
region.json
{
“regions”: [ {
“datacenters”:[{
“id” : “dc1”,
“priority”: 1
},{
“id”: “dc2”,
“priority”: 0,
“satellite”: 1,
“satellite_logs”: 2
}],
“satellite_redundancy_mode”: “one_satellite_double”
},{
“datacenters”:[{
“id”: “dc3”,
“priority”: 0
}]
}]
} DC1 is the primary DC. DC2 is satellite DC, and DC3 is the one in a different region. fdbcli --exec status
It seems the primary datacenter has changed to DC3,but fault tolerance changes to -1, and replication health keeps initializing automatic data distribution for a long time. However, read and write requests can be submitted by the cluster normally. fdbcli --exec 'status json' shows DC3 hasn’t fully recovered yet
And a stateless process deployed in DC3 shows the recovery_state is still accepting_commits.
Question:
Are there any problems with my cluster configuration? The status of cluster hasn't fully recovered yet of remote datacenter is not the expect behavior of me. Should the recovery_state of the cluster become fully_recovered? What can I do to handle this problem?
What should I do to recover the remote DC after the failure of primary datacenter? or the failure of satellite?
I plan to simulate all data loss of DC1, The next step is to change regions.json that sets priority of DC1 to -1,and configure usable_regions=1 .Is it a safe step when facing these situations?
Thanks in advance!
I have the same problem. I tried using versions 6.3, 7.1, 7.3 and the latest version 7.4 and got the same behavior. When I disconnect the main DC cluster, the cluster switches to the second region and hangs in the initialization status.
This state means the database is available, i.e., read or write transactions can be performed.
the database may have data loss and availability loss. Please restart following tlog interfaces, otherwise storage servers may never be able to catch up.
Old log epoch: 27 begin: 70577542348 end: 71344145316, missing log interfaces(id,address): 7b91b08e089a7e4e,
This is saying because of the DC failure, the new primary region can’t fetch the tail of mutation logs from the failed tlogs. Until the failed tlog is brought up, some storage servers can’t catch up (because they have to apply mutation logs in order and miss some of the logs).
The fault tolerance of your configuration is 1 DC + 1 satellite tlog. Did you tear down the whole primary region?
In first region I turned off only the primary dc1, the satellite dc3 continued to work. The main dc2 in the second region should pick up the log from the satellite dc3 of the first region, but the cluster state remains in initialization all time
More than the status display issue my main issue is with the data movement it seems that while this is happening the data mover will not be working and hence if we starts to see issue on the secondary region (ie. a pod/node is failing) then we will not try to rebalance to others ss which seems more risky.
I am also seeing the exact same problem in 7.4 – kill the primary DC (which has a satellite), it fails over to secondary region, then gets stuck:
Warning: the database may have data loss and availability loss. Please restart following tlog interfaces, otherwise storage servers may never be able to catch up.
Old log epoch: 12 begin: 827334449 end: 1093552744, missing log interfaces(id,address): 61f46c21d2e4658d, 6c502d17604de6a5, 7e3b34298e52abf4,
Server time - 08/15/25 13:25:03
Data:
Replication health - (Re)initializing automatic data distribution
Moving data - unknown (initializing)
MarkSh1, thank you for providing an update. What is not clear is that what your PR is doing. From my understanding, it doesn’t fix the situation, the warning still be there, no any logic change. So, what is the point of the PR?
According to your table with tests. I see 2 regions configuration, but it’s not clear about DCs and association of tlogs with different DCs. The original post operates with at least 3 DCs: 1 for primary, 1 for satellite, 1 for secondary. There is another option of using 4th DC, for secondary satellite, but let’s omit it for now.
So in your tests you still see some warning, but the main question: is there any real data loss that may happen? If yes, under what conditions? If no - then why do we have this warning at all?
If there is a possible of dataloss, then what’s the point of having primary and satellites at all? The main motivation for such a deployment is to use satellite with synchronous replication, that allows to catch up from secondary during DC failover. If it’s not the case, then what’s the point of having it?
>>the database may have data loss and availability loss. Please restart following tlog interfaces, otherwise storage servers may never be able to catch up. Old log epoch: 27 begin: 70577542348 end: 71344145316, missing log interfaces(id,address): 7b91b08e089a7e4e,
>This is saying because of the DC failure, the new primary region can’t fetch the tail of mutation logs from the failed tlogs. Until the failed tlog is brought up, some storage servers can’t catch up (because they have to apply mutation logs in order and miss some of the logs).
We were seeing similar messages with a cluster that has a primary with a replication factor of 2 and 4 Tlogs and a satellite with 2 tlogs and a one_satellite_double mode.
What happened is that we were shutting down the main DC and so FDB doesn’t have a quorum for the tlogs but still we have some copies in the satellite.
What if we can’t recover the original DC at all ? Isn’t there a risk that can never catch up because we can’t apply the mutation log and if so doesn’t defeat the purpose of the satellite ?
If main DC is down, the database will automatically fail over to the remote region. And the remote region can pull mutation logs from original satellite, thus no data loss. What I meant in “Until the failed tlog is brought up, some storage servers can’t catch up” is referring to the original primary side storage servers. After failover, the new primary side won’t have data loss.
After failover, the new primary side won’t have data loss
So do you mean that after switching from primary to secondary there won’t be any data loss? In such a case, what is the point of having the warning and disabling the data distribution? Is it a bug?
For this diff it looks like it fixes warning only: now it may not be shown in fdbcli status. But the real problem is that FDB reports that dataLoss == -1 and availLoss == -1. Which has implications on data mover (at least, maybe there are some other consequences). So the complaining here is not just having a warning which is misleading according to discussion, but that it has a consequences on the entire FDB system: the system enters into “degraded” mode where at least data movement is disabled. Maybe there are some other consequences which is hidden and not shown in fdbcli.
So the biggest issue is that “degraded” mode is not expected for failover from primary to secondary. The expectation is that secondary should continue under normal situation/condition.