Hello,
My cluster deployment was 3 DCs, 2 regions.The FDB database version is 6.3.23 .
region.json
{
“regions”: [ {
“datacenters”:[{
“id” : “dc1”,
“priority”: 1
},{
“id”: “dc2”,
“priority”: 0,
“satellite”: 1,
“satellite_logs”: 2
}],
“satellite_redundancy_mode”: “one_satellite_double”
},{
“datacenters”:[{
“id”: “dc3”,
“priority”: 0
}]
}]
} DC1 is the primary DC. DC2 is satellite DC, and DC3 is the one in a different region. fdbcli --exec status
It seems the primary datacenter has changed to DC3,but fault tolerance changes to -1, and replication health keeps initializing automatic data distribution for a long time. However, read and write requests can be submitted by the cluster normally. fdbcli --exec 'status json' shows DC3 hasn’t fully recovered yet
And a stateless process deployed in DC3 shows the recovery_state is still accepting_commits.
Question:
Are there any problems with my cluster configuration? The status of cluster hasn't fully recovered yet of remote datacenter is not the expect behavior of me. Should the recovery_state of the cluster become fully_recovered? What can I do to handle this problem?
What should I do to recover the remote DC after the failure of primary datacenter? or the failure of satellite?
I plan to simulate all data loss of DC1, The next step is to change regions.json that sets priority of DC1 to -1,and configure usable_regions=1 .Is it a safe step when facing these situations?
Thanks in advance!
I have the same problem. I tried using versions 6.3, 7.1, 7.3 and the latest version 7.4 and got the same behavior. When I disconnect the main DC cluster, the cluster switches to the second region and hangs in the initialization status.
This state means the database is available, i.e., read or write transactions can be performed.
the database may have data loss and availability loss. Please restart following tlog interfaces, otherwise storage servers may never be able to catch up.
Old log epoch: 27 begin: 70577542348 end: 71344145316, missing log interfaces(id,address): 7b91b08e089a7e4e,
This is saying because of the DC failure, the new primary region can’t fetch the tail of mutation logs from the failed tlogs. Until the failed tlog is brought up, some storage servers can’t catch up (because they have to apply mutation logs in order and miss some of the logs).
The fault tolerance of your configuration is 1 DC + 1 satellite tlog. Did you tear down the whole primary region?
In first region I turned off only the primary dc1, the satellite dc3 continued to work. The main dc2 in the second region should pick up the log from the satellite dc3 of the first region, but the cluster state remains in initialization all time