We have below setup for cross-region to verify the behavior of FDB’s capability to handle idc failures.
{"regions":[{
"datacenters":[{
"id":"idc1",
"priority":1
},{
"id":"idc2",
"priority":1,
"satellite":1,
"satellite_logs":2
},{
"id":"idc3",
"priority":0,
"satellite":1,
"satellite_logs":2
}],
"satellite_redundancy_mode":"one_satellite_double"
},{
"datacenters":[{
"id":"idc3",
"priority":0
}]
}]}
And we have read write clients connecting to the cluster.
If we shutdown the processes of fdb in idc1 by kill -9 processIds
, we observed the read-write traffic goes to idc3, which was expected.
And then we restarted the processes in idc1, which simulating the recovery of idc1.
We see some text in status details
output like below:
10.218.74.129:5018 ( 75% cpu; 90% machine; 0.059 Gbps; 93% disk IO; 2.1 GB / 8.0 GB RAM ) Storage server lagging by 126 seconds.
Seems idc1 is trying to catch up idc3’s new data.
But this time, the read-write traffic was still hitting idc3, the expected auto switch back as document configuration described didn’t happen. https://apple.github.io/foundationdb/configuration.html#asymmetric-configurations
.
Anyone has a hint what is missing here .
BTW: If we configure the dc3 with a -1 priority, the client traffic switched back to idc1.
{"regions":[{
"datacenters":[{
"id":"idc1",
"priority":1
},{
"id":"idc2",
"priority":1,
"satellite":1,
"satellite_logs":2
},{
"id":"idc3",
"priority":0,
"satellite":1,
"satellite_logs":2
}],
"satellite_redundancy_mode":"one_satellite_double"
},{
"datacenters":[{
"id":"idc3",
"priority":-1
}]
}]}