We have three data centers that are physically separated from each other (denoted by L
, S
, and R
). The round-trip network latency between the two datacenters, such as L -> S, or R -> S, is about 10 ms.
We set up two FDB deployments (with FDB 6.2.15), with each deployment following 3 DCs - 2 regions deployment scheme defined in the FoundationDB architecture (link: https://apple.github.io/foundationdb/configuration.html#asymmetric-configurations). In this deployment, DC1 is the primary DC, DC3 is the standby DC, and DC2 holds the transaction logs from DC1.
- Deployment 1: Region 1 contains
DC1
andDC2
.DC1
is in datacenterL
andDC2
is in datacenterS
. Region 2 containsDC3
hosted in datacenterR
. In this setup,DC1
andDC2
are at different datacenters that are geographically separated. - Deployment 2: Region 1 contains
DC1
in datacenterL
and DC2 is also in datacenterL
, but in different availability zones. The latency between two availability zones in the same datacenter is around 0.5 ms. Region 2 containsDC3
hosted in datacenterS
. In this setup,DC1
andDC2
are located very close within the same datacenter and thus the latency (about 0.5 ms) is much smaller compared to the cross-datacenter latency of 10 ms.
With each deployment, we set up two clients at DC1
and DC3
, each client executing transactions to the deployed FDB cluster.
We experienced the latency (in milliseconds) reported by the clients (at both DCs) for read-only transactions as follows:
Deployment 1:
Configure | DC1 | DC3 | ||||
---|---|---|---|---|---|---|
50th | 95th | 99th | 50th | 95th | 99th | |
Configuration 1: Primary DC is DC1 (the original configuration) | 19 | 25 | 25 | 4 | 5 | 9 |
Configuration 2: Primary DC is DC3 (after we perform DC switching to have DC3 to become the primary DC) | 4 | 8 | 10 | 5 | 10 | 10 |
Deployment 2:
Configure | DC1 | DC3 | ||||
---|---|---|---|---|---|---|
50th | 95th | 99th | 50th | 95th | 99th | |
Configuration 1: Primary DC is DC1 (the original configuration) | 5 | 10 | 10 | 6 | 10 | 10 |
Configuration 2: Primary DC is DC3 (after we perform DC switching to have DC3 to become the primary DC) | 4 | 7 | 14 | 8 | 10 | 30 |
We observed that the latency of {Configuration 1, Deployment 1} experienced by the client located at the primary DC is much higher than the latency measured by the client at the primary DC in {Configuration 1, Deployment 2}. That is, (19,25,25 vs. 5,10,10 for 50th, 95th, and 99th latencies).
Is this latency difference of 15 ms because DC2 is not in the same datacenter as DC1 in Deployment 1? But our understanding is that DC2 only stores transaction logs, and thus should only impact writes latency. But what we observed is that it also impacts reads latency also. Is it true that this is a normal behavior?
We do have some read-only optimization implemented, as reported at FoundationDB Summit last year (the link: https://static.sched.com/hosted_files/foundationdbsummit2019/52/NuGraph.Built.Upon.JanusGraph.FoundationDB.Version11.pptx). As a result, at the standby DC in all of the four configurations above, the latency in our experiments is alway low because we cached the Global Read Versions (GRV) at the FDB client library and use them for initializing transactions (so the latency reported at standby DC is as we expected). However, we do not cache the transaction versions at the primary DC at this time.
For further details, this is the configuration when DC1 is the primary DC:
{
"regions": [
{
"datacenters": [
{
"id": "dc1",
"priority": 2
},
{
"id": "dc2",
"priority": 0,
"satellite": 1
}
],
"satellite_redundancy_mode": "one_satellite_double",
"satellite_logs": 10
},
{
"datacenters": [
{
"id": "dc3",
"priority": 1
}
]
}
]
}
and when DC3 is the primary DC:
{
"regions": [
{
"datacenters": [
{
"id": "dc1",
"priority": 1
},
{
"id": "dc2",
"priority": 0,
"satellite": 1
}
],
"satellite_redundancy_mode": "one_satellite_double",
"satellite_logs": 10
},
{
"datacenters": [
{
"id": "dc3",
"priority": 2
}
]
}
]
}
In summary, the questions that we like to ask:
- Does DC2, which hosts transaction log servers, play the role on the read-path?
- Are there some configurations in FDB 6.2 that we need to be aware that lead to possible high latency that we observed?
- In general, what is a good way to troubleshoot latency issues in FDB, to know where the transaction spent the most time on? It would be good if you can point us to some reference documentation.