High latency of read-only transactions at the primary DC?

We have three data centers that are physically separated from each other (denoted by L, S, and R). The round-trip network latency between the two datacenters, such as L -> S, or R -> S, is about 10 ms.

We set up two FDB deployments (with FDB 6.2.15), with each deployment following 3 DCs - 2 regions deployment scheme defined in the FoundationDB architecture (link: https://apple.github.io/foundationdb/configuration.html#asymmetric-configurations). In this deployment, DC1 is the primary DC, DC3 is the standby DC, and DC2 holds the transaction logs from DC1.

  • Deployment 1: Region 1 contains DC1 and DC2. DC1 is in datacenter L and DC2 is in datacenter S. Region 2 contains DC3 hosted in datacenter R. In this setup, DC1 and DC2 are at different datacenters that are geographically separated.
  • Deployment 2: Region 1 contains DC1 in datacenter L and DC2 is also in datacenter L, but in different availability zones. The latency between two availability zones in the same datacenter is around 0.5 ms. Region 2 contains DC3 hosted in datacenter S. In this setup, DC1 and DC2 are located very close within the same datacenter and thus the latency (about 0.5 ms) is much smaller compared to the cross-datacenter latency of 10 ms.

With each deployment, we set up two clients at DC1 and DC3, each client executing transactions to the deployed FDB cluster.

We experienced the latency (in milliseconds) reported by the clients (at both DCs) for read-only transactions as follows:

Deployment 1:

Configure DC1 DC3
50th 95th 99th 50th 95th 99th
Configuration 1: Primary DC is DC1 (the original configuration) 19 25 25 4 5 9
Configuration 2: Primary DC is DC3 (after we perform DC switching to have DC3 to become the primary DC) 4 8 10 5 10 10

Deployment 2:

Configure DC1 DC3
50th 95th 99th 50th 95th 99th
Configuration 1: Primary DC is DC1 (the original configuration) 5 10 10 6 10 10
Configuration 2: Primary DC is DC3 (after we perform DC switching to have DC3 to become the primary DC) 4 7 14 8 10 30

We observed that the latency of {Configuration 1, Deployment 1} experienced by the client located at the primary DC is much higher than the latency measured by the client at the primary DC in {Configuration 1, Deployment 2}. That is, (19,25,25 vs. 5,10,10 for 50th, 95th, and 99th latencies).

Is this latency difference of 15 ms because DC2 is not in the same datacenter as DC1 in Deployment 1? But our understanding is that DC2 only stores transaction logs, and thus should only impact writes latency. But what we observed is that it also impacts reads latency also. Is it true that this is a normal behavior?

We do have some read-only optimization implemented, as reported at FoundationDB Summit last year (the link: https://static.sched.com/hosted_files/foundationdbsummit2019/52/NuGraph.Built.Upon.JanusGraph.FoundationDB.Version11.pptx). As a result, at the standby DC in all of the four configurations above, the latency in our experiments is alway low because we cached the Global Read Versions (GRV) at the FDB client library and use them for initializing transactions (so the latency reported at standby DC is as we expected). However, we do not cache the transaction versions at the primary DC at this time.

For further details, this is the configuration when DC1 is the primary DC:

{
  "regions": [
    {
      "datacenters": [
        {
          "id": "dc1",
          "priority": 2
        },
        {
          "id": "dc2",
          "priority": 0,
          "satellite": 1
        }
      ],
      "satellite_redundancy_mode": "one_satellite_double",
      "satellite_logs": 10
    },
    {
      "datacenters": [
        {
          "id": "dc3",
          "priority": 1
        }
      ]
    }
  ]
}

and when DC3 is the primary DC:

{
  "regions": [
    {
      "datacenters": [
        {
          "id": "dc1",
          "priority": 1
        },
        {
          "id": "dc2",
          "priority": 0,
          "satellite": 1
        }
      ],
      "satellite_redundancy_mode": "one_satellite_double",
      "satellite_logs": 10
    },
    {
      "datacenters": [
        {
          "id": "dc3",
          "priority": 2
        }
      ]
    }
  ]
}

In summary, the questions that we like to ask:

  1. Does DC2, which hosts transaction log servers, play the role on the read-path?
  2. Are there some configurations in FDB 6.2 that we need to be aware that lead to possible high latency that we observed?
  3. In general, what is a good way to troubleshoot latency issues in FDB, to know where the transaction spent the most time on? It would be good if you can point us to some reference documentation.

Yes. GetReadVersion must verify that satellite transaction logs have not been locked, so they’re still on the path for starting a (read-only) transaction. (This is to guard against the database having already recovered into the secondary, which would have required locking transaction logs in the satellite.)

The further you put your satellite logs from your primary in a multi-region config, the higher the latency of a GetReadVersion call will be. Depending on how much you trust your clocks or how much (rare) stale data would affect your application, you could evaluate setting CAUSAL_READ_RISKY to drop the “have my transaction logs been locked” check.

There are debug transactions, which cause CommitDebug trace events to be logged in the various server components when that transaction is being handled. I wrote a hacky chrome://tracing based visualization for those logs as well. There’s a transaction every 10s that already has this set, so for most casual analysis, you probably don’t need to change your client code.

1 Like

Thank you very much @alexmiller for the clarification.
We then should cache the global read version at DC1 to reduce read latency in favor of performance over consistency.

I will also try out the tool as well to understand more about FDB transactions. Thanks for the tool!

If you’re already okay with stale data even in the primary, I’d strongly suggest trying CAUSAL_READ_RISKY first. You can even default it to on for all transactions on the server via --always_causal_read_risky=1.

Reviving an older thread, after reading through Documentation: How FDB read and write path works in FDB 6.2 by xumengpanda · Pull Request #4099 · apple/foundationdb · GitHub, it said:

Step 3 (Confirm proxy’s liveness) : To prevent proxies that are no longer a part of the system (such as due to network partition) from serving requests, each proxy contacts the queuing system for each timestamp request to confirm it is still valid proxies. This is based on the FDB property that at most one active queuing system is available at any given time.

  • Why do we need this step? This is to achieve consensus (i.e., external consistency). Compared to serializable isolation, Strict Serializable Isolation (SSI) requires external consistency. It means the timestamp received by clients cannot decrease. If we do not have step and network partition happens, a set of old proxies that are disconnected from the rest of systems can still serve timestamp requests to clients. These timestamps can be smaller than the new generation of proxies, which breaks the external consistency in SSI.
  • O(n * m) communication cost: To confirm a proxy’s liveness, the proxy has to contact all members in the queuing system to ensure the queuing system is still active. This causes m network communication, where m is the number of processes in the queuing system. A system with n proxies will have O(n * m) network communications at the step 3. In our deployment, n is typically equal to m;
  • Do FDB production clusters have this overhead? No. Our production clusters disable the external consistency by configuring the knob ALWAYS_CAUSAL_READ_RISKY.

Specifically, production clusters are recommended to have ALWAYS_CAUSAL_READ_RISKY. While we set this ourselves directly (on every transaction that we use), does it make sense to set this knob instead (meaning does the cluster itself perform better when all internal FDB requests have this on?)

Saving O(n*m) calls seems significant enough that perhaps CAUSAL_READ_RISKY should be on-by-default anyway?

Good question.

DD moves data using transactions, which do not set the causal read risky. But it manually sets its version. See https://github.com/apple/foundationdb/blob/cea7336baed0c55fca6e2d93909fc7b4e413e25b/fdbserver/storageserver.actor.cpp#L1844-L1857

Changing the cluster’s knob may help reduce the overhead for other system-initiated transactions. But I don’t think we ever evaluated how much it will provide.

Changing the knob’s default value means we will have to say by default, we don’t support external consistency unless operators change the knob value. If an app relies on the property, we may accidentally break the app when its cluster upgrades.