Set primary data center does not work as expected

I have two setups of FDB clusters: SingleDC and MultiDCs (2 regions with 1 primary DC, 1 standby DC, 1 satellite DC).

I have my client issuing requests (read-only) to each FDB cluster. With MultiDCs, my client is located at the primary DC. However, the latency of the MultiDC set-up is higher than the SingleDC. In my client, I already set the data center id to be the primary DC.

If I understand correctly, the latency with the MultiDCs should be comparable with SingleDC with my setup, since it is read-only and with MultiDC, the client should always go to the primary DC to get the read version and perform the read. Am I missing anything? My FDB version is 6.0.15.

Updated:

  • If I put my client at the primary DC, the latency is ~2 times higher (15ms vs 7.5ms).
  • If I put my client at the standby DC, the latency is ~5 times higher (35ms – add 20ms more).
    (Note that this is read transactions that may read multiple key-value pairs, not reading a single key-value pair).

If you ssh to a host in your primary DC, what’s the round-trip time to your satellite DC, and what’s the round-trip time to the remote DC? (ie. ping them, and what’s the latency?)

@alexmiller:
Roundtrip time from one node in primary DC to another node in remote DC is ~ 7 to 8ms.
Roundtrip time from one node in primary DC to another node in satellite DC is ~ 7 to 8ms.

If the transaction is read-only (and is executed at the primary DC), how frequent does it go the other DCs or usually how many roundtrips it makes to the other DCs? As I understand, it should not go either satellite DC or remote DC, should it?

When I turned on the client trace, the event of type “TransactionTrace_GetVersion” always takes 10ms although I configured the client to be the same DC as the primary DC and set the db.options().setDatacenterId() to the primary DC id.

When my client is at the remote DC, the latency of GetVersion is 20ms to 30ms.

When I check the experiment with single DC, this number is around 1ms.

Other events the latency looks normal to me. Does that mean my configuration is wrong somehow that causes my GetVersion latency to be so high?

Both commit and get read version requests require a round trip to the transaction logs from the proxies. If you have a setup with logs in a remote DC (like the satellites), then both of those requests will be affected accordingly.

If I’m interpreting your setup correctly that you have logs in a satellite with a round-trip time of 8ms, then the numbers you report sound reasonable to me.

Read-only transactions will need to get a read version for the first read, so they will also experience a round trip latency to the satellite DC in that case. There are various strategies that could be used to limit that effect, such as operating a satellite that is closer to your primary, reducing causal consistency guarantees by using the CAUSAL_READ_RISKY transaction option (doing so avoids the round-trip to the logs, but introduces a rare possibility that you might get a read version older than a committed version), or caching read versions and reusing them (which means you could be reading slightly stale data).

I was thinking about this, and thought that the GRV quorum work I did some time ago meant that this shouldn’t be required. If we get enough GRV responses back from the local DC to indicate that local TLogs could still commit, I’d… think that should be good enough. I’ve thrown my claims into correctness, and we’ll see if that turns into a PR.

Interesting, I wasn’t aware of that. Based on the reported numbers, it seems likely that it’s having to do a round-trip outside of the primary. Could there be something about their config that would make it not benefit from your change?

I retract my claim, because tolerating the loss of 1DC means we need to be able to lock 2 out of 3 DCs during recovery, which also means our GRV quorum needs to be 2 out of 3 DCs. So, GRVs do indeed need to go to satellites, and minimal quorums only work within one DC.

I tried to bring the satellite DC close to the primary DC and the latency reduces to be close to the single DC setup (actually, they are deployed on the same data center) :slight_smile: . Unfortunately, with my current change, my setup may not tolerate data center failure.

I don’t quite understand what you are discussing, probably because of lack of knowledge of how FDB works internally. For example, other than get read version and commit, is there other transaction operations that go cross DCs? A read or get-range always get the data in its DC, or sometimes it has to go to the other DC to get the data? A write operation always go to the transaction logs processes of the primary DC and satellite DC synchronously?

Is there a document describing how a transaction is executed with a 3-DC, 2-region setup? That would be really helpful for me to get some insights of my deployment.

If I put my client at the remote DC, the latency is still higher than with the primary DC (15ms vs 8ms). I tried both methods: enable causal read risky and use a background thread to pre-fetch the read version does not help. Could it be because the tx.commit().get() or tx.cancel() still goes to the primary DC and satellite DC? If my transaction is read-only, could it skip tx.commit().get() and tx.cancel()?

During recovery, there’s a Lock RPC that we make to TLogs to tell them that there’s now a newer generation of TLogs and that they should stop accepting new commits. Getting a read version checks to make sure that the TLogs aren’t locked, and therefore we can decrease latency a bit by using a minimal quorum for this. As recoveries are far less frequent than getting a read version, we force recovery to lock a large number of TLogs and GRV to check that a minimal number aren’t locked. I was thinking that this quorum check for GRVs should have been fine with a quorum of responses from one DC, but then later reflected, and realized that it does need to wait for responses from two DCs. So no cheap easy optimization for me. :man_shrugging:

Correct. Clients in the secondary DC are expected to have higher latency, as starting a transaction and committing a transaction still require going to the primary datacenter. The other operations that a transaction would do are reads, and those should be able to be served from the local DC, as long as there are non-failed storage servers available locally.

If your transaction is read-only, tx.commit() is already a no-op, won’t actually send any packets, and should return a future that’s immediately ready.

1 Like

Based on your answer, with a read-only transaction from the remote DC, committing a transaction does not go to the primary DC anymore.

How about starting a transaction? Currently, I initialize a transaction by using Transaction tx = db.createTransaction();, so I guess FDB does not know up-front whether this is a read-only or read-write transaction and then still spends one roundtrip to the primary DC. Is it correct?

If I know the transaction is read-only, is there a way to hint the transaction at the beginning so that starting a transaction only access the local DC?

If I use the read transaction from tx.snapshot(), will it solve the issue?

A transaction needs some read version to perform reads. Your options are to let the client get it automatically (incurring the round-trip to the primary DC), or some flavor of re-using an older read version outlined below, which will read potentially stale data.

snapshot is a different concern. Snapshot does not record read conflict ranges for the reads you perform. You still need some read version to perform those reads at.

There will eventually be improvements to doing this, as well. Being able to get a stale read version from a non-primary DC locally is #1006. Better read version caching is #1310.