Incorrect clusterfile coordinators when using three_data_hall?

I am using a three_data_hall cluster composed of 3 halls, 1 per AZ:

apiVersion: apps.foundationdb.org/v1beta2
kind: FoundationDBCluster
metadata:
  name: my-cluster-$az
  namespace: my-ns
spec:
  version: 7.3.27
  skip: false
  sidecarContainer:
    enableLivenessProbe: true
  replaceInstancesWhenResourcesChange: true
  routing:
    publicIPSource: service
  automationOptions:
    cacheDatabaseStatusForReconciliation: true
    deletionMode: Zone
    podUpdateStrategy: ReplaceTransactionSystem
    removalMode: Zone
  coordinatorSelection:
  - priority: 100
    processClass: coordinator
  dataHall: $az
  databaseConfiguration:
    logs: 3
    proxies: 3
    redundancy_mode: three_data_hall
    storage: 3
    storage_engine: ssd-2
  ignoreUpgradabilityChecks: false
  maxZonesWithUnavailablePods: 1
  processCounts:
    coordinator: 3
    log: 3
    stateless: 3
  processGroupIDPrefix: $az

I followed the setup using operator, with all the steps, and cluster seems working fine; however today I discovered that the cluster files of each of the 3 halls do contain coordinators from other zones.

This was surprising; I also see plenty of cross-AZ traffic when using (as a client) these cluster files.

Is this working as intended? I would expect that when using the cluster file of each of the individual halls, I get coordinators of that specific hall and thus proxies/storage nodes only of that single AZ?

Thanks for any pointer!

In three_data_hall mode, the transaction system’s coordinators and stateless roles will be distributed in 3 data centers, and the traffic of the processes in the transaction system will cross 3 AZs.

Maybe you should use the three_datacenter mode so the transaction system roles stay in an AZ to avoid communication across AZs.

However, writes still need to request the primary AZ to write replicas to two AZs and store replicas in three AZs. As far as I know, clients can specify preferred AZs by configuring network options but write traffic is still sent to the primary AZ.

1 Like

Thanks for your reply!

You mean that even in three_datacenter mode I would need to specify preferred AZs, or that it’s possible to specify preferred AZs in three_data_hall mode? I am interested in optimizing only for fast reads so I judge acceptable that writes cross AZ or are unevenly hitting the primary AZ.

In FDB Read and Write Path — FoundationDB 7.1, it says:

Load balance: Each data exists on k storage servers, where k is the replication factor. To balance the load across the k replicas, client has a load balancing algorithm to balance the number of requests to each replica.

In foundationdb/documentation/sphinx/source/api-common.rst.inc at main · apple/foundationdb · GitHub, it says:

Load balancing uses this option for location-awareness, attempting to send database operations first to servers on a specified machine, then a specified datacenter, then returning to its default algorithm.

I’m not sure if this option is only available in 3az mode, or if it only needs to be configured for locality on the cluster.

And for fast read, enable USE_GRV_CACHE also useful for read latency optimizations.

1 Like

Sorry, I don’t follow; by 3az you mean three_datacenter mode? Right now I am experimenting with a single region with 1 main DC and 2 satellites to cover the 3 AZs I have available, but specifying the DC id via client doesn’t seem to work as I see that clients from all 3 AZs are talking directly to the storage processes on the main DC in region A.

@johscheuer three_datacenter is not currently supported by operator, right? I saw on this configuration page that it mentions:

The only reason to use three_datacenter replication is if low latency reads from all three locations is required.

I am actually trying to achieve that (low read latency from all 3 AZs) but there must be something that I am missing.
This is an example configuration I am using (same for all 3 clusters, one per AZ, with the difference of the name/prefix):

apiVersion: apps.foundationdb.org/v1beta2
kind: FoundationDBCluster
metadata:
  name: skunkworks-1a
spec:
  dataCenter: 1a
  databaseConfiguration:
    redundancy_mode: triple
    regions:
    - datacenters:
      - id: 1a
        priority: 2
      - id: 1b
        priority: 1
        satellite: 1
      - id: 1c
        priority: 1
        satellite: 1
    storage: 3
    usable_regions: 1
  faultDomain:
    key: foundationdb.org/none
  lockOptions:
    disableLocks: true
  processGroupIDPrefix: 1a
  version: 7.3.27

@johscheuer three_datacenter is not currently supported by operator, right? I saw on this configuration page that it mentions:

Correct, the operator doesn’t support that yet. We could add support for that if there is a specific need for this configuration setup.

I am actually trying to achieve that (low read latency from all 3 AZs) but there must be something that I am missing.
This is an example configuration I am using (same for all 3 clusters, one per AZ, with the difference of the name/prefix):

The setup that you posted will require cross-az traffic as the commit must be done in the primary and the satellite. I’m actually not sure what happens when two dc’s have the same priority. Also all reads will go to the primary (1a in your case) as that will be the only AZ with storage servers. For doing reads and writes the client needs a GRV and that will be fetched in the primary (1a) too as this will be the only AZ running stateless processes. The satellites will only run log processes.

1 Like

Thanks for your reply!

If it’s the only way to achieve low latency reads from all 3 locations, then it would be welcome!

I am okay with slower writes and commits going always to the primary, and also okay with the GRV being fetched from the primary; but ideally I’d like to see that when client specifies 1b as DC id, it gets data from replica copies that exist in storage nodes in that same AZ 1b. Is this something achievable with current operator features, or with three_datacenter, or not achievable at all? (if you know) :thinking:

An example of cross-zone traffic from a client (sitting on zone C) to a storage node (sitting on zone A):

I am going to attempt using a configured region for each AZ and report back my findings.

Correct, the operator doesn’t support that yet. We could add support for that if there is a specific need for this configuration setup.

If it’s the only way to achieve low latency reads from all 3 locations, then it would be welcome!

In theory it should be straightforward if we implement it similar to the three_data_hall setup, but before adding support for it would be great if you could gather some data for it, so we don’t have to maintain another configuration in the operator that is not used.

I am okay with slower writes and commits going always to the primary, and also okay with the GRV being fetched from the primary; but ideally I’d like to see that when client specifies 1b as DC id, it gets data from replica copies that exist in storage nodes in that same AZ 1b . Is this something achievable with current operator features, or with three_datacenter , or not achievable at all? (if you know) :thinking:

Are you setting the fdb package - github.com/apple/foundationdb/bindings/go/src/fdb - Go Packages in your client (just to make sure you are setting the same datacenter IDs)? In theory that should work with a few caveats: 1. the AZ (or dc) with the satellite will not host any storage processes. 2. There can be cases where your application gets the GRV from the primary and then attempt to read from the remote side, if the storage server on the remote side doesn’t have the data yet, the call will be blocked and delay until either the version is available or a timeout is hit.

1 Like

Sure, please wait for me to test this manually (without operator) - because if AZ-local reads can be achieved with a less sophisticated setup I will not need it; thanks!

Yes, I am setting the correct DC id on client (although it’s not hexadecimal like documentation says, but as far as I could tell from FoundationDB source code it doesn’t have to be hexadecimal) but since there are no storage processes allocated on the satellite DCs/AZs (the pods are there but basically idle), it seems correct that clients are reading the data from the storage nodes on the only DC/AZ which has such storage nodes.

Understood; so far I only tried 1 primary DC with 2 satellite DCs, to cover 3 AZs; what I will try next are:

  1. 1 DC for each AZ, for 3 peer DCs in total (no satellites); had some issues bringing up the third DC, but from documentation it seems supported with v7+
  2. three_datacenter mode

So far I could not think of other ways to take advantage of the client and replica copies being hosted on the same AZ; theoretically, if I could make FoundationDB cluster match its “zones” to AZs, it should be possible to get AZ-local reads just by relying on the algorithm which monitors latencies and automatically assigns the closest replica, even with a simpler setup without multi-cluster/multi-DC/multi-region, but I have suspended experimenting with that for now, might try again later.

I haven’t tried this and maybe this doesn’t work, but you could try to set the dataCenter: 1a and dataHall: 1a (and the same for 1b, 1c) with three_data_hall. This will bring up the cluster with thee_data_hall replication mode, so in each DC there should be one replica. The transaction system would still behave as Rjerk has written, but with the assumption that you mostly care about the “local” reads, this might work. You still have to set the data centre ID in your client application.

1 Like

The three_data_hall setup with specified data center ids does actually work, thanks for the help! I can see that all of the client’s traffic is limited to its own zone, without even a roundtrip to zone A for the GRV.
I can try contributing some documentation PR for this, if you think it’s worth it.

Other observations: all FoundationDB cluster processes from the 3 data halls are communicating with the CC on zone A, and while zones A and C have similar performance (~2ms) I observe slightly higher latency on zone B, for unknown reasons.

Would you have any tip on how to chase down this extra latency there? I’d like to see how far I can push down the minimum latency.

I’ll open a new thread because I have a question specifically about GRV proxies in three_data_hall.