Three_data_hall coordinators

Hi @johscheuer , how did you manage to get 9 coordinators for three_data_hall ? ( Initial support for three data hall replication by johscheuer · Pull Request #1651 · FoundationDB/fdb-kubernetes-operator · GitHub ) every time i test i get 1 :frowning: ?

Could you share some more information about your setup? Without any information about what operator version you use and how your FoundationDBCluster resources look like it’s hard to help you.

i want to achieve three_data_hall across 3 AZ (cloud). I have nodes labeled with topology.kubernetes.io/zone=<respective_AZ> and following https://github.com/FoundationDB/fdb-kubernetes-operator/tree/main/config/tests/three_data_hall for deployment. Also locality is set

localities:
  - key: "data_hall"
    value: $az

with initial triple cluster i get a default 5 coordinators, when changes to three_data_hall goes to 1 instead of 9. what am i missing ?
thank you

Are you able to actually share the FoundationDBCluster resource? How many nodes do you have per AZ? You need at least 3 nodes per AZ, otherwise the operator is not able to select the right amount of coordinators.

localities:

  • key: “data_hall”
    value: $az

Shouldn’t the $az be replaced with the actual value (not sure where this information is from)? The docs have some additional information about the setup: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/docs/manual/fault_domains.md#three-data-hall-replication

please ignore localities section , is not working with new api apps.foundationdb.org/v1beta2 ( it comes from https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/docs/design/three_datahall.md#coordinator-selection);
back to initial problem, i do have 50 nodes available, and pods are scheduled correctly on respective nodes. fdbcluster definition i am using at the moment is https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/config/tests/three_data_hall/final.yaml

env vars:

AZ1=${AZ1:-"eastus2-1"}
AZ2=${AZ2:-"eastus2-2"}
AZ3=${AZ3:-"eastus2-3"}

observations:

  • in triple mode , i have 5 logs and 5 storage and 5 coordinators
  • when it switches to three_data_hall it shrinks , i have 3 clusters with 3 logs , 1 storage each and 1 coordinator
  • nothing obvious in operator pod logs

What operator version is deployed in your case? And have you made sure you use the correct CRD deployed from the according release branch (or newer)? I just tested the scripts/test setup that you referenced and everything works fine:

$ fdbcli --exec 'status details'

Using cluster file `/var/dynamic-conf/fdb.cluster'.

Configuration:
  Redundancy mode        - three_data_hall
  Storage engine         - ssd-2
  Coordinators           - 9
  Desired Commit Proxies - 2
  Desired GRV Proxies    - 1
  Desired Resolvers      - 1
  Desired Logs           - 4
  Desired Remote Logs    - -1
  Desired Log Routers    - -1
  Usable Regions         - 1

...

Coordination servers:
  192.168.0.3:4501  (reachable)
  192.168.0.4:4501  (reachable)
  192.168.0.5:4501  (reachable)
  192.168.0.6:4501  (reachable)
  192.168.0.23:4501  (reachable)
  192.168.0.9:4501  (reachable)
  192.168.0.11:4501  (reachable)
  192.168.0.101:4501  (reachable)
  192.168.0.102:4501  (reachable)

Is there anything interesting in the operator logs? Are you able to share them?

i updated crds but missed to update operator image version. Thank you so much for help!