useDNSInClusterFile - Connection string invalid

I’m looking for some guidance on using the DNS option for the cluster file connection string. I’m able to deploy the cluster fine when using pod or service IPs. I have tried this set up with multiple image versions of the operator/fdb/sidecar from 7.1.25 to 7.2.0 which was released yesterday, all with the same results. Between attempts I am deleting the operator deployment and the fdb cluster.

When I enable useDNSInClusterFile, the operator will create the cluster, pick coordinators, and initialize the cluster file with the what I expect for the connection string. However the operator errors after that with FoundationDB error code 2104 (Connection string invalid).

...
{"level":"info","ts":1669932412.0034533,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"foundationdb","cluster":"foundationdb-cluster","subReconciler":"controllers.updateLabels"}
{"level":"info","ts":1669932412.0045478,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"foundationdb","cluster":"foundationdb-cluster","subReconciler":"controllers.updateDatabaseConfiguration"}
{"level":"info","ts":1669932412.0046866,"logger":"fdbclient","msg":"Fetch values from FDB","namespace":"foundationdb","cluster":"foundationdb-cluster","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1669932412.0048652,"logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"foundationdb","cluster":"foundationdb-cluster","key":"\ufffd\ufffd/status/json"}
{"level":"error","ts":1669932412.0050175,"logger":"controller","msg":"Error in reconciliation","namespace":"foundationdb","cluster":"foundationdb-cluster","subReconciler":"controllers.updateDatabaseConfiguration","requeueAfter":0,"error":"FoundationDB error code 2104 (Connection string invalid)","stacktrace":"github.com/FoundationDB/fdb-kubernetes-operator/controllers.(*FoundationDBClusterReconciler).Reconcile\n\t/workspace/controllers/cluster_controller.go:180\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:234"}
{"level":"error","ts":1669932412.005133,"msg":"Reconciler error","controller":"foundationdbcluster","controllerGroup":"apps.foundationdb.org","controllerKind":"FoundationDBCluster","foundationDBCluster":{"name":"foundationdb-cluster","namespace":"foundationdb"},"namespace":"foundationdb","name":"foundationdb-cluster","reconcileID":"31fb36e8-85f9-4d36-b3ac-a93c1c0d3a88","error":"FoundationDB error code 2104 (Connection string invalid)","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:234"}

The cluster-config generated:

cluster-file: >-
    foundationdb_cluster:LNg3SuB6Knl88bYxNJ5scY2gkDXr1fsp@foundationdb-cluster-storage-1.foundationdb-cluster.foundationdb.svc.cluster.local:4501,foundationdb-cluster-storage-2.foundationdb-cluster.foundationdb.svc.cluster.local:4501,foundationdb-cluster-storage-3.foundationdb-cluster.foundationdb.svc.cluster.local:4501
  fdbmonitor-conf-cluster_controller: |-
    [general]
    kill_on_configuration_change = false
    restart_delay = 60
    [fdbserver.1]
    command = $BINARY_DIR/fdbserver
    cluster_file = /var/fdb/data/fdb.cluster
    seed_cluster_file = /var/dynamic-conf/fdb.cluster
    public_address = $FDB_PUBLIC_IP:4501
    class = cluster_controller
    logdir = /var/log/fdb-trace-logs
    loggroup = foundationdb-cluster
    datadir = /var/fdb/data
    locality_instance_id = $FDB_INSTANCE_ID
    locality_machineid = $FDB_MACHINE_ID
    locality_zoneid = $FDB_ZONE_ID
    listen_address = $FDB_POD_IP:4501
    knob_disable_posix_kernel_aio = 1
    locality_dns_name = $FDB_DNS_NAME
  fdbmonitor-conf-log: |-
    [general]
    kill_on_configuration_change = false
    restart_delay = 60
    [fdbserver.1]
    command = $BINARY_DIR/fdbserver
    cluster_file = /var/fdb/data/fdb.cluster
    seed_cluster_file = /var/dynamic-conf/fdb.cluster
    public_address = $FDB_PUBLIC_IP:4501
    class = log
    logdir = /var/log/fdb-trace-logs
    loggroup = foundationdb-cluster
    datadir = /var/fdb/data
    locality_instance_id = $FDB_INSTANCE_ID
    locality_machineid = $FDB_MACHINE_ID
    locality_zoneid = $FDB_ZONE_ID
    listen_address = $FDB_POD_IP:4501
    knob_disable_posix_kernel_aio = 1
    locality_dns_name = $FDB_DNS_NAME
  fdbmonitor-conf-storage: |-
    [general]
    kill_on_configuration_change = false
    restart_delay = 60
    [fdbserver.1]
    command = $BINARY_DIR/fdbserver
    cluster_file = /var/fdb/data/fdb.cluster
    seed_cluster_file = /var/dynamic-conf/fdb.cluster
    public_address = $FDB_PUBLIC_IP:4501
    class = storage
    logdir = /var/log/fdb-trace-logs
    loggroup = foundationdb-cluster
    datadir = /var/fdb/data
    locality_instance_id = $FDB_INSTANCE_ID
    locality_machineid = $FDB_MACHINE_ID
    locality_zoneid = $FDB_ZONE_ID
    listen_address = $FDB_POD_IP:4501
    knob_disable_posix_kernel_aio = 1
    locality_dns_name = $FDB_DNS_NAME
  running-version: 7.2.0

If I exec into one of the pods and check the status, I get the following:

Using cluster file `/var/dynamic-conf/fdb.cluster'.

Unable to communicate with the cluster controller at 172.20.179.0:4501 to get
status.

Configuration:
  Redundancy mode        - unknown
  Storage engine         - unknown
  Encryption at-rest     - disabled
  Coordinators           - unknown
  Usable Regions         - unknown

Cluster:
  FoundationDB processes - unknown
  Zones                  - unknown
  Machines               - 
  Machines               - unknown

Data:
  Replication health     - unknown
  Moving data            - unknown
  Sum of key-value sizes - unknown
  Disk space used        - unknown

Operating space:
  Unable to retrieve operating space status

Workload:
  Read rate              - unknown
  Write rate             - unknown
  Transactions started   - unknown
  Transactions committed - unknown
  Conflict rate          - unknown

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Execing into the coordinator gives similar results:

WARNING: Long delay (Ctrl-C to interrupt)
Using cluster file `/var/dynamic-conf/fdb.cluster'.

The coordinator(s) have no record of this database. Either the coordinator
addresses are incorrect, the coordination state on those machines is missing, or
no database has been created.

  foundationdb-cluster-storage-1.foundationdb-cluster.foundationdb.svc.cluster.local:4501  (reachable)
  foundationdb-cluster-storage-2.foundationdb-cluster.foundationdb.svc.cluster.local:4501  (reachable)
  foundationdb-cluster-storage-3.foundationdb-cluster.foundationdb.svc.cluster.local:4501  (reachable)

Unable to locate the data distributor worker.

Unable to locate the ratekeeper worker.

Unable to locate the consistencyScan worker.

I am running the FDB cluster on EKS via the operator with the following configuration:

apiVersion: apps.foundationdb.org/v1beta2
kind: FoundationDBCluster
metadata:
  labels:
    argocd.argoproj.io/instance: foundationdb.foundationdb.in-cluster
  name: foundationdb-cluster
  namespace: foundationdb
spec:
  automationOptions:
    killProcesses: true
    replacements:
      enabled: true
      maxConcurrentReplacements: 2
  faultDomain:
    key: foundationdb.org/none
  labels:
    filterOnOwnerReference: false
    matchLabels:
      foundationdb.org/fdb-cluster-name: foundationdb-cluster
    processClassLabels:
      - foundationdb.org/fdb-process-class
    processGroupIDLabels:
      - foundationdb.org/fdb-process-group-id
  mainContainer:
    imageConfigs:
      - baseImage: foundationdb/foundationdb
      - tag: 7.2.0
  minimumUptimeSecondsForBounce: 60
  processCounts:
    cluster_controller: 1
    stateless: -1
  processes:
    general:
      customParameters:
        - knob_disable_posix_kernel_aio=1
      podTemplate:
        spec:
          containers:
            - name: foundationdb
              resources:
                requests:
                  cpu: 400m
                  memory: 955Mi
              securityContext:
                runAsUser: 0
            - livenessProbe:
                failureThreshold: 5
                initialDelaySeconds: 15
                periodSeconds: 30
                successThreshold: 1
                tcpSocket:
                  port: 8080
                timeoutSeconds: 1
              name: foundationdb-kubernetes-sidecar
              resources:
                limits:
                  cpu: 100m
                  memory: 128Mi
                requests:
                  cpu: 100m
                  memory: 128Mi
              securityContext:
                runAsUser: 0
          initContainers:
            - name: foundationdb-kubernetes-init
              resources:
                limits:
                  cpu: 100m
                  memory: 128Mi
                requests:
                  cpu: 100m
                  memory: 128Mi
              securityContext:
                runAsUser: 0
      volumeClaimTemplate:
        spec:
          resources:
            requests:
              storage: 16G
  replaceInstancesWhenResourcesChange: false
  routing:
    publicIPSource: service
    useDNSInClusterFile: true
  sidecarContainer:
    enableLivenessProbe: false
    enableReadinessProbe: false
  useExplicitListenAddress: true
  version: 7.2.0

Notably, I have the routing block as

  routing:
    publicIPSource: service
    useDNSInClusterFile: true

Documentation is very light on this feature, so I don’t know if I’m missing a configuration option or what. I found this test case which helped a little. I checked the validation in FoundationDB itself for the hostname and the regex passes, so hostnames themselves should be valid. I’ve read just about every forum post and issue that mentions the DNS option.

My ultimate use case is to connect to the FDB cluster from another peered cluster via Cilium cluster mesh. I can do that with the pod IPs, but I need the predicable service names for this solution to be viable.

Does anyone have any suggestions? I’ve been trying to figure this out for a while now and I’m out of ideas. Thanks.

Documentation is very light on this feature

That’s very kind to say we don’t have documentation for that feature yet :slight_smile: Hopefully we are able to write down the documentation by the end of this year (I’m happy if you want to contribute the documentation).

Do you mind to share your operator deployment? When using the multi-version client setup there is currently an issue that needs some additional configuration. The TLDR of the issue is that the primary (default) library of the FDB operator is 6.2.29, since the operator supports 6.2.20 and newer versions (this is build here: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/Dockerfile#L7). This version of the FDB bindings don’t understand the DNS entires in the cluster file and therefore you get the error. There are 2 ways to workaround this issue:

1.) Build you own operator image with 7.1 as the default library (we will do that in the operator repository once we drop support fro 6.2 and 6.3) and use your own image.
2.) Copy the library from a 7.1 image to a “special” place and set LD_LIBRARY_PATH something like this:

        - name: foundationdb-kubernetes-init-7-1-primary
          image: foundationdb/foundationdb-kubernetes-sidecar:7.1.25
          imagePullPolicy: Always
          args:
            # Note that we are only copying a library, rather than copying any binaries. 
            - "--copy-library"
            - "7.1"
            - "--output-dir"
            - "/var/output-files/primary" # Note that we use `primary` as the subdirectory rather than specifying the FoundationDB version like we did in the other examples.
            - "--init-mode"
          volumeMounts:
            - name: fdb-binaries
              mountPath: /var/output-files

and in the operator simply add this environment variable:

          - name: LD_LIBRARY_PATH
            value: /usr/bin/fdb/primary/lib

Depending on your setup either option 1 or option 2 is the better fit.

Like I said if you’re currently trying out this feature and you want to write the documentation for it, I’m more than happy to help.

@johscheuer Thanks for the help.

I went with option #1 and built an image with the updated version

ARG BASE_IMAGE=docker.io/debian:bullseye

# Build the manager binary
FROM docker.io/library/golang:1.19.3 as builder

# Install FDB this version is only required to compile the fdb operator
ARG FDB_VERSION=7.1.5
ARG FDB_WEBSITE=https://github.com/apple/foundationdb/releases/download
ARG TAG="latest"

RUN set -eux && \
	curl --fail -L ${FDB_WEBSITE}/${FDB_VERSION}/foundationdb-clients_${FDB_VERSION}-1_amd64.deb -o fdb.deb && \
	dpkg -i fdb.deb && \
    rm fdb.deb

RUN apt-get -y update && apt-get install -y unzip
RUN curl -sL https://github.com/FoundationDB/fdb-kubernetes-operator/archive/refs/tags/v1.10.0.zip > /tmp/fdb-kubernetes-operator.zip
RUN unzip /tmp/fdb-kubernetes-operator.zip -d /

WORKDIR /workspace
# Copy the Go Modules manifests
RUN cp /fdb-kubernetes-operator-1.10.0/go.mod go.mod
RUN cp /fdb-kubernetes-operator-1.10.0/go.sum go.sum
# cache deps before building and copying source so that we don't need to re-download as much
# and so that source changes don't invalidate our downloaded layer
RUN go mod download -x

# Copy the go source
RUN cp /fdb-kubernetes-operator-1.10.0/main.go main.go
RUN cp /fdb-kubernetes-operator-1.10.0/Makefile Makefile
RUN cp -r /fdb-kubernetes-operator-1.10.0/api/ api/
RUN cp -r /fdb-kubernetes-operator-1.10.0/controllers/ controllers/
RUN cp -r /fdb-kubernetes-operator-1.10.0/setup/ setup/
RUN cp -r /fdb-kubernetes-operator-1.10.0/fdbclient/ fdbclient/
RUN cp -r /fdb-kubernetes-operator-1.10.0/internal/ internal/
RUN cp -r /fdb-kubernetes-operator-1.10.0/pkg/ pkg/
RUN cp -r /fdb-kubernetes-operator-1.10.0/mock-kubernetes-client/ mock-kubernetes-client/

# Build
RUN CGO_ENABLED=1 GOOS=linux GOARCH=amd64 GO111MODULE=on make manager

# Create user and group here since we don't have the tools
# in distroless
RUN groupadd --gid 4059 fdb && \
	useradd --gid 4059 --uid 4059 --create-home --shell /bin/bash fdb && \
	mkdir -p /var/log/fdb && \
	touch /var/log/fdb/.keep

FROM $BASE_IMAGE

VOLUME /usr/lib/fdb

WORKDIR /

COPY --from=builder /etc/passwd /etc/passwd
COPY --from=builder /etc/group /etc/group
COPY --chown=fdb:fdb --from=builder /workspace/bin/manager .
COPY --from=builder /usr/lib/libfdb_c.so /usr/lib/
COPY --chown=fdb:fdb --from=builder /var/log/fdb/.keep /var/log/fdb/.keep

# Set to the numeric UID of fdb user to satisfy PodSecurityPolices which enforce runAsNonRoot
USER 4059

ENV FDB_NETWORK_OPTION_TRACE_LOG_GROUP=fdb-kubernetes-operator
ENV FDB_NETWORK_OPTION_TRACE_ENABLE=/var/log/fdb
ENV FDB_BINARY_DIR=/usr/bin/fdb
ENV FDB_NETWORK_OPTION_EXTERNAL_CLIENT_DIRECTORY=/usr/bin/fdb

ENTRYPOINT ["/manager"]

With that new image, I am able to start up the operator and get past the connection string invalid error. Here is the operator deployment I’m now using:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/instance: foundationdb-operator
    app.kubernetes.io/name: foundationdb-operator
    app.kubernetes.io/version: 1.10.0
  name: foundationdb-operator
  namespace: foundationdb-operator
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/instance: foundationdb-operator
      app.kubernetes.io/name: foundationdb-operator
  template:
    metadata:
      labels:
        app.kubernetes.io/instance: foundationdb-operator
        app.kubernetes.io/name: foundationdb-operator
    spec:
      containers:
        - command:
            - /manager
            - '-cli-timeout=30'
          image: <custom-image-repo>/foundationdb-operator:1.10.0
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /metrics
              port: metrics
              scheme: HTTP
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          name: manager
          ports:
            - containerPort: 8080
              name: metrics
              protocol: TCP
          resources:
            limits:
              cpu: 500m
              memory: 256Mi
            requests:
              cpu: 500m
              memory: 256Mi
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - all
            privileged: false
            readOnlyRootFilesystem: true
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /tmp
              name: tmp
            - mountPath: /var/log/fdb
              name: logs
            - mountPath: /usr/bin/fdb
              name: fdb-binaries
      dnsPolicy: ClusterFirst
      initContainers:
        - args:
            - '--copy-library'
            - '7.1'
            - '--copy-binary'
            - fdbcli
            - '--copy-binary'
            - fdbbackup
            - '--copy-binary'
            - fdbrestore
            - '--output-dir'
            - /var/output-files/7.1.5
            - '--init-mode'
          image: 'foundationdb/foundationdb-kubernetes-sidecar:7.1.5-1'
          imagePullPolicy: IfNotPresent
          name: foundationdb-kubernetes-init-7-1
          resources:
            limits:
              cpu: 10m
              memory: 50Mi
            requests:
              cpu: 10m
              memory: 50Mi
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - all
            privileged: false
            readOnlyRootFilesystem: true
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /var/output-files
              name: fdb-binaries
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 4059
        runAsGroup: 4059
        runAsUser: 4059
      serviceAccount: foundationdb-operator
      serviceAccountName: foundationdb-operator
      terminationGracePeriodSeconds: 10
      volumes:
        - emptyDir: {}
          name: tmp
        - emptyDir: {}
          name: logs
        - emptyDir: {}
          name: fdb-binaries

I ran into another problem, which I’ve managed to solve. The operator couldn’t reconcile the cluster with these logs.

{"level":"info","ts":1670524142.8491852,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"foundationdb","cluster":"foundationdb-cluster","subReconciler":"controllers.updateDatabaseConfiguration"}
{"level":"info","ts":1670524142.8493474,"logger":"fdbclient","msg":"Fetch values from FDB","namespace":"foundationdb","cluster":"foundationdb-cluster","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1670524150.8583922,"logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"foundationdb","cluster":"foundationdb-cluster","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1670524150.8593502,"logger":"controller","msg":"Configuring database","namespace":"foundationdb","cluster":"foundationdb-cluster","reconciler":"updateDatabaseConfiguration","current configuration":{"redundancy_mode":"double","storage_engine":"ssd-2","usable_regions":1,"log_routers":-1,"remote_logs":-1},"desired configuration":{"redundancy_mode":"double","storage_engine":"ssd-2","usable_regions":1,"logs":3,"proxies":3,"resolvers":1,"log_routers":-1,"remote_logs":-1}}
{"level":"info","ts":1670524150.8594463,"logger":"fdbclient","msg":"Running command","namespace":"foundationdb","cluster":"foundationdb-cluster","path":"/usr/bin/fdb/7.1/fdbcli","args":["/usr/bin/fdb/7.1/fdbcli","--exec","configure new double ssd-2 usable_regions=1 logs=3 resolvers=1 log_routers=-1 remote_logs=-1 proxies=3 regions=[]","-C","/tmp/b9f9c3d4-513e-4efb-a141-8d698722e40f","--log","--log","--trace_format","xml","--log-dir","/var/log/fdb","--timeout","30"]}
{"level":"error","ts":1670524180.8982832,"logger":"fdbclient","msg":"Error from FDB command","namespace":"foundationdb","cluster":"foundationdb-cluster","code":1,"stdout":"\nWARNING: Long delay (Ctrl-C to interrupt)\nSpecified timeout reached -- exiting...\nWarning: Proxy role is being split into GRV Proxy and Commit Proxy, now prefer configuring 'grv_proxies' and 'commit_proxies' separately. Generally we should follow that 'commit_proxies' is three times of 'grv_proxies' count and 'grv_proxies' should be not more than 4.\n3 proxies are automatically converted into 1 GRV proxies and 2 Commit proxies.\n\nThe database is unavailable; type `status' for more information.\n\n","stderr":"","error":"exit status 1","stacktrace":"github.com/FoundationDB/fdb-kubernetes-operator/fdbclient.(*cliAdminClient).ConfigureDatabase\n\t/workspace/fdbclient/admin_client.go:364\ngithub.com/FoundationDB/fdb-kubernetes-operator/controllers.updateDatabaseConfiguration.reconcile\n\t/workspace/controllers/update_database_configuration.go:110\ngithub.com/FoundationDB/fdb-kubernetes-operator/controllers.(*FoundationDBClusterReconciler).Reconcile\n\t/workspace/controllers/cluster_controller.go:166\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:234"}
{"level":"error","ts":1670524180.898432,"logger":"controller","msg":"Error in reconciliation","namespace":"foundationdb","cluster":"foundationdb-cluster","subReconciler":"controllers.updateDatabaseConfiguration","requeueAfter":0,"error":"fdb timeout: exit status 1","stacktrace":"github.com/FoundationDB/fdb-kubernetes-operator/controllers.(*FoundationDBClusterReconciler).Reconcile\n\t/workspace/controllers/cluster_controller.go:180\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:234"}

In the end I found a status message in the status json log that indicated the cluster was trying(and evidently failing) to recruit a transaction process. I modified my cluster by explicitly adding 2 transaction processes and the cluster was finally able to reconcile and I can connect to the cluster from my test microservice. I’m going to have to look at my process setup closer.

A new thing I’m running across is trying to run the cluster without a headless service. I have the config set to headlessService: false but there is still a headless service created in the cluster and that’s reflected in the connection string. I believe that’s one of the last hurdles I have to have my multicluster connection. Is there a hidden dependency that is preventing the headless service from being disabled?

routing:
  useDNSInClusterFile: true
  headlessService: false
  publicIpSource: service

Edit:
This is interesting. Left the cluster running overnight. Sometime in the last 12h(logs don’t go back that far) the operator tried to rotate coordinators or something, created new pods, hit the EKS cluster’s resource/node limit(not unexpected), but now the connection string is foundationdb_cluster:9Iscf7sSnSCx3uJQH3JGf96GvfSQ5wPQ@10.0.3.74:4501(fromHostname),10.0.11.67:4501(fromHostname),10.0.13.169:4501(fromHostname). Aside from the strange addresses, the generationID also regenerated instead of using the fixed value I provide to the operator.

I went with option #1 and built an image with the updated version

You don’t have to change the Dockerfile, you can simply pass the build-arg to the docker build command e.g.: docker build --build-arg FDB_VERSION=7.1.25 --platform='linux/amd64' -t foundationdb/fdb-kubernetes-operator .

In the end I found a status message in the status json log that indicated the cluster was trying(and evidently failing) to recruit a transaction process. I modified my cluster by explicitly adding 2 transaction processes and the cluster was finally able to reconcile and I can connect to the cluster from my test microservice. I’m going to have to look at my process setup closer.

Do you mind to share your FoundationDBCluster? Just to understand what the issue could be.

A new thing I’m running across is trying to run the cluster without a headless service. I have the config set to headlessService: false but there is still a headless service created in the cluster and that’s reflected in the connection string. I believe that’s one of the last hurdles I have to have my multicluster connection. Is there a hidden dependency that is preventing the headless service from being disabled?

In the current implementation the operator will always create a headless service if DNS is enabled. We made those changes to use the Pod DNS entries: DNS for Services and Pods | Kubernetes. Is there a reason that you cannot use another headless service or what exactly is missing on the automatic created headless service?

This is interesting. Left the cluster running overnight. Sometime in the last 12h(logs don’t go back that far) the operator tried to rotate coordinators or something, created new pods, hit the EKS cluster’s resource/node limit(not unexpected), but now the connection string is foundationdb_cluster:9Iscf7sSnSCx3uJQH3JGf96GvfSQ5wPQ@10.0.3.74:4501(fromHostname),10.0.11.67:4501(fromHostname),10.0.13.169:4501(fromHostname) . Aside from the strange addresses, the generationID also regenerated instead of using the fixed value I provide to the operator.

If you mean by generationID this string 9Iscf7sSnSCx3uJQH3JGf96GvfSQ5wPQ, it’s expected that the generationID changes if new coordinators are selected. That generationID is used to distinguish different coordinator “generation”.

The connection string value that you showed there, is that the connection string from /var/dynamic-conf/fdb.cluster or the content from /var/fdb/data/fdb.cluster? The latter (/var/fdb/data/...) is the connection string file managed by the fdbserver process itself, so it’s expected that you see IP addresses there, since the fdbserver process was resolving those.

Building a new dockerfile fits better into our CI/CD so that’s why I opted for that option. Thanks for the suggestion though.

This was the FoundationDBCluster when it was failing. I don’t have the log files anymore with the actual error.

apiVersion: apps.foundationdb.org/v1beta2
kind: FoundationDBCluster
metadata:
  name: foundationdb-cluster
  namespace: foundationdb
spec:
  automationOptions:
    killProcesses: true
    replacements:
      enabled: true
      maxConcurrentReplacements: 2
  faultDomain:
    key: foundationdb.org/none
  labels:
    filterOnOwnerReference: false
    matchLabels:
      foundationdb.org/fdb-cluster-name: foundationdb-cluster
    processClassLabels:
    - foundationdb.org/fdb-process-class
    processGroupIDLabels:
    - foundationdb.org/fdb-process-group-id
    - app.kubernetes.io/name
  minimumUptimeSecondsForBounce: 60
  processCounts:
    cluster_controller: 1
    stateless: 2
    log: 3
    storage: 3
  partialConnectionString: 
    databaseName: foundationdb_cluster
    generationID: cKicfHRW3OHgnNpYdVb2j1NHl89UKvJT
  processes:
    general:
      customParameters:
      - knob_disable_posix_kernel_aio=1
      podTemplate:
        spec:
          containers:
          - name: foundationdb
            resources:
              requests:
                cpu: 500m
                memory: 2.5Gi
            securityContext:
              runAsUser: 0
          - name: foundationdb-kubernetes-sidecar
            resources:
                limits:
                  cpu: 200m
                  memory: 328Mi
                requests:
                  cpu: 200m
                  memory: 328Mi
            securityContext:
              runAsUser: 0
            livenessProbe:
              failureThreshold: 5
              periodSeconds: 30
              successThreshold: 1
              tcpSocket:
                port: 8080
              timeoutSeconds: 1
              initialDelaySeconds: 15
          initContainers:
          - name: foundationdb-kubernetes-init
            resources:
                limits:
                  cpu: 200m
                  memory: 228Mi
                requests:
                  cpu: 200m
                  memory: 228Mi
            securityContext:
              runAsUser: 0
      volumeClaimTemplate:
        spec:
          resources:
            resources:
              requests:
                storage: 16G
  replaceInstancesWhenResourcesChange: true
  routing:
    headlessService: true
    useDNSInClusterFile: true
    publicIPSource: service
  sidecarContainer:
    enableLivenessProbe: false
    enableReadinessProbe: false
  useExplicitListenAddress: false
  mainContainer:
    imageConfigs:
      - baseImage: foundationdb/foundationdb
      - tag: 7.1.5
  version: 7.1.5

Updated to this config and the cluster was able to come up fully

apiVersion: apps.foundationdb.org/v1beta2
kind: FoundationDBCluster
metadata:
  name: foundationdb-cluster
  namespace: foundationdb
spec:
  automationOptions:
    killProcesses: true
    replacements:
      enabled: true
      maxConcurrentReplacements: 2
  faultDomain:
    key: foundationdb.org/none
  labels:
    filterOnOwnerReference: false
    matchLabels:
      foundationdb.org/fdb-cluster-name: foundationdb-cluster
    processClassLabels:
    - foundationdb.org/fdb-process-class
    processGroupIDLabels:
    - foundationdb.org/fdb-process-group-id
    - app.kubernetes.io/name
  minimumUptimeSecondsForBounce: 60
  processCounts:
    cluster_controller: 1
    stateless: 1
    log: 3
    storage: 3
    transaction: 2
  partialConnectionString: 
    databaseName: foundationdb_cluster
    generationID: cKicfHRW3OHgnNpYdVb2j1NHl89UKvJT
  processes:
    general:
      customParameters:
      - knob_disable_posix_kernel_aio=1
      - knob_worker_failure_time=10
      podTemplate:
        spec:
          containers:
          - name: foundationdb
            resources:
              requests:
                cpu: 500m
                memory: 2.5Gi
            securityContext:
              runAsUser: 0
          - name: foundationdb-kubernetes-sidecar
            resources:
                limits:
                  cpu: 200m
                  memory: 328Mi
                requests:
                  cpu: 200m
                  memory: 328Mi
            securityContext:
              runAsUser: 0
            livenessProbe:
              failureThreshold: 5
              periodSeconds: 30
              successThreshold: 1
              tcpSocket:
                port: 8080
              timeoutSeconds: 1
              initialDelaySeconds: 15
          initContainers:
          - name: foundationdb-kubernetes-init
            resources:
                limits:
                  cpu: 200m
                  memory: 228Mi
                requests:
                  cpu: 200m
                  memory: 228Mi
            securityContext:
              runAsUser: 0
      volumeClaimTemplate:
        spec:
          resources:
            resources:
              requests:
                storage: 16G
  replaceInstancesWhenResourcesChange: true
  routing:
    headlessService: true
    useDNSInClusterFile: true
    publicIPSource: service
  sidecarContainer:
    enableLivenessProbe: false
    enableReadinessProbe: false
  useExplicitListenAddress: false
  mainContainer:
    imageConfigs:
      - baseImage: foundationdb/foundationdb
      - tag: 7.1.5
  version: 7.1.5

In the current implementation the operator will always create a headless service if DNS is enabled. We made those changes to use the Pod DNS entries: DNS for Services and Pods | Kubernetes. Is there a reason that you cannot use another headless service or what exactly is missing on the automatic created headless service?

Makes sense. As I’m attempting to get this to work with a Cilium multi-cluster service mesh, services need a ClusterIP to be part of the service mesh. I can hit the fdb services from the other cluster if I modify the remote app’s conn string to include just the service dns(eg, foundationdb-cluster-storage-1.foundationdb.svc.cluster.local:4501) which, as expected, is rejected since the cluster wants the headless dns.

I’m currently running down an idea that involves mirroring the fdb headless service’s EndpointSplice into the remote cluster. So far that is allowing the remote app to hit the fdb cluster using the expected connection string, but the request is erroring with a timeout. I can’t say yet if the error is in my approach or the cluster as the cluster status was returning an unhealthy status. At the same time, I had a local app connecting to fdb successfully. I saved that status log but it’s too large to post in here.
Process counts at this point were set to

  cluster_controller: 1
  stateless: 1
  log: 3
  storage: 3
  transaction: 1

and resources were

  requests:
    cpu: 400m
    memory: 3.73Gi
  limits:
    cpu: 1000m
    memory: 5.73Gi

FDB status json Pastebin link.
Trace logs from the cluster didn’t tell me much but here’s a snippet. 10.2.13.74:47444 is the remote app. 10.0 IPs are local pods, 172 ips are service IPs.

<Event Severity="10" Time="1671038819.605231" DateTime="2022-12-14T17:26:59Z" Type="GetClientInfoFromLeaderGotClientInfo" ID="8a5ede519d76bd5f" CommitProxy0="172.20.189.179:4501" GrvProxy0="172.20.189.179:4501" ClientID="7585bc134b4d70ec" ThreadID="3996680950543339390" Machine="172.20.194.130:4501" LogGroup="foundationdb-cluster" Roles="CD" />
<Event Severity="10" Time="1671038820.108217" DateTime="2022-12-14T17:27:00Z" Type="ConnectionFrom" ID="e71881de0c78e7c7" SuppressedEventCount="0" FromAddress="10.2.13.74:47444" ListenAddress="10.0.11.247:4501" ThreadID="3996680950543339390" Machine="172.20.194.130:4501" LogGroup="foundationdb-cluster" Roles="CD" />
<Event Severity="10" Time="1671038820.121625" DateTime="2022-12-14T17:27:00Z" Type="ConnectionEstablished" ID="e71881de0c78e7c7" SuppressedEventCount="6" Peer="10.2.13.74:47444" ConnectionId="0" ThreadID="3996680950543339390" Machine="172.20.194.130:4501" LogGroup="foundationdb-cluster" Roles="CD" />
<Event Severity="10" Time="1671038820.121625" DateTime="2022-12-14T17:27:00Z" Type="NotifyAddressHealthy" ID="0000000000000000" SuppressedEventCount="0" Address="10.2.13.74:47444" ThreadID="3996680950543339390" Machine="172.20.194.130:4501" LogGroup="foundationdb-cluster" Roles="CD" />
<Event Severity="10" Time="1671038820.121625" DateTime="2022-12-14T17:27:00Z" Type="IncomingConnection" ID="e71881de0c78e7c7" SuppressedEventCount="0" FromAddr="10.2.13.74:47444" CanonicalAddr="10.2.13.74:47444" IsPublic="0" ThreadID="3996680950543339390" Machine="172.20.194.130:4501" LogGroup="foundationdb-cluster" Roles="CD" />
<Event Severity="10" Time="1671038820.734171" DateTime="2022-12-14T17:27:00Z" Type="PingLatency" ID="0000000000000000" Elapsed="36.0005" PeerAddr="172.20.186.138:4501" MinLatency="0.000660181" MaxLatency="0.00109434" MeanLatency="0.000771463" MedianLatency="0.000750065" P90Latency="0.000909328" Count="36" BytesReceived="82512" BytesSent="23820" TimeoutCount="0" ConnectOutgoingCount="0" ConnectIncomingCount="95" ConnectFailedCount="0" ConnectMinLatency="0" ConnectMaxLatency="0" ConnectMeanLatency="0" ConnectMedianLatency="0" ConnectP90Latency="0" ThreadID="3996680950543339390" Machine="172.20.194.130:4501" LogGroup="foundationdb-cluster" Roles="CD" />
<Event Severity="10" Time="1671038820.735880" DateTime="2022-12-14T17:27:00Z" Type="ProcessTimeOffset" ID="0000000000000000" ProcessTime="1671038820.735882" SystemTime="1671038820.735882" OffsetFromSystemTime="0.000000" ThreadID="3996680950543339390" Machine="172.20.194.130:4501" LogGroup="foundationdb-cluster" Roles="CD" />
<Event Severity="10" Time="1671038821.402629" DateTime="2022-12-14T17:27:01Z" Type="NominatingLeader" ID="0000000000000000" NextNominee="0000000000000000" CurrentNominee="637647361392c4bb" Key="foundationdb_cluster:cKicfHRW3OHgnNpYdVb2j1NHl89UKvJT" ThreadID="3996680950543339390" Machine="172.20.194.130:4501" LogGroup="foundationdb-cluster" Roles="CD" />
<Event Severity="10" Time="1671038821.402629" DateTime="2022-12-14T17:27:01Z" Type="GetLeaderReply" ID="0000000000000000" SuppressedEventCount="0" Coordinator="foundationdb-cluster-storage-2.foundationdb-cluster.foundationdb.svc.cluster.local:4501" Nominee="0000000000000000" ClusterKey="foundationdb_cluster:cKicfHRW3OHgnNpYdVb2j1NHl89UKvJT" ThreadID="3996680950543339390" Machine="172.20.194.130:4501" LogGroup="foundationdb-cluster" Roles="CD" />
<Event Severity="10" Time="1671038821.402629" DateTime="2022-12-14T17:27:01Z" Type="MonitorLeaderAndGetClientInfoLeaderChange" ID="0000000000000000" NewLeader="61e6cf218d37bcc7" Key="foundationdb_cluster:cKicfHRW3OHgnNpYdVb2j1NHl89UKvJT" ThreadID="3996680950543339390" Machine="172.20.194.130:4501" LogGroup="foundationdb-cluster" Roles="CD" />
<Event Severity="10" Time="1671038821.402629" DateTime="2022-12-14T17:27:01Z" Type="LeaderChanged" ID="69fac5258d55a17a" ToID="61e6cf218d37bcc7" ThreadID="3996680950543339390" Machine="172.20.194.130:4501" LogGroup="foundationdb-cluster" Roles="CD" />
<Event Severity="10" Time="1671038822.402673" DateTime="2022-12-14T17:27:02Z" Type="NominatingLeader" ID="0000000000000000" NextNominee="637647361392c4bb" CurrentNominee="0000000000000000" Key="foundationdb_cluster:cKicfHRW3OHgnNpYdVb2j1NHl89UKvJT" ThreadID="3996680950543339390" Machine="172.20.194.130:4501" LogGroup="foundationdb-cluster" Roles="CD" />
<Event Severity="10" Time="1671038822.402673" DateTime="2022-12-14T17:27:02Z" Type="GetLeaderReply" ID="0000000000000000" SuppressedEventCount="0" Coordinator="foundationdb-cluster-storage-2.foundationdb-cluster.foundationdb.svc.cluster.local:4501" Nominee="637647361392c4bb" ClusterKey="foundationdb_cluster:cKicfHRW3OHgnNpYdVb2j1NHl89UKvJT" ThreadID="3996680950543339390" Machine="172.20.194.130:4501" LogGroup="foundationdb-cluster" Roles="CD" />
<Event Severity="10" Time="1671038822.402673" DateTime="2022-12-14T17:27:02Z" Type="MonitorLeaderAndGetClientInfoLeaderChange" ID="0000000000000000" NewLeader="61e6cf218d37bcc7" Key="foundationdb_cluster:cKicfHRW3OHgnNpYdVb2j1NHl89UKvJT" ThreadID="3996680950543339390" Machine="172.20.194.130:4501" LogGroup="foundationdb-cluster" Roles="CD" />
<Event Severity="10" Time="1671038822.402673" DateTime="2022-12-14T17:27:02Z" Type="LeaderChanged" ID="69fac5258d55a17a" ToID="61e6cf218d37bcc7" ThreadID="3996680950543339390" Machine="172.20.194.130:4501" LogGroup="foundationdb-cluster" Roles="CD" />
<Event Severity="20" Time="1671038822.988973" DateTime="2022-12-14T17:27:02Z" Type="N2_ReadError" ID="e71881de0c78e7c7" SuppressedEventCount="0" ErrorCode="2" Message="End of file" ThreadID="3996680950543339390" Machine="172.20.194.130:4501" LogGroup="foundationdb-cluster" Roles="CD" />
<Event Severity="10" Time="1671038822.988973" DateTime="2022-12-14T17:27:02Z" Type="NotifyAddressFailed" ID="0000000000000000" SuppressedEventCount="0" Address="10.2.13.74:47444" ThreadID="3996680950543339390" Machine="172.20.194.130:4501" LogGroup="foundationdb-cluster" Roles="CD" />
<Event Severity="10" Time="1671038822.988973" DateTime="2022-12-14T17:27:02Z" Type="ConnectionClosed" ID="e71881de0c78e7c7" Error="connection_failed" ErrorDescription="Network connection failed" ErrorCode="1026" SuppressedEventCount="0" PeerAddr="10.2.13.74:47444" ThreadID="3996680950543339390" Machine="172.20.194.130:4501" LogGroup="foundationdb-cluster" Roles="CD" />
<Event Severity="10" Time="1671038822.988973" DateTime="2022-12-14T17:27:02Z" Type="PeerDestroy" ID="0000000000000000" Error="connection_failed" ErrorDescription="Network connection failed" ErrorCode="1026" SuppressedEventCount="0" PeerAddr="10.2.13.74:47444" ThreadID="3996680950543339390" Machine="172.20.194.130:4501" LogGroup="foundationdb-cluster" Roles="CD" />
<Event Severity="10" Time="1671038823.734205" DateTime="2022-12-14T17:27:03Z" Type="PingLatency" ID="0000000000000000" Elapsed="36.0005" PeerAddr="172.20.189.179:4501" MinLatency="0.000243664" MaxLatency="0.00130987" MeanLatency="0.000386443" MedianLatency="0.000338078" P90Latency="0.000474691" Count="36" BytesReceived="53016" BytesSent="29168" TimeoutCount="0" ConnectOutgoingCount="0" ConnectIncomingCount="0" ConnectFailedCount="0" ConnectMinLatency="0" ConnectMaxLatency="0" ConnectMeanLatency="0" ConnectMedianLatency="0" ConnectP90Latency="0" ThreadID="3996680950543339390" Machine="172.20.194.130:4501" LogGroup="foundationdb-cluster" Roles="CD" />
<Event Severity="10" Time="1671038824.229319" DateTime="2022-12-14T17:27:04Z" Type="LocalConfigurationMetrics" ID="49a14732fce62693" Elapsed="5.00003" Snapshots="0 -1 0" ChangeRequestsFetched="0 -1 0" Mutations="0 -1 0" ThreadID="3996680950543339390" Machine="172.20.194.130:4501" LogGroup="foundationdb-cluster" Roles="CD" TrackLatestType="Original" />
<Event Severity="10" Time="1671038824.229319" DateTime="2022-12-14T17:27:04Z" Type="MachineLoadDetail" ID="0000000000000000" User="6262055" Nice="13835" System="1619707" Idle="89714417" IOWait="82706" IRQ="0" SoftIRQ="163208" Steal="2029" Guest="0" ThreadID="3996680950543339390" Machine="172.20.194.130:4501" LogGroup="foundationdb-cluster" Roles="CD" />
<Event Severity="10" Time="1671038824.229319" DateTime="2022-12-14T17:27:04Z" Type="ProcessMetrics" ID="0000000000000000" Elapsed="5.00002" CPUSeconds="0.024636" MainThreadCPUSeconds="0.023005" UptimeSeconds="413884" Memory="235266048" ResidentMemory="51638272" UnusedAllocatedMemory="131072" MbpsSent="0.0527166" MbpsReceived="0.0494526" DiskTotalBytes="15677202432" DiskFreeBytes="15660347392" DiskQueueDepth="0" DiskIdleSeconds="5.00002" DiskReads="0" DiskReadSeconds="5.00002" DiskWrites="0" DiskWriteSeconds="5.00002" DiskReadsCount="140" DiskWritesCount="479348" DiskWriteSectors="0" DiskReadSectors="0" FileWrites="0" FileReads="0" CacheReadBytes="0" CacheFinds="0" CacheWritesBlocked="0" CacheReadsBlocked="0" CachePageReadsMerged="0" CacheWrites="0" CacheReads="0" CacheHits="0" CacheMisses="0" CacheEvictions="0" DCID="[not set]" ZoneID="foundationdb-cluster-storage-2" MachineID="foundationdb-cluster-storage-2" AIOSubmitCount="0" AIOCollectCount="0" AIOSubmitLag="0" AIODiskStall="0" CurrentConnections="12" ConnectionsEstablished="0.199999" ConnectionsClosed="0.199999" ConnectionErrors="0" TLSPolicyFailures="0" ThreadID="3996680950543339390" Machine="172.20.194.130:4501" LogGroup="foundationdb-cluster" Roles="CD" TrackLatestType="Original" />
<Event Severity="10" Time="1671038824.229319" DateTime="2022-12-14T17:27:04Z" Type="MemoryMetrics" ID="0000000000000000" TotalMemory16="0" ApproximateUnusedMemory16="0" ActiveThreads16="0" TotalMemory32="131072" ApproximateUnusedMemory32="0" ActiveThreads32="1" TotalMemory64="786432" ApproximateUnusedMemory64="131072" ActiveThreads64="6" TotalMemory96="131040" ApproximateUnusedMemory96="0" ActiveThreads96="1" TotalMemory128="131072" ApproximateUnusedMemory128="0" ActiveThreads128="1" TotalMemory256="524288" ApproximateUnusedMemory256="0" ActiveThreads256="2" TotalMemory512="0" ApproximateUnusedMemory512="0" ActiveThreads512="0" TotalMemory1024="0" ApproximateUnusedMemory1024="0" ActiveThreads1024="0" TotalMemory2048="0" ApproximateUnusedMemory2048="0" ActiveThreads2048="0" TotalMemory4096="0" ApproximateUnusedMemory4096="0" ActiveThreads4096="0" TotalMemory8192="0" ApproximateUnusedMemory8192="0" ActiveThreads8192="0" TotalMemory16384="0" ApproximateUnusedMemory16384="0" ActiveThreads16384="0" HugeArenaMemory="92912" DCID="[not set]" ZoneID="foundationdb-cluster-storage-2" MachineID="foundationdb-cluster-storage-2" ThreadID="3996680950543339390" Machine="172.20.194.130:4501" LogGroup="foundationdb-cluster" Roles="CD" />
<Event Severity="10" Time="1671038824.229319" DateTime="2022-12-14T17:27:04Z" Type="FastAllocMemoryUsage" ID="0000000000000000" TotalMemory="1703904" UnusedMemory="131072" Utilization="92.307548%" ThreadID="3996680950543339390" Machine="172.20.194.130:4501" LogGroup="foundationdb-cluster" Roles="CD" />
<Event Severity="10" Time="1671038824.229319" DateTime="2022-12-14T17:27:04Z" Type="NetworkMetrics" ID="0000000000000000" Elapsed="5.00002" CantSleep="0" WontSleep="0" Yields="1" YieldCalls="141" YieldCallsTrue="0" RunLoopProfilingSignals="0" YieldBigStack="0" RunLoopIterations="393" TimersExecuted="572" TasksExecuted="562" ASIOEventsProcessed="428" ReadCalls="257" WriteCalls="123" ReadProbes="124" WriteProbes="0" PacketsRead="126" PacketsGenerated="126" WouldBlock="124" LaunchTime="0" ReactTime="0.000844479" DCID="[not set]" ZoneID="foundationdb-cluster-storage-2" MachineID="foundationdb-cluster-storage-2" SlowTask2M="1" PriorityBusy0="4.9861" PriorityStarvedBelow1="0.013937" PriorityMaxStarvedBelow1="0.00108194" PriorityStarvedBelow3500="0.013937" PriorityMaxStarvedBelow3500="0.00108194" PriorityStarvedBelow7000="0.013937" PriorityMaxStarvedBelow7000="0.00108194" PriorityStarvedBelow7500="0.0123634" PriorityMaxStarvedBelow7500="0.00108194" PriorityStarvedBelow8500="0.0119841" PriorityMaxStarvedBelow8500="0.00108194" PriorityStarvedBelow8900="0.0118778" PriorityMaxStarvedBelow8900="0.00108194" PriorityStarvedBelow10500="0.00340915" PriorityMaxStarvedBelow10500="0.00108194" ThreadID="3996680950543339390" Machine="172.20.194.130:4501" LogGroup="foundationdb-cluster" Roles="CD" TrackLatestType="Original" />
<Event Severity="10" Time="1671038824.229319" DateTime="2022-12-14T17:27:04Z" Type="MachineMetrics" ID="0000000000000000" Elapsed="5.00002" MbpsSent="0" MbpsReceived="0" OutSegs="0" RetransSegs="0" CPUSeconds="0.662288" TotalMemory="8080842752" CommittedMemory="1159446528" AvailableMemory="6921396224" DCID="[not set]" ZoneID="foundationdb-cluster-storage-2" MachineID="foundationdb-cluster-storage-2" ThreadID="3996680950543339390" Machine="172.20.194.130:4501" LogGroup="foundationdb-cluster" Roles="CD" TrackLatestType="Original" />

Yesterday I upgraded the EKS clusters to a larger instance type that can handle giving the FDB processes the recommended 8gb and 1cpu in case that was contributing. I haven’t completed reconnecting the cluster mesh since the migration, so still unknown if that was contributing to the timeouts in the remote cluster. So far the cluster status is healthy.

If you mean by generationID this string 9Iscf7sSnSCx3uJQH3JGf96GvfSQ5wPQ, it’s expected that the generationID changes if new coordinators are selected. That generationID is used to distinguish different coordinator “generation”.

I see that the docs say seedConnectionString isn’t used after the initial reconcilliation, but partialConnectionString doesn’t say anything about that. You’re saying that partialConnectionString also only applies to the initial reconcilliation?

partialConnectionString: 
    databaseName: foundationdb_cluster
    generationID: cKicfHRW3OHgnNpYdVb2j1NHl89UKvJT

The connection string value that you showed there, is that the connection string from /var/dynamic-conf/fdb.cluster or the content from /var/fdb/data/fdb.cluster? The latter (/var/fdb/data/...) is the connection string file managed by the fdbserver process itself, so it’s expected that you see IP addresses there, since the fdbserver process was resolving those.

I saw the conn string with the <ip>(fromHostname) in the ConfigMap, which would be the dynamic-conf mount. I haven’t encountered that since.

Thanks for all the help.

Wanted to post an update as I have finally managed to get FoundationDB working across clusters. In the end, it ended up being pretty simple. It was just a journey of understanding all the pieces to get there.

I have two clusters deployed on EKS joined via VPC-peering. Cilium is running on both clusters and both are part of the cluster mesh. This solution would likely work with any cluster network setup that grants direct pod IP access.

The FoundationDBCluster routing options were set to:

  useDNSInClusterFile: true
  headlessService: true
  publicIpSource: pod

The biggest change was changing publicIpSource from ‘service’ to ‘pod’. I discovered that when I did a fdbcli status from my remote cluster, it returned an error that it couldn’t reach the cluster coordinator, but it was trying to resolve using the service IP (the cluster_controller’s public IP). Since only the pod IPs are visible across the clusters, my remote service couldn’t resolve the controller. Upon learning the public IP was used in this way, I changed the publicIpSource to ‘pod’.

Since the headless service is still used, using the pod public IP source doesn’t change the connection string so we still get the benefit of the DNS names.

The other half of the puzzle is using the EndpointSlice from the FDB headless service and mirroring that over to the remote cluster where another headless service lives. For testing purposes, that mirroring process is done manually. To automate that, I’m looking into something like an operator/script/watcher and utilizing the Cilium service mesh. We need something to mirror the fdb cluster-file anyway, so it’s not as inconvenient as I first thought it would be.

Great to hear that you managed to setup everything. If you want you can write a setup document in the fdb-kubernetes-operator repo or if you want to write a blog post about this I’m happy to reference it from the operator repo.

Hi @KRKroening,

Could you share some information about your multi clusters?
Your multi clusters use multi DC or you setup each cluster separate manual?

I plan to use multi cluster but I don’t have final solution, it’s good if I have your help.
Thank

Just as reference if others have the same question: Run FoundationDB cluster on multi Kuberbetes clusters - #3 by johscheuer

I have FDB running on a single cluster and my services running on the second cluster. Sounds like the link johscheuer provided aligns more with that you are looking for.

Hi @KRKroening

Can you help me a little bit in this post?
https://forums.foundationdb.org/t/run-foundationdb-cluster-on-multi-kuberbetes-clusters/3741/4

Hi @KRKroening ,
Would you share your cluster information like instance type, which configuration for cluster by eksctl, network, …? for successful deployment FDB operator on eks.

I create eks cluster and deploy FDB, it deploy fdb-kubernetes-operator and cluster-controller only. For storage pod, log pod, transaction pod, … those cannot be assigned to any nodes in cluster, so those don’t have ip and cannot connect to them. Those pods are in pending state, not start.

Like I commented in the other thread that’s an issue with your EKS cluster configuration that prevents that EBS volumes are created. You probably want to check this guide: Storage classes - Amazon EKS and debug the issue.