Expanding an existing cluster cause the db to die

tangerine · June 1, 2021, 10:21pm

When running on an ocp using the cluster_local_tls.yaml setting, changing that yaml and added proxy: 1 and resolver: 1 cause 1 storage pod terminated and restarted and the whole cluster become unreachable. The new proxy and resolver only has processID created and no cluster file. The new storagePod that got recreated also only has the processID file.
Deleted the cluster and rerun with the same deployment file created all the pods and db is up and running. So, this only occurs when adding the proxy and resolver to an existing running cluster.

john_brownlee · June 1, 2021, 11:16pm

Can you share the modified YAML file that presented this problem?

tangerine · June 2, 2021, 4:29pm

Here it is.

# This file provides an example of a cluster you can run in a local testing
# environment.
apiVersion: apps.foundationdb.org/v1beta1
kind: FoundationDBCluster
metadata:
  name: sample-cluster
spec:
  version: 6.2.29
  faultDomain:
    key: foundationdb.org/none
  services:
    headless: true
  processCounts:
    stateless: -1
    cluster_controller: 1
    resolver: 1
    proxy: 1
    storage: 4
  processes:
    general:
      customParameters:
        - "knob_disable_posix_kernel_aio=1"
        - "locality_test=1"
      volumeClaimTemplate:
        spec:
          resources:
            requests:
              storage: "10G"
      podTemplate:
        spec:
          volumes:
            - name: fdb-certs
              secret:
                secretName: fdb-kubernetes-operator-secrets
          containers:
            - name: foundationdb
              env:
                - name: FDB_TLS_CERTIFICATE_FILE
                  value: /tmp/fdb-certs/tls.crt
                - name: FDB_TLS_CA_FILE
                  value: /tmp/fdb-certs/tls.crt
                - name: FDB_TLS_KEY_FILE
                  value: /tmp/fdb-certs/tls.key
              volumeMounts:
                - name: fdb-certs
                  mountPath: /tmp/fdb-certs
              resources:
                  requests:
                    cpu: 250m
                    memory: 128Mi
            - name: foundationdb-kubernetes-sidecar
              env:
                - name: FDB_TLS_CERTIFICATE_FILE
                  value: /tmp/fdb-certs/tls.crt
                - name: FDB_TLS_CA_FILE
                  value: /tmp/fdb-certs/tls.crt
                - name: FDB_TLS_KEY_FILE
                  value: /tmp/fdb-certs/tls.key
              volumeMounts:
                - name: fdb-certs
                  mountPath: /tmp/fdb-certs
              resources:
                requests:
                  cpu: 100m
                  memory: 128Mi
                limits:
                  cpu: 100m
                  memory: 128Mi
          initContainers:
            - name: foundationdb-kubernetes-init
              resources:
                requests:
                  cpu: 100m
                  memory: 128Mi
                limits:
                  cpu: 100m
                  memory: 128Mi
          securityContext:
            serviceAccountName: foundationdbsa
            allowPrivilegeEscalation: false
            privileged: false
            readOnlyRootFilesystem: false
mainContainer:
   enableTls: true
sidecarContainer:
   enableTls: true

Moderator ninja edit: fixed formatting

john_brownlee · June 3, 2021, 9:30pm

Looks like the forum messed up the formatting, so I can’t apply it directly, but looking at that I don’t see anything that I would expect to cause a database to go down. That being said, I’m not sure what your intent is in running a single process with those process classes, and I wouldn’t recommend that kind of configuration in general.

johscheuer · June 5, 2021, 9:44am

Could you provide some more information e.g. which operator version re you using? I just tested this with the latest operator release and the cluster stays available. The only thing is that the cluster never reconciles because of the following errors:

Time="1622886189.954714" Severity="10" LogGroup="test-cluster" Process="fdbserver.1": Launching /usr/bin/fdbserver (51) for fdbserver.1
Time="1622886189.957407" Severity="40" LogGroup="test-cluster" Process="fdbserver.1": ERROR: Unknown machine class `resolver'
Time="1622886189.957432" Severity="40" LogGroup="test-cluster" Process="fdbserver.1": Try `/usr/bin/fdbserver --help' for more information.
Time="1622886189.957832" Severity="40" LogGroup="test-cluster" Process="fdbserver.1": Process 51 exited 1, restarting in 56 seconds

I raise an issue in GitHub to look into that. edit this is the issue: Use resolver in process counts leads to an error · Issue #745 · FoundationDB/fdb-kubernetes-operator · GitHub

tangerine · June 7, 2021, 12:53pm

I am trying with 0.35.1
And yes, sometime I have a perfectly healthy cluster with expand and shrink and some other time I run into this. Not very consistent. Basically, my intent is just to test shrink and expand of the cluster on an openshift env. to see if things are ok or not. I think we can park this one for now.

johscheuer · June 8, 2021, 8:03am

And the cluster is completely dead and doesn’t recover anymore? The next time you notice that it would be great to have to logs from the operator so see what’s potentially the issue.

Just to clarify you are expanding the clusters stateful and stateless processes? Stateless processes doesn’t have any effect on the coordinators so that would be suspicious.

tangerine · June 8, 2021, 6:49pm

ok.
For clarify, the issue only occurred when adding 1 proxy and 1 resolver to an existing running cluster that have no proxy or resolver.
Stateless expand and shrink does not have this kind of issue.

john_brownlee · June 8, 2021, 6:54pm

The operator will automatically configure proxies and resolvers, and if you want to add more, I would recommend using the database configuration rather than the process counts. We have some more discussion on this in the user manual.

Setting the process counts for proxy to 1 will create a single process with the parameter --class=proxy, which will mean that it is preferred for the proxy role over other stateless processes. It will not increase the number of processes that the database recruits with the proxy role. If you increase the proxies in the database configuration, it will both increase the number of recruited proxies and provision additional processes with --class=stateless to handle that work.

Topic		Replies	Views
Adding a resolver causes cluster to become non-reconciled Kubernetes Operator performance , operator	4	511	October 29, 2021
FoundationDB @ Kubernetes having "issues" Using FoundationDB	3	2031	May 11, 2018
Troubles scaling up the cluster Using FoundationDB	31	3746	November 1, 2018
Configuring FoundationDB to Use More Than One Resolver Using FoundationDB performance	23	2570	August 21, 2019
FDB kubernetes operator continuously boucing processes Kubernetes Operator	8	1012	May 3, 2020

Expanding an existing cluster cause the db to die

Related topics