When running on an ocp using the cluster_local_tls.yaml setting, changing that yaml and added proxy: 1 and resolver: 1 cause 1 storage pod terminated and restarted and the whole cluster become unreachable. The new proxy and resolver only has processID created and no cluster file. The new storagePod that got recreated also only has the processID file.
Deleted the cluster and rerun with the same deployment file created all the pods and db is up and running. So, this only occurs when adding the proxy and resolver to an existing running cluster.
Can you share the modified YAML file that presented this problem?
Here it is.
# This file provides an example of a cluster you can run in a local testing
# environment.
apiVersion: apps.foundationdb.org/v1beta1
kind: FoundationDBCluster
metadata:
name: sample-cluster
spec:
version: 6.2.29
faultDomain:
key: foundationdb.org/none
services:
headless: true
processCounts:
stateless: -1
cluster_controller: 1
resolver: 1
proxy: 1
storage: 4
processes:
general:
customParameters:
- "knob_disable_posix_kernel_aio=1"
- "locality_test=1"
volumeClaimTemplate:
spec:
resources:
requests:
storage: "10G"
podTemplate:
spec:
volumes:
- name: fdb-certs
secret:
secretName: fdb-kubernetes-operator-secrets
containers:
- name: foundationdb
env:
- name: FDB_TLS_CERTIFICATE_FILE
value: /tmp/fdb-certs/tls.crt
- name: FDB_TLS_CA_FILE
value: /tmp/fdb-certs/tls.crt
- name: FDB_TLS_KEY_FILE
value: /tmp/fdb-certs/tls.key
volumeMounts:
- name: fdb-certs
mountPath: /tmp/fdb-certs
resources:
requests:
cpu: 250m
memory: 128Mi
- name: foundationdb-kubernetes-sidecar
env:
- name: FDB_TLS_CERTIFICATE_FILE
value: /tmp/fdb-certs/tls.crt
- name: FDB_TLS_CA_FILE
value: /tmp/fdb-certs/tls.crt
- name: FDB_TLS_KEY_FILE
value: /tmp/fdb-certs/tls.key
volumeMounts:
- name: fdb-certs
mountPath: /tmp/fdb-certs
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 100m
memory: 128Mi
initContainers:
- name: foundationdb-kubernetes-init
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 100m
memory: 128Mi
securityContext:
serviceAccountName: foundationdbsa
allowPrivilegeEscalation: false
privileged: false
readOnlyRootFilesystem: false
mainContainer:
enableTls: true
sidecarContainer:
enableTls: true
Moderator ninja edit: fixed formatting
Looks like the forum messed up the formatting, so I can’t apply it directly, but looking at that I don’t see anything that I would expect to cause a database to go down. That being said, I’m not sure what your intent is in running a single process with those process classes, and I wouldn’t recommend that kind of configuration in general.
Could you provide some more information e.g. which operator version re you using? I just tested this with the latest operator release and the cluster stays available. The only thing is that the cluster never reconciles because of the following errors:
Time="1622886189.954714" Severity="10" LogGroup="test-cluster" Process="fdbserver.1": Launching /usr/bin/fdbserver (51) for fdbserver.1
Time="1622886189.957407" Severity="40" LogGroup="test-cluster" Process="fdbserver.1": ERROR: Unknown machine class `resolver'
Time="1622886189.957432" Severity="40" LogGroup="test-cluster" Process="fdbserver.1": Try `/usr/bin/fdbserver --help' for more information.
Time="1622886189.957832" Severity="40" LogGroup="test-cluster" Process="fdbserver.1": Process 51 exited 1, restarting in 56 seconds
I raise an issue in GitHub to look into that. edit this is the issue: Use resolver in process counts leads to an error · Issue #745 · FoundationDB/fdb-kubernetes-operator · GitHub
I am trying with 0.35.1
And yes, sometime I have a perfectly healthy cluster with expand and shrink and some other time I run into this. Not very consistent. Basically, my intent is just to test shrink and expand of the cluster on an openshift env. to see if things are ok or not. I think we can park this one for now.
And the cluster is completely dead and doesn’t recover anymore? The next time you notice that it would be great to have to logs from the operator so see what’s potentially the issue.
Just to clarify you are expanding the clusters stateful and stateless processes? Stateless processes doesn’t have any effect on the coordinators so that would be suspicious.
ok.
For clarify, the issue only occurred when adding 1 proxy and 1 resolver to an existing running cluster that have no proxy or resolver.
Stateless expand and shrink does not have this kind of issue.
The operator will automatically configure proxies and resolvers, and if you want to add more, I would recommend using the database configuration rather than the process counts. We have some more discussion on this in the user manual.
Setting the process counts for proxy
to 1 will create a single process with the parameter --class=proxy
, which will mean that it is preferred for the proxy role over other stateless processes. It will not increase the number of processes that the database recruits with the proxy role. If you increase the proxies
in the database configuration, it will both increase the number of recruited proxies and provision additional processes with --class=stateless
to handle that work.