How to upgrade apple operator 0.48 to 1.4.1 without deleting the old crd?

@johscheuer after upgrade the cluster is failed to connect, and all the connectionstring are old. But if I delete foundationdbcluster and recreate a new one in the same environment with the same yaml file, the new created foundationdbcluster can be connected. So I think it should be foundationdb cluster upgrade issue, not operator upgrade issue. The following is the pure-install output foundationdbcluster yaml file:

...
apiVersion: apps.foundationdb.org/v1beta2
kind: FoundationDBCluster
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"foundationdb.opencontent.ibm.com/v1","kind":"FdbCluster","metadata":{"annotations":{},"name":"mdm-foundationdb-ibm","namespace":"testoperator1"},"spec":{"backup_agents_config":{"deploymentSa":"fdb-controller-manager","pvcCapacity":"10G","pvcStorageClass":""},"backup_agents_template":{"metadata":{"labels":{"backup-custom-label":"backup-custom-value"}},"spec":{"containers":[{"env":[{"name":"FDB_TLS_CERTIFICATE_FILE","value":"/var/fdb-certs/tls.crt"},{"name":"FDB_TLS_KEY_FILE","value":"/var/fdb-certs/tls.key"},{"name":"FDB_TLS_CA_FILE","value":"/var/fdb-certs/ca.crt"}],"imagePullPolicy":"Always","name":"backup-agent","resources":{"limits":{"cpu":"500m","ephemeral-storage":"150Mi","memory":"1024Mi"},"requests":{"cpu":"500m","ephemeral-storage":"100Mi","memory":"1024Mi"}},"volumeMounts":[{"mountPath":"/var/fdb-certs","name":"fdb-certs"}]}],"serviceAccountName":"ibm-fdb-controller-manager","volumes":[{"name":"fdb-certs","secret":{"secretName":"internal-tls"}}]}},"foundationdb_cluster_spec":{"databaseConfiguration":null,"mainContainer":{"enableTls":true},"processCounts":{"proxy":2,"stateless":1},"processes":{"general":{"allowTagOverride":true,"podTemplate":{"metadata":{"labels":{"customlabel":"customvalue"}},"spec":{"containers":[{"env":[{"name":"FDB_TLS_CERTIFICATE_FILE","value":"/var/fdb-certs/tls.crt"},{"name":"FDB_TLS_KEY_FILE","value":"/var/fdb-certs/tls.key"},{"name":"FDB_TLS_CA_FILE","value":"/var/fdb-certs/ca.crt"}],"imagePullPolicy":"Always","livenessProbe":{"exec":{"command":["/bin/sh","-c","ps -ef | grep fdbmonitor"]},"initialDelaySeconds":60,"periodSeconds":30},"name":"foundationdb","resources":{"requests":{"cpu":"150m","memory":"256Mi"}},"volumeMounts":[{"mountPath":"/var/fdb-certs","name":"fdb-certs"}]},{"env":[{"name":"FDB_TLS_CERTIFICATE_FILE","value":"/var/fdb-certs/tls.crt"},{"name":"FDB_TLS_KEY_FILE","value":"/var/fdb-certs/tls.key"},{"name":"FDB_TLS_CA_FILE","value":"/var/fdb-certs/ca.crt"}],"imagePullPolicy":"Always","name":"foundationdb-kubernetes-sidecar","resources":{"requests":{"cpu":"150m","memory":"256Mi"}},"volumeMounts":[{"mountPath":"/var/fdb-certs","name":"fdb-certs"}]}],"initContainers":[{"env":[{"name":"FDB_TLS_CERTIFICATE_FILE","value":"/var/fdb-certs/tls.crt"},{"name":"FDB_TLS_KEY_FILE","value":"/var/fdb-certs/tls.key"},{"name":"FDB_TLS_CA_FILE","value":"/var/fdb-certs/ca.crt"}],"imagePullPolicy":"Always","name":"foundationdb-kubernetes-init","resources":{"requests":{"cpu":"150m","memory":"256Mi"}},"volumeMounts":[{"mountPath":"/var/fdb-certs","name":"fdb-certs"}]}],"securityContext":{"allowPrivilegeEscalation":false,"privileged":false,"readOnlyRootFilesystem":true},"volumes":[{"name":"fdb-certs","secret":{"secretName":"internal-tls"}}]}},"volumeClaimTemplate":{"spec":{"resources":{"requests":{"storage":"2Gi"}},"storageClassName":""}}}},"sidecarContainer":{"enableTls":true},"skip":false,"version":"6.3.24"},"ignoreForMaintenance":false,"restore_job_config":{"sa":"restore-job-sa"},"restore_job_template":{"metadata":{"labels":{"restore-custom-label":"restore-custom-value"}},"spec":{"containers":[{"env":[{"name":"FDB_TLS_CERTIFICATE_FILE","value":"/var/fdb-certs/tls.crt"},{"name":"FDB_TLS_KEY_FILE","value":"/var/fdb-certs/tls.key"},{"name":"FDB_TLS_CA_FILE","value":"/var/fdb-certs/ca.crt"}],"name":"restore","volumeMounts":[{"mountPath":"/var/fdb-certs","name":"fdb-certs"}]}],"serviceAccountName":"ibm-fdb-controller-manager","volumes":[{"name":"fdb-certs","secret":{"secretName":"internal-tls"}}]}},"shutdown":"false","size":"small"}}
  creationTimestamp: "2022-09-09T14:01:25Z"
  generation: 1
  labels:
    fdb-cluster: mdm-foundationdb-ibm
  name: mdm-foundationdb-ibm
  namespace: testoperator1
  ownerReferences:
  - apiVersion: foundationdb.opencontent.ibm.com/v1
    controller: true
    kind: FdbCluster
    name: mdm-foundationdb-ibm
    uid: 5b5d4879-ecd5-44c8-97a2-8efcdde261c2
  resourceVersion: "42692956"
  uid: 55eed2b7-57e1-44bb-9fe6-02e128a3bca3
spec:
  automationOptions:
    deletionMode: Zone
    podUpdateStrategy: ReplaceTransactionSystem
    removalMode: Zone
    replacements:
      maxConcurrentReplacements: 1
  buggify: {}
  databaseConfiguration:
    storage_engine: ssd-2
  faultDomain: {}
  labels: {}
  lockOptions: {}
  mainContainer:
    enableTls: true
    imageConfigs:
    - baseImage: cp.stg.icr.io/cp/cpd/fdb
      tag: 6.3.24
  minimumUptimeSecondsForBounce: 600
  partialConnectionString: {}
  processCounts:
    proxy: 2
    stateless: 1
  processes:
    general:
      podTemplate:
        metadata:
          labels:
            customlabel: customvalue
        spec:
          automountServiceAccountToken: false
          containers:
          - env:
            - name: FDB_TLS_CERTIFICATE_FILE
              value: /var/fdb-certs/tls.crt
            - name: FDB_TLS_KEY_FILE
              value: /var/fdb-certs/tls.key
            - name: FDB_TLS_CA_FILE
              value: /var/fdb-certs/ca.crt
            imagePullPolicy: Always
            livenessProbe:
              exec:
                command:
                - /bin/sh
                - -c
                - ps -ef | grep fdbmonitor
              initialDelaySeconds: 60
              periodSeconds: 30
            name: foundationdb
            resources:
              requests:
                cpu: 150m
                memory: 256Mi
            volumeMounts:
            - mountPath: /var/fdb-certs
              name: fdb-certs
          - env:
            - name: FDB_TLS_CERTIFICATE_FILE
              value: /var/fdb-certs/tls.crt
            - name: FDB_TLS_KEY_FILE
              value: /var/fdb-certs/tls.key
            - name: FDB_TLS_CA_FILE
              value: /var/fdb-certs/ca.crt
            imagePullPolicy: Always
            name: foundationdb-kubernetes-sidecar
            resources:
              requests:
                cpu: 150m
                memory: 256Mi
            volumeMounts:
            - mountPath: /var/fdb-certs
              name: fdb-certs
          initContainers:
          - env:
            - name: FDB_TLS_CERTIFICATE_FILE
              value: /var/fdb-certs/tls.crt
            - name: FDB_TLS_KEY_FILE
              value: /var/fdb-certs/tls.key
            - name: FDB_TLS_CA_FILE
              value: /var/fdb-certs/ca.crt
            imagePullPolicy: Always
            name: foundationdb-kubernetes-init
            resources:
              requests:
                cpu: 150m
                memory: 256Mi
            volumeMounts:
            - mountPath: /var/fdb-certs
              name: fdb-certs
          securityContext: {}
          volumes:
          - name: fdb-certs
            secret:
              secretName: internal-tls
      volumeClaimTemplate:
        metadata: {}
        spec:
          resources:
            requests:
              storage: 2Gi
          storageClassName: managed-nfs-storage
        status: {}
  replaceInstancesWhenResourcesChange: false
  routing: {}
  sidecarContainer:
    enableTls: true
    imageConfigs:
    - baseImage: cp.stg.icr.io/cp/cpd/fdb-sidecar
      tag: 6.3.24-1
  skip: false
  version: 6.3.24
status:
  configured: true
  connectionString: mdm_foundationdb_ibm:yTnSIZXwGH6YgVT8sZHy4vLAcX3zkscn@10.254.16.114:4500:tls,10.254.13.255:4500:tls,10.254.22.27:4500:tls
  databaseConfiguration:
    log_routers: -1
    logs: 3
    proxies: 3
    redundancy_mode: double
    remote_logs: -1
    resolvers: 1
    storage_engine: ssd-2
    usable_regions: 1
  generations:
    reconciled: 1
  hasListenIPsForAllPods: true
  health:
    available: true
    fullReplication: true
    healthy: true
  imageTypes:
  - split
  locks: {}
  processGroups:
  - addresses:
    - 10.254.16.115
    processClass: log
    processGroupID: log-1
  - addresses:
    - 10.254.22.28
    processClass: log
    processGroupID: log-2
  - addresses:
    - 10.254.14.0
    processClass: log
    processGroupID: log-3
  - addresses:
    - 10.254.16.116
    processClass: log
    processGroupID: log-4
  - addresses:
    - 10.254.13.253
    processClass: proxy
    processGroupID: proxy-1
  - addresses:
    - 10.254.22.26
    processClass: proxy
    processGroupID: proxy-2
  - addresses:
    - 10.254.13.254
    processClass: stateless
    processGroupID: stateless-1
  - addresses:
    - 10.254.16.114
    processClass: storage
    processGroupID: storage-1
  - addresses:
    - 10.254.13.255
    processClass: storage
    processGroupID: storage-2
  - addresses:
    - 10.254.22.27
    processClass: storage
    processGroupID: storage-3
  requiredAddresses:
    tls: true
  runningVersion: 6.3.24
  storageServersPerDisk:
  - 1
...

So is there any suggestion about foundationdb cluster upgrade in k8s environment ? Thanks!

I also found some error message in fdb trace file:

...
<Event Severity="10" Time="1662798652.099807" DateTime="2022-09-10T08:30:52Z" Type="ConnectionClosed" ID="0000000000000000" Error="connection_failed" ErrorDescription="Network connection failed" ErrorCode="1026" SuppressedEventCount="0" PeerAddr="10.254.14.37:4500:tls" ThreadID="9140593948499107393" Machine="10.254.16.126:4500" LogGroup="mdm-foundationdb-ibm" Roles="SS" />
<Event Severity="10" Time="1662798652.099807" DateTime="2022-09-10T08:30:52Z" Type="PeerDestroy" ID="0000000000000000" Error="connection_failed" ErrorDescription="Network connection failed" ErrorCode="1026" SuppressedEventCount="0" PeerAddr="10.254.14.37:4500:tls" ThreadID="9140593948499107393" Machine="10.254.16.126:4500" LogGroup="mdm-foundationdb-ibm" Roles="SS" />
<Event Severity="20" Time="1662798652.099807" DateTime="2022-09-10T08:30:52Z" Type="N2_ConnectError" ID="83b86bdfc84b449d" SuppressedEventCount="0" ErrorCode="125" Message="Operation canceled" ThreadID="9140593948499107393" Machine="10.254.16.126:4500" LogGroup="mdm-foundationdb-ibm" Roles="SS" />
<Event Severity="20" Time="1662798652.497951" DateTime="2022-09-10T08:30:52Z" Type="Net2RunLoopTrace" ID="0000000000000000" TraceTime="1662798652.467373" Trace="addr2line -e fdbserver.debug -p -C -f -i 0x7f07b52a5ce0 0x7f07b52a2a25 0x25e3d8e 0x25e5798 0x7ce35d 0x7f07b4f08cf3 0x818172" ThreadID="9140593948499107393" Machine="10.254.16.126:4500" LogGroup="mdm-foundationdb-ibm" Roles="SS" />
<Event Severity="20" Time="1662798652.748072" DateTime="2022-09-10T08:30:52Z" Type="Net2RunLoopTrace" ID="0000000000000000" TraceTime="1662798652.717581" Trace="addr2line -e fdbserver.debug -p -C -f -i 0x7f07b52a5ce0 0x7f07b52a2a25 0x25e3d8e 0x25e5798 0x7ce35d 0x7f07b4f08cf3 0x818172" ThreadID="9140593948499107393" Machine="10.254.16.126:4500" LogGroup="mdm-foundationdb-ibm" Roles="SS" />
...

@johscheuer I just wonder why the fdb process can not join the cluster after upgrade. Is there any specific steps for fdb cluster upgrade in k8s environment? I did with following steps:

1. set skip to true in foundationdbcluster spec 
2. remove all pods
3. replace the image from 6.2 to 6.3 in the cluster yaml file
4. set skip to false
5. reapply the new cluster yaml file with 6.3 image.
6. All the pods are startup and running
7. But the fdb process can not join the cluster , and all coordinator ips in cluster file are old.

Are the above steps correct for fdb upgrade in k8s environment? Or there are some formal upgrade steps or guide about it? Thank!

What is the reason that you set the cluster into skip mode and delete all Pods manually? The upgrade process is described here in our manual: fdb-kubernetes-operator/operations.md at main · FoundationDB/fdb-kubernetes-operator · GitHub and changing the version should be enough.

If you use a custom imageConfig you obviously need to add the new image config before or in the same step you update the version field. Our manual covers examples how to set a version specific imageConfig: fdb-kubernetes-operator/customization.md at main · FoundationDB/fdb-kubernetes-operator · GitHub.

edit If you delete all Pods manually and don’t use a Service in for of each Pod it’s expected that the cluster file contains the old IPs: fdb-kubernetes-operator/warnings.md at main · FoundationDB/fdb-kubernetes-operator · GitHub

Let me know if you still have questions after reviewing and validating the steps in our manual. If any step needs more documentation feel free to create a pull request in GitHub with the additional documentation.

@johscheuer , I have tried the steps of apply the new image directly previously, but after upgrade the pod number become more than upgrade and I checked the image in the pod, some images are old and some are new. So I have to set skip to true to remove all the pod but failed .

Today I follow your guide to tried again, and the more pods and different image versions issue come again. But I didn’t do more today, and just wait… After about 20 minutes, everything went well, the pod number become the same as before and fdb cluster can be connected this time!

Thank you so much for your help!

@johscheuer , I have tried the steps of apply the new image directly previously, but after upgrade the pod number become more than upgrade and I checked the image in the pod, some images are old and some are new. So I have to set skip to true to remove all the pod but failed .

With the default settings it’s expected that for a short time after the upgrade some Pods have the new image and some have the old image and that we have some additional Pods. The reason for this is covered here (but not well explained):

Once all of the processes are running at the new version, we will recreate all of the pods so that the foundationdb container uses the new version for its own image. This will use the strategies described in Pod Update Strategy.

The TLDR is that storage Pods are recreate in a rolling strategy, taking one fault domain at a time with a wait time of 1 minute between recreations. All Pods that are non storage Pods (e.g.stateless and log Pods) will be replaced, this means we create a new Pod for every Pod, once the majority of the new Pods are up and running we will exclude the old Pods and delete them. The idea here is to reduce the number of recoveries.

@johscheuer , It is very strange that I have retried the upgrade today, but I failed again today.
After upgrade the more pods issue is never gone ( I have tried 3 times, the last time I even wait for 1 hour)
You can see following fdb pods are all doubled, and fdb cluster can not be connected today. I am not sure why the upgrade is failed today. I have tried the same steps as yesterday. The upgrade is instable?

[root@fdbsre1 operator]# oc get po -w
NAME                                                          READY   STATUS    RESTARTS   AGE
fdb-kubernetes-operator-controller-manager-7dc6d58568-6v8ps   1/1     Running   0          46m
ibm-fdb-controller-manager-64d5998cbb-cth6c                   1/1     Running   0          39m
mdm-foundationdb-ibm-fdb-backup-agents-db5849db5-gbbm9        1/1     Running   0          39m
mdm-foundationdb-ibm-log-1                                    2/2     Running   0          49m
mdm-foundationdb-ibm-log-2                                    2/2     Running   0          49m
mdm-foundationdb-ibm-log-3                                    2/2     Running   0          49m
mdm-foundationdb-ibm-log-4                                    2/2     Running   0          49m
mdm-foundationdb-ibm-log-5                                    2/2     Running   0          38m
mdm-foundationdb-ibm-log-6                                    2/2     Running   0          38m
mdm-foundationdb-ibm-log-7                                    2/2     Running   0          38m
mdm-foundationdb-ibm-log-8                                    2/2     Running   0          38m
mdm-foundationdb-ibm-proxy-1                                  2/2     Running   0          49m
mdm-foundationdb-ibm-proxy-2                                  2/2     Running   0          49m
mdm-foundationdb-ibm-proxy-3                                  2/2     Running   0          38m
mdm-foundationdb-ibm-proxy-4                                  2/2     Running   0          38m
mdm-foundationdb-ibm-stateless-1                              2/2     Running   0          49m
mdm-foundationdb-ibm-stateless-2                              2/2     Running   0          38m
mdm-foundationdb-ibm-storage-1                                2/2     Running   0          49m
mdm-foundationdb-ibm-storage-2                                2/2     Running   0          49m
mdm-foundationdb-ibm-storage-3                                2/2     Running   0          49m

Before upgrade it is something like:

[root@fdbsre1 operator]# oc get po
NAME                                                      READY   STATUS    RESTARTS   AGE
apple-fdb-controller-manager-655787f948-2lmjf             1/1     Running   0          8m24s
ibm-fdb-controller-manager-64d5998cbb-txx2s               1/1     Running   0          3m1s
mdm-foundationdb-ibm-fdb-backup-agents-7b6b79b7bf-74dpq   1/1     Running   0          95s
mdm-foundationdb-ibm-log-1                                2/2     Running   0          2m46s
mdm-foundationdb-ibm-log-2                                2/2     Running   0          2m46s
mdm-foundationdb-ibm-log-3                                2/2     Running   0          2m46s
mdm-foundationdb-ibm-log-4                                2/2     Running   0          2m46s
mdm-foundationdb-ibm-proxy-1                              2/2     Running   0          2m46s
mdm-foundationdb-ibm-proxy-2                              2/2     Running   0          2m46s
mdm-foundationdb-ibm-stateless-1                          2/2     Running   0          2m46s
mdm-foundationdb-ibm-storage-1                            2/2     Running   0          2m46s
mdm-foundationdb-ibm-storage-2                            2/2     Running   0          2m46s
mdm-foundationdb-ibm-storage-3                            2/2     Running   0          2m46s

Can you share the operator logs? The first step to debug this is to exec into all stateful Pods (storage and log) to see if they are actually running the new version. The operator uses fdbcli --exec 'kill' to restart all fdbserver processes but sometimes this request is not reliable and some processes are not restarted.

I found some error message in operator log:

...
ster":"mdm-foundationdb-ibm","subReconciler":"controllers.updateLabels"}
{"level":"info","ts":1662978169.7281156,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"testoperator1","cluster":"mdm-foundationdb-ibm","subReconciler":"controllers.updateDatabaseConfiguration"}
{"level":"info","ts":1662978169.7282093,"logger":"fdbclient","msg":"Fetch status from FDB","namespace":"testoperator1","cluster":"mdm-foundationdb-ibm"}
{"level":"error","ts":1662978179.730642,"logger":"controller","msg":"Error in reconciliation","namespace":"testoperator1","cluster":"mdm-foundationdb-ibm","subReconciler":"controllers.updateDatabaseConfiguration","requeueAfter":0,"error":"FoundationDB error code 1031 (Operation aborted because the transaction timed out)","stacktrace":"github.com/FoundationDB/fdb-kubernetes-operator/controllers.(*FoundationDBClusterReconciler).Reconcile\n\t/workspace/controllers/cluster_controller.go:183\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.6/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.6/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.6/pkg/internal/controller/controller.go:214"}
{"level":"error","ts":1662978179.7307975,"logger":"controller-runtime.manager.controller.foundationdbcluster","msg":"Reconciler error","reconciler group":"apps.foundationdb.org","reconciler kind":"FoundationDBCluster","name":"mdm-foundationdb-ibm","namespace":"testoperator1","error":"FoundationDB error code 1031 (Operation aborted because the transaction timed out)","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.6/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.6/pkg/internal/controller/controller.go:214"}
...

And the first 4 log pods (log-1,log-2,log-2,log-4) are still in 6.2 image, and last 4 logs pods (log-5,log-6,log-7,log-8) are in 6.3 image.

I’m not talking about the image. I’m talking about the actual processes running inside the Pods/container.

I checked the pod info:
For log-1 pod:

[root@fdbsre1 operator]# oc exec -it mdm-foundationdb-ibm-log-1 bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Defaulted container "foundationdb" out of: foundationdb, foundationdb-kubernetes-sidecar, foundationdb-kubernetes-init (init)
bash-4.4$ fdbcli -v
FoundationDB CLI 6.2 (v6.2.29)
source version f3aef311ccfbbd66ae3fff6afe88a43de1d39707
protocol fdb00b062010001
bash-4.4$ ps -ef|grep fdb
1000660+       1       0  0 08:52 ?        00:00:00 sh -c fdbmonitor --conffile /var/dynamic-conf/fdbmonitor.conf --lockfile /var/dynamic-conf/fdbmonitor.lockfile --loggroup mdm-foundationdb-ibm >> /var/log/fdb-trace-logs/fdbmonitor-$(date '+%Y-%m-%d').log 2>&1
1000660+       7       1  0 08:52 ?        00:00:00 fdbmonitor --conffile /var/dynamic-conf/fdbmonitor.conf --lockfile /var/dynamic-conf/fdbmonitor.lockfile --loggroup mdm-foundationdb-ibm
1000660+      10       7  1 08:53 ?        00:01:40 /usr/bin/fdbserver --class log --cluster_file /var/fdb/data/fdb.cluster --datadir /var/fdb/data --locality_instance_id log-1 --locality_machineid worker1.fdbtest2.cp.fyre.ibm.com --locality_zoneid worker1.fdbtest2.cp.fyre.ibm.com --logdir /var/log/fdb-trace-logs --loggroup mdm-foundationdb-ibm --public_address 10.254.16.167:4500:tls --seed_cluster_file /var/dynamic-conf/fdb.cluster
1000660+    1824    1815  0 10:43 pts/0    00:00:00 grep fdb

fdbmonitor-2022-09-12.log:
Time="1662972803.712759" Severity="10" LogGroup="mdm-foundationdb-ibm" Process="fdbmonitor": Could not remove inotify conf file watch, continuing...
Time="1662972803.712830" Severity="10" LogGroup="mdm-foundationdb-ibm" Process="fdbmonitor": Watching conf file /var/dynamic-conf/fdbmonitor.conf
Time="1662972803.712840" Severity="10" LogGroup="mdm-foundationdb-ibm" Process="fdbmonitor": Watching conf dir /var/dynamic-conf/ (10)
Time="1662972803.712844" Severity="10" LogGroup="mdm-foundationdb-ibm" Process="fdbmonitor": Loading configuration /var/dynamic-conf/fdbmonitor.conf
Time="1662972803.713266" Severity="10" LogGroup="mdm-foundationdb-ibm" Process="fdbmonitor": Starting fdbserver.1
Time="1662972803.713649" Severity="10" LogGroup="mdm-foundationdb-ibm" Process="fdbserver.1": Launching /usr/bin/fdbserver (10) for fdbserver.1
Time="1662972804.824353" Severity="10" LogGroup="mdm-foundationdb-ibm" Process="fdbserver.1": FDBD joined cluster.

For log-5 pod:

[root@fdbsre1 operator]# oc exec -it mdm-foundationdb-ibm-log-5 bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Defaulted container "foundationdb" out of: foundationdb, foundationdb-kubernetes-sidecar, foundationdb-kubernetes-init (init)
bash-4.4$ fdbcli -v
FoundationDB CLI 6.3 (v6.3.24)
source version 7d7762ba77d3ef6507101905f51ed74f15458807
protocol fdb00b063010001
bash-4.4$ ps -ef|grep fdb
1000660+       1       0  0 09:03 ?        00:00:00 sh -c fdbmonitor --conffile /var/dynamic-conf/fdbmonitor.conf --lockfile /var/dynamic-conf/fdbmonitor.lockfile --loggroup mdm-foundationdb-ibm >> /var/log/fdb-trace-logs/fdbmonitor-$(date '+%Y-%m-%d').log 2>&1
1000660+       7       1  0 09:03 ?        00:00:00 fdbmonitor --conffile /var/dynamic-conf/fdbmonitor.conf --lockfile /var/dynamic-conf/fdbmonitor.lockfile --loggroup mdm-foundationdb-ibm
1000660+      10       7  0 09:03 ?        00:00:12 /usr/bin/fdbserver --class log --cluster_file /var/fdb/data/fdb.cluster --datadir /var/fdb/data --locality_instance_id log-5 --locality_machineid worker1.fdbtest2.cp.fyre.ibm.com --locality_zoneid worker1.fdbtest2.cp.fyre.ibm.com --logdir /var/log/fdb-trace-logs --loggroup mdm-foundationdb-ibm --public_address 10.254.16.171:4500:tls --seed_cluster_file /var/dynamic-conf/fdb.cluster
1000660+    1755    1739  0 10:45 pts/0    00:00:00 grep fdb
bash-4.4$
fdbmonitor-2022-09-12.log
Time="1662973419.922433" Severity="10" LogGroup="mdm-foundationdb-ibm" Process="fdbmonitor": Started FoundationDB Process Monitor 6.3 (v6.3.24)
Time="1662973419.922653" Severity="10" LogGroup="mdm-foundationdb-ibm" Process="fdbmonitor": Watching conf file /var/dynamic-conf/fdbmonitor.conf
Time="1662973419.922660" Severity="10" LogGroup="mdm-foundationdb-ibm" Process="fdbmonitor": Watching conf dir /var/dynamic-conf/ (2)
Time="1662973419.922684" Severity="10" LogGroup="mdm-foundationdb-ibm" Process="fdbmonitor": Loading configuration /var/dynamic-conf/fdbmonitor.conf
Time="1662973419.922981" Severity="10" LogGroup="mdm-foundationdb-ibm" Process="fdbmonitor": Starting fdbserver.1
Time="1662973419.987424" Severity="10" LogGroup="mdm-foundationdb-ibm" Process="fdbserver.1": Launching /usr/bin/fdbserver (10) for fdbserver.1
Time="1662973421.399672" Severity="10" LogGroup="mdm-foundationdb-ibm" Process="fdbmonitor": Could not remove inotify conf file watch, continuing...
Time="1662973421.399744" Severity="10" LogGroup="mdm-foundationdb-ibm" Process="fdbmonitor": Watching conf file /var/dynamic-conf/fdbmonitor.conf
Time="1662973421.399752" Severity="10" LogGroup="mdm-foundationdb-ibm" Process="fdbmonitor": Watching conf dir /var/dynamic-conf/ (4)
Time="1662973421.399755" Severity="10" LogGroup="mdm-foundationdb-ibm" Process="fdbmonitor": Loading configuration /var/dynamic-conf/fdbmonitor.conf
Time="1662973421.399834" Severity="10" LogGroup="mdm-foundationdb-ibm" Process="fdbmonitor": Updated configuration for fdbserver.1
Time="1662973425.171829" Severity="40" LogGroup="mdm-foundationdb-ibm" Process="fdbserver.1": Warning: FDBD has not joined the cluster after 5 seconds.
Time="1662973425.171876" Severity="40" LogGroup="mdm-foundationdb-ibm" Process="fdbserver.1":   Check configuration and availability using the 'status' command with the fdbcli

trace file:
<Event Severity="10" Time="1662973420.169720" DateTime="2022-09-12T09:03:40Z" Type="RecoveriesComplete" ID="a9e4828047e6ce35" ThreadID="440839419804011824" Machine="10.254.16.171:4500" LogGroup="mdm-foundationdb-ibm" />
<Event Severity="20" Time="1662973420.169720" DateTime="2022-09-12T09:03:40Z" Type="GetDiskStatisticsDeviceNotFound" ID="0000000000000000" Directory="/var/fdb/data" ThreadID="440839419804011824" Machine="10.254.16.171:4500" LogGroup="mdm-foundationdb-ibm" />
<Event Severity="10" Time="1662973420.169720" DateTime="2022-09-12T09:03:40Z" Type="MachineLoadDetail" ID="0000000000000000" User="3398539" Nice="17629" System="2682246" Idle="537335295" IOWait="2877" IRQ="0" SoftIRQ="298154" Steal="42233" Guest="0" ThreadID="440839419804011824" Machine="10.254.16.171:4500" LogGroup="mdm-foundationdb-ibm" />

And now all the 3 storage pods are all still in 6.2 image.

@johscheuer I found if the fdb cluster is unavailable status and can not connect to fdb cluster, everything should be hang. So I wonder what is the root cause of the db unavailable?
The connection string is old and always popup the message “FDBD has not joined the cluster” in fdbmonitor.log

It seems that after do the upgrade with some sleep time( several minutes) between each step, the upgrade can pass finally.

You have to check the fdb processes running inside the storage Pods. It’s expected that during the upgrade (after all processes are restart to run the new version) the FDB cluster will have log/stateless Pods in 6.2 and 6.3, I described that here: How to upgrade apple operator 0.48 to 1.4.1 without deleting the old crd? - #27 by johscheuer.

What do you mean by the “connection string is old”? Have you done any manual operation on this cluster?

Like stated here: How to upgrade apple operator 0.48 to 1.4.1 without deleting the old crd? - #29 by johscheuer the first step is to use kubectl exec to exec into the Pods still running with the 6.2 image and see if they are still running the old fdbserver binary (that would be /usr/bin/fdbserver) or if they are using the new binary from /var/.. if they still use the old fdbserver binary you have to restart this process by using kill e.g. pkill -f fdbserver.

It seems that after do the upgrade with some sleep time( several minutes) between each step, the upgrade can pass finally.

The operator should already wait at least 1 minute before removing a Pod and we have some additional safeguards in our operator code. If you can reproduce this issue with the latest operator version please create an issue with examples how to reproduce it. It would also be helpful to share the operator logs otherwise it’s hard/impossible to understand what has happened.

@johscheuer so far I can upgrade operator and fdb successfully, if I encounter more problem later I will contact you again, thanks a lot!

You’re welcome. I would always recommend trying to use the latest operator version to have all bug fixes included. We try our best to ensure that all upgrades (except for major upgrades) are seamless.

@johscheuer When I tried to upgrade fdbcluster from 6.2 to 6.3, I found there is some time that fdb is in unavailable status( around 20 minutes), I think during that time fdb pods are copying binaries files and redistributing data, right? But our customer won’t want to stop their workload during upgrade, is it possible for no downtime during fdb upgrade (from 6.2 to 6.3) ? Thanks!