How to upgrade apple operator 0.48 to 1.4.1 without deleting the old crd?

liyyue · August 24, 2022, 2:20pm

Hi, Experts

Now we are doing some upgrade test from apple operator 0.48 to 1.4.1, as the CRD will change, and we have to recreate CRD, so the cr will also be recreated, and pvc on that cr will be deleted then. All the data on pvc will be lost. Is there any solution to upgrade the CRD from 0.48 to 1.4.1 without deleting the old crd, just patch it? So our CR can be kept and data on pvc can be also kept. Thanks!

johscheuer · August 26, 2022, 8:00am

We documented the process here: fdb-kubernetes-operator/compatibility.md at main · FoundationDB/fdb-kubernetes-operator · GitHub. You don’t have to delete/recreate the FoundationDB custom resources, you just have to update the FoundationDB CRD. When updating the FoundationDB CRD the FoundationDB custom resource will be available in v1beta1 and v1beta2.

liyyue · August 29, 2022, 1:28am

Thanks a lot! Let me try.

liyyue · August 29, 2022, 12:25pm

I have tried the solution, the cr is changed to v1beta2, thanks! While I am also trying to upgrade the fdb cluster from 6.2.29 to 6.3.24, I use the 6.3.24 client to build 1.4.1 apple operator docker image and also upgrade the fdb cluster to 6.3.24, but in the apple operator log I still see the following error message:

...
{"level":"error","ts":1661767440.0894866,"logger":"controller","msg":"Error getting connection string from cluster","namespace":"testoperator1","cluster":"xxxxx","reconciler":"updateStatus","version":"6.2.29","connectionString":"xxxxx:Vsy7bpupiwkzdBkQzmGyZvyed4T16W1l@10.xxx.xxx.xxx:tls,10.xxx.xxx.xxx:4500:tls,10.xxx.xxx.xxx:4500:tls","error":"unable to fetch connection string: The database is unavailable; type `status' for more information.\n","stacktrace":"github.com/FoundationDB/fdb-kubernetes-operator/controllers.updateStatus.reconcile\n\t/workspace/controllers/update_status.go:69\ngithub.com/FoundationDB/fdb-kubernetes-operator/controllers.(*FoundationDBClusterReconciler).Reconcile\n\t/workspace/controllers/cluster_controller.go:169\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.6/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.6/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.6/pkg/internal/controller/controller.go:214"}
{"level":"info","ts":1661767440.0895357,"logger":"controller","msg":"Attempting to get connection string from cluster","namespace":"testoperator1","cluster":"xxxxx","reconciler":"updateStatus","version":"6.3.24","connectionString":"xxxxx:Vsy7bpupiwkzdBkQzmGyZvyed4T16W1l@10.xxx.xxx.xxx:4500:tls,10.xxx.xxx.xxx:4500:tls,10.xxx.xxx.xxx:4500:tls"}
...

It seems the db is inconsistent, but I am not sure where to check it.

liyyue · August 29, 2022, 1:37pm

And when I ran “oc get foundationdbcluster xxxx -o yaml”, there is some output:

...
version: 6.3.24
status:
...
  runningVersion: 6.2.29
...

During upgrade fdb cluster I just set skip to true and delete pods and update foundationcluster image to 6.3.24 and then set skip to false , then the pods come back. I am not sure if there is still some 6.2.29 info kept in the cluster, how can I change it to 6.3.24?

johscheuer · August 30, 2022, 4:24pm

Is the upgrade somehow related to the initial question? You have to provide the client libraries for 6.2 and 6.3 otherwise the operator is not able to communicate to the cluster in those versions.

In what version is the cluster currently running?

liyyue · September 2, 2022, 6:37am

Currently fdb cluster is running with 6.3.24 version

liyyue · September 2, 2022, 7:51am

@johscheuer After force set the running version to version, the version issue is gone, but I encounter the following error in apple operator log:

...
"level":"error","ts":1662093651.1403134,"logger":"controller","msg":"Error in reconciliation","namespace":"testoperator1","cluster":"mdm-foundationdb-ibm","subReconciler":"controllers.updateDatabaseConfiguration","requeueAfter":0,"error":"FoundationDB error code 1031 (Operation aborted because the transaction timed out)","stacktrace":"github.com/FoundationDB/fdb-kubernetes-operator/controllers.(*FoundationDBClusterReconciler).Reconcile\n\t/workspace/controllers/cluster_controller.go:183\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.6/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.6/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.6/pkg/internal/controller/controller.go:214"}
...

Any suggestion? Is there any formal guide to introduce the upgrade from fdb 6.2 to fdb 6.3 on cloud? Thanks!

liyyue · September 2, 2022, 2:07pm

And it seems there are only 2 items left in the configmap, it had 6 items (before upgrade).

liyyue · September 2, 2022, 2:40pm

@johscheuer If I have installed a cluster (0.48 apple operator + 6.2.29 fdb) and there is some data in the db, and I want to upgrade it to (1.4.1 apple operator + 6.3.24 fdb), any suggestion about the upgrade steps? Is there any detailed guide? Thanks!

johscheuer · September 3, 2022, 8:51am

Is there a reason you have to upgrade the operator and the FDB cluster at the same time? Or are those actually independent steps that happen one after the other?

After force set the running version to version, the version issue is gone, but I encounter the following error in apple operator log:

What do you mean by force set the running version? The provided error only suggest that the database is not available. The first step would be to check if the database is really unavailable by exec’ing into a Pod, it could happen that the connection string of the operator is outdated.

If I have installed a cluster (0.48 apple operator + 6.2.29 fdb) and there is some data in the db, and I want to upgrade it to (1.4.1 apple operator + 6.3.24 fdb), any suggestion about the upgrade steps? Is there any detailed guide? Thanks!

Why do you want to do those steps concurrently? I would suggest doing the operator upgrade first an then do the FDB cluster upgrade. The steps for the operator upgrade is documented here: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/docs/compatibility.md#installing-a-new-major-release (not sure if you encountered any issues here). The FDB cluster upgrade should be simply changing the version field in the FoundationDBCluster spec to the desired version. You only have to ensure that the operator has the client libraries for 6.2 and 6.3 available the recommended way is doing this with init containers: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/config/samples/deployment.yaml#L176-L227

edit If you provide some more details about what you did and what issues you see it’s easier to help you.

liyyue · September 3, 2022, 12:07pm

@johscheuer Thanks a lot for your suggestion! Let me try.

liyyue · September 3, 2022, 12:42pm

@johscheuer After upgrade operator from 0.48 to 1.4.1 with crd from v1beta1 to v1beta2, I found that some pods are in “Init:0/1” status, and I check the event of those pods and found the following error message:

...
  Warning  FailedMount       28s (x2 over 4m59s)  kubelet            Unable to attach or mount volumes: unmounted volumes=[config-map], unattached volumes=[dynamic-conf data fdb-trace-logs fdb-certs config-map]: timed out waiting for the condition
...

I found that the configMap has only 2 items while before upgrade it has 6 items. In the configmap there are only 2 items left:

...
apiVersion: v1
data:
  cluster-file: mdm_foundationdb_ibm:86t4v6nXxsHydUBSDGIWMlGOj4SJ3hBV@xxxxxxx
  running-version: 6.2.29
kind: ConfigMap
...

After I apply the new cluster file with the new fdb image, it still has the same problem. So I think we may need resolve the problem firstly then can go ahead.

liyyue · September 5, 2022, 2:01pm

@johscheuer After checking the apple operator code I found that there is no imageTypes in foundationdbcluster status after upgrade, while in pure install there is imageTypes in foundationdbcluster status. How to add the imageTypes in foundationdbcluster status after upgrade? Would it be added automatically by apple operator during upgrade? But it seems it has not been added here.

johscheuer · September 6, 2022, 8:20am

Just for my understanding you performed those steps:

1.) Update CRD to be served on v1beta1 and v1beta2 (e.g. applying the CRD yaml from the fdb-kubernetes-operator repository in a version newer than (or equal) to 1.0)?
2.) After that you upgraded the operator from 0.48 to 1.4.1 (any reason to pick this specific version?)

Could you share the operator logs during the upgrade? There might be some important information. I try to reproduce this issue locally but I won’t have time for this until the end of the week. Could you also ensure that you’re not using any deprecated fields in the FoundationDBCluster resource before upgrading? The kubectl fdb plugin should have a subcommand to help with that.

liyyue · September 6, 2022, 11:01am

@johscheuer , thank you so much for your help!
I have tried to update the status of foundationdbcluster to add imageTypes and now configmap issue is gone and all pods are running. But the fdb cluster is still in unavailable status, and it seems the coordinator ips in the following places are all old ones:

/var/fdb/data/fdb.cluster
/var/dynamic-conf/fdb.cluster
fdbcli → status details → coordination servers (all unreachable)
foundationdbcluster status → connectionString
All above 4 places the connectionString are the same, but they are all old ones, the ips in the connectionString did not exist any longer when you check with “oc get po -o wide”, all the pod ips are changed. So we need to double check how to enable the ips in above 4 places to be consistent with those ips which exist in current running pods.

I found there is some error message in fdb pods:
fdbmonitor.log:

...
Time="1662460970.955128" Severity="10" LogGroup="mdm-foundationdb-ibm" Process="fdbmonitor": Updated configuration for fdbserver.1
Time="1662460974.246982" Severity="40" LogGroup="mdm-foundationdb-ibm" Process="fdbserver.1": Warning: FDBD has not joined the cluster after 5 seconds.
Time="1662460974.247006" Severity="40" LogGroup="mdm-foundationdb-ibm" Process="fdbserver.1":   Check configuration and availability using the 'status' command with the fdbcli
...

And I also found some error message in operator log:

...
{"level":"error","ts":1662461344.6233137,"logger":"fdbclient","error":"FoundationDB error code 1031 (Operation aborted because the transaction timed out)","stacktrace":"github.com/FoundationDB/fdb-kubernetes-operator/fdbclient.(*cliAdminClient).GetStatus\n\t/workspace/fdbclient/admin_client.go:237\ngithub.com/FoundationDB/fdb-kubernetes-operator/controllers.updateStatus.reconcile\n\t/workspace/controllers/update_status.go:82\ngithub.com/FoundationDB/fdb-kubernetes-operator/controllers.(*FoundationDBClusterReconciler).Reconcile\n\t/workspace/controllers/cluster_controller.go:169\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.6/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.6/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.6/pkg/internal/controller/controller.go:214"}
...

And I have found that in pure-install environment, there is such following metric in foundationdbcluster status:

...
hasListenIPsForAllPods: true
...

But in upgrade foundationdbcluster status, there is no such metric.

liyyue · September 7, 2022, 11:46am

And I found some error message in fdb trace:

...
Event Severity="10" Time="1662550288.085386" DateTime="2022-09-07T11:31:28Z" Type="ConnectionClosed" ID="0000000000000000" Error="connection_failed" ErrorDescription="Network connection failed" ErrorCode="1026" SuppressedEventCount="2" PeerAddr="10.254.14.147:4500:tls" ThreadID="16466064147198516778" Machine="10.254.17.224:4500" LogGroup="mdm-foundationdb-ibm" Roles="SS" 
...

Here the 10.254.17.224 is the real ip of the running pod, and 10.254.14.147 is the old ip which does not exists but still present in fdb.cluster.

johscheuer · September 8, 2022, 12:16pm

I believe the operator logs would be more interesting after you do the upgrade to see what errors are logged. The trace files probably don’t have to much information that is in this case interesting since the operator or something seems to remove those entries in the ConfigMap.

liyyue · September 9, 2022, 12:59am

@johscheuer Have you tried the upgrade in your environment ( operator 0.48 to 1.4.1 and db from 6.2.29 to 6.3.24)? or Is there any formal upgrade guide for such upgrade? I want to confirm the correct steps. Thanks!

johscheuer · September 9, 2022, 9:24am

I haven’t had time to do it by now. Could you share your FoundationDBCluster spec to make sure I test the same/similar setup.

Topic		Replies	Views
Upgrade from fdb cluster from 6.2 to 6.3 failed on k8s environment Using FoundationDB fdbsummit , operator	5	510	November 8, 2022
How to upgrade from (0.48 apple operator + 6.2.29 fdb) to (1.4.1 apple operator + 6.3.24 fdb) in k8s/openshift environment? Kubernetes Operator fdbsummit , performance , operator	1	346	September 3, 2022
Operator 1.0 release Kubernetes Operator	0	493	March 10, 2022
How to get apple fdb operator version? Kubernetes Operator operator	5	463	October 28, 2022
Planned Operator 2.0 release Kubernetes Operator	0	317	October 11, 2023

How to upgrade apple operator 0.48 to 1.4.1 without deleting the old crd?

Related topics