How can I shutdown a FDB cluster?

We need to upgrade the host OS of Kubernetes nodes, upon which we deploy our fdb clusters. The OS upgrade will reboot the underlying node and restart fdb pods. This causes rebalance on fdb pods. This could make the OS upgrade on many nodes time consuming, and we are also afraid possible issues caused by rebalance.

We are thinking to shutdown fdb clusters first (at least for non-production fdb clusters), do the host OS upgrade, then bring up the cluster. This can avoid rebalance.

But we have never shutdown a fdb cluster, and I don’t find the shutdown cmd for fdb.

Is there a way to shutdown a fdb cluster, and bring it up later?

Thank you.

If it’s feasible for you, you can just shutdown all processes at once (or close to it). There shouldn’t be any significant harm in it being more staggered, but that will result in the movement you are trying to avoid.

Another option to avoid movement if you are working on one fault domain at a time is to use the maintenance command in fdbcli (available since 6.1). This will mark a fault domain as in maintenance and won’t replicate data from it if it goes down. I don’t see any documentation on our website for it, but you can run help maintenance in fdbcli.

The maintenance cmd is available in our installation. We’d like to try that. Like to confirm one minor detail.

In “help maintenance” output, it says:
“Mark a zone for maintenance. Calling this command with `on’ prevents data distribution from moving data away from the processes with the specified ZONEID.”

We set “locality_zoneid” to rackId in DC. So in our /etc/foundationdb/foundationdb.conf, we have lines like these

locality_zoneid = lvs_lvs03_01-0200_13_10
datacenter_id = dc1

Please confirm the maintenance’s ZONEID is the same as locality_zoneid in foundations.conf.

Thanks, AJ.

I tested using “maintenance” cmd like the following:

  • issue the maintenance cmd.
  • shutdown the fdb processes on the pod we were testing, by commenting out the process specs in /etc/foundationdb/foundationdb.conf.
  • Wait for 15 mins.
  • bring up the fdb processes by restore the conf file.

Then we saw data movement. Maybe I waited too long.

Is there an internal parameter that determines if a rebalance should be triggered? Can we increase the threshold, whether it’s time duration or version difference, so that we can avoid rebalance?

Thanks.

That should be the right zone ID to use.

It sounds like you saw data movement after adding the processes back, which may suggest that the movement you are observing is different than the movement that maintenance mode would be disabling. In particular, with maintenance mode you are preventing the movement of data that now has too few replicas as a result of your missing zone. It’s quite possible that due to mutations that occurred while your processes are gone, there may need to be some rebalancing work, etc., to get the data distribution back to an ideal state. Hopefully you would still be avoiding the evacuation and refilling of your processes.

Are you able to characterize the type and volume of the movement you saw? Status reports some details on this in ‘replication health’ and ‘moving data’.

One other thing worth mentioning based on your concern about waiting too long is that the maintenance mode is enabled with an expiration. I imagine you are aware of this given that the duration is a required argument, but if you have a process down when maintenance mode expires, it will begin moving data.