I was thinking about ways to simplify the manual deployment of multi-node clusters, using the existing setup and packages, and without having to fiddle with copying fdb.cluster files between host (not always practical in some virtualization environment with poor support for CTRL C / CTRL V) or having to delete all the files in the data folder and restart the service.
One idea would be to add a new command to
fdbcli, something like
join IP:PORT where IP:PORT would be at least one of the coordinators of the target cluster, or maybe the path to a file somewhere on the disk.
This command would only be supported by “single host” clusters (ie: one where all the processes are on the same machine). It would fail with an error message if run on multi-host clusters (to prevent easy mistakes like joining the cluster to the new empty single node instead of the reverse! like swapping the arguments of
If run from the fdbcli prompt, it could maybe display the details of the target cluster (id, descr, list of coordinators?) and maybe a confirmation prompt “All local data on $CURRENT_HOST will be deleted! proceed? Y/N”
Logically, this command would run these steps:
- check with the current cluster if there is one and only one distinct machine_id
- connect to the specified IP:PORT, check if this is a compatible fdb version
- check if the target cluster is not already the same as our hosts
- ask for the complete fdb.cluster file of this cluster
- display the results in the prompt, and ask for confirmation y/n (optional)
- stop all
fdbserverprocesses on the host
- remove all previous data files (or maybe rename them?)
- overwrite the local
fdb.clusterfile (the one used by fdbcli?)
- restart all
It would be also nice to have a way to pass an additional argument (
force ?) to not display the prompt and not ask user confirmation, so that it can be used by a custom install script).
Any opinion on this? There are for sure a ton of corner cases!
Typical example on a single host cluster (right after running the setup/installing the packages)
fdb> join 188.8.131.52:4500 You are about to join cluster xyz:123456789 with the following list of coordinators: 184.108.40.206:4500 220.127.116.11:4501 .... All local data on this host will be LOST. Proceed? [y/N] # 'y' + [ENTER] Stopping all local process... Removing previous data... Joining new cluster... Restart all process... Done! fdb>
Example if running the command on a cluster with more than 1 machine (most probably a mistake!)
fdb> join 18.104.22.168:4500 WARNING: the local cluster is already running on more than one machine! Are you sure you are running the command on the correct server? To force join, re-issue with command using the 'force' argument! fdb>
We could ask ourselves if we should allow a single-machine but multi-process cluster to join another? Example would be someone pre-configuring the
foundationdb.conf file with the list of all process/classes, BEFORE joining the target cluster (and not after).