I’m not 100% sure if this is a bug or not but figured I’d leave a note. I’ve got a test cluster that was accidentally reconfigured to have a single fdbserver per node due to some chef mishaps. While debugging what had happened I discovered one of the nodes had the entire set of proceses in the pre-single-fdbserver config still in the cluster.
Out of curiosity I touch'ed /etc/foundationdb/foundationdb.conf to see if maybe Chef had done something odd that prevented fdbmonitor from noticing the new config file. However, after touching foundationdb.conf, fdbmonitor logged that it was deconfiguring each of the expected processes to match the new config, yet none of the excess fdbserver processes have been stopped and are still part of the cluster.
Also to be clear, we’re being particularly brutal when we apply new test configurations. These are generally accompanied by rm -rf $data_dir type operations as we reset a test cluster for benchmarking. It just surprised me enough to make a note.
This turns out to be more of a confusing documentation issue. Our configs set kill_on_configuration_change = false which apparently means “don’t kill fdbserver processes”. Setting it to true causes fdbmonitor to actually kill child processes.
The current documentation reads:
If kill_on_configuration_change parameter is unset or set to true in foundationdb.conf then fdbmonitor will restart on changes automatically. If this parameter is set to false it will not restart on changes.
Which I took to mean, fdbmonitor itself would restart and rely on a daemon supervision system to restart the whole process tree. Instead it appears to actually mean:
If kill_on_configuration_change parameter is unset or set to true in foundationdb.conf then fdbmonitor will restart fdbserver processes on configuration changes automatically. If this parameter is set to false it will not restart any fdbserver processes on configuration changes.
Ah yes, your observation is correct (although the managed processes need not be fdbserver specifically), and I can see why the current documentation is confusing. If you’re interested in filing a PR clarifying the docs, that would be great. Or if you’d prefer, I can update it.