I’m relaying an installation issue from someone onsite, so may not have all details available. note: also issue has been worked around, so can’t get more details unfortunately
After installing 5.2.5 in production on 3 hosts running RHEL 7.4, host #1 and #3 work perfectly, but all the processes on host #2 stopped working: they were not able to create new log files, were unable to create or update the data files, and also were showing the incorrect_cluster_file_contents error. Running fdbcli on that host worked fine (as it could see all the other nodes of the cluster). Multiple reboot did not change the behavior. Each host has 4 process (+ 1 backup agent).
edit: the admin uninstalled foundationdb, and reinstalled from scratch (with the proper rights on the mounted partitions) and now everything works as intended.
I asked the person onsite to list the processes and something weird is happening: the foundationdb.conf is set up to spin 4 processes + 1 backup agent, but we see that fdbmonitor is spinning 5 additional instances of itself (with same arguments) as root, instead of spinning the fdbserver / agent processes.
Sorry for the quality picture, it’s all I’ve got at the moment:
All I have for additional information is that the data folder is mounted on a different partition, and the admin forgot to chown the folder with the correct rights (for user foundationdb).
Any idea what could lead fdbmonitor to spinning the correct number of identical copies of itself, instead of fdbserver instances?
On Linux, the usual (in fact, nearly the only) way to start a (sub)process is to fork() and then, in the child, exec() the binary you want to run (in this case, fdbserver). So the process tree shown probably reflects fdbmonitor having several forks which have neither successfully started an fdbserver nor terminated - i.e. they “are” somewhere between lines 510 and 576 of fdbmonitor.cpp. The obvious place is line 550, where after a first attempt to restart a child fails fdbmonitor sleeps for a configurable amount of time before attempting another restart (to prevent extremely rapid relaunching of a failing process from wasting resources, filling up the disk with log files, or otherwise making it harder to diagnose and recover from a failure).
In other words, I think this situation is probably quite normal if fdbserver can’t be started successfully, and I think you have already diagnosed that. If you are looking at this type of situation in the future, you want to look at the system log, where fdbmonitor will explain what it is doing, and potentially the fdbserver logs, where fdbserver may explain why it is failing.
I see. In our case, the fdbserver processes were unable to write their log files so it made it more difficult to see what was going on. And in the sys logs, they were complaining about the content of fdb.cluster being incorrect (even though it was visually identical to the other servers).
Why is it implemented as:
fork()
while(!success)
try exec()
delay()
retry exec()
delay(…)
retry exec()
…
Instead of:
while (!success)
fork()
try exec()
exit(1)
delay()
fork()
try exec()
exit(1)
delay(…)
…
In the current way, the admin will only see child processes with a duplicate of the original fdbmonitor command line args, and will not be able to easily identify which fdbserver is having issues (via its custom command line args).
Anyway, for the un-initiated, this was a bit troubling
If I remember correctly, it should be possible to determine which process is failing by looking through the fdbmonitor logs (usually in syslog). I believe the process will log the command it’s trying to run, its pid, and often any errors that cause the process to fail. I’m not currently in a position to verify that, though.
I think the reason for delaying in the forked process is to avoid tying up the single thread of the main fdbmonitor process.
EDIT: Having thought about it a bit more, I’m actually not sure whether it logs the command that gets run and/or the ID assigned to the process (which could be used along with foundationdb.conf to determine the arguments). If not, then perhaps it should. In practice, though, I’ve had to deal pretty infrequently with problems such that only some processes on a host are repeatedly dying, and there are other ways to identify the process that make it easy enough in the absence of this information coming from fdbmonitor (e.g. observing missing hosts in the trace logs or missing fdbserver processes on the host). If the information is missing, that’s presumably the reason why there’s been no prior effort to add it.
EDIT2: I checked the logs and it does look like we do the ID assigned to the process in the Process field (e.g. for ID 4500 in fdbserver, you would get Process="fdbserver.4500"), so you should be able to trace that back to its parameters in foundationdb.conf. It doesn’t log the parameters used to start the process in syslog, though.