Any way to gracefully stop fdbserver process temporarily without affecting traffic?

Sorry if I’m just missing something from the docs, but I’ve noticed that running sudo service foundationdb stop to temporarily shut off one of the fdbserver processes in my cluster can cause roughly 5 seconds of downtime depending on which roles the fdbserver was performing within the cluster.

I know that the exclude command helps when you want to permanently remove a machine from the cluster, but my understanding is that it moves all the data off the machine. If you just want to have the machine offline for a minute or two to upgrade the OS, it seems expensive to move all the data rather than just running for a minute or two without the server and catching up when it comes back online.

Am I understanding things correctly? Is exclude the recommended tool for this sort of situation or is there any other way to avoid blips of unavailability when taking a server offline temporarily?

There was a maintenance command added to fdbcli that makes data distribution ignore failures in a zone for a span of time. You could probably use this before you do the OS upgrades?

fdb> help maintenance

maintenance [on|off] [ZONEID] [SECONDS]

Mark a zone for maintenance.

Calling this command with `on' prevents data distribution from moving data away
from the processes with the specified ZONEID. Data distribution will
automatically be turned back on for ZONEID after the specified SECONDS have
elapsed, or after a storage server with a different ZONEID fails. Only one
ZONEID can be marked for maintenance. Calling this command with no arguments
will display any ongoing maintenance. Calling this command with `off' will
disable maintenance.

I’m not sure if maintenance mode itself moves off the various roles that contribute to the downtime that’s trying to be avoided. If not, maybe combining maintenance mode with exclusions would work.

Yeah, it sounds like maintenance mode would help avoid data movement, but not short periods of unavailability when the roles need to be reconfigured. I can try out maintenance mode plus exclusions, but I’m pretty surprised to hear this isn’t something that there’s a clear answer for.

How long of a blip are you seeing?

I had been thinking through this as that you were looking to avoid a recovery, which is sort of unavoidable. If there’s a transaction log/proxy/resolver on the host that you’re removing, then there’s a necessary function that is being performed on that host that needs to be moved for the cluster to keep working. However, the cluster voluntarily moving processes elsewhere should result in a pretty quick unavailability window (<1s).

I’m guessing that your actual comment here is that what you’re seeing is a 3-5 second recovery because you first have to wait out the failure detector, and then do a recovery. So what you want is something that’s like an exclude+maintenance command, that causes FDB to move all roles except storage servers off of the host, in a quick voluntariy recovery, and then ignore the storage servers failing on that host for the next 10min ?

Re-reading AJ’s reply, I think he read this the correct way, and overall it seems to make sense that maintenance should act like an exclude in terms of changing recruitement’s opinion of where is a good place to put the various pieces of the transaction authority.

For what it’s worth, I’m not 100% sure whether maintenance mode has an effect on recruited processes, so that’s probably worth looking into or experimenting with to determine the answer.

I’ve been seeing a blip of roughly 5 seconds when taking processes down by running service foundationdb stop, which should just be sending a SIGTERM. I’d have hoped that the SIGTERM handler would be able to initiate a role transfer.

Yeah, that’s the dream.

I’ll experiment with this, not just blindly trust that it works :slight_smile: Thanks.