SS / tLog kick-out and rejoin

gaurav · May 27, 2019, 3:44pm

Hi, I have a few questions on the handling of processes being unreachable from the cluster:

What is the protocol/semantics for handling a storage server or a tLog that is unreachable for an extended duration (say minutes or hours to days), and eventually joining back the cluster?
How soon will Master/DataDistributor/CC consider the unreachable SS as “dead” and, start extra replication of data after an SS is unreachable?
If an unreachable SS process is not specifically excluded via fdbcli, is there any cost paid by rest of the cluster (something like tLogs not popping out mutation, waiting for unreachable SS to come back)?
Once the SS comes back alive and tries to join back the cluster (on its own), will the Master/CC recruit it as a fresh SS process and clear its stored disk data, and then give it shards to replicate from other SS nodes?

It would be great to get some more idea about how this aspect is handled by fdb.

ajbeamon · May 28, 2019, 4:11pm

I don’t know the exact numbers, but it’s something on the order of seconds.

If the process is dead, I don’t think there is much difference between excluding it or not. If the process comes back before it’s fully re-replicated, then the behavior would be different, as the excluded process will continue excluding and the other would rejoin as described below.

If all data has been re-replicated, then the data files will be deleted and it will start fresh. Otherwise, the storage server will retain its data files and maintain responsibility for any data that hadn’t yet been re-replicated. It’s possible that this is only a small portion of the original data assigned to that process, in which case much of the file will be unused and have to be reclaimed for reuse.

Additionally, the cluster will attempt to move new data onto this process to replace data that had been moved away in order to keep the logical amount of data stored on each process balanced.

I think this was mostly answered above for the storage server. When this happens to a log or any stateless process, the cluster will perform a recovery and recruit a new transaction subsystem to replace the old one. If a log rejoins before the data in its files is made unnecessary, then it may also participate in the recovery, though I’m not certain on all the details of this. If it rejoins after the recovery from old logs is completed, though, then it too would be able to delete its files.

Topic		Replies	Views
Recover 'unreachable' Using FoundationDB	3	871	January 15, 2021
Troubleshooting queue build up Using FoundationDB	11	1634	December 13, 2019
Unexpected cluster state - Unable to read database configuration Using FoundationDB	1	1475	December 14, 2022
What Happens When a Server Crash and It is the Last Source for a Shard? Using FoundationDB	1	56	September 13, 2024
Questions about process classes, recruitment behavior and cluster configuration Using FoundationDB	5	2802	June 15, 2019

SS / tLog kick-out and rejoin

Related topics