What Happens When a Server Crash and It is the Last Source for a Shard?

Hi all~

I’m curious about what happens in FoundationDB when a server crashes and it is the last source for some shards. In the document design/tlog-spilling.md.html, it states:

“However, in the presence of failures, the transaction log can be required to buffer significantly more data. Most notably, when a storage server fails, its tag isn’t popped until data distribution is able to re-replicate all of the shards that storage server was responsible for to other storage servers. Before that happens, mutations will accumulate on the TLog destined for the failed storage server, in case it comes back and is able to rejoin the cluster.”

Additionally, in the comments for the removeKeysFromFailedServer function in MoveKeys.actor.cpp, it says:

“// If (failed) serverID is the last source server for a shard, the shard will be erased, and then be assigned to teamForDroppedRange.
// Assign the shard to the new team as an empty range. Note, there could be data loss.”

My question is: if a shard has already been assigned as an empty range, and the failed server comes back online, will the system still try to recover that shard from the failed server? Moreover, I’m curious if the system would try to replay the TLog from the beginning to restore the lost shard?

I noticed there was a previous discussion on this question: Behavior when all replicas of some shards are missing?. In that discussion, it was mentioned:

Writes to missing shards will be possible, but will buffer indefinitely on TLogs, potentially eventually filling up their disk space. Reads from missing shards will hang indefinitely.

However, I believe this is not how the system works nowadays.

The documentation is mostly correct, but doesn’t explain all possible scenarios. After a storage server (SS) failed, before Data Distributor (DD) removing the failed SS. The mutations destined to the SS will be sent to SS’s buddy TLog, but will not be peeked and thus not popped the TLog. As a result, it’s possible for TLog to buffer mutations and consuming more and more disk space.

However, DD continuously monitors all SSes in the system and will notice the SS has failed (after certain interval). Then DD will issue data movement to copy shards (key ranges) on the failed SS to a different team. After the data movement finishes, all key ranges on the failed SS has been reassigned to new teams, thus no data remaining. Then DD actually removes the SS from the cluster. The side effect of this removal will clear buffered mutations on its buddy tlog, thus freeing up disk space.

In the case the last SS has the shard replica failed, because data movement is blocked, DD can’t remove the SS because there are still key ranges being assigned to it. As a result, TLog disk space will continue decreasing. Operational alert should be set up to warn problems of low available disk space for TLogs so that operators are notified. Then the operator should try to bring the failed SS back (after identifying the underlying issue why SS failed), which will solve the problem. If the underlying issue can’t be fixed, then the operator has to accept data loss. In this scenario, the operator can use exclude failed to mark the SS as permanently failed, which will re-assign the lost shard to a team as an empty range. Even if the SS comes back later, the cluster wouldn’t accept the SS, because exclude failed has removed the SS information (its interfaces) from the system metadata. As a result, the data this SS possesses is just ignored. I don’t recall if we delete the data on this SS or not, after it tries to rejoin the cluster.