Race condition for sequencer election?

lets say the sequencer of fdb gets into a partial network partition. it then elects a new sequencer leader.
however, before it can pass the info of the new sequencer to all the nodes, some of the nodes still messages the old sequencer (maybe they are on the right side of the partition that can talk to sequencer).

so now we have a situation new sequencer elected by coordinator nodes, but before every node knows about it, some nodes can still talk to old sequencer.

doesnt this cause a race condition?

Yes, but it doesn’t matter. The problem you describe exists for commit, proxies, grv proxies, resolvers, sequencer… Basically for anything except coordinators and tlogs (talking to them requires a quorum).

In order to understand why this isn’t a problem you need to understand how transactions are started. When we go through a recovery (which is how we elect a new sequencer) we logically replace the whole transaction subsystem (everything except storage servers). We call this a new generation.

First in-flight transactions: these are kind of easy: they might still be able to read (since they already have a valid version number) and that is ok. They might abort earlier than 5 seconds after they started since a recovery will advance the life version and storages will learn about this eventually, but successful reads will return the correct data.

If such an in-flight transaction attempts a write, it will fail. Either because the commit proxy changed (and the client knows it needs to abort before it even attempts to commit), because it can’t talk to the commit proxy (it might have died already), or because the commit proxy can’t write to the tlog system (it needs to write to all tlogs – so if a new recovery started, at least some of them will be locked – this is what would have happened in the scenario you described).

For new transactions it’s not a problem either. If they talk to a grv proxy of the new generation the transaction will start normally (which is correct – so this case isn’t interesting). The interesting case is what happens if it talks to a grv proxy of the old generation (which can happen due to timing and/or a network partition). This is how it gets a consistent read version (simplified):

  1. Ask the grv proxy for a read version
  2. The grv proxy asks the sequencer for the most recent version and in parallel sends a request to all tlogs.
  3. Only if the sequencer and a majority of tlogs reply will the grv proxy reply to the client

(we have an optimization path here which we use in production where we ask the tlog only every 80ms – this works but is a bit complicated to explain in a short forum post).

So if you look at the three steps above, even if the sequencer and the grv proxy are still alive and they don’t know yet that the cluster went through a recovery, the grv proxy won’t be able to complete the request. So it will hang forever (or rather until the client learns about the new generation in which case it will retry on the new generation of grv proxies).


thank you for the excellent, detailed answer.
i have another one if thats ok:

how does the new sequencer know how to resume from where the previous one left off? does the previous sequencer checkpoint itself periodically? if you relied on just the wall clock as the starting point for each sequencer, that could cause trouble since wall clocks can go backward.

The sequencer is stateless, so no, there’s no checkpointing happening there.

To understand how this part works we first have to establish the exact job of the sequencer. It needs to keep track of two versions: the most recently committed version (the grv proxies will get this version from the sequencer in order to reply to GetReadVersion-requests from clients). The other is the latest version they’ve given to a commit proxy (when a commit proxy commits, it will batch multiple commit requests, get a new version from the sequencer and attempts to write at that version. This can fail early due to conflicts: it will send the read and write set of each transaction together with each read version and this new write version to the resolvers and they will reply with ack/fail for each commit request to let the commit proxy know whether it succeeded.

Assuming a commit is ready to be written the commit proxy will execute these steps (and it has to execute them in this exact order – otherwise we could violate strict serializability):

  1. First it will send the mutations in the transactions to the tlog system and wait for an acknowledgment from all tlogs. The tlogs will only asnwer after they have written the data to disk and executed an fsync. So at this point we know that the commit is durable.
  2. Then the commit proxy will send a message to the sequencer letting it know that it successfully committed. At this point the commit proxy will increase the most recently committed version and all following GRV requests will at least get this version from the grv proxies.
  3. Finally the commit proxy will send an acknowledgment to the client.

Btw: this is how you might get commit_unknown_result errors on the client: if the transaction subsystem fails before step 3 completes the client will eventually learn about the new grv and commit proxies but it won’t know whether step 1 completed (it will never receive an answer from the old commit proxy and it knows that because there’s now a new generation of the transaction subsystem). Only if step 1 fully completed (and there were no conflicts) is it guaranteed that the transaction committed. But there’s no way for the client to know. If step 1 didn’t even start, the commit obviously didn’t make it. If it started but didn’t complete, the transaction might have been committed. So it’s not an easy error to handle.

Anyways: so now how does a new sequencer know what the most recently committed version is? In the recovery step we’ll execute the following steps:

  1. Find the tlogs and send them a lock request. This request will instruct them to not accept any new writes. The tlogs will reply with the most recently committed version. This has to succeed on some number of tlogs (the number depends on your configuration – if you have triple replication it will be total number of active tlogs in the last generation minus two). The tlogs will reply with the newest version they know about.
  2. Of all these versions we will take the oldest one. Since commits go to all tlogs, we know that everything that isn’t present on all tlogs can be safely rolled back. Note that this means that the version number we chose depends on how many tlogs report back. This version we call the epoch end.
  3. Now we recruit a new set of tlogs (they can run on the same machine as the old ones) and we’ll calculate the recovery version which is epoch end version plus at least 5 million (5 million because that’s how many versions we generate in 5 seconds – we might decide to forward more than that in order to synchronize better with time). We let the old tlogs know that they should “seal” at the epoch end version (this is basically the rollback) and we let the new tlogs know what the recovery version is (this is the version of the first commit and they won’t accept any writes until they’ve seen this version).
  4. Finally we can send the recovery version to the sequencer. At this point the most recently committed version and the last given out write version will be both the same (the recovery version). The sequencer will never give out the recovery version though (in the recovery process we’ll also create a recovery transaction which will committ at this version).

At this point the recovery isn’t done, but the new sequencer will be ready. The whole process is quite complex. You can even have situations where multiple of these recoveries happen at the same time. But only one recovery will succeed – the way we enforce this is by making sure that at the very last state, right before we let the clients know where to find the new grv and commit proxies, we write to the coordinators. The coordinators implement a paxos-like protocol to guarantee consistency. Without that we would run into trouble with network partitions.

1 Like