Question about SIGMOD'21 paper

Hi there,
I’ve couple of questions regarding the paper, looking for experts response.

During log recovery, tLogs need to recover [PEK+1, RV). Since those logs have not been ACKed to client as committed, why tLogs need do this recovery? can new generation of tLogs just start from PEV?

Why proxy need to broadcast Header messages to all tLogs? The answer maybe version(timestamps) synchronization across all tLogs. But without the synchronization, what will be the trouble tLogs get in?

For the first question, the range contains versions that may have been replied back to client. Thus we need to copy them and can’t simply discard them. Any version larger than RV we know can’t be replied back to the client.

For the second question about LogServer broadcast, there are two main reasons:

  • To ensure there is no missing versions on LogServers, which makes recovery algorithm easier. The recovery algorithm knows all LogServers have the same continuous versions.
  • The broadcast also serves as a way to propagate commit versions to storage servers. If there are no such empty messages, then storage servers that peek from a LogServer may not know the most recent commit version, and thus can’t serve reads.

That said, we are working on a proposal that removes this broadcast, called version vector. We’ll soon benchmark the performance of this feature. The idea is that proxy only writes to a subset of LogServers. To solve the problem of knowing commit versions on storage servers, the sequencer role tracks the latest commit version (LCV) for each storage server. The LCV vector for all storage servers, i.e., version vector, is sent back to the client in the GRV reply. Thus, the client read can piggyback the LCV to a storage server, which can then use LCV to determine if the read can be served immediately or need to wait data from the LogServer.

Thank you Jingyu for your expert explaination. Now I’m very clear about the 2end question. Just need a little more clarification on the first one: You mean the comitted version that clients have get replied to may be larger than maximum of all KCVs? If yes, Any specific case for this scenario?

Yes, the client may get commit versions larger than max(KCV) on LogServers. The reason is that KCV is sent from proxy to LogServers. So a proxy may advance its KCV to a version V and tell a client about this version. Then there is a recovery. V was not propagated to any LogServers, and V is large than any KCVs on all LogServers.

Thank you Very much. Your explanation is so helpful for us to understand fdb internals.

PEV is the maximum version that a client could have heard an acknowledgement for, and everything PEV+1 and above can be discarded. If we know that there’s more than RF TLogs which did not durably commit PEV+1, then it’s impossible that we would have informed a client that the version is durable. Setting RV=PEV would thus be a valid decision. The reason why FDB chooses the most aggressively high recovery version as possible is because antiquorums.

Antiquorums is a “feature” in FDB that lets you specify the number of TLogs which can not respond to a commit, and the ack will still be sent to the client that the commit is durable. The idea behind it was that dropping the slowest TLog from each commit would significantly improve tail latency. I called it a 'feature" because it’s essentially broken, and no one uses it on their clusters (outside of simulation tests), because it can lead to terrible behavior. One TLog is permitted to become unboundedly far behind, because it can always be the TLog dropped from the quorum. Storage servers can still be assigned this unboundedly behind TLog as their preferred TLog, and thus become unboundedly far behind themself. Ratekeeper will eventually throttle the cluster to zero assuming that storage servers aren’t catching up because of an overload, whereas it’s actually because their TLog isn’t feeding them any newer commits, because the TLog isn’t receiving the commits itself.

All that said, when you factor antiquorums into recovery logic, you must always choose the most aggressively high recovery version that you can, because if you’ve heard m - k replies, the k replies you didn’t hear might be at an even higher version than the maximum KCV you’ve seen, and the lowest k replies that you heard were TLogs that were anti-quorum’d out of those commits.

So the recovery logic was written this way to support antiquorums, but no one uses antiquorums, and we don’t recommend that anyone use antiquorums. Thus, we didn’t discuss antiquorums in the paper, but we still described the actual recovery logic of FDB in the paper, which was written to support antiquorums, and that’s how we ended up here.

(As humorous nostalgia to me, this was also the first question that @markus.pilman asked when I first met him. :p)

Thank you Alex for showing us the hidden story of antiquorums in fdb.
Just want to be confirmed is the following a possible scenario :
" So a proxy may advance its KCV to a version V and tell a client about this version. Then there is a recovery. V was not propagated to any LogServers, and V is large than any KCVs on all LogServers."

Should it be m-k+1 here? If we lose k tLogs which happen to be a replication team, I think the recovery can not go on.

That should be “V wasn’t propagated to all LogServers”. That’s the important difference here, as in any normal FDB cluster that’s not using antiquorums, any version that a client knows about is by definition durable on all TLogs.

In short, if all antiquorum logic was reverted in FDB, then it’d be valid to choose the Recovery Version as min(KCV) instead of max(KCV).

Ah, yes, the text said “more than m - k replies” and I just copied the m - k part.

This scenario is possible. The proxy replied V to client would send a higher version to LogServers to notify them about KCV=V, but a recovery can make the proxy die without sending such a higher version. As a result, V is durable on 3 LogServers and the client got the reply, yet no LogServers know KCV=V.

I had to go back and stare at TagPartitionedLogSystem::{getDurableVersion,epochEnd}. I’m apparently minorly wrong on a number of places where my memory is off (and this code has changed in the meantime).

The calculation of the recovery version is slightly different than I remember. It sorts the list of TLogLockReplys by end version, and basically drops the lowest $antiquorum replies (balancing out that there could be $antiquorum higher tlogs that it didn’t hear back from yet). This ends up calculating a lower recovery version than what I recall the previous code doing, but it’s still safe either way. TLogLockRelies are then reduced down to the max of all KCV and the min of all end versions across logsets, and min(end) is used as the recoverAt version.

Anyway, it appears I did conflate KCV and end version calculation, but the code is still much more conservative in picking a recovery version than what I recall in the past :thinking: