Proxy all-to-all communication

Why does a proxy have to communicate with all the other proxies when getting a read version for a client?

This post from @alexmiller leads to me to believe that the proxy all-to-all communication is a performance optimization that allows the client to get a read version that’s less than the highest commit version that master has ever replied with and instead only get the highest version successfully committed.

This means new transactions do not need to wait for all the in-flight mutations to be available on all the storage servers the new transactions read from.

Is that a good summary for why this is the case? Are there other benefits to not just using the version from on the master itself? Scalability doesn’t seem like a likely explanation given the proxy all-to-all communication, while batched, is not more scalable than batching requests to the master.

1 Like

The short answer is: to guarantee external consistency.

For the long answer, lets look at the alternatives:

Get Read-Version from Master

You could get a read-version from master. You could ask the master what the last version was it gave out and use that as a read version.

As you correctly pointed out, this has the drawback that now the client would need to wait for all in-flight transaction-batches (a write version is used for a batch of transactions) to commit. This can take a long time and I guess this is the main reason for it not being done that way. However, this would be a correct way of getting read versions.

Another drawback here would be that there’s a bigger chance of seeing future_version errors which would increase load on the system. Last but not least, the master role can’t be scaled. So putting more work towards the master is generally something you want to avoid (even though giving out read-versions isn’t super expensive, it still would require the master to get a transaction budget from Ratekeeper, batch requests and generally have to maintain potentially thousands of network connections from clients and server processes).

Get Read-Version from one Proxy

You can actually do this by setting the flat CASUAL_READ_RISKY (I am not a fan of the name of this, as I would argue that in almost no cases doing this is actually risky). In that case the client would ask a random proxy for the last committed version.

This is probably the cheapest way of doing it (and I would argue in most deployments this should be the default).

However, this strategy comes with drawbacks:

  1. You could have a stale proxy. In that case the read-version would be old. This would increase chances of transaction conflicts (and conflicts are generally bad as these transactions use resources without doing work).
  2. It will violate external consistency. This means that if you start two transactions A and then B (in that order), A might get a higher read-version than B - so B will logically run before A. This basically means that your transactions will be reordered. Your transactions will still be serializable (a database is allowed to reorder transactions as long as the ordered execution is serializable) and this is why I would argue it is in general not a big deal.
  3. You might not read your own writes. This is a consequence of losing external consistency. You could write something in a transactions, successfully commit and then start another transaction which will read the same keys but you will read them in the state they were before you committed. I guess this is really the worst drawback as it is not very intuitive to the user. But keep in mind: you still wouldn’t violate serializability: any attempt of writing after reading this old state will result in a conflict.

I would like (and probably will implement) some changes to make above problems less problematic. We can so something for each point:

  1. The proxies could communicate with all other proxies at least once per 10ms (I made up this number on the spot - might be larger or smaller). This way, a proxy could limit how old the read version is.
  2. I think not having external consistency is fine for most applications. If you need external consistency, you need to disable CASUAL_READ_RISKY.
  3. Whenever the client commits a transaction, it could set an internally maintained max-committed version. After getting a read-version it could compare this read version with the highest committed version it knows of and use the larger one. This would be free for the FDB cluster (and very cheap on the client) and it would guarantee that would always see your own writes. From the view of one client you would have almost external consistency.

With these two simple changes, (and 1 might have been done already, but I need to check the code), clients could only see that there’s no external consistency if they communicate with each other through other channels than FDB (for example through direct network communication or another database). If that is the case in your application you might not want to use CASUAL_READ_RISKY.

Don’t get Read-Versions

This is actually a valid strategy. Whenever you commit, you get a version back that you can use as a read-version for your next transaction. Obviously, if your client doesn’t generate any load it would mean you will get a stale read version and all your transactions will abort (either by conflicting or by getting a past_version errors.

However, you could measure throughput at your client and decide to only fetch a read-version from a proxy if your last commit is old. Clients could even share their versions with each other, although I am not convinced you can do this in a more efficient way than just using the proxies.

I hope these explanation helps. There might be other ways of optimizing this part and I would be happy to hear about them.

2 Likes

Thanks, that’s very helpful. I was hoping to enumerate the different trade-offs between architectures where clients talk directly to the master, clients talk to a proxy without all-to-all messages, and clients talk to the proxy with all-to-all messages. I think that covers it.

Is it also used as a way for a proxy detect that a proxy is partitioned away and should stop? I could imagine a scenario in which a client, a proxy, and an old master (in my model where the proxy doesn’t use all-to-all messages) are partitioned away and could participate in some read-only transaction. I think the Tlogs all get locked to tell them not to commit transactions from older generations of the transaction subsystem anymore, but I don’t know about the proxies and their behavior here for read-only transactions.

I thought CAUSAL_READ_RISKY does still require the all-proxy communication, but it makes it so you don’t have to check in with the transaction logs (based on https://github.com/apple/foundationdb/blob/d19be2348a90f41b1dc715905188250b0475b418/fdbserver/MasterProxyServer.actor.cpp#L1029).

Oh, you’re right, guess my mental model was wrong.

So this means there’s actually another possible architecture in my list above and this is what CASUAL_READ_RISKY currently does. Let me quickly try to finish this:

Get Read-Versions from all Proxies but ignore TLogs

so this is CASUAL_READ_RISKY (and not what I wrote above) as you could get a read version from a proxy that was declared dead. This will give you external consistency for as long as there are no recoveries. Main benefit is that it reduces load on the log system.

Usually it is the failure monitors job to detect partitions. If the proxies can’t communicate with each other (but all proxies can reach the cluster controller), they will simply wait forever without handing out read-versions. This is actually an interesting question: I don’t know how the system would detect this kind of partition. I guess in most cases the partition would go away eventually, but this is obviously not a good argument. It might be that master will eventually kill itself if it can’t run system transactions? But I am not sure.

1 Like

Another big benefit of this is that there exist some deployment configurations with logs in distant locations (for example, to get durability to more than one datacenter), and get read version requests can avoid making remote round trips if they don’t talk to these logs.

1 Like

Hi, this is a really interesting discussion. Some comments/questions.

Just to understand: by “start A and then B”, you mean “start A, commit A and then start B”, or do you mean that A and B start one after the other, and B starts before A commits?

Instead of keeping this max-commit version private within the client, you could piggyback it on every message the client sends to a proxy. Then, the proxy can update its own value of “last committed known version”. The proxy can then enforce that clients monotonically increasing versions, and data freshness is improved (at the cost of having to transmit this info).

What is the role of the tlogs when determining the read-versions for a transaction?
If CAUSAL_READ_RISKY is disabled, are there any consistency issues in a failure-free scenario, or consistency can be compromised only if failures/network partitions happen?

Thank you

Yes. If you start B before you commit A, FDB will guarantee that you don’t see anything that A has written, not matter what and in that case you wouldn’t violate external consistency. If you start A, commit A, then start B and you would not read from all proxies AND/OR don’t check the tlogs (kind of left that part out in my first post) it could happen that B would not be able to read data that was written by A (if it would try to read it, it wouldn’t be able to write afterwards - therefore my argument that it is generally not a big deal).

Yes, that is a good point. Though it would only improve it, not solve it as the clients will only talk to one proxy for each get read version request (so one proxy could not see a request for a long time and therefore still be stale). Also clients don’t stick with one particular proxy, instead they will load balance their requests across all proxies. It would also not solve the tlog problem.

The Cluster Controller within FDB broadcasts the list of all proxies (and some other data - you can find that in the code in the struct ClientDBInfo) to all clients.

We don’t guarantee that this object is up to data - only that it will be updated eventually. So if the cluster goas through a recovery you might talk to an old proxy and that proxy might not know that it should be dead. So you might get a stale read version. The proxies will get informed (locked) by the new master process early in the recovery phase.

It can also happen if new machines come up. In that case the Cluster Controller or master could decide that it can run a better configuration and will therefore force a recovery. But without any recoveries, you won’t run into any issues with CASUAL_READ_RISKY (I think).

Ok, the confusion for me was caused by the fact that I did not get that A and B are executed by the same client (your explanation works for the general case anyway).

Sure, that would only be an improvement to be implemented on top of the periodic all-to-all broadcast + whatever is needed to be done with the tlogs.
Actually, instead of an all-to-all broadcast, you could use some kind of hierarchical gossiping across proxies, as done in some systems like GentleRain[1]. The message complexity goes from O(N squared) to O(logN) where N is the number of proxies. Probably it’s not worth the effort with a low proxy count, though.

Thanks for the details on the tlogs!


[1] https://infoscience.epfl.ch/record/202079/files/paper.pdf

If CAUSAL_READ_RISKY being disabled means that you aren’t using the option, then we guarantee strict serializability regardless of faults in that case. If you are using the option, then I believe the risk is that you could suffer reduced consistency guarantees during recoveries, as Markus stated.

Yes, thank you for pointing it out: of course enabling a flag called “risky” implies troubles, and not disabling it :slight_smile: