Read-after-write transaction flow


I’m just learning about FoundationDB. I have a question about the flow when a key is written and then immediately read.

From what I understand so far:

  1. A transaction is written to the transaction log synchronously
  2. It is written to the storage server asynchronously
  3. Reads are serviced directly from the storage server

If the above is correct, then how is a read-after-write serviced if it hasn’t made it to the storage server yet?

Transaction 2 gets a read version which is guaranteed to be at least the commit version of any transaction that was committed before transaction 2 started, so at least the commit version of Transaction 1. That read version identifies a consistent snapshot of the database, and all reads in that transaction are served at precisely that read version. The storage servers keep track of which keys and at which versions they can accurately respond to read requests for. When a storage server gets a read request, it waits until it knows it can accurately respond to the read request before replying (or in some failure scenarios it might give up on waiting and reply with a retryable error). The reply may indicate that the client has the wrong storage server for that request, since the client’s key location cache might be stale. In that case the client needs to update its key location cache and try the correct storage server.

That became kind of long-winded, but I think that answers your question. Let me know if any part was unclear.

When a storage server gets a read request, it waits until it knows it can accurately respond to the read request before replying

Does this mean that if the Storage Server only has Read Version 1, and there is a request for Read Version 2, the Storage Server will block until it has caught up to Read Version 2?

That’s right. It doesn’t block indefinitely though - it responds with a retryable error if it hasn’t caught up after a timeout.

1 Like

Could you explain this a bit more? When responding to a request, ss would probably have to wait to sync with tlogs till the requested readversion, even if there are no recent changes to the requested key; is that correct?

If above is yes, then is the assumption here that ss will never fall further than single digit milliseconds behind tlogs for majority of the time, if it has to provide less than 10ms latency per key read? This is something I haven’t been able to clearly understand when I was using fdb few years ago.

This is correct. In normal operation this doesn’t add extra read latency because the storage servers don’t need to wait for a version to be committed to read the mutations for that version from the tlog. Reading uncommitted data would violate strict serializability, but the GRV proxies only send back read versions which are committed.

During a recovery, this does mean that storage servers can have mutations from an uncommitted version in memory. Part of the recovery process is to “roll back” those mutations. This is currently implemented by just re-initializing all the in-memory mutations from the new epoch.

1 Like

I wrote a workload once to measure the latency. Add a ReadAfterWrite workload, to measure TLog->SS propagation delay by sfc-gh-almiller · Pull Request #3442 · apple/foundationdb · GitHub

The numbers I got showed tlog->ss propagation delay as unnoticeable to a client. GRV path was optimized in 7.0, so maybe(?) it’d be slightly more noticeable now, and different clusters can behave differently.

It calls getRandomKey(), so it’s not safe to use on production clusters, but a small change to fix to a specific key if provided, or to allow a conflict-range only commit, could make it safe to run on a production cluster.

1 Like

Thanks Alex! That’s great, and quite conclusive :slight_smile: