Questions regarding FDB transaction conflict on two concurrent transactions

I have a question regarding concurrent read and write transactions in FoundationDB in terms of conflict checking.

In this example, I have two transactions: a read transaction, T1, performs a scan on a key range, and another write transaction, T2, which changes the keys within this key range, interleaves while the read transaction T1 is pending (not committed yet).

I expect FDB to provide linearizability, meaning that after the write transaction T2 commits, the read transaction T1 should be determined to have conflict and thus gets aborted by the Resolver, because T1 reads the key range that covers the keys that have been inserted/updated by T2. However, when I ran the experiment to perform these two concurrent transactions, I found that the read transaction can read all old keys (meaning the keys with the version before the write transaction T2 starts) and then commit without triggering the FDB to raise error/exception.

Am I missing something? My understanding of the transaction conflict is for transactions T1 and T2, if T1 reads some key ranges with a commit version V1, and T2 has committed and updated some overlapped key ranges with a higher commit version V2, then T1 should be aborted.

The timeline to show T1 and T2 start and end transactions, and the involved key ranges, are shown in the diagram below:

T1						T2
start				
scan some first keys 
of range [0x00, 0xff]

						start
						add new keys within range [0x00, 0xff]
						commit			

scan the remaining keys
commit

Below is the full source code of my sample program. FDB version is 6.2.27, and FDB java client version is 6.22.

import com.apple.foundationdb.Database;
import com.apple.foundationdb.FDB;
import com.apple.foundationdb.KeyValue;
import com.apple.foundationdb.StreamingMode;
import com.apple.foundationdb.Transaction;

import java.nio.charset.StandardCharsets;
import java.util.Iterator;
import java.util.concurrent.ExecutionException;

public class SampleConcurrentFDBTx {
   private static final int NUM_KEYS_TO_INSERT = 1000;
   private static final int NUM_KEYS_TO_READ_BEFORE_WRITE = 10;

   public static void main(String[] args) throws ExecutionException, InterruptedException {
       FDB fdb = FDB.selectAPIVersion(620);
       Database db = fdb.open("/usr/local/etc/foundationdb/fdb.cluster");

       // clear the database
       clear(db);

       // To prepare the tests by inserting some keys
       Transaction tx = db.createTransaction();
       for (int i = 0; i < NUM_KEYS_TO_INSERT; i++) {
           byte[] key = ("key" + i).getBytes(StandardCharsets.UTF_8);
           byte[] val = ("val" + i).getBytes(StandardCharsets.UTF_8);
           tx.set(key, val);
       }
       tx.commit().get();

       // Start the read transaction T1 and read some keys first 
       // in the range of [0x00, 0xFF]
       Transaction readTx = db.createTransaction();
       System.out.println("Read tx start at " + readTx.getReadVersion().get() +
               " with read version " + readTx.getReadVersion().get());

       KeySelector begin = new KeySelector(new byte[] { (byte) 0x00 }, true, 0);
       KeySelector end = new KeySelector(new byte[] { (byte) 0xff }, true, 0);
       Iterator<KeyValue> iter = readTx.getRange(begin, end, 0, false, StreamingMode.SMALL).iterator();
       for (int i = 0; i < NUM_KEYS_TO_READ_BEFORE_WRITE; ++i) {
           KeyValue keyValue = iter.next();
           System.out.println(String.format("Key: %s, value: %s", new String(keyValue.getKey()),
                   new String(keyValue.getValue())));
       }
       System.out.println("Read tx completed reading some keys.");

       // Now start the write transaction to add new key-value pairs and then commit
       Transaction writeTx = db.createTransaction();
       System.out.println("Write tx starts at " + System.currentTimeMillis() +
               " with read version " + tx.getReadVersion().get());
       for (int i = 0; i < NUM_KEYS_TO_INSERT; i++) {
           byte[] key = ("key" + i + "abc").getBytes(StandardCharsets.UTF_8);
           byte[] val = ("val" + i + "abc").getBytes(StandardCharsets.UTF_8);
           writeTx.set(key, val);
       }
       writeTx.commit().get();
       System.out.println("Write tx committed at " + System.currentTimeMillis());

       // Back to the read transaction to read remain key-value pairs and commit
       System.out.println("Read tx resumes read keys at " + System.currentTimeMillis());
       while (iter.hasNext()) {
           KeyValue keyValue = iter.next();
           System.out.println(String.format("Key: %s, value: %s", new String(keyValue.getKey()),
                   new String(keyValue.getValue())));
       }
       readTx.commit().get();
       System.out.println("Read tx committed at " + System.currentTimeMillis());

       // The code below could be uncommented to re-scan to verify that the write transaction did commit successfully

//        System.out.println("Re-scan the database");
//        readTx = db.createTransaction();
//        iter = readTx.getRange(begin, end, 0, false, StreamingMode.SMALL).iterator();
//        while (iter.hasNext()) {
//            KeyValue keyValue = iter.next();
//            System.out.println(String.format("Key: %s, value: %s", new String(keyValue.getKey()),
//                    new String(keyValue.getValue())));
//        }
//        readTx.commit().get();
   }

   private static void clear(Database db) throws ExecutionException, InterruptedException {
       byte[] begin = new byte[] { (byte) 0x00 };
       byte[] end = new byte[] { (byte) 0xFF };
       Transaction tx = db.createTransaction();
       tx.clear(begin, end);
       tx.commit().get();
   }
}

Read-only transactions have a no-op commit. This is mentioned in the Developer’s Guide:

In particular, the reads in a transaction take place from an instantaneous snapshot of the database. From the perspective of the transaction this snapshot is not modified by the writes of other, concurrent transactions. When the read-write transaction is ready to be committed (read-only transactions don’t get committed and therefore never conflict), the FoundationDB cluster checks that it does not conflict with any previously committed transaction (i.e. that no value read by a transaction has been modified by another transaction since the read occurred) and, if it does conflict, rejects it. Rejected conflicting transactions are usually retried by the client. Accepted transactions are written to disk on multiple cluster nodes and then reported accepted to the client.

Linearizability is about “operations” following a real-time ordering, and when applied to a transactional database, the “operation” is a transaction. To be linearizable, there must be a single point at which the transaction atomically committed between its start and end. The schedule that you’ve shown does indeed satisfy this requirement: T1 starts before T2, and thus T1 can be ordered before T2 without breaking linearizability.

(If you really need a read-only transaction to have its reads re-validated upon commit, my personal cheap hack is to add a write conflict range over some key your clients will never touch.)

As you described, a read-only transaction becomes a snapshot read (Snapshot isolation - Wikipedia) since it reads a snapshot of the database, and conflict checking does not happen. Is it true?

I don’t explicitly specify my transaction to be read-only. Does the client library detect that the transaction contains no write and then consider it as read-only transaction?

I tried doing artificial writes in the transaction T1 (in the example) as you suggested and when committing T1, it can detect the conflict. To confirm, is it because the conflict checking happens this time and detects that the range T1 reads ([0x00, 0xff]) conflicts with new keys inserted by T2, as the new keys fall into this range?

All reads in FoundationDB are of a snapshot of the database.

The FDB client considers a transaction to be read only if it has no writes and it has no write conflict ranges. Some client bindings have different classes for Transaction vs ReadOnlyTransaction. This distinction doesn’t exist in C API, it’s just for programmer convenience in some bindings to have a transaction type that can’t be used for writes.

And you are correct. Once you add the write to T1, it causes commit() to send a commit RPC to FDB, and FDB correctly identifies that a previously accepted write (from T2) causes a read-write conflict, and thus T1 aborts.

What I really care about is the default transaction that provides strict serializability. The snap-shot read-based transaction is not the real issue that we tried to raise with you.

In the FDB architecture document (link: Developer Guide — FoundationDB 6.3), session “Conflict ranges”, it says that “By default, FoundationDB transactions guarantee strict serializability”. If a transaction is not specified further, in order to provide strict serializability, I believe FDB should check for conflict even if the transaction only has read operations. This generic transaction is different from a snapshot read as documented in the link.

Actually, a statement is specified in the document that:

…if a concurrent transaction happens to insert a new key anywhere in the range, our transaction will conflict with it and fail (resulting in a retry) because seeing the other transaction’s write would change the result of the range read

This is the behavior I expected to happen in my example that involves the default transaction that should provide the strict serializability guarantee, but I did not see it and that is the reason that I posted the question to the forum. The read transaction in my example is very similar to the transaction written in Python (copied from the link above):


@fdb.transactional
def remove_one(tr, range):
    all_kv = tr[range]
    key, value = random.choice(list(all_kv))
    del tr[key]
    return value

except that it does not have the delete operation of “del tr [key]”. I believe that following the FDB’s strict serializability guarantee, having this delete operation or not should not change the statement. That is, this remove_one() will have conflict and abort, due to the other write transaction that touches any key in the “all_kv” range.

In summary, based on the fact that a default transaction should provide strict serializability checking, I would expect my test code with Java bindings, which does not involve a snap-shot read transaction, to raise transaction conflict and thus abort, without having to artificially add dummy writes or write conflit range to trigger conflict checkings. Could you help provide further clarification?

He was not referring to the snapshot read option, but rather the fact that every transaction, including with the default behavior, performs reads on a snapshot of the database. In other words, it does all of its reads at a single consistent version that is obtained when the transaction is started. No writes that happen after that version (such as from concurrent transactions) will change the result of any of your reads. You will only see data that has been committed prior to starting your transaction.

At commit time, we need to make sure that the result is consistent with some strict ordering of the simultaneous transactions. In this case, the results of your operations are exactly the same as if the read-only T1 was ordered completely before the read-write T2 (T2 does not affect any of T1’s reads), so there is no conflict. Because T1 made no writes, we can in essence assign it a commit version equal to its read version without invalidating any reads that come later.

If T1 wrote a key and T2 modified T1’s read set, then this is no longer guaranteed to be true. It can’t be assumed to have a commit version equal to the read version, and it would be ordered after T2, assuming T2 committed first. The modifications to the read set of T1 mean that it cannot be ordered after, so it is a conflict.

Just to emphasize again: what you’re seeing is indeed strict serializability. FDB does not promise that a transaction is serialized as of a version that existed between the start and end of commit(), as you’re expected. FDB promises that a transaction is serialized as of a version that existed between the start of getReadVersion() and the end of commit(). This means that the current behavior you’re seeing is still linearizable.

If you wish for read-only transactions to always fetch a new version to commit at upon commit() being called, then I’d suggest adding a meaningless write conflict range to all of your transactions, and understand that those read-only transactions will now be significantly more expensive to FDB.

Thanks for the further clarification. Based on what you explained, does the following description correctly capture what we have discussed so far:

  • In the current implementation of FoundationDB, for a transaction T that is declared as read-write default transaction, if T includes only read operations but no write operations, then T will be always treated as a read-only transaction that reads the version obtained via GetReadVersion() at the beginning of the transaction. T will not be checked by the Resolver as specified by Algorithm 1 described on Page 5 of the FDB paper https://www.foundationdb.org/files/fdb-paper.pdf. As a result, T will always be committed.

  • Furthermore, if the application really wants to ensure that this read-write default transaction with read-only operations, to be checked by the Resolver, to make sure that more recent committed writes do not happen to any of the read-involved key ranges in T, then we have to introduce “a meaningless write conflict range” to T.

That sounds right to me. To make it a little more precise, for case 1 I would say T would not fail to commit, as it’s common for read-only transactions that you would not actually commit it.

Another way you could frame these two options would be that they are giving you different choices for the serialization order of your read-only transactions. In case 1, your transaction is being sequenced as if it happened at the read version, in case 2 it is being sequenced as if it happened at the commit version. If it’s important that your read occur at the commit version (i.e. after anything that has already committed at your transaction’s commit time), then you would need to use option 2.

2 Likes