Using FDB clusters as 2PC participants

Let’s say I have 3 FDB clusters, clusters A, B, and C. User data is divided between B and C depending on their physical location and data protection requirements. Cluster A holds only the metadata for determining which users are in which cluster, and isn’t particularly important to the rest of this example, other than to point out there is an external mapping determining which data is where.

User data is self contained entirely into one cluster or the other, but occasionally users interact across clusters. You can imagine a payment application where users can send payments to each other. Most users will interact with users who are near them and therefore in the same cluster, removing the need for any transaction outside of FDB. Sometimes a user will make a transfer to a user in the other cluster, and a sane choice for doing this atomically is 2 phase commit.

I’m not asking how to implement 2PC in general, but rather how to best handle a few parts of the protocol in FDB effectively.

2PC notably only ensures atomicity, so isolation is on the layer and FDB jointly to ensure. This comes in (at least) two parts: hiding uncommitted data from concurrent transactions in either “database”, and ensuring a transaction always can commit locally if asked to after responding to a prepare successfully.

If your define each “database” as a single user’s data in one cluster, you could add a lock key for each user to say “there is a 2PC going on for this user” and have every transaction for that user not participating in a 2PC add a read conflict range to that key. The 2PC transaction in the prepare phase writes to that key before acknowledging. This ensures concurrent transactions for that users’ data block or fail until the 2PC is over, but concurrent transactions for other users do not. You can assume all data for this user is contained within this database, including secondary indexes etc. (so no need to worry about dirty/stale reads or blocking other txns there). Any concurrent, read-only transaction which ignores the lock does it at its own risk, and writing anything at all would violate the rules unless a more fine-grained locking scheme were used.

If all transactions obey this lock, it should be possible to implement the guarantees required of participants by 2PC, correct? From a performance perspective, the per-user lock is not ideal, but it could be improved. This is ideally not an interactive transaction to minimize the time holding the lock.

The other bits are not as interesting or complicated. In the prepare phase, you log all the operations you’d like to do (set this key, clear this range etc) along with the ID of the active transaction. When your transaction logging operations commits, you acknowledge prepare. When you’re asked to commit by the transaction manager, you replay the log of operations and apply the mutations to the local FDB cluster, and because you hold the lock, you can commit knowing no other transactions could possibly conflict with you.

The failure scenarios are also not made any more difficult by FDB (only easier to handle) because the roles of the transaction manager and “databases” in the protocol are not single physical machines, but rather logical entities driven by any stateless layer process.

Questions:

  1. Is my description of locking and logging safe (as far as you can tell from my probably under-specified words) and did I miss any bits that could be optimized to completely skip work (like not locking at all, which I don’t think is possible)?
  2. Is my assumption that FDB only makes handling the failure scenarios easier true? I understand it will be slower than if each participant were a physical machine that could operate purely by itself, where a transaction could commit in say <1ms vs 3-4ms minimum in FDB, but I don’t think this makes it any more difficult to implement 2PC correctly.

Sorry for the wall of text! Hopefully this discussion can spark some interest among people who are interested in using FDB as the building block for a global distributed database.

1 Like

It sounds broadly safe to me. One thing that comes to mind: in your description of the locking procedure, you stated that every transaction needs to add a read conflict range to the lock key. I think that’s actually insufficient–you need each transaction to actually read the key (otherwise, you don’t know whether the user’s database is locked or not).

Just to be clear, I believe your proposal to be something like:

  1. Commit “prepare” transactions on clusters B and C locking one user on cluster B and one user on cluster B. These lock the databases and write logs of what to perform.
  2. Commit “commit” transactions on clusters B and C replaying the logs and then unlocking the databases.

I think if you wanted a “proof” that this is safe, I suppose you need some definition of safety. I think the goal would be to retain ACID semantics. To show this is atomic, you could go through the failure scenarios, and I believe they all result in either both operations being committed or neither. (There is some subtlety on what to do if the worker performing 2PC dies in the middle. You need something to eventually unlock your databases, and it needs to know whether the other operations in the 2PC protocol succeeded. In particular, it needs to be able to distinguish between the prepare transaction on the other cluster failing and the commit transaction on that cluster having already occurred.) Isolation should be guaranteed from the fact that no reads or writes are allowed to the databases while the 2PC is going on, and I think one can convince oneself that this creates a strictly serializable transaction ordering if one analogizes to traditional two-phase locking and then considers commit versions long enough.

Yeah, that sounds about right to me. FDB provides a mutable byte-array to byte-array map, which is essentially isomorphic to most common single-machine storage models. And it gives you the ability to make atomic updates across keys (unlike a disk, say), so you can do things like unlock and apply operations from the log transactionally, but that’s just an optimization on top of 2PC.

1 Like

It sounds broadly safe to me. One thing that comes to mind: in your description of the locking procedure, you stated that every transaction needs to add a read conflict range to the lock key. I think that’s actually insufficient–you need each transaction to actually read the key (otherwise, you don’t know whether the user’s database is locked or not).

Yes, my description there is wrong. I think I had somewhere I was going with that that wasn’t wrong, but I wrote it right before bed, so who knows? :sweat_smile: