Locking with FoundationDB

I am building a layer which provides the ability for clients to take locks on logical keys. A correct distributed lock needs to fulfill both safety and liveness (good introduction: https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html).

Skip to the end to see my specific question if you’re not interested in reading my detailed locking procedure

Safety is easy with FoundationDB: use the committed version of the database for fencing and rely on the transactional nature of the database for the rest. I understand this version increases at roughly 1 million per second, which I have demonstrated to myself locally on my machine too.

Liveness is where I see a larger challenge.

The layer will be implemented as a process co-located on the FoundationDB nodes (or on machines in the same datacenter) and speaks the Redis protocol because essentially every programming language has at least one client library for Redis. Clients will not talk to FoundationDB directly, so locking needs to be implemented at the layer level.

Here is my idea for liveness:

  1. Clients connect to the layer process on any machine it is deployed to and request a “lease token” which uniquely identifies their session within the system. All locks will live under this lease. The lease specifies a timeout in milliseconds after which it should be considered expired without a refresh. The lease key is written to the database along with the read version of when it was created as the value.
  2. The client periodically sends a refresh command to the layer. The layer updates its knowledge of this time using the system’s monotonic clock API. It additionally updates the lease key in the database with the read version of the transaction doing the update.
  3. The client attempts to acquire a lock with a lock command specifying the lease ID as well as the logical key it would like to lock. If the acquisition is successful, the lock’s key is set to the value of the lease ID which is holding the lock. The request returns the committed version of the database at which the lock was acquired. If the lock acquisition failed the layer returns -1 to indicate a failure. Other information could be included with the failure like which lease currently holds the lock and when it would expire without any more renewals.
  4. The client makes requests to other parts of the system using that lock and can use the committed version of the database as a fencing token for writing to external systems.
  5. Any layer processes attempting to acquire a lock will examine the version of the lease that holds a lock if it is held. It will then expire any leases within the same transaction if the leases are expired.

That’s the happy path. Here are a few (non-exhaustive) failure cases:

  1. The layer process crashes, resulting in the connections to all client processes being closed without an opportunity to cleanly release all the locks.
  2. The client process crashes, resulting in the client’s connection being closed without cleaning releasing any lock held by the client’s lease.
  3. The layer process loses the connection to FoundationDB, but still has connectivity to the client.
  4. The layer process and client process crash simultaneously.

Here is how I would handle each one, respectively:

  1. The client retains the lease ID and can send it to a different layer process after it reconnects. The layer process can examine the lease ID and the corresponding version of the database to determine if the lease should be considered expired. My direct question about this is at the end of the post.
  2. The client durably stores the lease ID if desired (and proceed similar to solution 1.), or can ignore any prior leases and create a new lease. This means the client essentially starts over and the lease will not receive any priority over any other leases attempting to acquire or hold the lock.
  3. The data the lock is protecting as well as the locks themselves will be contained within FoundationDB, so losing connection to FoundationDB will not allow the client to make progress, nor update the client’s lease, so the client should close its connection and attempt to reconnect to a different layer process (proceed similar to 1. and 2.).
  4. The client would attempt to reconnect to a new layer process when it restarts (proceed similar to 1. and 2.).

Do you see any holes in this procedure?

Specific question portion:

  1. I have read elsewhere on the forum that the version of the database can proceed much faster or slower than 1 million per second for multiple reasons. Because of that, I would like to use the layer’s local monotonic clock during non-failure scenarios to override the database version if the layer detects a large deviation in what would be expected at 1 million versions per second. I would like this to be purely a performance optimization and not a correctness violation, obviously. If each layer process can detect this without any coordination other than reading the database version (which is monotonic and a rough timer itself), is it still safe?

  2. If you read the entire locking procedure, do you spot an logical issues with either safety, liveness, or any obvious performance problems? There would be potentially tens of thousands of leases covering hundreds of thousands of locks with lease timeouts being in the neighborhood of 5-10 seconds depending on the tolerance for downtime of processing during a failure. Even at the high end of what I’ve imagined, I don’t think the system should result in more than a couple thousand transactions per second to manage the locks.

I would also appreciate any experience from existing FoundationDB users on implementing locks at the layer level if that has been done.

Combining leases with writes to external systems always makes me feel uncomfortable, though it’s often an unavoidable problem. You’ll want to make sure that your external system writes are idempotent or somehow otherwise transactional, as processes can lose their lease in between checking and doing the write to an external system.

Versions being roughly related with time is an implementation detail, and not part of the public contract of FDB. We only promise that commit timestamps are monotonically increasing. Even with our current implementation, a recovery would likely cause all locks to be dropped as it fast-forwards through a large number of versions. If you’re looking to use time as the notion of how to tell when a lock should be released, then I would recommend you actually use time.

In your fix to (3), you’re going to want to be a bit careful about if the database is fully unavailable, your provided fix would have all clients randomly disconnecting and reconnecting to different layer processes until it comes back, which could have strange effects.

Yes, this definitely assumes the operations of clients using the locks are either idempotent themselves or can be fenced with the database version.

It would’ve been quite elegant if I could’ve magically used the database version as a time too, but dropping all locks at once potentially during transitions sounds like a bad idea given everything would try to reconnect right away.

Back to the drawing board. Thanks!

I think it is important to point out that our backup and DR implementation use versions for a timeout, and we just accept that all running tasks will timeout every master recovery.

Recoveries can be considered a pretty rare event, so as long as you know what you are getting, it is perfectly reasonable to design around versions advancing at 1 million versions per second.

Well, we appear to have strongly conflicting opinions about this, which we should probably discuss.

Is there a reason not to peruse this strategy other than the rapidly advancing versions during cluster healing?

I think I can work around that with a hybrid system of using monotonic clocks in the layer processes that once a second or something timestamp the database version in their local memory with their monotonic clock offset. So each layer process will be roughly correct in terms of wall clock to database version ratio, enough to enable timeouts of leases that can be handled across individual layer processes. The database version will always keep the system correct for fencing and the monotonic clocks will keep competing layer processes from prematurely expiring leases during cluster healing.

The leases don’t need to be accurate at a millisecond level. Just accurate enough not to cause too much thrashing of lock holding under those leases.

With the current system, the answer is no. Recovery advancing versions O(seconds) is probably the only thing that’d cause you trouble. There’s a couple internal tools that we have that rely on 1 million versions per second to achieve about the same result as what you’re looking to do.

I’m personally uncomfortable with promising that versions are correlated with time, as future log system redesigns could potentially require that restriction to not exist. But that’s also an argument that I should go have with Evan at some point.

There’s also the question that if we did decide to make guarantees about versions and time, are we also guaranteeing that the rate is going to remain 1 million per second? There’s precedent for it being changed.