Feature Request: Client API (dis)affirming if key is being watched

If I’m reading this right, it looks like one could determine that a key is under watch by checking for a non-zero count on that key in the underlying AsyncMap of the StorageServer.

If that’s true, it would be cool if that information could be exposed via some client API (and even cooler if the number of watches on that key could be reported, but it doesn’t look like the architecture has any representation of that at this point…).

My use case is, given a scenario where I want to have a single primary agent tailing/processing a queue, to provide a mechanism for one or more secondaries to monitor the “liveliness” of the primary (which may have crashed/terminated) and “apply” for promotion if the position is vacant.

(Thinking about it now, I guess some sort of protocol where the primary has to publish a heartbeat to a key being “observed” by secondaries may be sufficient/better?)

Thoughts?

Hm, I’m not sure which is better design starting from tabula rasa, but I believe the heartbeat idea should work and doesn’t require an FDB API change.

In the Record Layer, there’s a SynchronizedSession class that is supposed to handle some of those details: https://github.com/FoundationDB/fdb-record-layer/blob/master/fdb-extensions/src/main/java/com/apple/foundationdb/synchronizedsession/SynchronizedSession.java

The way it works is there are two keys: (1) one with a UUID indicating the current holder of the lock and (2) another containing the lock’s timeout. Then to do things, the lock holder reads the UUID key, verifies it is the same, and updates the heartbeat with their timestamp plus some lease period. Observers read the UUID key and the timestamp key, and if the timestamp is less than the lease period expiration time, then they assume the lock holder is still alive. If their own time is greater than the timestamp, then they try and claim the lock by writing their own UUID into the key and updating the timestamp.

Note that in this design, it is the value of the UUID key that truly determines the lock holder/leader, not the timestamp value. This means that in the presence of clock drift, a leader is still chosen and there is consensus on who holds the lock. (I don’t think this in our implementation, but you could even relax this protocol by reading the UUID key at SNAPSHOT isolation level if you are okay with there temporarily being two leaders.) Using the timestamp is just a performance optimization.

That being said, there are problems that can arise with clock drift. For example, if one process’s clock drifts way ahead and then writes a timestamp way in the future, no one will be able to claim the lock until that time. This can be ameliorated by adding “escape hatches” that allow an operator to manually remove a lock, but it’s not awesome.

I believe our backup tooling does a similar thing, but it uses database versions instead of wall clock time. Maybe @SteavedHams can correct me, but I believe that that’s what this line is doing: https://github.com/apple/foundationdb/blob/a4f12a19a3a24bd68676bef8e629d6e855c4702c/fdbclient/TaskBucket.actor.cpp#L223

That ameliorates some of the problems with clocks, but it depends on knowing how versions correlate with time, and that relationship may change in a future release. (Or in other words, the current behavior should not be considered part of the API.) Or at least that’s advice we’ve given in the past.

You are, except that if a storage server has too many watches, then it will “cancel” some of them, which then causes clients to revert to polling. Thus, this would only work up to a certain (large) number of watches.

I agree that the better way to do this is by doing leader election on top of/via FDB, as Alec described.

Thanks, I think I’ll go with the “lease” strategy after all. Only thing I’m thinking differently than what was suggested is to leave the expiry-timing measurement concerns to the secondaries – something like have the primary regularly update the non-uuid key on a regular basis (atomic increments or maybe version-stamps) and require that secondaries have observed a static value there for the duration of the lease (according to each one’s local wall-time when sampling the “heartbeat”) before rushing for a promotion.

Correct. The TaskBucket execution framework used by Backup and DR uses timestamps for both task timeouts and scheduling future execution of tasks. Also, even though there currently is a fixed “versions per second” that the cluster attempts to adhere to, even without a behavior change there are events that can cause the commit version to advance suddenly and dramatically, such as a master recovery or a DR primary/secondary cluster switchover. TaskBucket is not particularly concerned with the passage of real time even for scheduled tasks, so they can be approximate at best and occasionally wildly inaccurate without serious consequences.