I have a long running service that connects to a FoundationDB cluster, and I was wondering what would be the best strategy to emulate a global on_connection_state_changed(previous_state, new_state)
event, in order to support the Circuit Breaker Pattern at the application level.
After some experience with FDB, I have added default timeouts/retry limits to protect against failure of the cluster while the application is running, but there is still an issue when the application starts while the cluster is not available (network issue, fdb.cluster outdated, less than X% disk space available globally, …), especially for all the init code that needs run before everything else, and will use the Directory Layer to open all the subspaces required by all the layers.
When using traditional SQL servers, the code usually does not need to go through that initialization step (schema is handled by the server), and only has to deal with the question “what is the probability that the the database is available now?”.
But with FDB and the Directory Layer, the question becomes “what is the probability that the database is available now, and that the key prefixes for all my subspaces are still valid?”. (They could change if the cluster was down due to a full data restore and just came back online, which has happened to me in the past and caused havoc everywhere!).
Having an event that triggers on state change, I could re-open all the subspaces in use, and be assured that I don’t pollute the database with old key prefixes.
My first though would be to have a thread that continuously tries to read a key from the database, to detect failure and simulate an internal Open/Close state for the circuit breaker, but this has a few drawbacks:
- This is polling which consumes a lot of resources for nothing (especially on a farm of multiple clients all polling the same key)
- When the cluster falls, I would have to wait for the read timeout to expire to officially declare the cluster unreachable, which will require another Magic Number for the timeout value (1 sec? 5 sec?..)
- When the cluster goes back online, I will also need to wait on average half the polling interval to resume operations.
- That’s another thread to spin up, monitor, and wait for it to abort when stopping the application.
I was thinking that the client library already does this sort of monitoring of the status of the cluster, because it needs to maintain connection to the proxies, storage nodes etc… and probably already have a “available/unavailable” state internally. Would there be a way to tap into this and get notified somehow when the client change from one state to another?
It could be something that looks like a Watch on a client-only virtual key in the \xFF
system keyspace? The key could store an enum value, and the client could watch the value change.
The desired properties of this are:
- reduce the cpu and network overhead by not having to repeatedly poll a key in the general keyspace
- reduce the delay to detect failure/resolution
- provide a global Circuit Breaker that can be used by the application to not pay the timeout cost when possible.
- provide a generic API that the application can plug into for “data refresh” events.