Detecting cluster availability transitions for long running clients

(Christophe Chevalier) #1

I have a long running service that connects to a FoundationDB cluster, and I was wondering what would be the best strategy to emulate a global on_connection_state_changed(previous_state, new_state) event, in order to support the Circuit Breaker Pattern at the application level.

After some experience with FDB, I have added default timeouts/retry limits to protect against failure of the cluster while the application is running, but there is still an issue when the application starts while the cluster is not available (network issue, fdb.cluster outdated, less than X% disk space available globally, …), especially for all the init code that needs run before everything else, and will use the Directory Layer to open all the subspaces required by all the layers.

When using traditional SQL servers, the code usually does not need to go through that initialization step (schema is handled by the server), and only has to deal with the question “what is the probability that the the database is available now?”.

But with FDB and the Directory Layer, the question becomes “what is the probability that the database is available now, and that the key prefixes for all my subspaces are still valid?”. (They could change if the cluster was down due to a full data restore and just came back online, which has happened to me in the past and caused havoc everywhere!).

Having an event that triggers on state change, I could re-open all the subspaces in use, and be assured that I don’t pollute the database with old key prefixes.

My first though would be to have a thread that continuously tries to read a key from the database, to detect failure and simulate an internal Open/Close state for the circuit breaker, but this has a few drawbacks:

  • This is polling which consumes a lot of resources for nothing (especially on a farm of multiple clients all polling the same key)
  • When the cluster falls, I would have to wait for the read timeout to expire to officially declare the cluster unreachable, which will require another Magic Number for the timeout value (1 sec? 5 sec?..)
  • When the cluster goes back online, I will also need to wait on average half the polling interval to resume operations.
  • That’s another thread to spin up, monitor, and wait for it to abort when stopping the application.

I was thinking that the client library already does this sort of monitoring of the status of the cluster, because it needs to maintain connection to the proxies, storage nodes etc… and probably already have a “available/unavailable” state internally. Would there be a way to tap into this and get notified somehow when the client change from one state to another?

It could be something that looks like a Watch on a client-only virtual key in the \xFF system keyspace? The key could store an enum value, and the client could watch the value change.

The desired properties of this are:

  • reduce the cpu and network overhead by not having to repeatedly poll a key in the general keyspace
  • reduce the delay to detect failure/resolution
  • provide a global Circuit Breaker that can be used by the application to not pay the timeout cost when possible.
  • provide a generic API that the application can plug into for “data refresh” events.

(David Scherer) #2

As to ACI (but not D) restores specifically breaking your invariants, maybe there is, or should be, a restore lock version key that you could check? Or something specific to the directory layer? I don’t actually think it is impossible to do a restore without clients losing connection to the database. The DB being down really is supposed to be indistinguishable from any other source of latency.

(This relates also to what I think might be the simplest feature for making metadata read scaling easier in layers, which is a way to add read conflict ranges at a version other than the snapshot version of a transaction. This feature could be used to validate a directory or restore epoch key without doing a read in every transaction)

(Christophe Chevalier) #3

After talking with a colleague about this today, I don’t think there is currently a way to guarantee 100% that all directory subspace prefix will be correct without adding at least one read to each transaction (even if to check a version key on the subspace itself). So in practice there would be no such thing as a write-only transaction, and if I need to do at least one read, then atomic operations and versionstamps are less useful, which is a shame.

This is too much of a cost to pay, in my opinion, for an event that only happens once in a blue moon, as in a “soft” backup/restore at the application level that happens when updating the schema or indexes of a document store, ie: clearrange '' \xFF followed by bulk inserting everything back in, while some application nodes are still running somewhere (adminsys forgot to kill some processes). This has happened to us multiple times (node not dying immediately because of some thread stuck somewhere, and continuing accepting incoming HTTP traffic).

If I lower my expectations just a tiny bit, from “it will never corrupt anything but will be slow” down to “at least if it happens, the application will figure it out quickly” then it could be tolerable. I always use Directory Partitions to contain an application’s data, so I could at least register the main partition with a value that is always different, so I’m guaranteed that the new restored content is in a different key subspace that the previous generation, and that way surviving nodes that haven’t rebooted yet would at most add data to the old location, and not break the new one, combined with some polling that detects things in a manner of a few seconds…

But yes, if there was a way in the future to say "conflict my transactions if this key (or range of keys) in the Directory Layer has been changed by anyone in the last 5 seconds`, combined with a thread that would check things every 5 seconds, then it would make this a non-issue.

I would still need a way to detect availability events to speed things up (right after the client is back online, do a check immediately, instead of waiting the remainder of the 5 sec quanta).

Btw: this is highlighting another issue with the current design of Directory Subspace currently: once they are created with the current prefix for this subspace, and handed to the Layer, then there is no real mean for the Directory Layer to silently update the key prefix to the new value, without some events or subscription by the Layer code… This is starting to get real messy and ugly, just for the sake of helping prevent major catastrophe in the event of a sysadmin not following the update checklist properly. I need to think about how the API could be changed somehow to allow such scenario, maybe by adding a level of indirection at the subspace level, with “constant prefix” for manual prefixes, and “dynamic prefix” for subspaces created by a Directory Layer. It would then keep references to such prefixes and have a way to change them from under the Layer’s nose (so to speak). But what if the layer is in a middle of a transaction? :slight_smile:

(A.J. Beamon) #4

Just so I understand, with respect to the directory layer are you referring to the issue of directories being moved or deleted while being used by other clients? It’s certainly true that the current directory layer defers managing the concurrency between directory access and directory modifications to the user and provides no mechanism for the directory objects on the client to synchronize with changes in the database.

(Christophe Chevalier) #5

Yes, there a two things that I’d like to deal with to make my application tolerant against changes in the cluster, either in connectivity (network issues, crashes, …) as well a schema changes that could lead to read or writing data to the wrong place:

  1. Detecting changes in availability as soon as possible, to get rid of the potential timeout if I let transactions discover it themselves. I can use the Circuit Breaker pattern so that requests happening when the cluster is unavailable (for any reason) fail immediately (or use a fallback method), and then have longer timeouts in general (relying on the breaker to react quickly).

  2. Protect against changes of prefix in the Directory Layer during some infrequent operations (restore or schema upgrade that have to delete / recreate the whole directory hierarchy) that could happen while some nodes are still running (for all kinds of reasons, mistakes, bugs, …)

Having 1. could be used as a signal to detect major events and pre-emptively recheck the Directory Layer for changes that are caused by a major events.

But to 100% protect against 2., I would still need to detect changes that did not impact connectivity, such as a script doing a clear_range on all keys, and recreating all the folders (with new prefixes). Then I guess the only way would be to have a to create a conflict range on the last N seconds, combined with a thread that checks for changes in the DL every N seconds as well. I would need a way to identify from the error code that it is the DL who is conflicting and needs refreshing, and not the rest of the transaction.

Another way of protecting against 2. would be to have some sort of global lock that an upgrade tool could take to “force” all nodes into an offline state, and then they would wait for the lock to be released to go back “online” and recheck all subspaces.

But where would I store this key? Since I use partitions and subspaces for everything, the lock key itself would be in a subspace whose prefix could have changed after the migration/restore script did its job.

I would need a “safe” location outside the sphere of influence of the Directory Layer, with a fixed prefix. Maybe the DL could always reserve a fixed prefix for some global locking subspace, used to protect against chances in the DL itself? (Could be convention based, like 254 is for the DL itself, and for ex, 253 could be pre-reserved for a “system” subspace with a constant name, that cannot be deleted (or at least will always reuse the same prefix when created?).

note: to clarify, the application will be deployed and maintained by people who are not database experts or even sysadmins, and on some cases don’t really know what they are doing and don’t always follow the checklists. I really need to make it almost fool proof, because mistakes will happen! And in this case it could lead to complete data loss.

(A.J. Beamon) #6

So the idea is to use database availability changes as a signal that somebody may have made substantial changes? That’s an interesting idea, but it sounds like it won’t be sufficient to catch all such major events you’re worried about. And if you do use another mechanism to react to these events, you may no longer need this signal.

As to being able to fail early, I think that may also be a bit tricky. As Dave said, the database being down is supposed to appear to the client as a (potentially long) source of latency. Sometimes, the database may be “down” for less than 1 second, in which case you may actually want to hang around and treat it like any other sub-second source of latency. So that raises the question how long should the cluster be down before we trigger this unavailability notification, and the current way to deal with that is with our timeouts. It sounds like the argument you’re making here may involve being able to have distinct timeouts for the database being down vs. other concerns (the database being slow, the transaction conflicting a lot, etc.).

You could potentially use a key inside each directory to track when it was created, with each of your clients storing the value of that key when opening the directory. For example, you could store a key with a versionstamped value inside each directory. You may need to update the value when moving the directory, depending on what your clients are doing with their DirectorySubspaces (and you would also have to recursively change the child directories). I don’t believe directory operations (such as creating new directories and removing directories) will work on DirectorySubspaces that refer to a directory that has been moved, but you should still be able to pack keys with them.

Clients would then be responsible for reading this key in each transaction that they use the directory. If the value matches what it was when you opened it, then you can proceed to use the directory. If it has changed, then you may need to reopen it. If the key is now missing, the directory has presumably been deleted. If there are concurrent modifications to the directory structure that change this key, then they will cause your transaction to fail.

The above scheme costs a read for each directory that you access (though you could pull the special keys up to a higher level if you wanted, reducing the number of reads if you are accessing a lot of directories), but it seems like it would protect you from the various events you described.

(Christophe Chevalier) #7

I have users that are actively standing up in front of an appliance, waiting for the little spinner to go away, so I cannot wait 10 or 15 seconds for a “traditional” timeout. I would need shorter timeouts, but then it will be too trigger happy, and now I’m faced with a magic number that is always too long and too short at the same time.

The current version or our product uses a SQL database with a Circuit Breaker on top that monitors failures and unusual latency, combined with polling every minute for connectivity to the database. This works well because when a failure happens, we still need 5-10 sec to react, but then the breaker opens and all following requests will be served from a cache, or at least we can push a message “sorry this is out of order please wait a moment” to the user. Since it is very unlikely that the database would go down at the exact moment a user would start interacting with the appliance, it means that they either show up to something that works, or is already displaying a “out of order” message with some helpful text. No waiting. And if the issue happens while no-one is watching, we did not waste any time timeouting or polling.

I need to keep doing that, but if I want to scale to larger server counts, I’m can’t really use frequent polling (client side) to pre-emptively react to db connectivity loses. Also, with more servers and more users, the probability that someone actually uses the appliance while the error condition is starting goes up.

My idea was that the client library already has to do this sort of things, and the polling here is probably happening at the socket layer with keep alive messages that are send to some coordinator or proxy node. This is probably very efficient. If I would do the same thing at the application level, I would need to probably read a key (or maybe get a read version) which will take more cpu time from the processes (and require more network roundtrips).

IF the client already does have this internal connected/disconnected/reconnecting state, and if it could be exposed to the application then at least I will not add any overhead to what there currently is.

The circuit breaker that sits on top could query this state (either on each transaction start or periodically), as well as the regular monitoring of errors / latency.

(Christophe Chevalier) #8

That’s not really possible for me, because this kills write-only transactions, and EVERY layer has to perform this check, everywhere, every time it does anything to the db. And since this is a read, this is async which also has an impact (at least in C# where the method need to be async as well).

Like explained above, I don’t need 100% certainty, just to ensure that the probability of if happening it so small that it is acceptable.

I think that with the current API, the only way would be to create such a key, and then use a watch to detect if it changes, and try to react as soon as possible. The window of opportunity for data corruption would be the time it takes for the watch to trigger, and the callback to pull the fuse (by poisoning the Subspace instance so that it throws whenever someones attempts to use it). Since database instances can also be setup to only see a partition, they would ALSO be poisoned.

At least, I could “nuke” any node that would have survived, and with proper logging/notification, wait for them to be restarted. (basically, the application could hook such event, and do something like a FailFast(…) to self destruct and get the attention of the sysadmin.

Ideally, with some new API to create conflict range in the past, then this would be automatically added to each transaction that comes from a database instance tied to a partition, and reduce this window of opportunity even more (maybe down to 0).

(A.J. Beamon) #9

To elaborate on what I said earlier, what constitutes the database being down? It can be down while all of the processes remain up and responsive, and it can be up but with the client unable to connect to some or all of it. And all of these states can be of arbitrarily short or long duration. We’ve designed the client so that these all appear to the client as an extra source of latency. During a transaction sub-system recovery, for example, the database is “down”, but this condition is usually fairly short-lived and probably shouldn’t be regarded as down from the client perspective. This is why it’s useful to have a timeout here, and if it exists it probably should be a duration specified by the client rather than us.

Right, that’s what I was getting at with suggesting having distinct timeouts for the two cases, though I’m not fully convinced yet that this is a useful distinction. Perhaps what may be useful is the ability to specify separate timeouts for transactions and operations. Transaction timeouts last the entire duration of the transaction, including retries, so they do need to be made long enough to account for the possibility of retries. However, it may be the case that you want to fail early if the database doesn’t respond to any given operation in some smaller duration. I guess you could argue that the database is more likely to eventually respond to a slow running request within your normal timeout if it’s “up” by some measure, but that seems hard to really quantify. Many cases where the database is “down” are probably short-lived too.