FoundationDB

Defaults for transaction timeouts and retries?


(Michael Zamani) #1

Just curious what the defaults are for the client timeout values and retry values, so that I can know whether I need to change those from their defaults for my use case. Also, a quick explanation of how those are used by the client could be helpful (i.e., does the client have a timeout set on each attempt, and then an overarching timeout value for all attempts in total? Or is the timeout just set for the overall set of attempts, and the client retries until either the max attempts are reached OR the timeout value is reached?)


Golang fdb.MustOpenDefault() does not fail when fdb.cluster content points to invalid host
(A.J. Beamon) #2

The default behavior is to not timeout or limit retries at all. If there is blocking operation that’s not being fulfilled or if the transaction is being retried repeatedly, it will continue to do so indefinitely.

To use timeouts and/or retry limits, you would set the appropriate transaction option during each attempt. The timeout is based on the start time of the first try. If either limit is reached, then your operations will start to fail with an error. In the case of timeouts, you’ll get a timeout error, while if you hit the retry limit it will return to you the last error that caused the transaction to fail. If you don’t set the option on a particular transaction attempt, then the limit won’t apply for that attempt, even if you set it during an earlier attempt.

The retry limit behavior is tracked in the onError function. All of our high level retry loops are implemented using onError, but if you are writing your own, you’ll probably want to use onError to determine if you should retry or not.


(Alec Grieser) #3

Just to continue on with what @ajbeamon was saying, here’s the link for the (python) transaction options: https://apple.github.io/foundationdb/api-python.html#transaction-options And here they are for Java: https://apple.github.io/foundationdb/javadoc/com/apple/foundationdb/TransactionOptions.html

The relevant ones for timeouts and retry loops would be (and it’s possible I’ve missed one):

  • set_retry_limit (set to -1 for ∞, 0 to only run the loop once, and k to run it up to k + 1 times, i.e., with k retries)
  • set_max_retry_delay (to set how long you wait during exponential backoff)
  • set_timeout (maximum amount of milliseconds to wait before automatically cancelling a transaction)

Unfortunately, these have to be set on a per-transaction basis, so there isn’t a way to say, “All transactions created with this database object should time out after 10 seconds” unfortunately. If you had your own retry loops, you could add that information in when you created the transaction, though there’s an argument that we should handle that better, too.

Oh, but also, if you are re-implementing retry loops, you might also want to look at the “isRetryable” method on errors (here are the Javadocs. This will return “true” if the error is something that could be temporary (like, say, network failures or transaction subsystem reboots) and “false” if the error is something that retrying won’t help with (like, say, key too large). You should only retry if that function returns true. The onError function handles that for you, in that it will error immediately if the error is not retryable and inject a delay before returning without error if it is, but you can also call it directly if you have some other system you are using to handle retry loops.


(Christophe Chevalier) #4

Just to be the one who disagree :slight_smile:

I always thought that expecting (junior) developers to never forget to set these options on each and every transaction (including after reset), was too optimistic, so I decided to have the notion of default timeout and retry limit baked in the binding itself, which will automatically set these options when creating a new transaction internally before exposing it to the application (and also when the transaction resets)

If your binding does not do that, I suggest that you create an helper function that wraps the existing retry loops to automatically set these options, and then call the original lambda function or handler.

After a lot of auditing “real life” code, this was a good choice, because nobody ever bothered setting a value. At least they had a default timeout (after 60 sec) instead of the equivalent of a while(true) { burn_some_cpu(); }.


(A.J. Beamon) #5

I think there is a strong case to be made for having the ability to specify these options (and others) more globally. The idea we’ve tossed around for this, which Alec alluded to, is to support setting transaction options at the database level that apply to every transaction created from that database object. However, this feature does not currently exist.

That said, I’m not sure I agree that the best general choice is for the bindings (or c client) to prescribe a particular non-infinite default timeout. Timeouts and retry limits are the kind of thing that’s likely not to be exercised in testing unless specifically thought about, and having operations unexpectedly fail in certain real world conditions doesn’t seem desirable. If you haven’t written your code to expect possible timeouts (which is a reasonable choice for someone who doesn’t need them), then you’re likely to be surprised by this.

On the other hand, it doesn’t seem onerous to me for someone that needs to deal with timeouts to also make a single call to set a global value. Then, they can set a value that makes sense for their use-case rather than a default that’s likely not actually what they want anyway. I do agree, though, that shifting the burden to setting this option on every transaction is a pretty high expectation, and it would be good to fix this.


(Clement Pang) #6

In Java, it’s not that hard to extend the class and automatically inject logging, setting of transaction options, etc. to all calls. We happen to also require all transactions to have a name so that we can track tps, time taken, etc. as metrics. It’s also helpful to automatically print retries when it happens.


(Christophe Chevalier) #7

How do you collect these? I don’t think the native client supports naming transactions?

This looks interesting, and I should try to investigate how to support this also. I’m currently only tagging transactions with a client-side ID, but a app-provided tag could be very useful in some cases…


(A.J. Beamon) #8

The native client doesn’t currently offer this, though support for it is something that we’re thinking about. I’ll try to get an issue created on GitHub to track it.


(Clement Pang) #9

No, it’s not passed to the native client, it’s just something we shim around the java layer and we use dropwizard metrics for them. We also reference count objects with dispose() and call them proactively (but that’s a different thing =p).


(Balachandar Namasivayam) #10

GitHub issue link: https://github.com/apple/foundationdb/issues/465


(Dave Koston) #11

I agree with both ajbeamon and Christophe. Each developer shouldn’t have to set a timeout as junior developers may not know these exist and they will never be set.

It should be easy in the bindings to set a per transaction override but it seems counterintuitive that the client is where you’d define the global timeout.

My strong preference would be able to configure a cluster wide timeout which is set as a best practice by ops based on data rather than a “best guess” from development.

This really is a server feature not a client feature if you think about it in the true sense. The server says: “you should expect a response back in X time, or consider it a failure” rather than each client saying “I’m gonna assume things failed if I don’t hear back from you in X time”. In a well configured system, you will likely have both because you want more guarantees but in most cases, the server advertising a timeout would be sufficient as long as the client respects it.