Defaults for transaction timeouts and retries?

Just curious what the defaults are for the client timeout values and retry values, so that I can know whether I need to change those from their defaults for my use case. Also, a quick explanation of how those are used by the client could be helpful (i.e., does the client have a timeout set on each attempt, and then an overarching timeout value for all attempts in total? Or is the timeout just set for the overall set of attempts, and the client retries until either the max attempts are reached OR the timeout value is reached?)

1 Like

The default behavior is to not timeout or limit retries at all. If there is blocking operation that’s not being fulfilled or if the transaction is being retried repeatedly, it will continue to do so indefinitely.

To use timeouts and/or retry limits, you would set the appropriate transaction option during each attempt. The timeout is based on the start time of the first try. If either limit is reached, then your operations will start to fail with an error. In the case of timeouts, you’ll get a timeout error, while if you hit the retry limit it will return to you the last error that caused the transaction to fail. If you don’t set the option on a particular transaction attempt, then the limit won’t apply for that attempt, even if you set it during an earlier attempt.

The retry limit behavior is tracked in the onError function. All of our high level retry loops are implemented using onError, but if you are writing your own, you’ll probably want to use onError to determine if you should retry or not.

1 Like

Just to continue on with what @ajbeamon was saying, here’s the link for the (python) transaction options: https://apple.github.io/foundationdb/api-python.html#transaction-options And here they are for Java: https://apple.github.io/foundationdb/javadoc/com/apple/foundationdb/TransactionOptions.html

The relevant ones for timeouts and retry loops would be (and it’s possible I’ve missed one):

  • set_retry_limit (set to -1 for ∞, 0 to only run the loop once, and k to run it up to k + 1 times, i.e., with k retries)
  • set_max_retry_delay (to set how long you wait during exponential backoff)
  • set_timeout (maximum amount of milliseconds to wait before automatically cancelling a transaction)

Unfortunately, these have to be set on a per-transaction basis, so there isn’t a way to say, “All transactions created with this database object should time out after 10 seconds” unfortunately. If you had your own retry loops, you could add that information in when you created the transaction, though there’s an argument that we should handle that better, too.

Oh, but also, if you are re-implementing retry loops, you might also want to look at the “isRetryable” method on errors (here are the Javadocs. This will return “true” if the error is something that could be temporary (like, say, network failures or transaction subsystem reboots) and “false” if the error is something that retrying won’t help with (like, say, key too large). You should only retry if that function returns true. The onError function handles that for you, in that it will error immediately if the error is not retryable and inject a delay before returning without error if it is, but you can also call it directly if you have some other system you are using to handle retry loops.

2 Likes

Just to be the one who disagree :slight_smile:

I always thought that expecting (junior) developers to never forget to set these options on each and every transaction (including after reset), was too optimistic, so I decided to have the notion of default timeout and retry limit baked in the binding itself, which will automatically set these options when creating a new transaction internally before exposing it to the application (and also when the transaction resets)

If your binding does not do that, I suggest that you create an helper function that wraps the existing retry loops to automatically set these options, and then call the original lambda function or handler.

After a lot of auditing “real life” code, this was a good choice, because nobody ever bothered setting a value. At least they had a default timeout (after 60 sec) instead of the equivalent of a while(true) { burn_some_cpu(); }.

2 Likes

I think there is a strong case to be made for having the ability to specify these options (and others) more globally. The idea we’ve tossed around for this, which Alec alluded to, is to support setting transaction options at the database level that apply to every transaction created from that database object. However, this feature does not currently exist.

That said, I’m not sure I agree that the best general choice is for the bindings (or c client) to prescribe a particular non-infinite default timeout. Timeouts and retry limits are the kind of thing that’s likely not to be exercised in testing unless specifically thought about, and having operations unexpectedly fail in certain real world conditions doesn’t seem desirable. If you haven’t written your code to expect possible timeouts (which is a reasonable choice for someone who doesn’t need them), then you’re likely to be surprised by this.

On the other hand, it doesn’t seem onerous to me for someone that needs to deal with timeouts to also make a single call to set a global value. Then, they can set a value that makes sense for their use-case rather than a default that’s likely not actually what they want anyway. I do agree, though, that shifting the burden to setting this option on every transaction is a pretty high expectation, and it would be good to fix this.

In Java, it’s not that hard to extend the class and automatically inject logging, setting of transaction options, etc. to all calls. We happen to also require all transactions to have a name so that we can track tps, time taken, etc. as metrics. It’s also helpful to automatically print retries when it happens.

3 Likes

How do you collect these? I don’t think the native client supports naming transactions?

This looks interesting, and I should try to investigate how to support this also. I’m currently only tagging transactions with a client-side ID, but a app-provided tag could be very useful in some cases…

The native client doesn’t currently offer this, though support for it is something that we’re thinking about. I’ll try to get an issue created on GitHub to track it.

3 Likes

No, it’s not passed to the native client, it’s just something we shim around the java layer and we use dropwizard metrics for them. We also reference count objects with dispose() and call them proactively (but that’s a different thing =p).

GitHub issue link: https://github.com/apple/foundationdb/issues/465

I agree with both ajbeamon and Christophe. Each developer shouldn’t have to set a timeout as junior developers may not know these exist and they will never be set.

It should be easy in the bindings to set a per transaction override but it seems counterintuitive that the client is where you’d define the global timeout.

My strong preference would be able to configure a cluster wide timeout which is set as a best practice by ops based on data rather than a “best guess” from development.

This really is a server feature not a client feature if you think about it in the true sense. The server says: “you should expect a response back in X time, or consider it a failure” rather than each client saying “I’m gonna assume things failed if I don’t hear back from you in X time”. In a well configured system, you will likely have both because you want more guarantees but in most cases, the server advertising a timeout would be sufficient as long as the client respects it.

Hi,

I’m using FDB in my java maven project and have problem with running batch query. It failed with ‘past_version’ error. I set timeout option to 60000 and retry limit to 100 every time before running transaction. When I run transaction first time after running my application, it is interrupted after about 50 seconds, but next times it interrupted after 5 second and after retrying to run transaction 10 times it throws ‘past_version’ error.

Could you please help me to solve this problem?

My system is macOS Mojave 10.14.3
java version - 11
FoundationDB version 3.0.9

That is unfortunately unrelated to a transaction timeout. FDB has a 5s transaction limit. See the known limitations. This means that you’ll need to break your scan of data down into multiple transactions, with each next transaction starting the read at the last key that your previous transaction read. This means you’ll no longer have transactional reads though. If you need to do a transactional reads, then the easiest way is to suffix your keys with a versionstamp, and filter out newer versions manually.

Thank you for your quick response.
I read about limitation, but the fact that first time after starting my application the transaction runs about 50 seconds brought me to the idea that maybe it is possible to customize transaction limit. Have you any idea, why first time the limitation does not work?
Thank in advance.

There is no reason within FoundationDB that a transaction could last 50s. The only ordinary way this could appear is if it reliably takes you 45s to be able to connect to the database. However, it seems awfully suspicious to me that you set a transaction to retry 10 times, and you’re seeing a 10 x 5second delay…

I’m using batch transaction for creating multiple records and first time all records within 50 seconds are created without any retry, only next transactions are retried 10 time, each by 5 seconds. Anyway, thank you, I will try to solve the problem based on your proposes.

It seems that different database driver has different default timeout configurations. For example, go-redis DialTimeout=5s, ReadTimeout = WriteTimeout = 3s,

whereas mongodb go driver has: ConnectTimeout=30s, SocketTimeout=0 (can block indefinitely),

As far as I’m concerned, that read/write timeout is defaulted to 0, meaning the operation may be blocked indefinitely, is a reasonable configuration.