I’m trying to get the list of error codes that a Watch can throw but mean “you need to retry” vs “this is a real error”.
See below for the pseudo-code.
Questions:
-
What are the error codes that are safe to retry for a Watch?
-
Also, I think that I should probably wait a random amount of time before retrying (with exponential backoff?) to prevent hammering the database…
-
Can I use
fdb_transaction_on_error()
for this case also? It seems that it is tailored for general errors that happen inside a transaction or during commit. But here this is happening outside of the transaction, so maybe we need a different list of codes? -
is there any easy way to make the watches fault to test that my code is handling all cases properly?
EDIT: looking at the code (watchValue(..)
in NativeAPI.actor.cpp), it looks like the implementation already handles some error codes and does wait between errors:
- both
error_code_wrong_shard_server
/error_code_all_alternatives_failed
wait 10 ms before retrying -
error_code_timed_out
is retried after 10ms also -
error_code_watch_cancelled
is retried after 1 second
Any other code waits 10ms (same delay as timed_out) and is thrown back to the caller.
Looking at this, I’m not sure if I need to bother catching codes because they seem to already be handled.
I don’t know what I should do if I see too_many_watches
though, I don’t really want to have another code path that defaults to polling
My algorithm looks like this:
- loop until we get messages (or are cancelled by caller)
- start a transaction to popup next messages, or if none, setup a watch
- if got messages, return
- await the watch for next messages
- try again
async Task<Message[]> GetNextMessages(string queue, CancellationToken ct)
{
while(!ct.IsCancellationRequested)
{
// get next messages, or setup a watch if none
(var items, var watch) = await db.ReadWriteAsync(async tr =>
{
var items = await PopNextMessagesFromQueue(tr, queue, ....);
if (items == null)
{ // nothing, setup the watch
return (null, tr.Watch(....));
}
// we have some items!
return (items, null);
}, ct);
if (items != null) return items;
// wait for next commands
try
{
await watch;
}
catch(FDBException x)
{ // the watch failed!
if (x.Code not in [/* LIST OF RETRYABLE WATCH ERRORS */])
{ // this is a non recoverable error, abort!
throw;
}
// this is a transient error, try again
//TODO: should we wait a bit here?
}
}
}