Performance characteristics of using Watches for Distributed Task Scheduling

I’m trying to get the list of error codes that a Watch can throw but mean “you need to retry” vs “this is a real error”.

See below for the pseudo-code.

Questions:

  1. What are the error codes that are safe to retry for a Watch?

  2. Also, I think that I should probably wait a random amount of time before retrying (with exponential backoff?) to prevent hammering the database…

  3. Can I use fdb_transaction_on_error() for this case also? It seems that it is tailored for general errors that happen inside a transaction or during commit. But here this is happening outside of the transaction, so maybe we need a different list of codes?

  4. is there any easy way to make the watches fault to test that my code is handling all cases properly?

EDIT: looking at the code (watchValue(..) in NativeAPI.actor.cpp), it looks like the implementation already handles some error codes and does wait between errors:

  • both error_code_wrong_shard_server/error_code_all_alternatives_failed wait 10 ms before retrying
  • error_code_timed_out is retried after 10ms also
  • error_code_watch_cancelled is retried after 1 second

Any other code waits 10ms (same delay as timed_out) and is thrown back to the caller.

Looking at this, I’m not sure if I need to bother catching codes because they seem to already be handled.

I don’t know what I should do if I see too_many_watches though, I don’t really want to have another code path that defaults to polling :confused:


My algorithm looks like this:

  • loop until we get messages (or are cancelled by caller)
    • start a transaction to popup next messages, or if none, setup a watch
    • if got messages, return
    • await the watch for next messages
    • try again
async Task<Message[]> GetNextMessages(string queue, CancellationToken ct)
{
  while(!ct.IsCancellationRequested)
  {

    // get next messages, or setup a watch if none
    (var items, var watch) = await db.ReadWriteAsync(async tr =>
    {
      var items = await PopNextMessagesFromQueue(tr, queue, ....);

      if (items == null)
      { // nothing, setup the watch
        return (null, tr.Watch(....));
      }
      // we have some items!
      return (items, null); 
    }, ct);

    if (items != null) return items;

    // wait for next commands
    try
    { 
      await watch;
    }
    catch(FDBException x)
    { // the watch failed!

      if (x.Code not in [/* LIST OF RETRYABLE WATCH ERRORS */])
      { // this is a non recoverable error, abort!
         throw;
      }
      // this is a transient error, try again
      //TODO: should we wait a bit here?
    }
  }
}