How to understand fdbbackup error : Task execution stopped

(Mengran Wang) #1

Hi Folks,
Running into some problems with fdbbackup and looking for some guidance on how to understand this specific error message.

Every 2.0s: fdbbackup status -C /etc/foundationdb/fdb.cluster  
The previous backup on tag `default' at file:///XXXX/backup-2019-05-21-02-53-02.740366 completed at version 123840072995008.
Older Errors
1.88 minute(s) ago : 'Task execution stopped due to timeout, abort, or completion by another worker' on 'file_backup_write_range_5.2'

Both fdbserver/fdbclient is on version 5.2.26
We’re simply writing the backup into local disk files, so no blobstore/s3 is involved.

Trying to understand :

  1. can we know for sure that the backup is completed and succeed ?
  2. What does this error message : Task execution stopped due to timeout, abort, or completion by another worker mean. I checked the source code, it’s only been thrown in one place, TaskBucket.h. Any addition information to help track down the actual cause?
  3. And what is file_backup_write_range_5.2? A process/worker name or a file name ?

Thank you!

(Alex Miller) #2

This is probably a question for @SteavedHams :slight_smile:

The highest 5.2 went is 5.2.8? Did you do a series of internal additions to 5.2 to get to 5.2.26? Did any of those involve backup code?

(Steve Atherton) #3

Your backup is completed, that’s what the first sentence of output is saying.

The errors, categorized by age and type, are meant to be used to figure out why a backup isn’t making progress as expected. Errors are logged to the database and the latest error of each error type is printed in status along with the age of the error.

The specific error you are seeing is common and in most cases is analogous to a “transaction too old” error one might get when using the FDB API but on a longer time frame (about 1 minute). In backup, work is done in small units called tasks. Tasks use many transactions and can run for a long time, but while a task is executing the executor must renew its lease on the task to prevent another executor from declaring the task failed and re-queueing it, and sometimes for various reason a lease expires before renewal. The lease mechanism is part of backup execution’s distributed / fault tolerant design. If an executor dies in the middle of doing work, the task(s) it owned and was executing at the time will soon be declared failed, after lease expiry, and be requeued by some other executor so that backup progress can continue.

Task leases and timeouts are defined by version, not actual time, so events like a master recovery (which artificially advance the version counter) will cause all outstanding backup (and DR, and restore, anything using Taskbucket) tasks to be timed out and requeued.

Also, if a task throws any other exception (such as an I/O error or file permission error) instead of returning success, this same this error will be logged at the Taskbucket level. The backup_agent logs will likely have more detail about the error.