FDB Encrypted Backup bug caused by addition of AsyncFileKAIO Latency Histograms?

In Februrary I was looking at implementing encrypted FDB backups, I hit an issue which I couldn’t process past - posted in the forum here.

I’ve since had some time to look at it, the actual error is being caused by

It appears an error occurrs while trying to load the encryption key from disk, my understanding of the code is this is executed within fdbbackup, rather than via RPC in fdbserver.

Specificially the fault is caused by access to: SERVER_KNOBS->DISK_METRIC_LOGGING_INTERVAL.

The trace shows ClientKnobCollection::getServerKnobs() which suggests that AsyncFileKAIO is being used in the Client (fdbbackup)? But the knob seems to only be present on the ServerKnobCollection?

Try starting your backup_agent processes with --knob_disable_posix_kernel_aio 1 to see if that avoids the issue. This disables use of KAIO which should avoid initializing its metrics.

Conceptually, this knob should be a FLOW_KNOB not a SERVER_KNOB but I’m actually surprised that the server knobs object is not initialized as its definition is also in the fdbclient code. Assuming the server knobs init is the issue, moving the knob to flow/include/flow/Knobs.h and flow/Knobs.cpp should fix it.

I wouldn’t expect that this issue is specific to encrypted backup files but rather just writing to a file:// destination with backup, which is something I don’t think is done often outside of simulation tests which run from fdbserver so you’re the first person to find/report it.

Hello!

Thanks for the feedback, sorry for the delay - we’ve had a public holiday in the UK.

TL;DR: Adding --knob_disable_posix_kernel_aio 1 allows the fdbbackup command to executed.

We took the opportunity to upgrade to 7.1.31, first we replicated the issue:

[root@ip-10-1-245-241 ~]# fdbbackup start -d blobstore://REDACTED@s3.eu-west-1.amazonaws.com:443/REDACTED?bucket=REDACTED -s 604800 -z --encryption-key-file /mnt/fdb/backup-encryption-key.dat --log --logdir .
bash: fdbbackup: command not found
[root@ip-10-1-245-241 ~]# /opt/foundationdb/current/bin/fdbbackup start -d blobstore://REDACTED@s3.eu-west-1.amazonaws.com:443/REDACTED?bucket=REDACTED -s 604800 -z --encryption-key-file /mnt/fdb/backup-encryption-key.dat --log --logdir .
Internal Error @ /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbclient/ClientKnobCollection.h 45:
  addr2line -e fdbbackup.debug -p -C -f -i 0xe5f47d 0xd8c585 0xe1b6b4 0xe1befb 0xe0539a 0xe48af7 0xe3fdee 0xe056f4 0x62a78e 0x610684 0x64f7d1 0x600680 0x5666ad 0x550cf1 0x7fa05be6c13a
ERROR: Could not create backup container: An internal error occurred
ERROR: An error was encountered during submission
Fatal Error: Backup error

Then we tested with the knob to disable AIO:

[root@ip-10-1-245-241 ~]# /opt/foundationdb/current/bin/fdbbackup start -d blobstore://REDACTED@s3.eu-west-1.amazonaws.com:443/REDACTED?bucket=REDACTED -s 604800 -z --encryption-key-file /mnt/fdb/backup-encryption-key.dat --log --logdir . --knob_disable_posix_kernel_aio 1
The backup on tag `default' was successfully submitted.

Seems like the server knob may be the issue? I assume the command line argument only effects fdbbackup, I assume backup_agent and fdbserver will keep using AIO for actually performing the backup?

Yes, the knob is just changing the file access mode for the fdbbackup command and any file operations it does on backup data. In the case of start it is just creating the output folder.