What causes a point in time backup to backup slowly?

We’re running FDB on Linux VMs, not with the K8S operator. We initiate a nightly backup using fdbbackup to Amazon S3, and the runtime of this backup is unpredictable.

Sometimes the backup takes less than 5 minutes, and other times it takes nearly an hour. In both cases, the database size is similar (+/- 1 GB) and all the same backup parameters are present. Can anyone explain what might cause a backup to take longer?

We’re using the default --initial-snapshot-interval of 0, but sometimes it takes 30-40 minutes to complete the initial snapshot (which makes the log phase take longer). What could cause this?

Also, sometimes, the backup just hangs at the (just started) phase, what causes this?

I’ve tried setting --knob_http_verbose_level=4 to troubleshoot whether there are any issues uploading to S3, but I’m not sure where these “verbose” HTTP logs end up.

The backup snapshot works by dividing the keyspace up recursively into many short snapshot tasks which each produce a key range snapshot at some read version. These tasks are executed by backup_agent processes. The things that most affect backup speed are:

  • the number of backup_agent processes you are running
  • the stability of the backup_agent processes

If your backups sometimes take 5 minutes and sometimes 30 minutes without changing the database size, I would guess that you have some kind of backup agent instability. One possibility is you have one or more backup_agent instances that can’t write to S3 but they can commit to FDB. The effect of this is that every time such an agent claims backup subtask in FDB it tries but fails to complete it with many retries. Eventually it will give up and the subtask will go back into the scheduling pool to be claimed by another (or possibly the same) agent.

Another source of instability could be agents crashing from OutOfMemory errors because the tasks they are taking on require too much memory at once. This will randomly cause delays because tasks which are in progress during a crash have to time out and be reclaimed by other agents before progress continues.

I suggest that you check your backup_agent trace logs for warnings and errors, especially with Type starting with “FileBackup” or “TaskBucket” (TaskBucket is the internal task scheduling/distribution framework that Backup is built on).

If a backup never gets past started then you likely do not have any functional backup_agent processes.

The logging from --knob_http_verbose_level writes to stdout from the backup_agent processes.

1 Like

Thanks for this detailed answer Steve.

The answer was in the trace logs as you suggested. We encountered these concerning errors:

<Event
  Severity="30"
  Time="1728146920.821939"
  DateTime="2024-10-05T16:48:40Z"
  Type="UnableToWriteStatus"
  ID="0000000000000000"
  Error="platform_error"
  ErrorDescription="Platform error"
  ErrorCode="1500"
  ThreadID="3803461471724407639"
  Machine="REDACTED"
  LogGroup="default"
  ClientDescription="primary-7.2.0-3803461471724407639"
/>
<Event
  Severity="40"
  ErrorKind="Unset"
  Time="1728147140.541484"
  DateTime="2024-10-05T16:52:20Z"
  Type="GetMemoryUsage"
  ID="0000000000000000"
  UnixErrorCode="18"
  UnixError="Too many open files"
  ThreadID="3803461471724407639"
  Backtrace="addr2line -e backup_agent.debug -p -C -f -i 0x11fb51c 0x11fa160 0x11fa54e 0x11bedf9 0x5e6585 0x5e74de 0xd4b738 0x7fff68 0xb328b0 0xb330f3 0xb2d4c0 0xb35228 0xb3c09e 0xb34318 0xb34418 0xae9648 0x6706e8 0x1004577 0x1004c3b 0x68aa60 0x118d752 0xaa4d22 0x59dfd8 0x7f4cbc885083"
  Machine="REDACTED"
  LogGroup="default"
  ClientDescription="primary-7.2.0-3803461471724407639"
/>

It turns out that the running backup_agent processes had a soft nofile limit of 1024 open files, and it was hitting that limit from file descriptors and network sockets. Your suggestion about the stability of the backup_agent processes was right!

We encountered this because we’re running FoundationDB on Ubuntu 20.04, installed thru the .deb package. Ubuntu runs systemd, but the .deb package installs FoundationDB as a service thru a SysV init.d startup script at /etc/init.d/foundationdb.

systemd handles this by converting the script to a systemd service with systemd-sysv-generator(8), which does not include a LimitNOFILE option in the generated service config. Therefore, on an unmodified Ubuntu server install, other users might hit this issue. Figured it was useful to share here.

We fixed this by raising the global soft nofile limit. Our server provisioning tools had code to do this, but it was outdated and we didn’t notice until now since most of our service definitions have a LimitNOFILE option set in the service file.

One question: when running these backup_agents under fdbmonitor as normal, (and as described in the docs,) where is that stdout redirected to? The stdout of fdbmonitor? /dev/null? I wasn’t able to find that output anywhere obvious while I had the knob enabled.

I’m not sure, I haven’t used the verbose output when launching from fdbmonitor. I think fdbmonitor captures the stdout and stderr from its child processes but I’m not sure what it does with them. Since fdbmonitor does write events to the syslog I suggest checking there.

1 Like