How is the snapshot backup supposed to work, and how do I debug why it is failing?

danm · June 12, 2025, 7:05pm

I’ve got the most simple ‘snapshot’ script. It’s not actually doing any snapshotting, I’m just trying to get something in there as a base doing some logging so I can see how it’s being called by FDB etc.

So my script exists on every host, and just contains the following:

#!/bin/sh

LOG="/opt/snapshot/logs/unknown.log"

echo "Unparsed args: $@" >> "${LOG}"

exit 0

For testing, the file is executable by any user, and the log dir is writable by any user:

$ ls -lah /opt/snapshot/
total 4.0K
drwxr-xr-x.  3 root root   33 Jun 12 15:54 .
drwxr-xr-x. 10 root root  140 Jun 12 11:17 ..
drwxrwxrwx.  2 root root    6 Jun 12 18:44 logs
-rwxr-xr-x.  1 root root 1.7K Jun 12 18:48 snap.sh

It’s referenced in my foundationdb.conf under whitelist-binpath:

[general]
restart_delay = 60

[fdbmonitor]
user = foundationdb
group = foundationdb

[fdbserver.4500]
class = storage
cluster-file = /mnt/fdb/4500/fdb.cluster
command = /opt/foundationdb/current/bin/fdbserver
datadir = /mnt/fdb/4500/data
listen-address = public
...
logdir = /var/log/foundationdb
loggroup = main-dev
public-address = auto:4600:tls
...
trace-format = json
whitelist-binpath = /opt/snapshot/snap.sh

I can run the script manually on the CLI of any given host, and get the file I expected written to the logs dir with the content I expected.

But when I try and run it through fdbcli, first of all the CLI seems to hang, and secondly I get an error back saying the cluster isn’t healthy so the snapshot can’t complete:

$ fdbcli --exec 'snapshot /opt/snapshot/snap.sh --foo bar'

WARNING: Long delay (Ctrl-C to interrupt)
Snapshot command failed 2506 (Unsupported when the cluster is not fully recovered). Please cleanup any instance level snapshots created with UID 6f08a1e4ab969e6b5f33ef2c80656d4e.

My log file isn’t written, so it’s like my script isn’t being called at all. Also, if I run fdbcli --exec 'status details' immediately after the snapshot call returns an error, I can see that some of the processes are erroring (they’re not reporting metrics). In this example it’s a stateless class node, so shouldn’t even be trying to run the script, and when I check the process it seems fdbserver crashed and was recreated by fdbmonitor. I’ve also seen storage nodes briefly report lag, but they didn’t crash/restart.

$ fdbcli --exec 'status details'
Using cluster file `/mnt/fdb/4500/fdb.cluster'.

Unable to retrieve all status information.

Configuration:
  Redundancy mode        - three_data_hall
  Storage engine         - ssd-redwood-1
  Log engine             - ssd-2
  Encryption at-rest     - disabled
  Coordinators           - 9
  Desired Commit Proxies - 3
  Desired GRV Proxies    - 1
  Desired Resolvers      - 1
  Desired Logs           - 3
  Usable Regions         - 1
...
Data:
  Replication health     - (Re)initializing automatic data distribution
  Moving data            - unknown (initializing)
  Sum of key-value sizes - unknown
  Disk space used        - 1.515 TB

Operating space:
  Storage server         - 214.5 GB free on most full server
  Log server             - 94.6 GB free on most full server

Workload:
  Read rate              - 108 Hz
  Write rate             - 0 Hz
  Transactions started   - 0 Hz
  Transactions committed - 0 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:
...
  10.1.245.254:4600:tls  (  1% cpu;  2% machine; 0.000 Gbps;  0% disk IO; 0.1 GB / 6.5 GB RAM  )
  10.1.246.11:4600:tls   (no metrics available)
  10.1.246.11:4601:tls   (  4% cpu; 43% machine; 0.004 Gbps;  0% disk IO; 0.1 GB  )
...

Coordination servers:
  (all reachable)

The error in the trace log is:

{  "Severity": "40", "ErrorKind": "BugDetected", "Time": "1749752867.903873", "DateTime": "2025-06-12T18:27:47Z", "Type": "Crash", "ID": "0000000000000000", "Signal": "11", "Name": "Segmentation fault", "Trace": "addr2line -e fdbserver.debug -p -C -f -i 0x7f1e2523ebf0 0x270d7cc 0x1d0a318 0x1d0a23d 0x27122b8 0x27121b9 0x1d66888 0x1d65ee4 0x1c0cd98 0x1c0ccb7 0x534a1cd 0x5349aa3 0x54fe808 0x329ff99 0x7f1e252295d0", "ThreadID": "6332857270817026668", "Backtrace": "addr2line -e fdbserver.debug -p -C -f -i 0x55785ed 0x55788b3 0x5572ab4 0x55402eb 0x7f1e2523ebf0 0x270d7cc 0x1d0a318 0x1d0a23d 0x27122b8 0x27121b9 0x1d66888 0x1d65ee4 0x1c0cd98 0x1c0ccb7 0x534a1cd 0x5349aa3 0x54fe808 0x329ff99 0x7f1e252295d0", "Machine": "10.1.246.11:4600", "LogGroup": "main-dev", "Roles": "CS,DD,MS,RK" }

Topic		Replies	Views
Use of FDB disk snapshot facility Using FoundationDB	0	305	July 21, 2023
Snapshot creation failed, path not whitelisted Using FoundationDB	2	428	February 22, 2021
Snapshot command failed 2501 (Failed to snapshot storage nodes) Using FoundationDB operator	10	385	September 15, 2023
Disk snapshots on FDB Running FoundationDB	1	459	September 1, 2023
Backup /restore fdb Using FoundationDB	21	2695	October 5, 2019

How is the snapshot backup supposed to work, and how do I debug why it is failing?

Related topics