I’ve got the most simple ‘snapshot’ script. It’s not actually doing any snapshotting, I’m just trying to get something in there as a base doing some logging so I can see how it’s being called by FDB etc.
So my script exists on every host, and just contains the following:
#!/bin/sh
LOG="/opt/snapshot/logs/unknown.log"
echo "Unparsed args: $@" >> "${LOG}"
exit 0
For testing, the file is executable by any user, and the log dir is writable by any user:
$ ls -lah /opt/snapshot/
total 4.0K
drwxr-xr-x. 3 root root 33 Jun 12 15:54 .
drwxr-xr-x. 10 root root 140 Jun 12 11:17 ..
drwxrwxrwx. 2 root root 6 Jun 12 18:44 logs
-rwxr-xr-x. 1 root root 1.7K Jun 12 18:48 snap.sh
It’s referenced in my foundationdb.conf
under whitelist-binpath
:
[general]
restart_delay = 60
[fdbmonitor]
user = foundationdb
group = foundationdb
[fdbserver.4500]
class = storage
cluster-file = /mnt/fdb/4500/fdb.cluster
command = /opt/foundationdb/current/bin/fdbserver
datadir = /mnt/fdb/4500/data
listen-address = public
...
logdir = /var/log/foundationdb
loggroup = main-dev
public-address = auto:4600:tls
...
trace-format = json
whitelist-binpath = /opt/snapshot/snap.sh
I can run the script manually on the CLI of any given host, and get the file I expected written to the logs dir with the content I expected.
But when I try and run it through fdbcli
, first of all the CLI seems to hang, and secondly I get an error back saying the cluster isn’t healthy so the snapshot can’t complete:
$ fdbcli --exec 'snapshot /opt/snapshot/snap.sh --foo bar'
WARNING: Long delay (Ctrl-C to interrupt)
Snapshot command failed 2506 (Unsupported when the cluster is not fully recovered). Please cleanup any instance level snapshots created with UID 6f08a1e4ab969e6b5f33ef2c80656d4e.
My log file isn’t written, so it’s like my script isn’t being called at all. Also, if I run fdbcli --exec 'status details'
immediately after the snapshot call returns an error, I can see that some of the processes are erroring (they’re not reporting metrics). In this example it’s a stateless
class node, so shouldn’t even be trying to run the script, and when I check the process it seems fdbserver
crashed and was recreated by fdbmonitor
. I’ve also seen storage
nodes briefly report lag, but they didn’t crash/restart.
$ fdbcli --exec 'status details'
Using cluster file `/mnt/fdb/4500/fdb.cluster'.
Unable to retrieve all status information.
Configuration:
Redundancy mode - three_data_hall
Storage engine - ssd-redwood-1
Log engine - ssd-2
Encryption at-rest - disabled
Coordinators - 9
Desired Commit Proxies - 3
Desired GRV Proxies - 1
Desired Resolvers - 1
Desired Logs - 3
Usable Regions - 1
...
Data:
Replication health - (Re)initializing automatic data distribution
Moving data - unknown (initializing)
Sum of key-value sizes - unknown
Disk space used - 1.515 TB
Operating space:
Storage server - 214.5 GB free on most full server
Log server - 94.6 GB free on most full server
Workload:
Read rate - 108 Hz
Write rate - 0 Hz
Transactions started - 0 Hz
Transactions committed - 0 Hz
Conflict rate - 0 Hz
Backup and DR:
Running backups - 0
Running DRs - 0
Process performance details:
...
10.1.245.254:4600:tls ( 1% cpu; 2% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 6.5 GB RAM )
10.1.246.11:4600:tls (no metrics available)
10.1.246.11:4601:tls ( 4% cpu; 43% machine; 0.004 Gbps; 0% disk IO; 0.1 GB )
...
Coordination servers:
(all reachable)
The error in the trace log is:
{ "Severity": "40", "ErrorKind": "BugDetected", "Time": "1749752867.903873", "DateTime": "2025-06-12T18:27:47Z", "Type": "Crash", "ID": "0000000000000000", "Signal": "11", "Name": "Segmentation fault", "Trace": "addr2line -e fdbserver.debug -p -C -f -i 0x7f1e2523ebf0 0x270d7cc 0x1d0a318 0x1d0a23d 0x27122b8 0x27121b9 0x1d66888 0x1d65ee4 0x1c0cd98 0x1c0ccb7 0x534a1cd 0x5349aa3 0x54fe808 0x329ff99 0x7f1e252295d0", "ThreadID": "6332857270817026668", "Backtrace": "addr2line -e fdbserver.debug -p -C -f -i 0x55785ed 0x55788b3 0x5572ab4 0x55402eb 0x7f1e2523ebf0 0x270d7cc 0x1d0a318 0x1d0a23d 0x27122b8 0x27121b9 0x1d66888 0x1d65ee4 0x1c0cd98 0x1c0ccb7 0x534a1cd 0x5349aa3 0x54fe808 0x329ff99 0x7f1e252295d0", "Machine": "10.1.246.11:4600", "LogGroup": "main-dev", "Roles": "CS,DD,MS,RK" }