I was wondering if it would be possible to create and restore backups using a snapshot of the data disks. I’ve checked out the backup tool first, but a backup to a local file seems to take a while (when I tested it it took 2.5 hours for 250 GiB), while snapshots are created quite fast. It doesn’t seem easy however to restore the database using these snapshots. I can mount the disks on new instances but then the following error is given:
The coordinator(s) have no record of this database. Either the coordinator
addresses are incorrect, the coordination state on those machines is missing, or
no database has been created.
Creating a new database, e.g. using ‘configure new double ssd’ results in all the data being deleted. Omitting ‘new’ doesn’t work as the db is unavailable.
Is it somehow possible to restore the coordination state on the machines, do I need to take a backup of non data folders as well?
While disk snapshot based backup and restore are certainly going to be faster, you can get faster FDB backup bandwidth by running more backup_agent processes. Backup speed is mainly limited by the cluster’s read bandwidth. Restore is much slower, however, as it is limited by the cluster’s write bandwidth.
@bjorndv It would be good to know your use case, is it for one off dev & test or do you plan to use it for production purposes. Is it for cloud or on-prem?
As far as the backup itself, you need to backup the disk images associated with TLog, storages and coordinator in a consistent manner. Also, you need the foundationdb.conf and fdb.cluster backed up and restored too. The video itself explains - how we get the consistent disk image(s) across the cluster.
Please let me know if you have specific questions, more than happy to help you out.
I’m looking for a backup method to use in production. We’re using the google cloud platform. It would be great if we would be able to use snapshots, but I think for now it will suffice to use fdbbackup with multiple backup agents as @SteavedHams suggested. Thanks for all the help!
Because restoring the backup seems to take a long time, I tried out snapshots again. Unfortunately, by using GCP compute engine snapshots the restored cluster turns out to be unhealthy, as the snapshots aren’t taken at the exact same time.
I was wondering if it’s possible to ‘force’ mark the cluster as healthy somehow if it’s in an unhealthy state, because in the event of a backup restore it wouldn’t really matter to us that the restored version isn’t entirely consistent, as long as it’s restored fast enough.
The status command on the restored version shows:
Data:
Replication health - UNHEALTHY: No replicas remain of some data
Moving data - unknown
Sum of key-value sizes - unknown
Disk space used - 432.704 GB
We’re still at 5.2.5, we kind of postponed updating because we wanted to see if we could get the backups working first. Restoring a backup via fdbrestore would probably indeed work, however it takes too long for our use case (by the time that it would be restored, we would probably already have regenerated the data).
@rjenkins
The restore speed you observed seems much lower than it should.
There are two possible bottleneck in restore:
a) The destination FDB cluster is limited by its write bandwidth;
b) The restore workers (fdbbackup agent) did not write the backup data quickly enough to the cluster.
To understand where the bottleneck is, do you happen to have answers to the following questions?
When you restore the cluster, did your destination FDB cluster reach to the maximum write bandwidth?
Another question is:
How large is your snapshot and log (mutation) files respectively?
On the CPU Util graph pretty much all cores are maxing out on the storage nodes. The dips are during restore as it pauses pulling blocks as the ApplyVersionLag is burning down. The drives are doing ~500K writes/min (assuming I have my cloudwatch metrics resolution right). The i3.xlarge support 70K write IOPS.
On the EC2 Network In/Out the larger area series are the transaction (log) nodes doing about 2.5GB/min and the storage nodes are doing ~1GB/min and they have 10Gb NICs.
The Mbps send/sec Mbps received/sec, CPU Seconds, Retransmit Segments, UpTime Seconds and Processes Current Connection graphs are from scraping trace.xml files. I have a toolkit I’ve written for that actually with a lot more graphs. Very helpful ty for the trace logs.
Re: Snapshot and log file sizes, just a keep sampling shows.
Thank you very much for the detailed information!
It seems to me that the destination FDB cluster is the bottleneck in the restore: specifically, the storage server is CPU-bound.
If you’d like to speed up your restore, you can increase your cluster size (e.g., increase the number of storage servers). You can keep doing this until the cluster is no longer the bottleneck.
I also tried restoring a backup of about 250 GB key values using a new cluster of 3 16 cores. I restored it from a separate instance (so not from blobstore/Amazon S3) by disabling the backup agents on the cluster and running agents on the separate instance. While the amount of writes/s were around 700-1000 k Hz, it would still take about 2.5 hours to restore the backup.