Increased system keyspace size after backups finish

We are running weekly backups of our FDB clusters, and in our largest cluster the system keyspace [\xFF, \xFF\xFF) seems to be growing week by week. It grows during the weekly backup, and when the backup is completed, it goes down again, but not to the level it was before the backup started.

When looking at the distribution of data in the system keyspace, we that most of the data is under the \xff\x02/backup-agent/uid->config/prefix, and most of the keys have the form \xff\x02/backup-agent/uid->config/[uid]snapshotRangeFileMap... where [uid] seems to be a 16-byte UID.

Is this expected, and is there any way to clean it up?

Since writing this post, I have since discovered that the uid is indeed the uid of a backup, and that the tag may be found by lookup at the entries under the \xff\x02/backup-agent/tag->uid/ prefix.

The files for many of the tags are already deleted, and even if they weren’t I don’t know why we need to keep several GB’s of snapshotRangeFileMap entries. The backups are presumably restorable without this data, as we need to be able to restore without it.

For a given backup tag, is it safe to delete the entries under \xff\x02/backup-agent/tag->uid/ and \xff\x02/backup-agent/uid->config/, or is there some other invariant that one needs to maintain?

If you run fdbbackup status, does it show anything? Also, you could run fdbcli status details and see if reports that there are any running backup or DR tags.

The fdbbackup tool has a cleanup command that can show and help remove some stale backup data. I would try using this first rather than deleting the keys manually. I think there can be some trickiness involved with cleaning up the backup state directly, though this is probably easier if you don’t have any backups or DRs running at all.

Typical output from status json (from .cluster.layers.backup.tags.<tag>)

        "current_container": "<blob store address>",
        "current_status": "has been completed",
        "last_restorable_seconds_behind": 4656009.350732,
        "last_restorable_version": 4199961082790,
        "mutation_log_bytes_written": 10418747,
        "mutation_stream_id": "<id>",
        "range_bytes_written": 28418394700,
        "running_backup": false,
        "running_backup_is_restorable": false

From status details:

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

From fdbbackup status I get;
No previous backups found.

It does not seem that fdbbackup cleanup does anything, it outputs nothing.

Ok, I see. That doesn’t sound like the expected behavior, but I don’t know the inner workings of backup well enough to be able to offer an explanation. @SteavedHams do you have any insights?

Yes, once a backup is complete (completed or aborted) it is safe to delete these things. Also,

This subspace, specifically, is used for tracking progress of the current active snapshot being written during a backup. It is not used by restore.

The backup configuration for a UID was initially left behind on purpose because it was small and may be useful later if there is no external record of backup actions. Unfortunately, as you’ve discovered, it is no longer small largely due to the snapshot range file map which was added some time later during backup’s development. At the very least, this section of the config should be cleared after a snapshot is completely flushed to the backup destination.

Thanks for finding this, I’ll try to get this into FDB 7.2. It’s a very small fix.

1 Like

Thank you for you detailed reply!

I will go ahead with manual deletes of these ranges for now, and then look forward to a fix when we upgrade to 7.2 (currently on 6.3)

1 Like