My team is running a foundationdb cluster, version 5.1.7. There was a DR continuous backup kicked off awhile ago, and it appears to have been aborted without using the --cleanup flag. The destination cluster no longer exists, and I don’t have the address for it. Now we have the \xff\x02/blog/ prefix filling up with backup logs that are not being consumed. We need to clear them out and stop them from accumulating, since they are causing our cluster to grow much faster than we scaled it for.
Is there a way to determine what the destination cluster is? I’ve looked under \xff\x02/backupstatus/dr_backup/json but don’t know how or if I can figure out the destination from the value there.
How can I abort the DR backup? My understanding is that the fdbdr command only works when there’s a dr_agent running. When I try to run them without passing in a valid, running destination cluster, the fdbdr commands hang. When I tried recreating the destination cluster with a fresh, empty database, the fdbdr commands returned but gave me an error
ERROR: A DR was not running on tag `default'
Fatal Error: Backup unneeded request
Our Plan B is to dump all the data from our cluster into a file, destroy and recreate the database, and load the data back in. But I’m wondering if there’s a better, supported way to clean up the \xff\x02/blog/ data.
Anticipating the obvious question – we plan to upgrade to version 6, but we’d like to get a handle on our data bloat problem first.
The configuration for which mutations are copied to \xff\x02/blog/ is in \xff/logRanges/. So to get out of this situation, first abort any other DR tags you have active (I’m guessing none), and then abort or let finish any Backup tags you have active, and then from fdbcli do
The order here is important, as the first clear will stop writes to the blog prefix and the second will clear the data that has accumulated.
As for how you got into this situation, I’m not sure. What was the original deployment version of this cluster, and what version was it running when you started the DR operation?
Actually, disregard my questions, this is actually the expected behavior when you do a DR abort without cleanup and then lose the secondary forever. The problem is that the UID of the mutation stream for the DR (or backup) tag that you want to about exists in the secondary database, which is gone, so without that there is no way for the cleanup part to know which stream to cancel and clear on the primary.
The easiest way out is the instructions I posted above.
Thank you for the thorough explanation and the fdbcli commands. They worked as advertised. I tested writing a new key, and nothing appeared in \xff\x02/blog.
Would you also recommend clearing out other metadata such as \xff\x02/status/json/dr_backup and \xff\x02/backupstatus/dr_backup/json (and any others I’m not aware of)?
You can clear \xff\x02/backupstatus/dr_backup/json it if you want (it’s not large) but as long as you are running at least one dr_agent, regardless of whether or not DR is active, the contents of that prefix is continuously updated and trimmed to contain recent status reports for only active agent processes.
I’m facing a similar issue on foundationdb version 6.2.20. The system data started to grow after a DR with a lot of \xff\x02/blog keys. The cluster says there is no dr_backup running on status json and xff\x02/backupstatus/dr_backup/json doesn’t exist. But I found a key: \xff\x02/backupstatus/dr_backup/json/agent-XXXXXXX that say the contrary. I checked for dr_agent running on the cluster, I found nothing.
So, I guess cleaning the 2 ranges would be the solution. Do you know the risks of cleaning theses two ranges? I think \xff\x02/blog is also used for differential backup. Do we need to start a full backup after cleaning?
This is the space that DR agents use to be made aware of each other and write status info about themselves to the database. The DR agents clean up keys written here by dead agents as well, but once all agents are shut down there will necessarily be a few keys left. Existence of keys here does not suggest any agents are running currently and it does not indicate any DR jobs exist, it just means that DR agent(s) used to exist.
Yes this is the solution if you have lost the secondary cluster forever. The only risk is it will make any active backups no longer functional as they will not have a log stream anymore. They also won’t realize this, it will just appear to the backup(s) that there aren’t any mutations on the cluster to write to the backup. Therefore, you should first abort any active backups before clearing these ranges, and then you can start the backups again afterwards.