FoundationDB

Aborting a DR Backup when destination is unreachable


#1

My team is running a foundationdb cluster, version 5.1.7. There was a DR continuous backup kicked off awhile ago, and it appears to have been aborted without using the --cleanup flag. The destination cluster no longer exists, and I don’t have the address for it. Now we have the \xff\x02/blog/ prefix filling up with backup logs that are not being consumed. We need to clear them out and stop them from accumulating, since they are causing our cluster to grow much faster than we scaled it for.

Is there a way to determine what the destination cluster is? I’ve looked under \xff\x02/backupstatus/dr_backup/json but don’t know how or if I can figure out the destination from the value there.

How can I abort the DR backup? My understanding is that the fdbdr command only works when there’s a dr_agent running. When I try to run them without passing in a valid, running destination cluster, the fdbdr commands hang. When I tried recreating the destination cluster with a fresh, empty database, the fdbdr commands returned but gave me an error

ERROR: A DR was not running on tag `default'
Fatal Error: Backup unneeded request

Our Plan B is to dump all the data from our cluster into a file, destroy and recreate the database, and load the data back in. But I’m wondering if there’s a better, supported way to clean up the \xff\x02/blog/ data.

Anticipating the obvious question – we plan to upgrade to version 6, but we’d like to get a handle on our data bloat problem first.


(Steve Atherton) #2

The configuration for which mutations are copied to \xff\x02/blog/ is in \xff/logRanges/. So to get out of this situation, first abort any other DR tags you have active (I’m guessing none), and then abort or let finish any Backup tags you have active, and then from fdbcli do

clearrange \xff/logRanges/ \xff/logRanges/\xff
clearrange \xff\x02/blog/ \xff\x02/blog/\xff

The order here is important, as the first clear will stop writes to the blog prefix and the second will clear the data that has accumulated.

As for how you got into this situation, I’m not sure. What was the original deployment version of this cluster, and what version was it running when you started the DR operation?


(Steve Atherton) #3

What is the output of fdbdr status at this point, and what is in your status json output under layers.dr_backup.tags?


(Steve Atherton) #4

Actually, disregard my questions, this is actually the expected behavior when you do a DR abort without cleanup and then lose the secondary forever. The problem is that the UID of the mutation stream for the DR (or backup) tag that you want to about exists in the secondary database, which is gone, so without that there is no way for the cleanup part to know which stream to cancel and clear on the primary.

The easiest way out is the instructions I posted above.


#5

Thank you for the thorough explanation and the fdbcli commands. They worked as advertised. I tested writing a new key, and nothing appeared in \xff\x02/blog.

Would you also recommend clearing out other metadata such as \xff\x02/status/json/dr_backup and \xff\x02/backupstatus/dr_backup/json (and any others I’m not aware of)?


(Steve Atherton) #6

You can clear \xff\x02/backupstatus/dr_backup/json it if you want (it’s not large) but as long as you are running at least one dr_agent, regardless of whether or not DR is active, the contents of that prefix is continuously updated and trimmed to contain recent status reports for only active agent processes.


(Alex Miller) #7

I filed #1111 to suggest that we create an equivalent fdbcli (or other) command, so that users don’t have to issue manual clearrange operations.