We’re working on the issue related to memory storage engine recovery optimization where you can selectively abort/skip the process if the cluster is healthy and then let this woker join the cluster as new.
My questions are :
how to reboot and delete data? I found the code to handle rebootRequest inside worker.cpp, but please_reboot_delete() is only thrown under the condition of g_network->isSimulated() == true
Where is logic to deal with the please_reboot/please_reboot_delete exception ?
The please_reboot/please_reboot_delete gets thrown out of the workerServer, falls through fdbd, and ends up getting caught within simulatedFDBDRebooter in fdbserver/SimulatedCluster.actor.cpp. Look at the code surrounding the SimulatedFDBDRebootAndDelete trace event.
I’ll confess that figuring this out also took me much longer than I expected…
EDIT: Sorry, I forgot to actually answer your first question.
As for how to reboot and delete, I don’t think you’d actually need to reboot. I think you’d need to change it so that in worker.actor.cpp when it’s scanning DiskStores that exist on disk and re-creating the corresponding storage and log instances, if it sees that the cluster is healthy, it creates the storage server and then immediately calls dispose() on it. dispose() then deletes the associated files. You’ll then rejoin the cluster as a worker with nothing recruited, and then likely be immediately recruited by data distribution as a new storage server.
This will probably mean having to move the code that does recovery from the constructor of KeyValueStoreMemory into the init() method that IKeyValueStore offers but everything but Redwood doesn’t actually use…