It seems like void rollback( StorageServer* data, Version rollbackVersion, Version nextVersion ) in storageserver.actor.cpp would call please_reboot() which is pretty disruptive to memory storage servers (depending on storage_memory size of the node, it could take minutes to come back). The question is what triggers it?
It seems like it is called in applyPrivateData() in storageserver.actor.cpp.
Our storage servers pull data from the log system before we are sure all the transactions logs have gotten (and fsync’ed) that data. We do this to help prevent future versions, by giving the storage servers as much time as possible to process new versions before clients can read those versions.
If a storage server pulls data from the logs, and then a failure causes us to choose a recovery version less than a version the storage server has processed, we need to discard the commits. We know of on disk data is around 5 seconds old, so we rollback to that version and re-fetch everything up to the recovery version.
In general the situations which cause rollbacks are very rare. All of the scenarios require different logs to end a generation at different versions. Then you need one of the logs that has a lower version than other logs to be alive during recovery. In practice this does not happen often.
While a rollback is not a big problem for the ssd storage engine, it is definitely expensive for the memory storage engine. It would be difficult to completely eliminate the need for rollbacks, however one simple thing would be to choose as big of a version as possible during recovery. Currently the recovery logic chooses the smallest version possible as the recovery version because that will minimize the chance the we have to restart recovery.
To change it so that it picks the biggest version just change:
to:
Version end = results[ std::min(safe_range_end, (int)(results.size())) -1 ];
This is completely untested, so obviously more thought and testing would be needed, but having a knob which change the behavior here seems reasonable.
Thanks @Evan, we are currently still on the 3.x code so I am mainly looking at a particular failure scenario whereby adding servers (or when role changes happen), a cluster-wide rollback is issued and all KeyValueMemoryStorage is discarded and has to be re-read from disk. I found your commit in the 5.x branch:
but the line: if ( rollbackVersion < fromVersion && rollbackVersion > restoredVersion) { seems to only be applied to whether print a trace event instead of whether to actually do the rollback:
Now, this might not really help our situation but we do find that rollbacks are not that rare (just adding a node to the cluster can cause it to rollback and we would experience 20 minutes+ of recovery time (based on our storage_memory configuration of course). I am currently in the process of patching the 3.x codebase to see if the change would help but I realized that the change in 5.x wouldn’t help either (again, it’s just not doing the TraceEvent).
It seems to me that the commit only really prevents a rollback if the original restored version is larger than the rollback version (which seems to be a rare case to me). In any case, I did a PR https://github.com/apple/foundationdb/pull/385/commits to address the fact that the original commit just addressed the trace event and not the actual rollback.
I vaguely remember attempting to make the change you suggested and running into some issue. I will need to run a bunch in simulation tests on the change to see if it finds anything.
My advice should still work on the 3.X codebase to reduce the number of rollbacks. You should be able to find a very similar line of code somewhere in the tag partitioned log system. You should give it a try if you have a reliable test that reproduces the scenario.
A two-phase log stopping process could also be employed to add a little latency to recovery but make it much more likely all the tlogs end a generation at the same version.
I agree the original change seems fishy. The result seems to be that in some cases we do a rollback without logging it to the trace file, which doesn’t sound like what you’d want. I’m not actually sure why in the original version those two if blocks weren’t unified, with the one line in between them placed before. That doesn’t appear to me to change the behavior, and it prevents edits like the one that decouple the trace event from the actual rollback.
I’m not familiar with what issue Evan had run into applying the new condition to the second if block, but maybe the reason for placing it around the first block was to guard the ASSERT? If that’s the case, though, we should probably just move the condition into the ASSERT argument. Evan’s out for a couple weeks now, though, so we may want to wait until he returns so we can better understand what the intent was.
Sorry to resurrect an old flag, is there someway that we can work around rollbacks especially with the memory engine? The code implies that it’s possible to do the rollback without rebooting the storage server but that the logic is cumbersome. Is there an old version of the code that has that missing piece that we could test with?