When consistencycheck is not running, I delete a data directory and it will be automatically rebuilt. When run it, after I delete the directory, it just reports that storage has been checked Storage_Unavailable.
When run consistencycheck, I use vim to edit the *. sqllite file and delete most of the contents, and the file is not fixed.
So, what can consistencycheck do? Does it just throw a false when it detects an error? Will it have some actions to repair?
If we run multiple consistencycheck’s, does it support parallelism by default? We see in the source code that the value of distributed is true by default. Can it be set manually as a parameter?
Can anyone help me? Thank you!
I’ll try to explain what consistency check does:
- Consistency check works by iterating through the whole database, shard by shard, and comparing all replicas for a shard’s data. If there is a mismatch, it reports an error in the log.
- Consistency check doesn’t try to repair the data, because it doesn’t know which copy of the inconsistent data is the correct one.
- Consistency check for now only supports a single instance. You can’t use multiple instances to speed up the checks. There was some discussion about making it parallelized, but no concrete plan yet.
- You can run consistency check as an external process, like
fdbserver -r consistencycheck -C cluster_file
, which is documented here
Hi, @jzhou !
If consistency checks detect errors, how should we fix them? What are the potential risks to the cluster if the consistency check process is not started?
Thanks!
If consistency check detects an error, which means replicas have different values, then the best we can do is to restore from backups that are “known” to be good. However, it’s hard to know when the corruption happened and when the value diverged, so it’s difficult to tell which version is good to be restored to. We do have a tool fdbdecode
to analyze backup mutation logs, so if you find some recent mutations that modify the data, it might help you to determine the time of corruption. However, if the corruption happened a while back and there is no mutation to the corrupted key, then we don’t know when.
For test clusters, the simple “fix” is to overwrite the corrupted key with a value, then all replicas are the same again.
If you don’t run consistency check, then you don’t know if there are latent corruptions. Fortunately, the good news is that we never saw consistency check errors for production clusters, i.e., no corruptions ever!
Thank you. Next, I plan to study the knowledge related to FDB backup.