Is consistency check only for check? What will it do if it detects an error?

libo-sober · February 24, 2023, 2:57am

When consistencycheck is not running, I delete a data directory and it will be automatically rebuilt. When run it, after I delete the directory, it just reports that storage has been checked Storage_Unavailable.
When run consistencycheck, I use vim to edit the *. sqllite file and delete most of the contents, and the file is not fixed.
So, what can consistencycheck do? Does it just throw a false when it detects an error? Will it have some actions to repair?
If we run multiple consistencycheck’s, does it support parallelism by default? We see in the source code that the value of distributed is true by default. Can it be set manually as a parameter?
Can anyone help me? Thank you！

jzhou · March 10, 2023, 6:42pm

I’ll try to explain what consistency check does:

Consistency check works by iterating through the whole database, shard by shard, and comparing all replicas for a shard’s data. If there is a mismatch, it reports an error in the log.
Consistency check doesn’t try to repair the data, because it doesn’t know which copy of the inconsistent data is the correct one.
Consistency check for now only supports a single instance. You can’t use multiple instances to speed up the checks. There was some discussion about making it parallelized, but no concrete plan yet.
You can run consistency check as an external process, like fdbserver -r consistencycheck -C cluster_file, which is documented here

libo-sober · November 28, 2023, 2:22am

Hi, @jzhou !
If consistency checks detect errors, how should we fix them? What are the potential risks to the cluster if the consistency check process is not started?
Thanks!

jzhou · November 28, 2023, 7:47pm

If consistency check detects an error, which means replicas have different values, then the best we can do is to restore from backups that are “known” to be good. However, it’s hard to know when the corruption happened and when the value diverged, so it’s difficult to tell which version is good to be restored to. We do have a tool fdbdecode to analyze backup mutation logs, so if you find some recent mutations that modify the data, it might help you to determine the time of corruption. However, if the corruption happened a while back and there is no mutation to the corrupted key, then we don’t know when.

For test clusters, the simple “fix” is to overwrite the corrupted key with a value, then all replicas are the same again.

If you don’t run consistency check, then you don’t know if there are latent corruptions. Fortunately, the good news is that we never saw consistency check errors for production clusters, i.e., no corruptions ever!

libo-sober · November 29, 2023, 7:58am

Thank you. Next, I plan to study the knowledge related to FDB backup.

Topic		Replies	Views
Internal_error KeyvalueStoreSQLite.actor.cpp Using FoundationDB	14	1130	March 30, 2023
Replication, automatic repairs, errors and bit rot FoundationDB Core	5	1795	May 10, 2019
How to run ConsistencyCheck in FDB 7.1.49? Using FoundationDB	0	167	December 19, 2023
Data integrity resiliency on single node deployments Using FoundationDB	2	794	August 23, 2018
Fdbserver error in a cluster with double redundancy Using FoundationDB	2	857	September 22, 2020

Is consistency check only for check? What will it do if it detects an error?

Related topics