Assuming double, triple, or the upcoming satellite redundancy modes, can FoundationDB recover from any form of data corruption while a cluster remains healthy? FoundationDB is not compatible with zfs or btrfs, which would lead to the requirement for a higher-level means to repair object-level failure.
As I see no mention of erasure coding within the codebase I assume it must be built at layer level; and if so, would public-facing access to the internal and/or external (e.g., satellite) tlog make sense (to rebuild corrupted data according to the layer’s requirements without having to employ/in combination with erasure coding/fingerprinting)?
Regardless of replication mode, the existence of other full replicas of the same data allows for FDB to resolve these issues without intervention:
Repairs, that of restoring replication when a host or drive has failed, is handled by FoundationDB automatically, in particular by its data distribution subcomponent. Key ranges that were affected by the loss will be assigned to a new team of servers. Until the data is re-replicated, status will report decreased failure tolerance.
Errors, those of IO requests failing or transiently returning incorrect data, and Bit rot, that of data being silently corrupted on disk over time without being accessed, are handled similarly. Errors cause an io_error to be thrown, and bit rot will be detected when accessed and treated as an IO error as well. The storage server will die, effectively discarding all of its data as potentially corrupt, and the data will be re-replicated from the non-corrupt nodes.
I’m interested to hear your specific reasons for desiring erasure encoding. It’d impose either an additional storage overhead per node, to provide resilience for data already held elsewhere, or require reads to hit more than 1 replica, if you’re doing erasure codes across replicas. Neither tradeoff seems appealing to me, but I’ve never actually dealt with an erasure encoded system to see the other side of this.
One thing that we don’t (to my knowledge) have out of the box in this area is anything that reads extremely cold data periodically to ensure that bit rot is detected. If you have a FDB database with tons of data that is simply never read or written, and you never have any failures that require it to be moved around, over a long period of time the likelihood that multiple disks have failed silently under all the replicas of some piece of data might slowly climb. I’m not sure that the backup process reliably solves this, because depending on your setup at a low write rate you might be doing just incremental backups for a long time, and anyway it only needs to read data from one replica. Running a consistency check every once in a while would do it, but I’m not sure that’s even documented.
One other caveat with regard to bad disk behavior is that disks that fail at durability can concievably result in durability failures - checksums can’t always detect if a disk simply fails to commit writes.
An erasure coding based storage engine plugin for big, cold data that is expensive to store replicated wouldn’t be totally crazy. You could assign specific key spaces to it and it would make FDB even more of an all-in-one storage solution.
Alternatively, if someone ever adds a log structured storage engine that has decent write performance on spinning disks, doing replication with one fast SSD (which serves all reads) and one or two slow spinning disks (which accept writes and will be used to reconstruct the SSD replica in case of failure) would be an intermediate point on the cost/performance curve.
In case you’re curious, this starts the consistency check in a mode where it runs indefinitely at a modest rate controlled by a knob. If you start one of these processes up on a cluster, it should periodically check all of the data. I don’t recall what guarantees are made here, though, particularly in the face of data movement.
Hi Alex, what would precisely happen to the storage server when it detects an IO/corruption error? Will it simply quit? If so, will not fdbmonitor bring it back up again after some delay? Or will the corrupt storage server explicitly discard all its shards (files on disk) before quitting?