This is almost certainly caused by your disk breaking the disk durability contract.
Blocks on disk have checksums to prevent corruption, but this cannot protect from “lost writes”, where a page update which is fsync()'d remains at its previous state instead when later read back from disk. Checksums do not help here because the old state of the page had a valid checksum, the page was just never updated to the new content.
A lost write can cause BTree invariant violations which are the type of errors you are seeing. A lost write can also cause stale data to be returned from the SQLite tree without it detecting an invariant violation internally. In the fragment error you saw, SQLite itself did not detect an issue and so it returned a bad result, but this broke an invariant in the FDB KV modeling layer on top of the SQLite tree which adds record fragmenting to avoid a high slack space issue with SQLite’s structure. If the record were not fragmented, the error would not have been detected by the KV modeling layer and a bad result would have been returned from the storage engine.
There is an incredible amount of simulated and real world testing and at-scale usage of the SQLite storage engine, and its code in FDB and the underlying SQLite code have not changed meaningfully in many years. A lost write can cause all of the issues you have seen, and is the most likely cause.
The good news is, we’ve encountered badly behaving disks before and so there is an FDB feature you can turn on which will prove that this is what is happening.
If you set this option in your FDB conf file in the fdbserver section
knob_page_write_checksum_history = 50000000
then FDB track written block checksums within a process’s lifetime for the first 50,000,000 4k blocks of space in your data files, and use these to verify the checksums of any of those blocks when read back from disk later after cache eviction. If a mismatch is found, it will log a AsyncFileLostWriteDetected
event.
I suggest enabling this on your cluster and looking for these events. The memory overhead of this feature is approximately the history size setting * 8, so around 400MB of memory to track 50 million blocks which would cover all blocks of a 200GB data file.