Hi guys, Is there more detailed design documents? I want to know the detailed design of transaction, storage engine, storage file format, metadata management, mvcc, the advantages and disadvantages to spanner, cockroachdb and tidb. Thanks
Unfortunately, we do not have any additional design documents. I posted some additional details of the architecture in a different thread, but nothing that covers your specific questions.
How do I go find the details for how FoundationDB behaves relative to PostgreSQL in interaction with the underlying hardware? the following LWN article reports on fsync():
PostgreSQL assumes that a successful call to fsync() indicates that all data written since the last successful call made it safely to persistent storage. But that is not what the kernel actually does. When a buffered I/O write fails due to a hardware-level error, filesystems will respond differently
- LWN article retrieved 30 Apr 18: PostgreSQL’s fsync() surprise
Specifically for that, I already saw the LWN article, and double checked that we will correctly intentionally crash on an
fsync() failure. Code-wise, you’d want to go crawling through the
AsyncFile* files. AsyncFileEIO is for MacOS, AsyncFileKAIO is for linux, and AsyncFileWinASIO is for Windows.
FDB uses O_DIRECT, so as I understand it the specific problem mentioned in the article doesn’t apply.
But IMO the broader problem - that Linux’s disk subsystem doesn’t have a coherent way of propagating low level errors - is still a concern. I think that this (hardware failure dependent) scenario may occur:
(1) FDB does some (direct) writes to a file
(2) The writes are delivered to the SSD’s volatile write buffer, and the OS returns success from the writes to FDB
(3) Some other operation, maybe not even initiated by FDB, causes the SSD to “lock up” or take unreasonably long to respond to commands
(4) The linux block driver gets tired of waiting and issues a hard link reset to the SSD
(5) The SSD responds to the link reset by throwing away its write cache
(6) The linux block driver doesn’t report these events to the filesystem in any way
(7) FDB does an fsync
(8) Linux sends a FLUSH BUFFERS to the SSD, which flushes its (now empty) write buffer and returns success
(9) fsync returns success, even though the previous writes are not in fact durable