SS lagging behind

That IO operations aren’t completing in 20 seconds does suggest that your disk may not be healthy.

I believe when you get io_error, the process will terminate whatever role got the error (e.g. the storage server) but keep the process itself running. For a storage server, this prevents re-recruitment on that process until it gets restarted.

In the case of io_timeout, I believe the process will terminate itself, and assuming you are using something like fdbmonitor to watch the process, it will then be restarted. It should be noted that io_timeout was mainly designed to work around a particular issue (outside of FDB) where a disk would claim to have written something that it did not actually write. The timeout here allowed us to avoid the circumstances leading to that issue.

The timeout in its current state was not designed as a general purpose mechanism to cause a process to fail when the disk is being non-responsive. There are various operations which are not monitored with the timeout, and there may also be a better choice for what to do when a process’s disk is hung rather than restarting immediately.

The file isn’t necessarily corrupted, though it’s possible that it could be. If the process restarts, it will try to use the file as normal. If any pages are corrupted such that their checksums are no longer valid, then an error would be thrown at the time that page is read, which could be in the distant future. I don’t believe the file would be deleted in this case.

There are classes of errors that wouldn’t be detected by checksums, such as the example I gave above where we saw a disk acknowledging writes that it hadn’t actually written. If you think there is a risk of such a problem on your disks, or if you want to be extra careful, you could remove the process yourself.

If you remove the process, then all the data on that process would have to be re-replicated elsewhere.

Not that I recall.