SS lagging behind

ajbeamon · May 29, 2019, 7:31pm

That IO operations aren’t completing in 20 seconds does suggest that your disk may not be healthy.

I believe when you get io_error, the process will terminate whatever role got the error (e.g. the storage server) but keep the process itself running. For a storage server, this prevents re-recruitment on that process until it gets restarted.

In the case of io_timeout, I believe the process will terminate itself, and assuming you are using something like fdbmonitor to watch the process, it will then be restarted. It should be noted that io_timeout was mainly designed to work around a particular issue (outside of FDB) where a disk would claim to have written something that it did not actually write. The timeout here allowed us to avoid the circumstances leading to that issue.

The timeout in its current state was not designed as a general purpose mechanism to cause a process to fail when the disk is being non-responsive. There are various operations which are not monitored with the timeout, and there may also be a better choice for what to do when a process’s disk is hung rather than restarting immediately.

The file isn’t necessarily corrupted, though it’s possible that it could be. If the process restarts, it will try to use the file as normal. If any pages are corrupted such that their checksums are no longer valid, then an error would be thrown at the time that page is read, which could be in the distant future. I don’t believe the file would be deleted in this case.

There are classes of errors that wouldn’t be detected by checksums, such as the example I gave above where we saw a disk acknowledging writes that it hadn’t actually written. If you think there is a risk of such a problem on your disks, or if you want to be extra careful, you could remove the process yourself.

If you remove the process, then all the data on that process would have to be re-replicated elsewhere.

Not that I recall.

Topic		Replies	Views
Storage Server CPU bottleneck - Growing data lag Using FoundationDB performance	22	3031	December 13, 2021
How to detect disk stalls? Using FoundationDB	11	1940	November 13, 2019
Segmentation fault error and broken cluster Using FoundationDB	16	4311	June 11, 2018
Database unavailable after shutting down a foundationdb node Using FoundationDB	17	8545	February 5, 2021
Behaviour of Storage Server in case of bad disk from 6.2.30 onwards Using FoundationDB	4	606	June 16, 2021

SS lagging behind

Related topics