How to detect disk stalls?

kao.makino · November 12, 2019, 7:04pm

Can someone please tell me how io_trust_seconds work?
Or How can we detect disk stalls in general?

I created a small cluster (1 host, 3 process) with io_trust_seconds set to 5 seconds, then stalled the disk on the storage server via fuse filesystem. I was expecting to see some kind of timeout error, but the entire process got stalled instead until the disk came back.
The following is the trace logs on the storage server when I stalled the disk after 1573579506.935685.
(It’s interesting the second line showed up after the disk came back, but the timestamp is only 3 seconds after the previous line.)

{"Severity":"10","Time":"1573579506.935685","Type":"MachineMetrics","ID":"0000000000000000","Elapsed":"5.00012","MbpsSent":"70.7843","MbpsReceived":"70.7843","OutSegs":"228751","RetransSegs":"0","CPUSeconds":"4.7144","TotalMemory":"7993544704","CommittedMemory":"1441497088","AvailableMemory":"6552047616","ZoneID":"d168b16bcd45a2081e434b59a2e60cbd","MachineID":"d168b16bcd45a2081e434b59a2e60cbd","Machine":"127.0.0.1:4501","LogGroup":"default","Roles":"SS","TrackLatestType":"Original"}
{"Severity":"30","Time":"1573579509.597792","Type":"SlowTask","ID":"0000000000000000","TaskID":"9000","MClocks":"90003","Duration":"30.0007","SampleRate":"1","NumYields":"0","Machine":"127.0.0.1:4501","LogGroup":"default","Roles":"SS"}
{"Severity":"20","Time":"1573579509.597792","Type":"Net2SlowTaskTrace","ID":"0000000000000000","TraceTime":"1573579519.598367","Trace":"addr2line -e fdbserver.debug -p -C -f -i 0x7f28d39fd5f0 0x7f28d3209467 0x7f28d32094db 0x1e9a68a 0xc12674 0x10f0b50 0x1132a64 0x79cf01 0x1d65a5e 0x1d65c95 0x1e81628 0x73da6f 0x7f28d313c505 0x776cf5","Machine":"127.0.0.1:4501","LogGroup":"default","Roles":"SS"}
{"Severity":"20","Time":"1573579509.597792","Type":"Net2SlowTaskTrace","ID":"0000000000000000","TraceTime":"1573579539.598582","Trace":"addr2line -e fdbserver.debug -p -C -f -i 0x7f28d39fd5f0 0x7f28d3209275 0x7f28d32094f7 0x1e9a68a 0xc12674 0x10f0b50 0x1132a64 0x79cf01 0x1d65a5e 0x1d65c95 0x1e81628 0x73da6f 0x7f28d313c505 0x776cf5","Machine":"127.0.0.1:4501","LogGroup":"default","Roles":"SS"}
{"Severity":"10","Time":"1573579509.597792","Type":"SomewhatSlowRunLoopBottom","ID":"0000000000000000","Elapsed":"30.001","Machine":"127.0.0.1:4501","LogGroup":"default","Roles":"SS"}
{"Severity":"10","Time":"1573579509.597792","Type":"ConnectionFrom","ID":"c200d62665d48c6a","SuppressedEventCount":"7","FromAddress":"127.0.0.1:60330","ListenAddress":"127.0.0.1:4501","Machine":"127.0.0.1:4501","LogGroup":"default","Roles":"SS"}
{"Severity":"20","Time":"1573579509.597792","Type":"N2_ReadError","ID":"c200d62665d48c6a","SuppressedEventCount":"0","Message":"2","Machine":"127.0.0.1:4501","LogGroup":"default","Roles":"SS"}
{"Severity":"10","Time":"1573579509.597792","Type":"IncomingConnectionError","ID":"c200d62665d48c6a","Error":"connection_failed","ErrorDescription":"Network connection failed","ErrorCode":"1026","SuppressedEventCount":"0","FromAddress":"127.0.0.1:60330","Machine":"127.0.0.1:4501","LogGroup":"default","Roles":"SS"}
{"Severity":"20","Time":"1573579539.598975","Type":"GetDiskStatisticsDeviceNotFound","ID":"0000000000000000","Directory":"/mnt/charybdbde","Machine":"127.0.0.1:4501","LogGroup":"default","Roles":"SS"}
{"Severity":"10","Time":"1573579539.598975","Type":"MachineLoadDetail","ID":"0000000000000000","User":"182443","Nice":"270","System":"32507","Idle":"465553","IOWait":"10231","IRQ":"0","SoftIRQ":"1245","Steal":"70","Guest":"0","Machine":"127.0.0.1:4501","LogGroup":"default","Roles":"SS"}
{"Severity":"10","Time":"1573579539.598975","Type":"ProcessMetrics","ID":"0000000000000000","Elapsed":"32.6633","CPUSeconds":"2.18722","MainThreadCPUSeconds":"2.1848","UptimeSeconds":"367.688","Memory":"725852160","ResidentMemory":"518291456","UnusedAllocatedMemory":"358433984","MbpsSent":"1.81155","MbpsReceived":"2.26044","DiskTotalBytes":"98294394880","DiskFreeBytes":"90431365120","DiskQueueDepth":"0","DiskIdleSeconds":"32.6633","DiskReads":"0","DiskWrites":"0","DiskReadsCount":"0","DiskWritesCount":"0","DiskWriteSectors":"0","DiskReadSectors":"0","FileWrites":"5","FileReads":"0","CacheReadBytes":"20480","CacheFinds":"11","CacheWritesBlocked":"0","CacheReadsBlocked":"0","CachePageReadsMerged":"0","CacheWrites":"7","CacheReads":"172121","CacheHits":"172126","CacheMisses":"0","CacheEvictions":"0","ZoneID":"d168b16bcd45a2081e434b59a2e60cbd","MachineID":"d168b16bcd45a2081e434b59a2e60cbd","AIOSubmitCount":"2","AIOCollectCount":"2","AIOSubmitLag":"2.57184e-10","AIODiskStall":"6.63153e-10","CurrentConnections":"10","ConnectionsEstablished":"0","ConnectionsClosed":"0","ConnectionErrors":"0","Machine":"127.0.0.1:4501","LogGroup":"default","Roles":"SS","TrackLatestType":"Original"}

The only evidence of the disk stall was the ProcessMetrics, but it was reported after the disk came to back normal.
"DiskIdleSeconds":"32.6633"

Is this an expected behavior?
How can we detest stalled disks earlier?

alexmiller · November 12, 2019, 8:01pm

Can you symbolicate one of those two Net2SlowTaskTrace events? It’s kind of hard for FDB to do anything to guard against a bad disk if it doesn’t return to the runloop for 30 seconds.

kao.makino · November 12, 2019, 9:11pm

Oh, interesting. It was trying to retrieve file metadata.

First one:

$ addr2line -e bin/fdbserver -p -C -f -i 0x7fe45929c5f0 0x7fe458aa8467 0x7fe458aa84db 0x1e9a68a 0xc12674 0x10f0b50 0x1132a64 0x79cf01 0x1d65a5e 0x1d65c95 0x1e81628 0x73da6f 0x7fe4589db505 0x776cf5
?? ??:0
?? ??:0
?? ??:0
getDiskBytes(std::string const&, long&, long&) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3355
std::string::_M_rep() const at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/FastAlloc.h:200
 (inlined by) std::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3640
 (inlined by) KeyValueStoreSQLite::getStorageBytes() at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/fdbserver/KeyValueStoreSQLite.actor.cpp:1959
getQueuingMetrics(StorageServer*, StorageQueuingMetricsRequest const&) at /opt/rh/devtoolset-8/root/usr/include/c++/8/ext/new_allocator.h:102
a_body1loopBody1when8 at /opt/rh/devtoolset-8/root/usr/include/c++/8/ext/new_allocator.h:102
 (inlined by) a_callback_fire at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/build/fdbserver/storageserver.actor.g.cpp:14024
 (inlined by) fire at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/flow.h:1009
void NotifiedQueue<StorageQueuingMetricsRequest>::send<StorageQueuingMetricsRequest>(StorageQueuingMetricsRequest&&) at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/flow.h:598
 (inlined by) void NotifiedQueue<StorageQueuingMetricsRequest>::send<StorageQueuingMetricsRequest>(StorageQueuingMetricsRequest&&) at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/flow.h:594
 (inlined by) NetNotifiedQueue<StorageQueuingMetricsRequest>::receive(ArenaObjectReader&) at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/fdbrpc/fdbrpc.h:253
Reference<ArenaBlock>::~Reference() at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/Deque.h:162
 (inlined by) Arena::~Arena() at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/Arena.h:92
 (inlined by) ArenaObjectReader::~ArenaObjectReader() at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/ObjectSerializer.h:117
 (inlined by) a_body1cont1 at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/fdbrpc/FlowTransport.actor.cpp:640
ActorCallback<(anonymous namespace)::DeliverActor, 0, Void>::fire(Void const&) at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/Deque.h:162
void SAV<Void>::send<Void>(Void&&) at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/TDMetric.actor.h:1218
 (inlined by) void Promise<Void>::send<Void>(Void&&) const at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/flow.h:779
 (inlined by) N2::PromiseTask::operator()() at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/Net2.actor.cpp:490
 (inlined by) N2::Net2::run() at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/Net2.actor.cpp:666
main at ??:?
?? ??:0
_start at ??:?

Second one:

$ addr2line -e bin/fdbserver -p -C -f -i 0x7fe45929c5f0 0x7fe458aa8275 0x7fe458aa84f7 0x1e9a68a 0xc12674 0x10f0b50 0x1132a64 0x79cf01 0x1d65a5e 0x1d65c95 0x1e81628 0x73da6f 0x7fe4589db505 0x776cf5
?? ??:0
?? ??:0
?? ??:0
getDiskBytes(std::string const&, long&, long&) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3355
std::string::_M_rep() const at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/FastAlloc.h:200
 (inlined by) std::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3640
 (inlined by) KeyValueStoreSQLite::getStorageBytes() at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/fdbserver/KeyValueStoreSQLite.actor.cpp:1959
getQueuingMetrics(StorageServer*, StorageQueuingMetricsRequest const&) at /opt/rh/devtoolset-8/root/usr/include/c++/8/ext/new_allocator.h:102
a_body1loopBody1when8 at /opt/rh/devtoolset-8/root/usr/include/c++/8/ext/new_allocator.h:102
 (inlined by) a_callback_fire at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/build/fdbserver/storageserver.actor.g.cpp:14024
 (inlined by) fire at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/flow.h:1009
void NotifiedQueue<StorageQueuingMetricsRequest>::send<StorageQueuingMetricsRequest>(StorageQueuingMetricsRequest&&) at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/flow.h:598
 (inlined by) void NotifiedQueue<StorageQueuingMetricsRequest>::send<StorageQueuingMetricsRequest>(StorageQueuingMetricsRequest&&) at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/flow.h:594
 (inlined by) NetNotifiedQueue<StorageQueuingMetricsRequest>::receive(ArenaObjectReader&) at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/fdbrpc/fdbrpc.h:253
Reference<ArenaBlock>::~Reference() at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/Deque.h:162
 (inlined by) Arena::~Arena() at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/Arena.h:92
 (inlined by) ArenaObjectReader::~ArenaObjectReader() at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/ObjectSerializer.h:117
 (inlined by) a_body1cont1 at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/fdbrpc/FlowTransport.actor.cpp:640
ActorCallback<(anonymous namespace)::DeliverActor, 0, Void>::fire(Void const&) at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/Deque.h:162
void SAV<Void>::send<Void>(Void&&) at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/TDMetric.actor.h:1218
 (inlined by) void Promise<Void>::send<Void>(Void&&) const at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/flow.h:779
 (inlined by) N2::PromiseTask::operator()() at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/Net2.actor.cpp:490
 (inlined by) N2::Net2::run() at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/Net2.actor.cpp:666
main at ??:?
?? ??:0
_start at ??:?

markus.pilman · November 12, 2019, 9:17pm

Oh I see - we still block for some disk operations (statvfs in this case). I am not sure how to feel about this…

For what we saw in production, this particular behavior is actually not bad. The process will stall and therefore the CC would remove it from the cluster. So for tlogs and storages, this would behave fine.

On the other hand, this makes testing quite painful… Can you somehow only stall data operations but not meta-data operations? Or could you just stall read, write, and fdatasync? Those are a bit more interesting to test.

ajbeamon · November 12, 2019, 9:48pm

io_trust_seconds is a mechanism that was designed to address a particular issue where we couldn’t trust the success of an IO operation if it had been outstanding long enough (in particular, if the kernel attempted a task abort). To work around this case, we aborted any IO that was running longer than some duration.

While it can be used to help timeout a process when its disk is unresponsive, it wasn’t really designed for that use case, and as you discovered there are certainly cases where you can stall on an unresponsive disk that aren’t addressed by this feature.

kao.makino · November 12, 2019, 9:52pm

If we stall only read, write, fdatasync, it didn’t block the runloop.

There was one task stalled.

{  "Severity": "30", "Time": "1573594590.811891", "Type": "SlowTask", "ID": "0000000000000000", "TaskID": "9000", "MClocks": "29718.6", "Duration": "\
9.90613", "SampleRate": "1", "NumYields": "0", "Machine": "127.0.0.1:4502", "LogGroup": "default", "Roles": "SS" }
{  "Severity": "20", "Time": "1573594590.811891", "Type": "Net2SlowTaskTrace", "ID": "0000000000000000", "TraceTime": "1573594600.717443", "Trace": "\
addr2line -e fdbserver.debug -p -C -f -i 0x7ff3d324c5f0 0x7ff3d2a58315 0x7ff3d29ac5a9 0x1e9d71f 0x1e9ddc0 0xc12663 0x10f0b50 0x1132a64 0x79cf01 0x1d6\
5a5e 0x1d65c95 0x1e81628 0x73da6f 0x7ff3d298b505 0x776cf5", "Machine": "127.0.0.1:4502", "LogGroup": "default", "Roles": "SS" }

Which is this guy… which still seems to me a metadata read, but I can be wrong.

?? ??:0
?? ??:0
?? ??:0
abspath(std::string const&, bool, bool) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3355
parentDirectory(std::string const&, bool, bool) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/basic_string.h:3355
KeyValueStoreSQLite::getStorageBytes() at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/FastAlloc.h:200
getQueuingMetrics(StorageServer*, StorageQueuingMetricsRequest const&) at /opt/rh/devtoolset-8/root/usr/include/c++/8/ext/new_allocator.h:102
a_body1loopBody1when8 at /opt/rh/devtoolset-8/root/usr/include/c++/8/ext/new_allocator.h:102
 (inlined by) a_callback_fire at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/build/fdbserver/storageserver.actor.g.cpp:14024
 (inlined by) fire at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/flow.h:1009
void NotifiedQueue<StorageQueuingMetricsRequest>::send<StorageQueuingMetricsRequest>(StorageQueuingMetricsRequest&&) at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/flow.h:598
 (inlined by) void NotifiedQueue<StorageQueuingMetricsRequest>::send<StorageQueuingMetricsRequest>(StorageQueuingMetricsRequest&&) at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/flow.h:594
 (inlined by) NetNotifiedQueue<StorageQueuingMetricsRequest>::receive(ArenaObjectReader&) at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/fdbrpc/fdbrpc.h:253
Reference<ArenaBlock>::~Reference() at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/Deque.h:162
 (inlined by) Arena::~Arena() at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/Arena.h:92
 (inlined by) ArenaObjectReader::~ArenaObjectReader() at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/ObjectSerializer.h:117
 (inlined by) a_body1cont1 at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/fdbrpc/FlowTransport.actor.cpp:640
ActorCallback<(anonymous namespace)::DeliverActor, 0, Void>::fire(Void const&) at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/Deque.h:162
void SAV<Void>::send<Void>(Void&&) at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/TDMetric.actor.h:1218
 (inlined by) void Promise<Void>::send<Void>(Void&&) const at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/flow.h:779
 (inlined by) N2::PromiseTask::operator()() at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/Net2.actor.cpp:490
 (inlined by) N2::Net2::run() at /mnt/jenkins/home/jenkins/workspace/FoundationDB-OSFDB-Build/jenkins/foundationdb/flow/Net2.actor.cpp:666
main at ??:?
?? ??:0
_start at ??:?

Eventually (more than 10 seconds after the disk was stalled), it detected some network communication error, but this can be unrelated.

{  "Severity": "20", "Time": "1573594600.719515", "Type": "N2_ReadError", "ID": "6d0e0d23c63d3258", "SuppressedEventCount": "0", "Message": "2", "Mac\
hine": "127.0.0.1:4502", "LogGroup": "default", "Roles": "SS" }
{  "Severity": "10", "Time": "1573594600.719515", "Type": "ConnectionClosed", "ID": "6d0e0d23c63d3258", "Error": "connection_failed", "ErrorDescripti\
on": "Network connection failed", "ErrorCode": "1026", "SuppressedEventCount": "0", "PeerAddr": "127.0.0.1:4500", "Machine": "127.0.0.1:4502", "LogGr\
oup": "default", "Roles": "SS" }
{  "Severity": "10", "Time": "1573594600.719515", "Type": "PeerDestroy", "ID": "0000000000000000", "Error": "connection_failed", "ErrorDescription": \
"Network connection failed", "ErrorCode": "1026", "SuppressedEventCount": "0", "PeerAddr": "127.0.0.1:32916", "Machine": "127.0.0.1:4502", "LogGroup"\
: "default", "Roles": "SS" }
{  "Severity": "30", "Time": "1573594600.719515", "Type": "FailureMonitorClientSlow", "ID": "0000000000000000", "Elapsed": "9.9566", "Expected": "0.1\
", "Machine": "127.0.0.1:4502", "LogGroup": "default", "Roles": "SS" }
{  "Severity": "10", "Time": "1573594600.719515", "Type": "ConnectionFrom", "ID": "9babc127863a64b8", "SuppressedEventCount": "0", "FromAddress": "12\
7.0.0.1:32928", "ListenAddress": "127.0.0.1:4502", "Machine": "127.0.0.1:4502", "LogGroup": "default", "Roles": "SS" }
{  "Severity": "10", "Time": "1573594600.719923", "Type": "IncomingConnectionError", "ID": "9babc127863a64b8", "Error": "connection_failed", "ErrorDe\
scription": "Network connection failed", "ErrorCode": "1026", "SuppressedEventCount": "0", "FromAddress": "127.0.0.1:32928", "Machine": "127.0.0.1:45\
02", "LogGroup": "default", "Roles": "SS" }

kao.makino · November 12, 2019, 9:54pm

Thanks for the clarification.
So io_trust_seconds is obviously not the mechanism we want to rely on in this case.
Do we have any other mechanism to react to unresponsive disks?

ajbeamon · November 12, 2019, 10:03pm

At the moment, I think the best we have is io_trust_seconds and a new-ish mechanism on the transaction logs that cause them to be marked degraded and deprioritized for selection by the cluster controller when they can’t commit. There’s additional work in this area milestoned for 7.0 in this issue:

markus.pilman · November 12, 2019, 10:03pm

I don’t think these are unrelated - we get networking errors because the main loop stalled (which is because of the disk stall).

again: In general I like this behavior. The question is whether we can somehow construct a test where the disk stalls but the main thread doesn’t… This is what happened in production and this is where stuff gets weird.

alexmiller · November 12, 2019, 10:20pm

I think it’s more that we should expand what io_trust_seconds applies to, as you appear to have (rather easily) found holes in it. Setting some time aside to get #1447 merged for 7.0 might be a pretty good idea…

ajbeamon · November 12, 2019, 11:06pm

io_trust_seconds is a feature to specify how long we trust the results of a read or write IO, so it may be helpful to think of it as somewhat distinct from the topic of detecting stalls. It’s possible we’d want to also have a global IO timeout, though a lot of discussion on the past has been centered around how we’d want these kind of decisions to be made centrally rather than locally (as in the issue I posted above).

Or maybe we will decide that we want local timeouts and that they could subsume the io_trust_seconds feature (or that it’s no longer necessary for its original purpose), in which case we could try to rename and expand it to a broader timeout mechanism. If we decide to go that route and continue supporting the original use case, we’ll want to make sure that the way errors are handled meets the needs of other use cases.

kao.makino · November 13, 2019, 7:13pm

Ah, OK, now I understand what io_trust_seconds is designed for.
So it’s more of a verification mechanism of the super long latency I/Os, but it doesn’t really timeout.

Here, I set it to 5 seconds and stalled the disk for 20+ seconds.
SS complains that I/O took too long after the disk comes back.

         OPS         8974      1123.01         0.00
         OPS         6685       830.68         0.00
         OPS         9408      1175.01         1.00
         OPS         7967       997.01         0.00
         OPS           43         6.00         0.00
         OPS            0         0.00         0.00
         OPS            0         0.00         0.00
         OPS            0         0.00         0.00
         OPS            0         0.00         0.00
         OPS            0         0.00         0.00
         OPS            0         0.00         0.00
         OPS            0         0.00         0.00
         OPS            0         0.00         0.00
         OPS            0         0.00         0.00
         OPS            0         0.00         0.00
         OPS            0         0.00         0.00
         OPS            0         0.00         0.00
         OPS            0         0.00         0.00
         OPS            0         0.00         0.00
         OPS            0         0.00         0.00
         OPS            0         0.00         0.00
         OPS            0         0.00         0.00
         OPS            0         0.00         0.00
         OPS            0         0.00         0.00
         OPS            0         0.00         0.00
         OPS            0         0.00         0.00
         OPS            0         0.00         0.00
         OPS            0         0.00         0.00
ERROR: Error commit_transaction (1007) occured at Transaction is too old to perform reads or be committed
Time="1573672012.970110" Severity="40" LogGroup="default" Process="fdbserver.4501": Fatal Error: A disk IO operation failed to complete in a timely manner

By the way, just FYI to the other members in this forum, I used mako benchmark to generate the load.

Topic		Replies	Views
SS lagging behind Using FoundationDB	11	1723	December 13, 2023
FoundationDB cluster performance issue - Periods of high disk I/O and sustained high latency Using FoundationDB performance	21	2531	July 6, 2020
Foundationdb 6.2 - fdbserver going out of memory Using FoundationDB	9	1026	April 23, 2020
Storage Server CPU bottleneck - Growing data lag Using FoundationDB performance	22	3029	December 13, 2021
Foundationdb seems keep reinitialising itself after client hang Using FoundationDB	5	1217	June 26, 2019

How to detect disk stalls?

Related topics