SS lagging behind

(gaurav) #1

Hi, I have an 8-node cluster with 2 febserver process running on each (v6.0.15). The topology of the cluster can be seen from the screenshot below. I frequently see in this cluster that one of the SS develops both a high data_lag (data_lag->seconds) and high durability_lag(durability_lag->seconds). I am unable to find out the reason for this. The cluster seems only to be moderately loaded.

Any pointers on how to debug this further will be very helpful:

Status Json

On the lagging process (10.196.78.162:4500, I see following kinds logs above severity 10):

<Event Severity="20" Time="1559047903.276518" Type="StorageServerUpdateLag" ID="c010af68cb32eb52" Version="1520451207836" DurableVersion="1517261749412" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1559047904.277512" Type="StorageServerUpdateLag" ID="c010af68cb32eb52" Version="1520451207836" DurableVersion="1517261749412" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1559047905.278558" Type="StorageServerUpdateLag" ID="c010af68cb32eb52" Version="1520451207836" DurableVersion="1517261749412" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1559047905.488639" Type="StorageServerUpdateLag" ID="c010af68cb32eb52" Version="1520454143590" DurableVersion="1517265129285" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1559047905.508182" Type="StorageServerUpdateLag" ID="c010af68cb32eb52" Version="1520455598618" DurableVersion="1517272800465" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1559047905.554610" Type="StorageServerUpdateLag" ID="c010af68cb32eb52" Version="1520458034806" DurableVersion="1517275101043" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1559047905.630482" Type="StorageServerUpdateLag" ID="c010af68cb32eb52" Version="1520461014779" DurableVersion="1517280540285" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1559047905.676272" Type="StorageServerUpdateLag" ID="c010af68cb32eb52" Version="1520462329085" DurableVersion="1517286780238" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1559047906.678751" Type="StorageServerUpdateLag" ID="c010af68cb32eb52" Version="1520462329085" DurableVersion="1517287075963" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1559047907.632185" Type="SlowSSLoopx100" ID="c010af68cb32eb52" Elapsed="0.0600758" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1559047907.679207" Type="StorageServerUpdateLag" ID="c010af68cb32eb52" Version="1520462329085" DurableVersion="1517287075963" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1559047908.600749" Type="FetchPast" ID="c010af68cb32eb52" TotalAttempts="112640" FKID="93a98b2b798e44c1" V="1520462329085" N="1520462329085" E="1520462329085" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1559047908.679482" Type="StorageServerUpdateLag" ID="c010af68cb32eb52" Version="1520462329085" DurableVersion="1517287075963" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1559047909.681028" Type="StorageServerUpdateLag" ID="c010af68cb32eb52" Version="1520462329085" DurableVersion="1517287075963" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1559047910.681947" Type="StorageServerUpdateLag" ID="c010af68cb32eb52" Version="1520462329085" DurableVersion="1517287075963" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1559047911.018044" Type="StorageServerUpdateLag" ID="c010af68cb32eb52" Version="1520464161755" DurableVersion="1517288247848" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1559047911.036774" Type="StorageServerUpdateLag" ID="c010af68cb32eb52" Version="1520466735558" DurableVersion="1517289913248" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1559047911.054719" Type="StorageServerUpdateLag" ID="c010af68cb32eb52" Version="1520468120733" DurableVersion="1517291493154" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1559047911.078749" Type="StorageServerUpdateLag" ID="c010af68cb32eb52" Version="1520470167786" DurableVersion="1517296071337" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1559047911.097467" Type="StorageServerUpdateLag" ID="c010af68cb32eb52" Version="1520473575353" DurableVersion="1517299817340" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1559047911.125359" Type="StorageServerUpdateLag" ID="c010af68cb32eb52" Version="1520475647147" DurableVersion="1517304361907" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1559047911.144527" Type="StorageServerUpdateLag" ID="c010af68cb32eb52" Version="1520477655758" DurableVersion="1517305764326" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1559047911.182338" Type="StorageServerUpdateLag" ID="c010af68cb32eb52" Version="1520479263127" DurableVersion="1517306679220" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1559047912.186457" Type="StorageServerUpdateLag" ID="c010af68cb32eb52" Version="1520479263127" DurableVersion="1517306679220" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1559047913.188647" Type="StorageServerUpdateLag" ID="c010af68cb32eb52" Version="1520479263127" DurableVersion="1517306679220" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1559047913.473041" Type="FetchPast" ID="c010af68cb32eb52" TotalAttempts="115712" FKID="78842ee4f8ae4547" V="1520479263127" N="1520479263127" E="1520479263127" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />

(A.J. Beamon) #2

Is it the same process that lags every time? Often when I’ve seen something like this, there’s something wrong with the host that’s preventing it from keeping up. I don’t see anything else from your stats that seems like a good lead.

If you can’t find anything, you could also try to exclude the process and see if the cluster behaves well. Then if you wanted you could try to reinclude it and see if it that process runs into problems again. Or you could try to run some other stress test on the process so as not to pollute your cluster.

(gaurav) #3

Thanks, @ajbeamon.
Yes, this is one server that has a tendency to get stuck. I got some more logs above severity 20 from this machine that point to some disk errors. However, I cannot immediately fully understand the information being conveyed in these. Could you please see these? Thanks so much!

Specifically, do these error lines point to permanent failure of the SS worker? If so, it appears that this process is in some “zombie” state instead of being killed. Is that expected behavior?
I am using the option io_trust_seconds = 20 for fdbserver (maybe related to timeout errors?).

<Event Severity="40" Time="1559081026.661524" Type="StorageServerFailed" ID="c010af68cb32eb52" Error="io_timeout" ErrorDescription="A disk IO operation failed to complete in a timely manner" ErrorCode="1521" Reason="Error" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1414706 0x1412e77 0xaae7db 0xaaf0fc 0xaaf274 0xa9d233 0xa9d537 0x45cd09 0xaaa8cc 0xaaab58 0x4f03b8 0x4f06f7 0x45cd09 0xaddf20 0xade394 0x45cd09 0xaa370b 0xaa3aa0 0x4f03b8 0x4f06f7 0x45cd09 0x5fda60 0x52d540 0x1438902 0x43a4a7 0x7ffff70fb830" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="40" Time="1559081026.661524" Type="WorkerFailed" ID="c81c5d7300968eb7" Error="io_timeout" ErrorDescription="A disk IO operation failed to complete in a timely manner" ErrorCode="1521" Reason="WorkerError" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1414706 0x1412e77 0xaae7db 0xaaf6cb 0xaafa6b 0x45cd09 0xaaf175 0xaaf274 0xa9d233 0xa9d537 0x45cd09 0xaaa8cc 0xaaab58 0x4f03b8 0x4f06f7 0x45cd09 0xaddf20 0xade394 0x45cd09 0xaa370b 0xaa3aa0 0x4f03b8 0x4f06f7 0x45cd09 0x5fda60 0x52d540 0x1438902 0x43a4a7 0x7ffff70fb830" Machine="10.196.78.162:4500" LogGroup="default" />
...
<Event Severity="40" Time="1559081027.827925" Type="WorkerServer" ID="0000000000000000" Error="io_timeout" ErrorDescription="A disk IO operation failed to complete in a timely manner" ErrorCode="1521" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1414706 0x1412e77 0x5fd766 0x5fd987 0x45cd09 0xaa8952 0x52d540 0x1438902 0x43a4a7 0x7ffff70fb830" Machine="10.196.78.162:4500" LogGroup="default" />
<Event Severity="40" Time="1559081027.827925" Type="StopAfterError" ID="0000000000000000" Error="io_timeout" ErrorDescription="A disk IO operation failed to complete in a timely manner" ErrorCode="1521" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1414706 0x1412e77 0x94cde7 0x94d0f7 0x45cd09 0x52db2d 0x45cd09 0xa9ebd2 0xa9efd7 0x45cd09 0x52db2d 0x45cd09 0x5fd783 0x5fd987 0x45cd09 0xaa8952 0x52d540 0x1438902 0x43a4a7 0x7ffff70fb830" Machine="10.196.78.162:4500" LogGroup="default" />

Full logs:

<Event Severity="30" Time="1558426937.945936" Type="SlowTask" ID="0000000000000000" TaskID="10500" MClocks="3265.48" Duration="1.48431" SampleRate="1" NumYields="0" Machine="10.196.78.162:10138" LogGroup="default" />
<Event Severity="30" Time="1558426939.724215" Type="SlowTask" ID="0000000000000000" TaskID="9000" MClocks="541.682" Duration="0.246221" SampleRate="1" NumYields="0" Machine="10.196.78.162:10138" LogGroup="default" />
<Event Severity="30" Time="1558426940.169407" Type="FailureMonitorClientSlow" ID="0000000000000000" Elapsed="3.45602" Expected="0.1" Machine="10.196.78.162:10138" LogGroup="default" />
<Event Severity="30" Time="1559071549.305820" Type="SlowTask" ID="0000000000000000" TaskID="20000" MClocks="684.447" Duration="0.311114" SampleRate="1" NumYields="0" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="30" Time="1559075913.947347" Type="SlowTask" ID="0000000000000000" TaskID="20000" MClocks="778.822" Duration="0.354012" SampleRate="1" NumYields="0" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="30" Time="1559081026.293712" Type="AsyncFileKAIOTimeout" ID="0000000000000000" Fd="56" Op="0" Nbytes="4096" Offset="55751970816" Ptr="140735483731968" Filename="/var/lib/foundationdb/data/4500/storage-c010af68cb32eb528508a8dec5e8ce6c.sqlite" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="30" Time="1559081026.293712" Type="AsyncFileKAIOTimeout" ID="0000000000000000" Fd="56" Op="0" Nbytes="4096" Offset="51079520256" Ptr="140736127582208" Filename="/var/lib/foundationdb/data/4500/storage-c010af68cb32eb528508a8dec5e8ce6c.sqlite" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="30" Time="1559081026.293712" Type="AsyncFileKAIOTimeout" ID="0000000000000000" Fd="56" Op="0" Nbytes="4096" Offset="30361763840" Ptr="140736458969088" Filename="/var/lib/foundationdb/data/4500/storage-c010af68cb32eb528508a8dec5e8ce6c.sqlite" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="30" Time="1559081026.293712" Type="AsyncFileKAIOTimeout" ID="0000000000000000" Fd="56" Op="0" Nbytes="4096" Offset="30361812992" Ptr="140736166064128" Filename="/var/lib/foundationdb/data/4500/storage-c010af68cb32eb528508a8dec5e8ce6c.sqlite" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="30" Time="1559081026.293712" Type="AsyncFileKAIOTimeout" ID="0000000000000000" Fd="56" Op="0" Nbytes="4096" Offset="30361866240" Ptr="140735396265984" Filename="/var/lib/foundationdb/data/4500/storage-c010af68cb32eb528508a8dec5e8ce6c.sqlite" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="30" Time="1559081026.293712" Type="AsyncFileKAIOTimeout" ID="0000000000000000" Fd="56" Op="0" Nbytes="4096" Offset="18616791040" Ptr="140736894042112" Filename="/var/lib/foundationdb/data/4500/storage-c010af68cb32eb528508a8dec5e8ce6c.sqlite" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="30" Time="1559081026.293712" Type="AsyncFileKAIOTimeout" ID="0000000000000000" Fd="56" Op="0" Nbytes="4096" Offset="21165961216" Ptr="140736624898048" Filename="/var/lib/foundationdb/data/4500/storage-c010af68cb32eb528508a8dec5e8ce6c.sqlite" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="30" Time="1559081026.293712" Type="AsyncFileKAIOTimeout" ID="0000000000000000" Fd="56" Op="0" Nbytes="4096" Offset="9695883264" Ptr="140736155959296" Filename="/var/lib/foundationdb/data/4500/storage-c010af68cb32eb528508a8dec5e8ce6c.sqlite" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="30" Time="1559081026.293712" Type="AsyncFileKAIOTimeout" ID="0000000000000000" Fd="56" Op="0" Nbytes="4096" Offset="35209949184" Ptr="140734980591616" Filename="/var/lib/foundationdb/data/4500/storage-c010af68cb32eb528508a8dec5e8ce6c.sqlite" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="30" Time="1559081026.293712" Type="AsyncFileKAIOTimeout" ID="0000000000000000" Fd="56" Op="0" Nbytes="4096" Offset="17618251776" Ptr="140735106314240" Filename="/var/lib/foundationdb/data/4500/storage-c010af68cb32eb528508a8dec5e8ce6c.sqlite" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="30" Time="1559081026.293712" Type="AsyncFileKAIOTimeout" ID="0000000000000000" Fd="56" Op="0" Nbytes="4096" Offset="11842953216" Ptr="140736152465408" Filename="/var/lib/foundationdb/data/4500/storage-c010af68cb32eb528508a8dec5e8ce6c.sqlite" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="30" Time="1559081026.293712" Type="AsyncFileKAIOTimeout" ID="0000000000000000" Fd="56" Op="0" Nbytes="4096" Offset="3780812800" Ptr="140735511056384" Filename="/var/lib/foundationdb/data/4500/storage-c010af68cb32eb528508a8dec5e8ce6c.sqlite" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="30" Time="1559081026.293712" Type="AsyncFileKAIOTimeout" ID="0000000000000000" Fd="56" Op="0" Nbytes="4096" Offset="45969174528" Ptr="140735090565120" Filename="/var/lib/foundationdb/data/4500/storage-c010af68cb32eb528508a8dec5e8ce6c.sqlite" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="30" Time="1559081026.293712" Type="AsyncFileKAIOTimeout" ID="0000000000000000" Fd="56" Op="0" Nbytes="4096" Offset="9308602368" Ptr="140736556589056" Filename="/var/lib/foundationdb/data/4500/storage-c010af68cb32eb528508a8dec5e8ce6c.sqlite" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="30" Time="1559081026.293712" Type="AsyncFileKAIOTimeout" ID="0000000000000000" Fd="56" Op="0" Nbytes="4096" Offset="35944521728" Ptr="140736532815872" Filename="/var/lib/foundationdb/data/4500/storage-c010af68cb32eb528508a8dec5e8ce6c.sqlite" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="30" Time="1559081026.293712" Type="AsyncFileKAIOTimeout" ID="0000000000000000" Fd="56" Op="0" Nbytes="4096" Offset="17962188800" Ptr="140735370956800" Filename="/var/lib/foundationdb/data/4500/storage-c010af68cb32eb528508a8dec5e8ce6c.sqlite" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="30" Time="1559081026.293712" Type="AsyncFileKAIOTimeout" ID="0000000000000000" Fd="56" Op="0" Nbytes="4096" Offset="10939691008" Ptr="140736224985088" Filename="/var/lib/foundationdb/data/4500/storage-c010af68cb32eb528508a8dec5e8ce6c.sqlite" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="30" Time="1559081026.293712" Type="AsyncFileKAIOTimeout" ID="0000000000000000" Fd="56" Op="0" Nbytes="4096" Offset="111964160" Ptr="140734942621696" Filename="/var/lib/foundationdb/data/4500/storage-c010af68cb32eb528508a8dec5e8ce6c.sqlite" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="30" Time="1559081026.293712" Type="SlowTask" ID="0000000000000000" TaskID="9150" MClocks="639.82" Duration="0.290829" SampleRate="1" NumYields="0" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="30" Time="1559081026.584695" Type="AsyncFileKAIOTimeout" ID="0000000000000000" Fd="56" Op="0" Nbytes="4096" Offset="679956480" Ptr="140735823495168" Filename="/var/lib/foundationdb/data/4500/storage-c010af68cb32eb528508a8dec5e8ce6c.sqlite" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="40" Time="1559081026.612489" Type="DiskError" ID="0000000000000000" Error="io_error" ErrorDescription="Disk i/o operation failed" ErrorCode="1510" In="BtreeMovetoUnpacked" File="/var/lib/foundationdb/data/4500/storage-c010af68cb32eb528508a8dec5e8ce6c.sqlite" SQLiteError="disk I/O error" SQLiteErrorCode="266" UnixErrorCode="0" UnixError="Success" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1414706 0x1412e77 0x6eb3d8 0x6eccca 0x6ef6c5 0x6f5505 0x5f9989 0x5f7029 0x130a26f 0x7ffff71255d0" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="40" Time="1559081026.661524" Type="StorageServerFailed" ID="c010af68cb32eb52" Error="io_timeout" ErrorDescription="A disk IO operation failed to complete in a timely manner" ErrorCode="1521" Reason="Error" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1414706 0x1412e77 0xaae7db 0xaaf0fc 0xaaf274 0xa9d233 0xa9d537 0x45cd09 0xaaa8cc 0xaaab58 0x4f03b8 0x4f06f7 0x45cd09 0xaddf20 0xade394 0x45cd09 0xaa370b 0xaa3aa0 0x4f03b8 0x4f06f7 0x45cd09 0x5fda60 0x52d540 0x1438902 0x43a4a7 0x7ffff70fb830" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="40" Time="1559081026.661524" Type="WorkerFailed" ID="c81c5d7300968eb7" Error="io_timeout" ErrorDescription="A disk IO operation failed to complete in a timely manner" ErrorCode="1521" Reason="WorkerError" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1414706 0x1412e77 0xaae7db 0xaaf6cb 0xaafa6b 0x45cd09 0xaaf175 0xaaf274 0xa9d233 0xa9d537 0x45cd09 0xaaa8cc 0xaaab58 0x4f03b8 0x4f06f7 0x45cd09 0xaddf20 0xade394 0x45cd09 0xaa370b 0xaa3aa0 0x4f03b8 0x4f06f7 0x45cd09 0x5fda60 0x52d540 0x1438902 0x43a4a7 0x7ffff70fb830" Machine="10.196.78.162:4500" LogGroup="default" />
<Event Severity="30" Time="1559081026.661524" Type="SlowTask" ID="0000000000000000" TaskID="7010" MClocks="804.704" Duration="0.365777" SampleRate="1" NumYields="0" Machine="10.196.78.162:4500" LogGroup="default" />
<Event Severity="40" Time="1559081027.029538" Type="DiskError" ID="0000000000000000" Error="io_error" ErrorDescription="Disk i/o operation failed" ErrorCode="1510" In="BtreeMovetoUnpacked" File="/var/lib/foundationdb/data/4500/storage-c010af68cb32eb528508a8dec5e8ce6c.sqlite" SQLiteError="disk I/O error" SQLiteErrorCode="266" UnixErrorCode="0" UnixError="Success" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1414706 0x1412e77 0x6eb3d8 0x6eccca 0x6eeb53 0x6f4986 0x5f9989 0x5f7029 0x130a26f 0x7ffff71255d0" Machine="10.196.78.162:4500" LogGroup="default" />
<Event Severity="40" Time="1559081027.030633" Type="DiskError" ID="0000000000000000" Error="io_error" ErrorDescription="Disk i/o operation failed" ErrorCode="1510" In="BtreeMovetoUnpacked" File="/var/lib/foundationdb/data/4500/storage-c010af68cb32eb528508a8dec5e8ce6c.sqlite" SQLiteError="disk I/O error" SQLiteErrorCode="266" UnixErrorCode="0" UnixError="Success" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1414706 0x1412e77 0x6eb3d8 0x6eccca 0x6ef2ce 0x6f5505 0x5f9989 0x5f7029 0x130a26f 0x7ffff71255d0" Machine="10.196.78.162:4500" LogGroup="default" />
< -- lot of DiskError here -- >
<Event Severity="40" Time="1559081027.159487" Type="DiskError" ID="0000000000000000" Error="io_error" ErrorDescription="Disk i/o operation failed" ErrorCode="1510" In="checkpoint" File="/var/lib/foundationdb/data/4500/storage-c010af68cb32eb528508a8dec5e8ce6c.sqlite" SQLiteError="disk I/O error" SQLiteErrorCode="1034" UnixErrorCode="0" UnixError="Success" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1414706 0x1412e77 0x6eb3d8 0x6ed4bc 0x6f12b0 0x6f5ab7 0x5f9989 0x5f7029 0x130a26f 0x7ffff71255d0" Machine="10.196.78.162:4500" LogGroup="default" />
<Event Severity="30" Time="1559081027.159487" Type="SlowTask" ID="0000000000000000" TaskID="3010" MClocks="1460.33" Duration="0.663789" SampleRate="1" NumYields="0" Machine="10.196.78.162:4500" LogGroup="default" />
<Event Severity="40" Time="1559081027.827925" Type="WorkerServer" ID="0000000000000000" Error="io_timeout" ErrorDescription="A disk IO operation failed to complete in a timely manner" ErrorCode="1521" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1414706 0x1412e77 0x5fd766 0x5fd987 0x45cd09 0xaa8952 0x52d540 0x1438902 0x43a4a7 0x7ffff70fb830" Machine="10.196.78.162:4500" LogGroup="default" />
<Event Severity="40" Time="1559081027.827925" Type="StopAfterError" ID="0000000000000000" Error="io_timeout" ErrorDescription="A disk IO operation failed to complete in a timely manner" ErrorCode="1521" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1414706 0x1412e77 0x94cde7 0x94d0f7 0x45cd09 0x52db2d 0x45cd09 0xa9ebd2 0xa9efd7 0x45cd09 0x52db2d 0x45cd09 0x5fd783 0x5fd987 0x45cd09 0xaa8952 0x52d540 0x1438902 0x43a4a7 0x7ffff70fb830" Machine="10.196.78.162:4500" LogGroup="default" />
<Event Severity="30" Time="1559088187.263414" Type="SlowTask" ID="0000000000000000" TaskID="20000" MClocks="797.555" Duration="0.362527" SampleRate="1" NumYields="0" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="30" Time="1559090577.839791" Type="SlowTask" ID="0000000000000000" TaskID="3010" MClocks="538.897" Duration="0.244955" SampleRate="1" NumYields="0" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<Event Severity="30" Time="1559092517.167086" Type="SlowTask" ID="0000000000000000" TaskID="20000" MClocks="521.881" Duration="0.23722" SampleRate="1" NumYields="0" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS" />
<-- lot of SlowTask here -->
<Event Severity="30" Time="1559124940.123616" Type="SlowTask" ID="0000000000000000" TaskID="20000" MClocks="643.005" Duration="0.292277" SampleRate="1" NumYields="0" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS,TL" />
<Event Severity="30" Time="1559127774.096001" Type="SlowTask" ID="0000000000000000" TaskID="20000" MClocks="516.917" Duration="0.234963" SampleRate="1" NumYields="1" Machine="10.196.78.162:4500" LogGroup="default" Roles="SS,TL" />
<Event Severity="30" Time="1558953235.430224" Type="MasterRecoveryDuration" ID="360e6d8d852b246f" RecoveryDuration="24.0081" Machine="10.196.78.162:4501" LogGroup="default" Roles="MS" TrackLatestType="Original" />
<Event Severity="30" Time="1559015525.363110" Type="MasterRecoveryDuration" ID="360c1d0737798bf5" RecoveryDuration="4.70905" Machine="10.196.78.162:4501" LogGroup="default" Roles="MS" TrackLatestType="Original" />
<Event Severity="30" Time="1559028259.936205" OriginalTime="1559015525.363110" Type="MasterRecoveryDuration" ID="360c1d0737798bf5" RecoveryDuration="4.70905" Machine="10.196.78.162:4501" LogGroup="default" Roles="MS" TrackLatestType="Rolled" />
<Event Severity="30" Time="1559048420.028981" OriginalTime="1559015525.363110" Type="MasterRecoveryDuration" ID="360c1d0737798bf5" RecoveryDuration="4.70905" Machine="10.196.78.162:4501" LogGroup="default" Roles="MS" TrackLatestType="Rolled" />
<Event Severity="30" Time="1559068530.291029" OriginalTime="1559015525.363110" Type="MasterRecoveryDuration" ID="360c1d0737798bf5" RecoveryDuration="4.70905" Machine="10.196.78.162:4501" LogGroup="default" Roles="MS" TrackLatestType="Rolled" />
<Event Severity="30" Time="1559069109.705699" Type="SlowTask" ID="0000000000000000" TaskID="10000" MClocks="573.565" Duration="0.260713" SampleRate="1" NumYields="0" Machine="10.196.78.162:4501" LogGroup="default" />
<Event Severity="30" Time="1559088650.493057" OriginalTime="1559015525.363110" Type="MasterRecoveryDuration" ID="360c1d0737798bf5" RecoveryDuration="4.70905" Machine="10.196.78.162:4501" LogGroup="default" Roles="MS" TrackLatestType="Rolled" />
<Event Severity="30" Time="1559108780.460537" OriginalTime="1559015525.363110" Type="MasterRecoveryDuration" ID="360c1d0737798bf5" RecoveryDuration="4.70905" Machine="10.196.78.162:4501" LogGroup="default" Roles="MS" TrackLatestType="Rolled" />
<Event Severity="30" Time="1559120177.562344" Type="Net2_LargePacket" ID="0000000000000000" SuppressedEventCount="0" FromPeer="10.196.78.151:50890" Length="2239879" Token="3521fa4bb2d68a23" Machine="10.196.78.162:4501" LogGroup="default" Roles="MP" />
<Event Severity="30" Time="1559120179.339329" Type="Net2_LargePacket" ID="0000000000000000" SuppressedEventCount="0" FromPeer="10.196.78.151:50890" Length="2239827" Token="3521fa4bb2d68a23" Machine="10.196.78.162:4501" LogGroup="default" Roles="MP" />
<Event Severity="30" Time="1559120477.936338" Type="Net2_LargePacket" ID="0000000000000000" SuppressedEventCount="0" FromPeer="10.196.78.151:50890" Length="2239847" Token="3521fa4bb2d68a23" Machine="10.196.78.162:4501" LogGroup="default" Roles="MP" />
...
(gaurav) #4

I would also want to understand the behavior of SS when it encounters a disk error or io timeout; in the description of io_trust_seconds it says that the SS will fail with error after getting an io error on a file:

io_trust_seconds: Time in seconds that a read or write operation is allowed to take before timing out with an error. If an operation times out, all future operations on that file will fail with an error as well. Only has an effect when using AsyncFileKAIO in Linux. If unset, defaults to 0 which means timeout is disabled.

But then what is the recovery mechanism from this sate (both for SS process and for the file that received the io error)?

Is this file now completely corrupted, and SS would have to throw it away? I have seen SS sqlite files to be of size upwards of 50GB, it seems a lot of work to recover from an isolated disk error. Could these files be split into smaller ones so that the amount of data discarded is smaller (assuming the file becomes useless on receiving an io error on it)?

How is storage server process itself recovered? Does it kill itself on getting an io error, and rejoin the cluster? Or is there a manual step to “recover” such an SS (and what is that step)?

Are there any changes in version 6.1.8 that would make these issues handled differently?

(A.J. Beamon) #5

That IO operations aren’t completing in 20 seconds does suggest that your disk may not be healthy.

I believe when you get io_error, the process will terminate whatever role got the error (e.g. the storage server) but keep the process itself running. For a storage server, this prevents re-recruitment on that process until it gets restarted.

In the case of io_timeout, I believe the process will terminate itself, and assuming you are using something like fdbmonitor to watch the process, it will then be restarted. It should be noted that io_timeout was mainly designed to work around a particular issue (outside of FDB) where a disk would claim to have written something that it did not actually write. The timeout here allowed us to avoid the circumstances leading to that issue.

The timeout in its current state was not designed as a general purpose mechanism to cause a process to fail when the disk is being non-responsive. There are various operations which are not monitored with the timeout, and there may also be a better choice for what to do when a process’s disk is hung rather than restarting immediately.

The file isn’t necessarily corrupted, though it’s possible that it could be. If the process restarts, it will try to use the file as normal. If any pages are corrupted such that their checksums are no longer valid, then an error would be thrown at the time that page is read, which could be in the distant future. I don’t believe the file would be deleted in this case.

There are classes of errors that wouldn’t be detected by checksums, such as the example I gave above where we saw a disk acknowledging writes that it hadn’t actually written. If you think there is a risk of such a problem on your disks, or if you want to be extra careful, you could remove the process yourself.

If you remove the process, then all the data on that process would have to be re-replicated elsewhere.

Not that I recall.

(gaurav) #6

Thanks for helping out with the queries; I have a few more follow up questions:

Is there a way to detect this process state by looking into status json (will this server still be reported as a recruited SS in status, even after the role is terminated)? Or is there a log line that points to this state (zombie process)? We want to be able to detect such situations automatically via scripts and raise alerts or take some other action.

What happens if a page is found corrupted by SS? Is that page repaired in place using a copy from another replica, or will the entire file be discarded? Or is there another strategy to handle this situation?

(A.J. Beamon) #7

For a process that has entered this state and hasn’t been restarted, it will report this information through its messages field (cluster.processes.<ID>.messages). It will also show up next to the process entry in status details run from fdbcli, and even regular status will report that there exist processes with errors.

You can also look for Severity 40 events on the process. I believe most errors besides io_timeout and please_reboot that bubble out will result in the process being left running but the role terminating.

I think in this case, the process just throws an error (maybe io_error?) and behaves like it got any other error. No attempt is made to fix the file or destroy it.

(gaurav) #8

But then what is the process to address this situation? Would excluding this process explicitly and clearing its files be the only way to fix a corrupt page on disk?

(A.J. Beamon) #9

Typically, you’d probably just remove the process/host from the cluster. Then you can do whatever you want with the host to check it for problems before reusing it or throwing out the disk, etc.

Your safest bet is probably just to shut the process off and let the cluster heal from other replicas if you do this. If you instead exclude the process and it restarts (thus clearing the error state), then it could serve as a source for re-replication. If there were some corruption in the file that didn’t result in bad checksums, you could potentially spread it.

If it turns out you do need the data that was in the file, you could potentially try to recover it, though there aren’t any tools that exist to help with that (probably due to lack of need up to this point).

1 Like
(gaurav) #10

Thanks for being so patient with the questions. This has been very helpful!