Foundationdb seems keep reinitialising itself after client hang

(ppggff) #1

We use foundationdb 6.0.15 as storage server for our go program.

It’s a very very basic config:

  • two machines in this cluster (192.168.111.55, 192.168.111.40)
  • one fdb process per machine, and there is no role config for it
  • two coordinator, storage config is: double ssd
  • data located in a 40GB disk (when problem happens, 37GB is used)

Today our go program hang with following stack:

(gdb) #0  runtime.cgocall (arg=0xc00005a770, fn=0xabb7e0 <_cgo_ce7ede6d4d1e_Cfunc_fdb_run_network>, ~r2=<optimized out>) at /usr/lib/golang/src/runtime/cgocall.go:130
#1  0x0000000000a992c9 in github.com/apple/foundationdb/bindings/go/src/fdb._Cfunc_fdb_run_network (r1=0) at _cgo_gotypes.go:415
#2  0x0000000000aa01e6 in github.com/apple/foundationdb/bindings/go/src/fdb.startNetwork.func1 () at /code/server/src/github.com/apple/foundationdb/bindings/go/src/fdb/fdb.go:214
#3  0x00000000004601a1 in runtime.goexit () at /usr/lib/golang/src/runtime/asm_amd64.s:1333
#4  0x0000000000000000 in ?? ()

(gdb) #0  runtime.gopark (lock=0x14f7660 <runtime.semtable+7552>, reason=17 '\021', traceEv=25 '\031', traceskip=4, unlockf=<optimized out>) at /usr/lib/golang/src/runtime/proc.go:303
#1  0x0000000000433bd3 in runtime.goparkunlock (lock=<optimized out>, reason=<optimized out>, traceEv=<optimized out>, traceskip=<optimized out>) at /usr/lib/golang/src/runtime/proc.go:308
#2  0x00000000004449fc in runtime.semacquire1 (addr=0xc0007415b4, lifo=false, profile=3) at /usr/lib/golang/src/runtime/sema.go:144
#3  0x000000000044472d in sync.runtime_SemacquireMutex (addr=<optimized out>, lifo=<optimized out>) at /usr/lib/golang/src/runtime/sema.go:71
#4  0x000000000047559f in sync.(*Mutex).Lock (m=0xc0007415b0) at /usr/lib/golang/src/sync/mutex.go:134
#5  0x0000000000a9c859 in github.com/apple/foundationdb/bindings/go/src/fdb.fdb_future_block_until_ready (f=0x7fddc401e8e0) at /code/server/src/github.com/apple/foundationdb/bindings/go/src/fdb/futures.go:100
#6  0x0000000000a9c89b in github.com/apple/foundationdb/bindings/go/src/fdb.future.BlockUntilReady (f=...) at /code/server/src/github.com/apple/foundationdb/bindings/go/src/fdb/futures.go:104
#7  0x0000000000a9cbb2 in github.com/apple/foundationdb/bindings/go/src/fdb.futureNil.Get (f=..., ~r0=...) at /code/server/src/github.com/apple/foundationdb/bindings/go/src/fdb/futures.go:244
#8  0x0000000000aa1efc in github.com/apple/foundationdb/bindings/go/src/fdb.(*futureNil).Get (~r0=...) at <autogenerated>:1
#9  0x0000000000aa006c in github.com/apple/foundationdb/bindings/go/src/fdb.Database.Transact.func1 (e=..., ret=...) at /code/server/src/github.com/apple/foundationdb/bindings/go/src/fdb/database.go:139
#10 0x0000000000a9adeb in github.com/apple/foundationdb/bindings/go/src/fdb.retryable (e=..., onError=
    {void (github.com/apple/foundationdb/bindings/go/src/fdb.Error, github.com/apple/foundationdb/bindings/go/src/fdb.FutureNil *)} 0xc01f8b36e0, ret=..., wrapped={void (interface {} *, error *)} 0xc01f8b36d8)
    at /code/server/src/github.com/apple/foundationdb/bindings/go/src/fdb/database.go:87
#11 0x0000000000a9af72 in github.com/apple/foundationdb/bindings/go/src/fdb.Database.Transact (d=..., f={void (github.com/apple/foundationdb/bindings/go/src/fdb.Transaction, interface {} *, error *)} 0xc01f8b3738, ~r1=..., ~r2=...)
    at /code/server/src/github.com/apple/foundationdb/bindings/go/src/fdb/database.go:145
...

(gdb) #0  runtime.gopark (lock=0x14f7ea0 <runtime.semtable+9664>, reason=17 '\021', traceEv=25 '\031', traceskip=4, unlockf=<optimized out>) at /usr/lib/golang/src/runtime/proc.go:303
#1  0x0000000000433bd3 in runtime.goparkunlock (lock=<optimized out>, reason=<optimized out>, traceEv=<optimized out>, traceskip=<optimized out>) at /usr/lib/golang/src/runtime/proc.go:308
#2  0x00000000004449fc in runtime.semacquire1 (addr=0xc04a854834, lifo=false, profile=3) at /usr/lib/golang/src/runtime/sema.go:144
#3  0x000000000044472d in sync.runtime_SemacquireMutex (addr=<optimized out>, lifo=<optimized out>) at /usr/lib/golang/src/runtime/sema.go:71
#4  0x000000000047559f in sync.(*Mutex).Lock (m=0xc04a854830) at /usr/lib/golang/src/sync/mutex.go:134
#5  0x0000000000a9c859 in github.com/apple/foundationdb/bindings/go/src/fdb.fdb_future_block_until_ready (f=0x7fdde80f17a0) at /code/server/src/github.com/apple/foundationdb/bindings/go/src/fdb/futures.go:100
#6  0x0000000000a9c89b in github.com/apple/foundationdb/bindings/go/src/fdb.future.BlockUntilReady (f=...) at /code/server/src/github.com/apple/foundationdb/bindings/go/src/fdb/futures.go:104
#7  0x0000000000aa09fe in github.com/apple/foundationdb/bindings/go/src/fdb.(*futureByteSlice).Get.func1 () at /code/server/src/github.com/apple/foundationdb/bindings/go/src/fdb/futures.go:147
#8  0x0000000000475903 in sync.(*Once).Do (f={void ()} 0xc07ad7b1d0, o=0xc01d481f70) at /usr/lib/golang/src/sync/once.go:44
#9  0x0000000000a9c998 in github.com/apple/foundationdb/bindings/go/src/fdb.(*futureByteSlice).Get (f=<optimized out>, ~r0=..., ~r1=...) at /code/server/src/github.com/apple/foundationdb/bindings/go/src/fdb/futures.go:142
#10 0x0000000000a9ca0b in github.com/apple/foundationdb/bindings/go/src/fdb.(*futureByteSlice).MustGet (f=<optimized out>, ~r0=...) at /code/server/src/github.com/apple/foundationdb/bindings/go/src/fdb/futures.go:164
#11 0x0000000000aad9d5 in server/storage/fdb.(*FDB).Get.func1 (err=..., ret=..., tr=...) at /code/server/src/server/storage/fdb/fdb.go:67
#12 0x0000000000aa0129 in github.com/apple/foundationdb/bindings/go/src/fdb.Database.ReadTransact.func1 (e=..., ret=...) at /code/server/src/github.com/apple/foundationdb/bindings/go/src/fdb/database.go:176
#13 0x0000000000a9adeb in github.com/apple/foundationdb/bindings/go/src/fdb.retryable (e=..., onError=
    {void (github.com/apple/foundationdb/bindings/go/src/fdb.Error, github.com/apple/foundationdb/bindings/go/src/fdb.FutureNil *)} 0xc07ad7b328, ret=..., wrapped={void (interface {} *, error *)} 0xc07ad7b320)
    at /code/server/src/github.com/apple/foundationdb/bindings/go/src/fdb/database.go:87
#14 0x0000000000a9b0b2 in github.com/apple/foundationdb/bindings/go/src/fdb.Database.ReadTransact (d=..., f={void (github.com/apple/foundationdb/bindings/go/src/fdb.ReadTransaction, interface {} *, error *)} 0xc07ad7b380, ~r1=...,
    ~r2=...) at /code/server/src/github.com/apple/foundationdb/bindings/go/src/fdb/database.go:185
...

And then foundationdb seems keep reinitialising itself.

  1. fdbcli cannot show status, it says: Initializing new transaction servers and recovering transaction logs.

  2. there are several trace logs, they all begin with role recruitment; and it keep create new trace log

This is one trace log file: 55a.tar.gz

And there are some problem about it:

  • Why the fdb try to reinit?
  • Why the fdb process still the same one?
  • There are connection failed, SlowTask, SlowSSLoopx100 in the log. Is this OK?
(ppggff) #2

Try to restart foundationdb, find following error in log file:

...
<Event Severity="20" Time="1560217744.565075" Type="GetTLogTeamFailed" ID="0000000000000000" Policy="zoneid^2 x 1" Processes="1" Workers="1" FitnessGroups="1" TLogZones="bd1c6112828f22c3b3907350b48935b6" TLogDataHalls="[unset]" MissingZones="[no items]" MissingDataHalls="[no items]" Required="2" DesiredLogs="3" RatingTests="200" CheckStable="0" PolicyGenerations="100" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1414706 0x557bfb 0x55d228 0x4f175e 0x4f1955 0x4ff3e2 0x584fb9 0x1352bf8 0x1352e49 0x52d540 0x1438902 0x43a4a7 0x7ff42b1513d5" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,MS,SS,TL" />
<Event Severity="20" Time="1560217744.565075" Type="RecruitFromConfigurationNotAvailable" ID="3336c81737335915" Error="no_more_servers" ErrorDescription="Not enough physical servers available" ErrorCode="1008" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,MS,SS,TL" />
<Event Severity="20" Time="1560217744.970812" Type="GWFTADTooFew" ID="3336c81737335915" Fitness="2" Processes="1" Required="2" TLogPolicy="zoneid^2 x 1" DesiredLogs="3" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,MS,SS,TL" />
<Event Severity="20" Time="1560217744.970812" Type="GetTLogTeamFailed" ID="0000000000000000" Policy="zoneid^2 x 1" Processes="1" Workers="1" FitnessGroups="1" TLogZones="bd1c6112828f22c3b3907350b48935b6" TLogDataHalls="[unset]" MissingZones="[no items]" MissingDataHalls="[no items]" Required="2" DesiredLogs="3" RatingTests="200" CheckStable="0" PolicyGenerations="100" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1414706 0x557bfb 0x55d228 0x4f14ec 0x4f3aca 0x4f3e90 0x52d540 0x1438902 0x43a4a7 0x7ff42b1513d5" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,MS,SS,TL" />
<Event Severity="20" Time="1560217744.970812" Type="RecruitTLogMatchingSetNotAvailable" ID="3336c81737335915" Error="no_more_servers" ErrorDescription="Not enough physical servers available" ErrorCode="1008" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,MS,SS,TL" />
<Event Severity="10" Time="1560217745.454973" Type="CCWDB" ID="3336c81737335915" Watching="44052dd1c047b6b1" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,MS,SS,TL" />
...
<Event Severity="10" Time="1560217747.040182" Type="AsyncFileKAIOAllocateError" ID="0000000000000000" Fd="33" Filename="/data0/foundationdb/data/4500/logqueue-b947657af02f2443ed9d45153a854be7-0.fdq" UnixErrorCode="1c" UnixError="No space left on device" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,MS,SS,TL" />
...
<Event Severity="10" Time="1560217747.040182" Type="TLogError" ID="b947657af02f2443" Error="io_error" ErrorDescription="Disk i/o operation failed" ErrorCode="1510" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,MS,SS,TL" />
...
<Event Severity="10" Time="1560217747.040182" Type="Role" ID="a722f0d38df1b75f" Transition="End" As="TLog" Reason="Error" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,MS,SS,TL" TrackLatestType="Original" />
<Event Severity="10" Time="1560217747.040182" Type="Role" ID="7b871750f758771b" Transition="End" As="TLog" Reason="Error" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,MS,SS,TL" TrackLatestType="Original" />
<Event Severity="10" Time="1560217747.040182" Type="Role" ID="3ed63878f6a57695" Transition="End" As="TLog" Reason="Error" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,MS,SS,TL" TrackLatestType="Original" />
<Event Severity="10" Time="1560217747.040182" Type="Role" ID="093a49bb570812fe" Transition="End" As="TLog" Reason="Error" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,MS,SS,TL" TrackLatestType="Original" />
<Event Severity="10" Time="1560217747.040182" Type="Role" ID="b947657af02f2443" Error="io_error" ErrorDescription="Disk i/o operation failed" ErrorCode="1510" Transition="End" As="SharedTLog" Reason="Error" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,MS,SS,TL" TrackLatestType="Original" />
<Event Severity="40" Time="1560217747.040182" Type="SharedTLogFailed" ID="b947657af02f2443" Error="io_error" ErrorDescription="Disk i/o operation failed" ErrorCode="1510" Reason="Error" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1414706 0x1412e77 0xaae7db 0xaaf0fc 0xaaf274 0xa9d233 0xa9d537 0x45cd09 0xaa3c10 0xaa3ded 0xaa3ec0 0x4f03b8 0x4f06f7 0x45cd09 0xaa3c10 0xaa3ded 0xaa3ec0 0x4f03b8 0x4f06f7 0x45cd09 0x8dae32 0x8daff8 0x45cd09 0x1430275 0x1430aad 0x1430f38 0x8d71f8 0x8d8075 0x52d540 0x6de2dc 0x8e1697 0x8e1976 0x8e1e1a 0x8e8323 0x4654f0 0x72814b 0x7289d6 0x4654f0 0x4654f0 0x724348 0x724808 0x724a45 0x4654f0 0x722cc1 0x723970 0x4654f0 0x462c58 0x4654f0 0x7269ba" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,MS,SS,TL" />
<Event Severity="10" Time="1560217747.040182" Type="Role" ID="fa97b60cf7a2cbae" Transition="End" As="TLog" Reason="Error" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,MS,SS,TL" TrackLatestType="Original" />
<Event Severity="10" Time="1560217747.040182" Type="SlowTask" ID="0000000000000000" TaskID="9000" MClocks="324.273" Duration="0.147399" SampleRate="1" NumYields="0" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,MS,SS,TL" />
<Event Severity="10" Time="1560217747.188729" Type="UpdateChangedDatacenter" ID="3336c81737335915" CCFirst="0" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,MS,SS,TL" />
<Event Severity="10" Time="1560217747.188729" Type="MasterTerminated" ID="44052dd1c047b6b1" Error="master_recovery_failed" ErrorDescription="Master recovery failed" ErrorCode="1203" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,MS,SS,TL" />
<Event Severity="10" Time="1560217747.188729" Type="Role" ID="44052dd1c047b6b1" Error="success" ErrorDescription="Success" ErrorCode="0" Transition="End" As="MasterServer" Reason="Error" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,MS,SS,TL" TrackLatestType="Original" />
...
<Event Severity="20" Time="1560217747.191931" Type="DetectedFailedMaster" ID="3336c81737335915" OldMaster="44052dd1c047b6b1" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,SS,TL" />
...
<Event Severity="10" Time="1560217747.193482" Type="CommitDummyTransactionError" ID="0000000000000000" Error="commit_unknown_result" ErrorDescription="Transaction may or may not have committed" ErrorCode="1021" Key="\xff/SC/^\xafU\x95\x97\xae\x917\xa8P\x93\x1dj\x8eMU" Retries="0" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,SS,TL" />
...
<Event Severity="10" Time="1560217747.206125" Type="Role" ID="162ba39ee2d2469c" Transition="End" As="TLog" Reason="Error" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,SS,TL" TrackLatestType="Original" />
<Event Severity="10" Time="1560217747.206125" Type="TLogBytesWhenRemoved" ID="162ba39ee2d2469c" SharedBytesInput="147" SharedBytesDurable="147" LocalBytesInput="147" LocalBytesDurable="0" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,SS" />
<Event Severity="10" Time="1560217747.206125" Type="TLogError" ID="ea36c6c9de0ae15f" Error="worker_removed" ErrorDescription="Normal worker shut down" ErrorCode="1202" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,SS" />
<Event Severity="10" Time="1560217747.206125" Type="Role" ID="ea36c6c9de0ae15f" Transition="End" As="SharedTLog" Reason="Error" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,SS" TrackLatestType="Original" />
<Event Severity="10" Time="1560217747.206125" Type="KVClose" ID="ea36c6c9de0ae15f" Del="1" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,SS" />
<Event Severity="10" Time="1560217747.206125" Type="WorkPool_Stop" ID="0000000000000000" Workers="1" Idle="0" Work="7" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,SS" />
<Event Severity="10" Time="1560217747.206125" Type="WorkPool_Stop" ID="0000000000000000" Workers="64" Idle="64" Work="0" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,SS" />
<Event Severity="10" Time="1560217747.206125" Type="DQDestroy" ID="ea36c6c9de0ae15f" LastPoppedSeq="0" PoppedSeq="0" NextPageSeq="4096" File0Name="/data0/foundationdb/data/4500/logqueue-ea36c6c9de0ae15f92397ffc8a371be3-1.fdq" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,SS" />
<Event Severity="10" Time="1560217747.206125" Type="DiskQueueShutdownDeleting" ID="ea36c6c9de0ae15f" File0="/data0/foundationdb/data/4500/logqueue-ea36c6c9de0ae15f92397ffc8a371be3-0.fdq" File1="/data0/foundationdb/data/4500/logqueue-ea36c6c9de0ae15f92397ffc8a371be3-1.fdq" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,SS" />
<Event Severity="10" Time="1560217747.206125" Type="TLogTerminated" ID="ea36c6c9de0ae15f" Error="worker_removed" ErrorDescription="Normal worker shut down" ErrorCode="1202" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,SS" />
<Event Severity="10" Time="1560217747.206125" Type="Role" ID="ea36c6c9de0ae15f" Error="success" ErrorDescription="Success" ErrorCode="0" Transition="End" As="SharedTLog" Reason="Error" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,SS" TrackLatestType="Original" />
...
<Event Severity="10" Time="1560217747.297575" Type="AsyncFileKAIOOpenFailed" ID="0000000000000000" Error="file_not_found" ErrorDescription="File not found" ErrorCode="1511" Filename="/data0/foundationdb/data/4500/logqueue-2312d0181508998cebc551c46c7e6bfd-0.fdq" Flags="70002" OSFlags="4002" Mode="02000" UnixErrorCode="2" UnixError="No such file or directory" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,SS" />
<Event Severity="10" Time="1560217747.297575" Type="AsyncFileKAIOOpenFailed" ID="0000000000000000" Error="file_not_found" ErrorDescription="File not found" ErrorCode="1511" Filename="/data0/foundationdb/data/4500/logqueue-2312d0181508998cebc551c46c7e6bfd-1.fdq" Flags="70002" OSFlags="4002" Mode="02000" UnixErrorCode="2" UnixError="No such file or directory" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,SS" />
...
<Event Severity="10" Time="1560217747.300605" Type="AsyncFileKAIOOpenFailed" ID="0000000000000000" Error="file_not_found" ErrorDescription="File not found" ErrorCode="1511" Filename="/data0/foundationdb/data/4500/log-2312d0181508998cebc551c46c7e6bfd.sqlite" Flags="70002" OSFlags="4002" Mode="02000" UnixErrorCode="2" UnixError="No such file or directory" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,MP,SS" />
<Event Severity="10" Time="1560217747.300605" Type="AFCUnderlyingOpenBegin" ID="0000000000000000" Filename="/data0/foundationdb/data/4500/log-2312d0181508998cebc551c46c7e6bfd.sqlite-wal" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,MP,SS" />
<Event Severity="10" Time="1560217747.300605" Type="AsyncFileKAIOOpenFailed" ID="0000000000000000" Error="file_not_found" ErrorDescription="File not found" ErrorCode="1511" Filename="/data0/foundationdb/data/4500/log-2312d0181508998cebc551c46c7e6bfd.sqlite-wal" Flags="70002" OSFlags="4002" Mode="02000" UnixErrorCode="2" UnixError="No such file or directory" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,MP,SS" />
...
<Event Severity="10" Time="1560217747.354972" Type="AllAlternativesFailed" ID="0000000000000000" Interval="1" Alternatives="6aeca351f4d3f883,1b955d1d60c2f8e0" Delay="0.05" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,MP,SS,TL" />
...
<Event Severity="10" Time="1560217748.361637" Type="FKBlockFail" ID="dbcb31c184ff8409" Error="future_version" ErrorDescription="Request for future version" ErrorCode="1009" SuppressedEventCount="0" FKID="d77d6b8ae9d58a17" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,MP,SS,TL" />
...
<Event Severity="10" Time="1560217751.431876" Type="AsyncFileKAIOAllocateError" ID="0000000000000000" Fd="26" Filename="/data0/foundationdb/data/4500/storage-1b955d1d60c2f8e004286e163f577afe.sqlite" UnixErrorCode="1c" UnixError="No space left on device" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,MP,SS,TL" />
<Event Severity="10" Time="1560217751.431876" Type="AsyncFileKAIOAllocateError" ID="0000000000000000" Fd="26" Filename="/data0/foundationdb/data/4500/storage-1b955d1d60c2f8e004286e163f577afe.sqlite" UnixErrorCode="1c" UnixError="No space left on device" Machine="192.168.111.55:4500" LogGroup="default" Roles="CC,MP,SS,TL" />
...

It seems this problem is caused by ‘No space left on device’. I will try to extend the disk and try again.

(ppggff) #3

The db recovered after extending the disk.

(Alex Miller) #4

I’ve filed an issue for what you’ve seen, as you should have had FDB throttle your clients down to zero transactions, but still technically be available.