Dedicated master (stateless) constantly getting terminated

This is 5.2.5 code and we have a dedicated process (4550) that’s running the master, it seems like the cluster is never able to complete reinit, it seems like we are getting termination due to failure client monitor

<Event Severity="10" Time="1530669355.484019" OriginalTime="1530669157.207156" Type="Role" Machine="10.42.0.32:4550" ID="87923f606a83079d" As="MasterServer" Transition="Begin" Origination="Recruited" OnWorker="a44e1ee2af8c7ecd" logGroup="default" TrackLatestType="Rolled"/>
<Event Severity="10" Time="1530669355.484019" OriginalTime="1530669260.481117" Type="DDTrackerStarting" Machine="10.42.0.32:4550" ID="87923f606a83079d" State="Inactive" logGroup="default" TrackLatestType="Rolled"/>
<Event Severity="10" Time="1530669355.484019" OriginalTime="1530669345.485189" Type="MovingData" Machine="10.42.0.32:4550" ID="87923f606a83079d" InFlight="1061" InQueue="273408" AverageShardSize="-1" LowPriorityRelocations="274469" HighPriorityRelocations="0" HighestPriority="110" BytesWritten="0" logGroup="default" TrackLatestType="Rolled"/>
<Event Severity="10" Time="1530669355.484019" OriginalTime="1530669347.881669" Type="RkUpdate" Machine="10.42.0.32:4550" ID="0000000000000000" TPSLimit="1.69711e+09" Reason="4" ReasonServerID="21c9407514677a66" ReleasedTPS="4437.95" TPSBasis="4437.95" StorageServers="0" Proxies="10" TLogs="10" WorstFreeSpaceStorageServer="9223372036854775807" WorstFreeSpaceTLog="1952452153539" WorstStorageServerQueue="0" LimitingStorageServerQueue="0" WorstTLogQueue="17562597" TotalDiskUsageBytes="16929191216" WorstStorageServerVersionLag="0" LimitingStorageServerVersionLag="0" logGroup="default" TrackLatestType="Rolled"/>
<Event Severity="10" Time="1530669355.484019" OriginalTime="1530668999.653329" Type="StorageServerRecruitment" Machine="10.42.0.32:4550" ID="14a87d1b85115b11" State="Idle" logGroup="default" TrackLatestType="Rolled"/>
<Event Severity="10" Time="1530669355.484019" OriginalTime="1530669349.696709" Type="StorageServerRecruitment" Machine="10.42.0.32:4550" ID="87923f606a83079d" State="Idle" logGroup="default" TrackLatestType="Rolled"/>
<Event Severity="10" Time="1530669355.484019" OriginalTime="1530669349.696709" Type="TotalDataInFlight" Machine="10.42.0.32:4550" ID="87923f606a83079d" TotalBytes="0" UnhealthyServers="73" logGroup="default" TrackLatestType="Rolled"/>
<Event Severity="10" Time="1530669355.484019" OriginalTime="1530669260.481117" Type="DDInitGotInitialDD" Machine="10.42.0.32:4550" ID="87923f606a83079d" b="\xff/keyServers/" e="\xff\xff" src="b34eb37a3a8446e63301089a7774275a,ffd409739f4c223cc53812fd2d219a0f" dest="[no items]" logGroup="default" TrackLatestType="Rolled"/>
<Event Severity="10" Time="1530669355.484019" OriginalTime="1530669355.462430" Type="MachineMetrics" Machine="10.42.0.32:4550" ID="0000000000000000" Elapsed="9.32917" MbpsSent="53.0264" MbpsReceived="22.4124" OutSegs="308490" RetransSegs="44" CPUSeconds="0.389609" TotalMemory="515695927296" CommittedMemory="207401893888" AvailableMemory="308294033408" ZoneID="2ad413f2872649bc74f98f7b90d04cc8" MachineID="2ad413f2872649bc74f98f7b90d04cc8" logGroup="default" TrackLatestType="Rolled"/>
<Event Severity="30" Time="1530669355.484019" OriginalTime="1530669257.544068" Type="MasterRecoveryDuration" Machine="10.42.0.32:4550" ID="87923f606a83079d" recoveryDuration="100.337" logGroup="default" TrackLatestType="Rolled"/>
<Event Severity="10" Time="1530669355.484019" OriginalTime="1530669257.544068" Type="MasterRecoveryState" Machine="10.42.0.32:4550" ID="87923f606a83079d" StatusCode="11" Status="fully_recovered" storeType="2" recoveryDuration="100.337" logGroup="default" TrackLatestType="Rolled"/>
<Event Severity="10" Time="1530669355.484019" OriginalTime="1530669355.462430" Type="ProcessMetrics" Machine="10.42.0.32:4550" ID="0000000000000000" Elapsed="9.32917" CPUSeconds="9.332" MainThreadCPUSeconds="9.328" UptimeSeconds="1707.35" Memory="5032357888" ResidentMemory="4923228160" MbpsSent="7.81985" MbpsReceived="18.2806" DiskTotalBytes="12422639616" DiskFreeBytes="6803271680" DiskQueueDepth="0" DiskIdleSeconds="9.32917" DiskReads="0" DiskWrites="28" DiskReadsCount="855674" DiskWritesCount="45843322" DiskWriteSectors="488" DiskReadSectors="0" FileWrites="0" FileReads="0" CacheReadBytes="0" CacheFinds="0" CacheWritesBlocked="0" CacheReadsBlocked="0" CachePageReadsMerged="0" CacheWrites="0" CacheReads="0" ZoneID="2ad413f2872649bc74f98f7b90d04cc8" MachineID="2ad413f2872649bc74f98f7b90d04cc8" AIO_SubmitCount="0" AIO_CollectCount="0" AIO_SubmitLagMS="0" AIO_DiskStallMS="0" CurrentConnections="187" ConnectionsEstablished="4.18044" ConnectionsClosed="4.18044" ConnectionErrors="0" logGroup="default" TrackLatestType="Rolled"/>
<Event Severity="10" Time="1530669355.484019" OriginalTime="1530667648.107355" Type="ProgramStart" Machine="10.42.0.32:4550" ID="0000000000000000" RandomSeed="304576861" SourceVersion="994142624e33d7a1810502e5ac8e006daefccb59" Version="5.2.6-PRERELEASE" PackageName="5.2" FileSystem="" DataFolder="/mnt/fdb/4550" WorkingDirectory="/" ClusterFile="/etc/foundationdb/fdb.cluster" ConnectionString="telemetry:VpxIgeVh1ENTGPh9BdY2KIj5tRm3xRNI@10.42.0.9:4500,10.42.0.14:4500,10.42.0.22:4500" ActualTime="1530667648" CommandLine="/usr/sbin/fdbserver --class=stateless --cluster_file=/etc/foundationdb/fdb.cluster --datadir=/mnt/fdb/4550 --listen_address=public --logdir=/mnt/logs/fdb --memory=16GiB --public_address=auto:4550" BuggifyEnabled="0" MemoryLimit="17179869184" logGroup="default" TrackLatestType="Rolled"/>
<Event Severity="10" Time="1530669355.484019" OriginalTime="1530669156.340126" Type="ProxyMetrics" Machine="10.42.0.32:4550" ID="0ff5092d42139205" Elapsed="5.00008" txnStartIn="56.399 167.718 23523" txnStartOut="56.399 167.756 23523" txnStartBatch="9.59984 28.5478 3174" txnSystemPriorityStartIn="0 0 12" txnSystemPriorityStartOut="0 0 12" txnBatchPriorityStartIn="0.199997 1.98395 4" txnBatchPriorityStartOut="0.199997 1.98481 4" txnDefaultPriorityStartIn="56.1991 122.856 23507" txnDefaultPriorityStartOut="56.1991 122.853 23507" txnCommitIn="0 0 5" txnCommitVersionAssigned="0 0 5" txnCommitResolving="0 0 5" txnCommitResolved="0 0 5" txnCommitOut="0 0 5" txnCommitOutSuccess="0 0 5" txnConflicts="0 0 0" commitBatchIn="0.199997 0.00755149 73" commitBatchOut="0 0 72" mutationBytes="0 0 52367" mutations="0 0 6" conflictRanges="0 0 16" lastAssignedCommitVersion="10827149834255" version="10827149834255" committedVersion="10827149834255" logGroup="default" TrackLatestType="Rolled"/>
<Event Severity="10" Time="1530669355.484019" OriginalTime="1530669254.868627" Type="MasterRecoveredConfig" Machine="10.42.0.32:4550" ID="87923f606a83079d" conf="logs=10;proxies=10;redundancy_mode=double;resolvers=10;storage_engine=ssd-2" logGroup="default" TrackLatestType="Rolled"/>
<Event Severity="10" Time="1530669355.484019" OriginalTime="1530667648.111248" Type="Role" Machine="10.42.0.32:4550" ID="a44e1ee2af8c7ecd" As="Worker" Transition="Begin" Origination="Recruited" OnWorker="a44e1ee2af8c7ecd" DataFolder="/mnt/fdb/4550" Locality="zoneid=2ad413f2872649bc74f98f7b90d04cc8 processid=ea7c14ccf4dac38a4ee07cfa982f13db machineid=2ad413f2872649bc74f98f7b90d04cc8" StoresPresent="0" logGroup="default" TrackLatestType="Rolled"/>
<Event Severity="20" Time="1530669355.486430" Type="N2_WriteError" Machine="10.42.0.32:4550" ID="41ab14e8cd37b99a" Message="32" SuppressedEventCount="137" logGroup="default"/>
<Event Severity="10" Time="1530669355.486430" Type="ConnectionClosed" Machine="10.42.0.32:4550" ID="41ab14e8cd37b99a" PeerAddr="10.42.0.7:4514" Error="connection_failed" ErrorDescription="Network connection failed" ErrorCode="1026" SuppressedEventCount="38" logGroup="default"/>
<Event Severity="10" Time="1530669355.494121" Type="ConnectionFrom" Machine="10.42.0.32:4550" ID="a8112136db6cdbe2" FromAddress="10.42.0.43:42390" SuppressedEventCount="128" logGroup="default"/>
<Event Severity="10" Time="1530669355.536756" Type="ConnectionEstablished" Machine="10.42.0.32:4550" ID="48ddd7b25de2258d" Peer="10.42.0.18:36074" ConnectionId="1" SuppressedEventCount="128" logGroup="default"/>
<Event Severity="10" Time="1530669355.536756" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="48ddd7b25de2258d" FromAddr="10.42.0.18:36074" CanonicalAddr="10.42.0.18:4500" logGroup="default"/>
<Event Severity="10" Time="1530669355.552089" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="705f4c3edfdc69f8" FromAddr="10.42.0.7:53786" CanonicalAddr="10.42.0.7:4509" logGroup="default"/>
<Event Severity="10" Time="1530669355.554433" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="4dda0808088f0e08" FromAddr="10.42.0.14:40862" CanonicalAddr="10.42.0.14:4500" logGroup="default"/>
<Event Severity="10" Time="1530669355.555397" Type="IncomingConnection" Machine="10.42.0.32:4550" ID="3d5ba314c7bc9497" FromAddr="10.42.0.55:33540" CanonicalAddr="10.42.0.55:4510" IsPublic="1" logGroup="default"/>
<Event Severity="10" Time="1530669355.560400" Type="IncomingConnection" Machine="10.42.0.32:4550" ID="3d63e2038e8fcd57" FromAddr="10.42.0.43:42404" CanonicalAddr="10.42.0.43:4513" IsPublic="1" logGroup="default"/>
<Event Severity="10" Time="1530669355.567210" Type="IncomingConnection" Machine="10.42.0.32:4550" ID="71c0e5347b4c415c" FromAddr="10.42.0.55:33542" CanonicalAddr="10.42.0.55:4550" IsPublic="1" logGroup="default"/>
<Event Severity="10" Time="1530669355.571898" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="042fd365f7dd9100" FromAddr="10.42.0.7:53788" CanonicalAddr="10.42.0.7:4510" logGroup="default"/>
<Event Severity="10" Time="1530669355.572860" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="3e4a41da608fd20d" FromAddr="10.42.0.22:60790" CanonicalAddr="10.42.0.22:4502" logGroup="default"/>
<Event Severity="10" Time="1530669355.576047" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="9f2b4e8957eae00e" FromAddr="10.42.0.16:46500" CanonicalAddr="10.42.0.16:4510" logGroup="default"/>
<Event Severity="10" Time="1530669355.579260" Type="IncomingConnection" Machine="10.42.0.32:4550" ID="96a91a3600fc7bcc" FromAddr="10.42.0.55:33544" CanonicalAddr="10.42.0.55:4504" IsPublic="1" logGroup="default"/>
<Event Severity="10" Time="1530669355.587887" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="c3d4e73f704bbdb4" FromAddr="10.42.0.16:46502" CanonicalAddr="10.42.0.16:4504" logGroup="default"/>
<Event Severity="10" Time="1530669355.591986" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="d0b6d8bf5eee9a86" FromAddr="10.42.0.9:33506" CanonicalAddr="10.42.0.9:4512" logGroup="default"/>
<Event Severity="10" Time="1530669355.598727" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="3d24cb0616bfbbe5" FromAddr="10.42.0.14:40864" CanonicalAddr="10.42.0.14:4504" logGroup="default"/>
<Event Severity="10" Time="1530669355.604163" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="6e5a28873cef14f9" FromAddr="10.42.0.16:46504" CanonicalAddr="10.42.0.16:4503" logGroup="default"/>
<Event Severity="10" Time="1530669355.604163" Type="IncomingConnection" Machine="10.42.0.32:4550" ID="180cb69cf41c36f2" FromAddr="10.42.0.43:42408" CanonicalAddr="10.42.0.43:4506" IsPublic="1" logGroup="default"/>
<Event Severity="10" Time="1530669355.607915" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="ba8d996bd27cfd77" FromAddr="10.42.0.9:33508" CanonicalAddr="10.42.0.9:4508" logGroup="default"/>
<Event Severity="10" Time="1530669355.612928" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="046520619a54199c" FromAddr="10.42.0.22:60792" CanonicalAddr="10.42.0.22:4550" logGroup="default"/>
<Event Severity="10" Time="1530669355.613883" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="cc59abcf6f37dae1" FromAddr="10.42.0.22:60842" CanonicalAddr="10.42.0.22:4550" logGroup="default"/>
<Event Severity="10" Time="1530669355.616159" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="e4a0644a7a7ecfc5" FromAddr="10.42.0.9:33510" CanonicalAddr="10.42.0.9:4515" logGroup="default"/>
<Event Severity="10" Time="1530669355.620787" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="b3d0ae6aba9d2fb4" FromAddr="10.42.0.18:36076" CanonicalAddr="10.42.0.18:4508" logGroup="default"/>
<Event Severity="10" Time="1530669355.632136" Type="IncomingConnection" Machine="10.42.0.32:4550" ID="d4da1922b842c82e" FromAddr="10.42.0.48:48116" CanonicalAddr="10.42.0.48:4502" IsPublic="1" logGroup="default"/>
<Event Severity="10" Time="1530669355.632692" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="08afd705731f3c5a" FromAddr="10.42.0.22:60794" CanonicalAddr="10.42.0.22:4500" logGroup="default"/>
<Event Severity="10" Time="1530669355.632692" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="889129616e309aaa" FromAddr="10.42.0.22:60796" CanonicalAddr="10.42.0.22:4510" logGroup="default"/>
<Event Severity="10" Time="1530669355.635092" Type="IncomingConnection" Machine="10.42.0.32:4550" ID="aa65f0ff8fe9d344" FromAddr="10.42.0.55:33546" CanonicalAddr="10.42.0.55:4508" IsPublic="1" logGroup="default"/>
<Event Severity="10" Time="1530669355.638788" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="bac183f36127a62b" FromAddr="10.42.0.14:40866" CanonicalAddr="10.42.0.14:4550" logGroup="default"/>
<Event Severity="10" Time="1530669355.648172" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="67d06edf060e1934" FromAddr="10.42.0.7:53790" CanonicalAddr="10.42.0.7:4514" logGroup="default"/>
<Event Severity="10" Time="1530669355.660383" Type="IncomingConnection" Machine="10.42.0.32:4550" ID="2f8e7b6f934b306e" FromAddr="10.42.0.43:42412" CanonicalAddr="10.42.0.43:4511" IsPublic="1" logGroup="default"/>
<Event Severity="10" Time="1530669355.660383" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="c7a024e533005702" FromAddr="10.42.0.22:60798" CanonicalAddr="10.42.0.22:4505" logGroup="default"/>
<Event Severity="10" Time="1530669355.668394" Type="IncomingConnection" Machine="10.42.0.32:4550" ID="bd6dee709d95cdd0" FromAddr="10.42.0.43:42414" CanonicalAddr="10.42.0.43:4502" IsPublic="1" logGroup="default"/>
<Event Severity="10" Time="1530669355.670741" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="9a8a92b6a4d37598" FromAddr="10.42.0.14:40868" CanonicalAddr="10.42.0.14:4515" logGroup="default"/>
<Event Severity="10" Time="1530669355.672150" Type="IncomingConnection" Machine="10.42.0.32:4550" ID="1ef7ece699956f8c" FromAddr="10.42.0.43:42416" CanonicalAddr="10.42.0.43:4512" IsPublic="1" logGroup="default"/>
<Event Severity="10" Time="1530669355.698702" Type="IncomingConnection" Machine="10.42.0.32:4550" ID="a831d3d23d1a26a4" FromAddr="10.42.0.43:42456" CanonicalAddr="10.42.0.43:4550" IsPublic="1" logGroup="default"/>
<Event Severity="10" Time="1530669355.771359" Type="IncomingConnection" Machine="10.42.0.32:4550" ID="ae4ec717b4218276" FromAddr="10.42.0.55:33550" CanonicalAddr="10.42.0.55:4513" IsPublic="1" logGroup="default"/>
<Event Severity="10" Time="1530669355.781433" Type="IncomingConnection" Machine="10.42.0.32:4550" ID="69ac7a45dc55077b" FromAddr="10.42.0.55:33580" CanonicalAddr="10.42.0.55:4551" IsPublic="1" logGroup="default"/>
<Event Severity="10" Time="1530669355.800183" Type="IncomingConnection" Machine="10.42.0.32:4550" ID="cea7128f54b81fc5" FromAddr="10.42.0.48:48124" CanonicalAddr="10.42.0.48:4511" IsPublic="1" logGroup="default"/>
<Event Severity="10" Time="1530669355.812043" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="c28c7cafcb612f30" FromAddr="10.42.0.16:46508" CanonicalAddr="10.42.0.16:4509" logGroup="default"/>
<Event Severity="10" Time="1530669355.820137" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="2725f60e0b320f5b" FromAddr="10.42.0.7:53798" CanonicalAddr="10.42.0.7:4501" logGroup="default"/>
<Event Severity="10" Time="1530669355.859888" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="4e229deca206bc9f" FromAddr="10.42.0.16:46512" CanonicalAddr="10.42.0.16:4508" logGroup="default"/>
<Event Severity="10" Time="1530669355.884142" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="f863452cf4866bbd" FromAddr="10.42.0.7:53802" CanonicalAddr="10.42.0.7:4507" logGroup="default"/>
<Event Severity="10" Time="1530669355.886024" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="228f580e1ad6f0b8" FromAddr="10.42.0.32:51774" CanonicalAddr="10.42.0.32:4512" logGroup="default"/>
<Event Severity="10" Time="1530669355.894712" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="d6bc99ee459c07d7" FromAddr="10.42.0.14:40880" CanonicalAddr="10.42.0.14:4501" logGroup="default"/>
<Event Severity="10" Time="1530669355.896634" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="aaa25e74d8686a52" FromAddr="10.42.0.22:60808" CanonicalAddr="10.42.0.22:4509" logGroup="default"/>
<Event Severity="10" Time="1530669355.924069" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="16a0f5dd560739cb" FromAddr="10.42.0.16:46516" CanonicalAddr="10.42.0.16:4505" logGroup="default"/>
<Event Severity="10" Time="1530669355.936704" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="5be08f0f642a23e6" FromAddr="10.42.0.18:36082" CanonicalAddr="10.42.0.18:4503" logGroup="default"/>
<Event Severity="10" Time="1530669355.979931" Type="IncomingConnection" Machine="10.42.0.32:4550" ID="f1041212b3085a9c" FromAddr="10.42.0.48:48130" CanonicalAddr="10.42.0.48:4551" IsPublic="1" logGroup="default"/>
<Event Severity="10" Time="1530669356.040174" Type="IncomingConnection" Machine="10.42.0.32:4550" ID="6e439baeefaa7cfb" FromAddr="10.42.0.48:48132" CanonicalAddr="10.42.0.48:4514" IsPublic="1" logGroup="default"/>
<Event Severity="10" Time="1530669356.042554" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="578c85528d5b4d7e" FromAddr="10.42.0.14:40882" CanonicalAddr="10.42.0.14:4508" logGroup="default"/>
<Event Severity="10" Time="1530669356.072176" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="54a630574c1c66ff" FromAddr="10.42.0.16:46518" CanonicalAddr="10.42.0.16:4512" logGroup="default"/>
<Event Severity="10" Time="1530669356.084208" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="1d5d0f8ef842fcd7" FromAddr="10.42.0.7:53806" CanonicalAddr="10.42.0.7:4512" logGroup="default"/>
<Event Severity="10" Time="1530669356.092844" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="2a6861d8c612912e" FromAddr="10.42.0.18:36090" CanonicalAddr="10.42.0.18:4502" logGroup="default"/>
<Event Severity="10" Time="1530669356.104938" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="af7c23a638070eae" FromAddr="10.42.0.22:60812" CanonicalAddr="10.42.0.22:4501" logGroup="default"/>
<Event Severity="10" Time="1530669356.107245" Type="IncomingConnection" Machine="10.42.0.32:4550" ID="d682c87bf4dd2f58" FromAddr="10.42.0.55:33566" CanonicalAddr="10.42.0.55:4501" IsPublic="1" logGroup="default"/>
<Event Severity="10" Time="1530669356.112393" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="c866bb0ff3c59489" FromAddr="10.42.0.7:53808" CanonicalAddr="10.42.0.7:4511" logGroup="default"/>
<Event Severity="10" Time="1530669356.124019" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="2d61f02990cf9ca3" FromAddr="10.42.0.7:53810" CanonicalAddr="10.42.0.7:4550" logGroup="default"/>
<Event Severity="10" Time="1530669356.240101" Type="IncomingConnection" Machine="10.42.0.32:4550" ID="66e654564bc451e0" FromAddr="10.42.0.48:48136" CanonicalAddr="10.42.0.48:4506" IsPublic="1" logGroup="default"/>
<Event Severity="10" Time="1530669356.308044" Type="IncomingConnection" Machine="10.42.0.32:4550" ID="3282726333fbe156" FromAddr="10.42.0.48:48150" CanonicalAddr="10.42.0.48:4550" IsPublic="1" logGroup="default"/>
<Event Severity="10" Time="1530669356.408143" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="e1c21a5bc5422c17" FromAddr="10.42.0.7:53834" CanonicalAddr="10.42.0.7:4551" logGroup="default"/>
<Event Severity="10" Time="1530669356.497915" Type="ConnectionFrom" Machine="10.42.0.32:4550" ID="8c7196310f68b95e" FromAddress="10.42.0.9:33492" SuppressedEventCount="148" logGroup="default"/>
<Event Severity="20" Time="1530669356.957797" Type="N2_WriteError" Machine="10.42.0.32:4550" ID="a850d7a607161e9c" Message="32" SuppressedEventCount="144" logGroup="default"/>
<Event Severity="10" Time="1530669356.957797" Type="ConnectionClosed" Machine="10.42.0.32:4550" ID="a850d7a607161e9c" PeerAddr="10.42.0.32:4513" Error="connection_failed" ErrorDescription="Network connection failed" ErrorCode="1026" SuppressedEventCount="163" logGroup="default"/>
<Event Severity="20" Time="1530669357.360884" Type="N2_ReadError" Machine="10.42.0.32:4550" ID="45dd5a80b5450dcc" Message="2" SuppressedEventCount="0" logGroup="default"/>
<Event Severity="10" Time="1530669357.380562" Type="ConnectionEstablished" Machine="10.42.0.32:4550" ID="8dbcec794d5f6091" Peer="10.42.0.7:53774" ConnectionId="1" SuppressedEventCount="56" logGroup="default"/>
<Event Severity="10" Time="1530669357.380562" Type="RedundantConnection" Machine="10.42.0.32:4550" ID="8dbcec794d5f6091" FromAddr="10.42.0.7:53774" CanonicalAddr="10.42.0.7:4503" logGroup="default"/>

Shortly after

<Event Severity="30" Time="1530669357.664314" Type="FailureMonitorClientSlow" Machine="10.42.0.32:4550" ID="0000000000000000" Elapsed="8.04514" Expected="0.1" logGroup="default"/>
<Event Severity="10" Time="1530669357.664314" Type="MovingData" Machine="10.42.0.32:4550" ID="87923f606a83079d" InFlight="1066" InQueue="483900" AverageShardSize="-1" LowPriorityRelocations="484930" HighPriorityRelocations="36" HighestPriority="800" BytesWritten="0" logGroup="default" TrackLatestType="Original"/>
<Event Severity="10" Time="1530669357.665261" Type="RkTracking" Machine="10.42.0.32:4550" ID="53bd751875d53a00" logGroup="default"/>
<Event Severity="10" Time="1530669357.665261" Type="TotalDataInFlight" Machine="10.42.0.32:4550" ID="87923f606a83079d" TotalBytes="0" UnhealthyServers="73" logGroup="default" TrackLatestType="Original"/>
<Event Severity="10" Time="1530669357.665261" Type="HealthyTeamCheck" Machine="10.42.0.32:4550" ID="87923f606a83079d" ValidatedCount="789" ProvidedCount="789" logGroup="default"/>
<Event Severity="10" Time="1530669357.665843" Type="ConnectingTo" Machine="10.42.0.32:4550" ID="0000000000000000" PeerAddr="10.42.0.7:4514" SuppressedEventCount="6" logGroup="default"/>
<Event Severity="10" Time="1530669357.666332" Type="ConnectionExchangingConnectPacket" Machine="10.42.0.32:4550" ID="f3f0dc8bf85996ea" PeerAddr="10.42.0.7:4514" SuppressedEventCount="6" logGroup="default"/>
<Event Severity="10" Time="1530669357.685696" Type="ConnectedOutgoing" Machine="10.42.0.32:4550" ID="0000000000000000" PeerAddr="10.42.0.32:4503" SuppressedEventCount="0" logGroup="default"/>
<Event Severity="10" Time="1530669357.713732" Type="MasterTerminated" Machine="10.42.0.32:4550" ID="87923f606a83079d" Reason="Conflict" logGroup="default"/>

The cluster just repeats this forever, a master is chosen and then it’s terminated.

Seems like the task that took too long was:

<Event Severity="30" Time="1530671660.847135" Type="SlowTask" Machine="10.42.0.22:4551" ID="0000000000000000" TaskID="3500" MClocks="10159.8" Duration="4.41711" SampleRate="1" NumYields="0" logGroup="default"/>

That’s TaskDataDistribution, but then shortly after that (6 seconds later), there’re these lines:

<Event Severity="30" Time="1530671666.838983" Type="FailureMonitorClientSlow" Machine="10.42.0.22:4551" ID="0000000000000000" Elapsed="6.073" Expected="0.1" logGroup="default"/>
<Event Severity="20" Time="1530671666.983854" Type="ZeroTeamsHealthySignalling" Machine="10.42.0.22:4551" ID="d53968429ac8056b" SignallingTeam="Size 2; 10.42.0.7:4511 f7063eaaa6b742ab, 10.42.0.48:4508 fa7cea97529849d2" logGroup="default"/>
<Event Severity="30" Time="1530671666.983854" Type="SlowTask" Machine="10.42.0.22:4551" ID="0000000000000000" TaskID="9000" MClocks="12758.9" Duration="5.54712" SampleRate="1" NumYields="0" logGroup="default"/>

That’s TaskReadSocket at which point the cluster controller detects the failed master:

<Event Severity="20" Time="1530671666.381078" Type="DetectedFailedMaster" Machine="10.42.0.22:4550" ID="354105bbe72c9271" OldMaster="d53968429ac8056b" logGroup="default"/>

DataDistribution being a slow task upon master startup is a known problem, but shouldn’t be causing any issues. (Duration’s unit is nanoseconds, so this isn’t anywhere near close to triggering a health monitor failure on its own.)

But the fact that it’s dieing due to Reason="Conflict" makes this seem… odd. That means that some other process is trying to become the master, and thus the master is killing itself. That… shouldn’t happen in the extended case.

Your second post is from a 10.40.0.22:4551 process, is that the cluster controller…? It looks like a different master process?

I’m looking through the code to chase down some ideas, but if you can look through logs to confirm either of the following, that’d be great:

  1. The <TraceEvent Type="FailureMonitorClientSlow" Elapsed="6.073" /> concerns me, because that would seem to suggest we haven’t checked in with the cluster controller for >6s, if I understand “elapsed” to actually be in seconds correctly. That would explain why we’re being killed.

    If you can match up the timestamps on the cluster controller and the dedicated master to show that 5 seconds after the master is marked as unhealthy, the master reports that it has failed to send health status for 6 seconds, then this would be confirmed.

  2. If you can see if any other process temporarily elects itself as master, then that’d be good to know. I suspect that if it does, BetterMasterExists would then kill the cluster, as the master would be a better fit on the process dedicated for it anyway.

  3. Or maybe, having differing cluster files on some machines with partially overlapping coordinators could possibly cause this?

Also, is this a production issue, and you need emergency cluster recovery instructions, or was this just hit in some QA environment that can be broken over the holiday?

EDIT: I’ll assume yes, so have a nice holiday, thanks for the report, and I’ll look into it on Thursday.

Thanks for the prompt reply. It is indeed in a QA cluster that we just upgraded from 3.x but taking significant load. It also happened during the migration from ssd-1 to ssd-2 engine (kinda half-way).

Have a nice 4th.

I bit more detail on this cluster. After it started happening, I reduced the cluster so that there’s only two stateless processes on the same box and so it’s always the case that one is a CC and one is a master. They just basically trade places FWICT, the CC becomes the master and the master becomes the CC (at least in the cases I was able to catch it happening). Partially overlapping coordinators huh? I can check.

Yep, I can see that the CC detected that the master is out (below is the CC when it elects itself to be the master):

<Event Severity="10" Time="1530687360.621799" Type="FailureDetectionStatus" Machine="10.42.0.22:4550" ID="834d7ce929057f23" System="10.42.0.22:4551" Status="Failed" Why="Timeout" LastRequestAge="5.19607" PivotDelay="0.01025" logGroup="default"/>
<Event Severity="10" Time="1530687360.662372" Type="AllAlternativesFailed" Machine="10.42.0.22:4550" ID="0000000000000000" Interval="1" ServersValidTime="1e+99" Alternatives="2 4df6f1bc22768e11" Delay="1e+99" logGroup="default"/>
<Event Severity="20" Time="1530687361.060788" Type="DetectedFailedMaster" Machine="10.42.0.22:4550" ID="834d7ce929057f23" OldMaster="767a75b0ad594a1b" logGroup="default"/>
<Event Severity="10" Time="1530687361.060788" Type="CCWDB" Machine="10.42.0.22:4550" ID="834d7ce929057f23" logGroup="default"/>
<Event Severity="10" Time="1530687361.060788" Type="CCWDB" Machine="10.42.0.22:4550" ID="834d7ce929057f23" Recruiting="Master" logGroup="default"/>
<Event Severity="10" Time="1530687361.062092" Type="Role" Machine="10.42.0.22:4550" ID="40754154839138f5" As="MasterServer" Transition="Begin" Origination="Recruited" OnWorker="d4524493c9464c1b" logGroup="default" TrackLatestType="Original"/>
<Event Severity="10" Time="1530687361.062092" Type="DumpToken" Machine="10.42.0.22:4550" ID="40754154839138f5" Name="recruited.waitFailure" Token="a9eed800d2a06541" logGroup="default"/>
<Event Severity="10" Time="1530687361.062092" Type="DumpToken" Machine="10.42.0.22:4550" ID="40754154839138f5" Name="recruited.getRateInfo" Token="b4b335ddd504d469" logGroup="default"/>
<Event Severity="10" Time="1530687361.062092" Type="DumpToken" Machine="10.42.0.22:4550" ID="40754154839138f5" Name="recruited.tlogRejoin" Token="c756dc31f5027375" logGroup="default"/>
<Event Severity="10" Time="1530687361.062092" Type="DumpToken" Machine="10.42.0.22:4550" ID="40754154839138f5" Name="recruited.changeCoordinators" Token="40754154839138f5" logGroup="default"/>
<Event Severity="10" Time="1530687361.062092" Type="DumpToken" Machine="10.42.0.22:4550" ID="40754154839138f5" Name="recruited.getCommitVersion" Token="5477deece74c4279" logGroup="default"/>
<Event Severity="10" Time="1530687361.062092" Type="MasterLifetime" Machine="10.42.0.22:4550" ID="40754154839138f5" LifetimeToken="834d7ce929057f23#2" logGroup="default"/>
<Event Severity="10" Time="1530687361.062092" Type="MasterRecovery" Machine="10.42.0.22:4550" BeginPair="0f5c3c06b4030e88" ID="40754154839138f5" logGroup="default"/>
<Event Severity="10" Time="1530687361.062092" Type="MasterRecoveryState" Machine="10.42.0.22:4550" ID="40754154839138f5" StatusCode="0" Status="reading_coordinated_state" logGroup="default" TrackLatestType="Original"/>
<Event Severity="10" Time="1530687361.062549" Type="CCWDB" Machine="10.42.0.22:4550" ID="834d7ce929057f23" Recruited="40754154839138f5" logGroup="default"/>
<Event Severity="10" Time="1530687361.062549" Type="RecruitedMasterWorker" Machine="10.42.0.22:4550" ID="834d7ce929057f23" Address="10.42.0.22:4550" logGroup="default" TrackLatestType="Original"/>
<Event Severity="10" Time="1530687361.062549" Type="CCWDB" Machine="10.42.0.22:4550" ID="834d7ce929057f23" Lifetime="834d7ce929057f23#3" ChangeID="c6649761e79bac73" logGroup="default"/>
<Event Severity="10" Time="1530687361.074368" Type="GotServerDBInfoChange" Machine="10.42.0.22:4550" ID="0000000000000000" ChangeID="c6649761e79bac73" MasterID="40754154839138f5" logGroup="default"/>
<Event Severity="10" Time="1530687361.089900" Type="CoreStateSerialize" Machine="10.42.0.22:4550" ID="0000000000000000" AntiQuorum="0" logSystemType="2" recoveryCount="3245" tLogReplicationFactor="2" tLogPolicy="zoneid^2 x 1" logs="a41b9a0014533447c58204caa4580daf,56f4e50209c3b0c4161060469a157a70,4cc1beeb4ff235e785c0f11a744b476a,26066b4c8857845aeb518dcefaae0664,2b5e20736fdcd33baef641320b9cda88,34aa8532672320bd2c4710229eeacca3,5d42e2243a9ba62f778e59c7fd997468,433ac2beed888f7319d2724ac957c5df,0781c5fbc920ed3305b4ba4bd30d756e,4970d16206f8999e0b7c8117f6bd7ce1" procotol="1142507640513888257" oldTLogData="154" deserializing="1" logGroup="default"/>
<Event Severity="10" Time="1530687361.089900" Type="MasterRecoveryState" Machine="10.42.0.22:4550" ID="40754154839138f5" StatusCode="1" Status="locking_coordinated_state" TLogs="10" MyRecoveryCount="3247" StateSize="273750" logGroup="default" TrackLatestType="Original"/>

If they’re just swapping, then I’d suspect that this is just an issue where something about your setup is causing the overwhelming stress of trying to load information about every shard all at once to be more overwhelming on the master than it usually otherwise is. It’s relatively “normal” for the master to be pegged at 100% CPU for up to a minute or so loading information about every shard. I’ve seen @Evan do changes in this area a couple times before, but we have a “many shards” test that I would have expected to hit this issue before you did.

I’ll go poke around to see if there’s any specific knob that you can tweak just for this in the meantime, but I’d expect that setting --knob_worker_failure_time=10 on the cluster controller would give you cluster stability again (at the cost of it taking 10seconds for FDB to notice that any process has died).

I believe duration is measured in seconds in the trace event, which is in line with the event taking about 10 billion cycles. That may also explain the FailureMonitorClientSlow event if it coincided with a long slow task.

Yup, I read the struct and missed that we have a *1e9 before storing duration in, and what we log is the seconds version. I suppose a brief thought should have also revealed that assuming 4ns is a “slow task” is a bit optimistic… sigh

So yeah, this is just DataDistribution causing a freakishly long slow task.

Just restarted with

knob_worker_failure_time=10

on this 10-node cluster.

The memory tier came right back (but I believe we started with a fresh db). SSD is up but moving some 82TB.

Will observe over the holiday.

Thanks @mrz. Obviously we need to figure out what that “freakishly” long slow task was.

There should be built-in profiling of slow tasks that will emit “Net2SlowTaskTrace” TraceEvents. Could you grep for and post them by any chance?

@mrz / @panghy: grep quickly, and a fix might make it into 6.0.0 for you :slight_smile:

I am on PTO still but from what I can tell it’s the teamTracker call in DataDistribution.actor.cpp. Basically when we are converting from ssd-1 to ssd-2, there seems to be a lot more stuff to track in those loops and it will cause the master to become unresponsive beyond 5s.

Having talked to Ben the other day, we’ll redo the test but not upgrade from ssd-1 to ssd-2 and see how it fares. I am out until the end of the month though :slight_smile:

Seems like ssd-1 to ssd-2 for large clusters is not a good idea for the master. We did not run into issues so far if we don’t try the conversion.