In a 5 Node cluster with double redundancy, after an outrage, on bringing all the nodes back we are observing that fdbserver is not coming up in one of the nodes with the error
LogGroup="default" Process="fdbserver.4500": Internal Error @ fdbserver/storageserver.actor.cpp 2177:
LogGroup="default" Process="fdbserver.4500": addr2line -e fdbserver.debug -p -C -f -i 0x1539b59 0xb2134b 0xb7002e 0xb788e2 0xb46c5e 0xb4724b 0xb47cb2 0xb483a3 0xb486f2 0x4a2a20 0xb27273 0xb27452 0xb68b50 0xb25ef5 0xb25fe7 0x4a
Output of status details
Using cluster file `/etc/foundationdb/fdb.cluster'.
The database is available, but has issues (type 'status' for more information).
Welcome to the fdbcli. For help, type `help'.
fdb> status details
Using cluster file `/etc/foundationdb/fdb.cluster'.
Configuration:
Redundancy mode - double
Storage engine - ssd-2
Coordinators - 3
Desired Proxies - 1
Desired Logs - 2
Cluster:
FoundationDB processes - 10 (less 0 excluded; 1 with errors)
Machines - 5
Memory availability - 22.1 GB per process on machine with least available
Retransmissions rate - 0 Hz
Fault Tolerance - 0 machines
Server time - 09/20/20 16:37:32
Data:
Replication health - UNHEALTHY: No replicas remain of some data
Moving data - unknown
Sum of key-value sizes - unknown
Disk space used - 25.563 GB
Operating space:
Storage server - 788.9 GB free on most full server
Log server - 792.6 GB free on most full server
Workload:
Read rate - 30 Hz
Write rate - 0 Hz
Transactions started - 96 Hz
Transactions committed - 0 Hz
Conflict rate - 0 Hz
Backup and DR:
Running backups - 0
Running DRs - 0
Process performance details:
10.87.200.35:4500 ( 2% cpu; 3% machine; 0.001 Gbps; 0% disk IO; 0.4 GB / 22.1 GB RAM )
10.87.200.35:4501 ( 0% cpu; 3% machine; 0.001 Gbps; 0% disk IO; 0.1 GB / 22.1 GB RAM )
10.87.200.36:4500 ( 2% cpu; 5% machine; 0.001 Gbps; 0% disk IO; 0.6 GB / 22.4 GB RAM )
10.87.200.36:4501 ( 1% cpu; 5% machine; 0.001 Gbps; 0% disk IO; 0.1 GB / 22.4 GB RAM )
10.87.200.37:4500 ( 2% cpu; 3% machine; 0.004 Gbps; 0% disk IO; 0.6 GB / 23.2 GB RAM )
10.87.200.37:4501 ( 6% cpu; 3% machine; 0.004 Gbps; 1% disk IO; 0.1 GB / 23.2 GB RAM )
10.87.200.38:4500 ( 2% cpu; 2% machine; 0.004 Gbps; 0% disk IO; 0.5 GB / 22.9 GB RAM )
Last logged error: StorageServerFailed: internal_error at Sun Sep 20 14:22:05 2020
10.87.200.38:4501 ( 4% cpu; 2% machine; 0.004 Gbps; 0% disk IO; 0.3 GB / 22.9 GB RAM )
10.87.200.39:4500 ( 1% cpu; 3% machine; 0.002 Gbps; 0% disk IO; 0.6 GB / 22.2 GB RAM )
10.87.200.39:4501 ( 4% cpu; 3% machine; 0.002 Gbps; 0% disk IO; 0.1 GB / 22.2 GB RAM )
Coordination servers:
10.87.200.36:4500 (reachable)
10.87.200.37:4500 (reachable)
10.87.200.39:4500 (reachable)
Client time: 09/20/20 16:37:32
WARNING: A single process is both a transaction log and a storage server.
For best performance use dedicated disks for the transaction logs by setting process classes.
On the problematic node I am seeing Severity "40"
like below
trace.10.87.200.38.4500.1600574883.ZRnVjr.1.xml:<Event Severity="40" Time="1600574884.903619" Type="DeliveredToNotAssigned" ID="0000000000000000" Version="6609581472260" Mutation="code: SetValue param1: \x15\x03\x15\x01\x15+3\x00\x00\x06\x02\xe9\xc1z\x04\x00\x00\x00\x00\x14 param2: \x9c\x0atCiMxMC44Ni4wLjIwIC0+IDEwLjc4Lj\x01\x08\xf0<EyIFtwb3J0OjUzXRIdMTAwMDA6NTE1OjkyMDE3NjA5NzQxNTIyODUwOTkaHwo\x1d,xQxOjcyMDYwODc1MDk5NzYxODQ0MDciHB,\x00\xf0ZExMzUwNTY4Nzg2MDk4Njc1NTg4AVJQCCASCjEwLjg2LjAuMjAaDzI1NS4yNTUuMjU1LjI1NSIKMTAuODYuMC4yMCoNM2\x10\x00\xa08zMjCUgNhSOJSA2FJCCEVORFBPSU5USAFaUwggEgs!(\x0c3OC4\x01\x08\x14xMhoPM\x15d\x004\x15t\x00I>(\x00(ioOMTAuNzgu\x01\x08\xf0@MTIvMzI..." Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1579226 0x157795f 0xb212d7 0xb7002e 0xb788e2 0xb46c5e 0xb4724b 0xb47cb2 0xb483a3 0xb486f2 0x4a2a20 0xb27273 0xb27452 0xb68b50 0xb25ef5 0xb25fe7 0x4a2a20 0x9e5c68 0x719970 0xb605e0 0x75f86f 0x56ed90 0x159c97a 0x449c2f 0x7f4df820b830" Machine="10.87.200.38:4500" LogGroup="default" Roles="SS,TL" />
trace.10.87.200.38.4500.1600574883.ZRnVjr.1.xml:<Event Severity="40" Time="1600574884.903619" Type="InternalError" ID="0000000000000000" Error="internal_error" ErrorDescription="An internal error occurred" ErrorCode="4100" File="fdbserver/storageserver.actor.cpp" Line="2177" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1579226 0x157795f 0x1539c7c 0xb2134b 0xb7002e 0xb788e2 0xb46c5e 0xb4724b 0xb47cb2 0xb483a3 0xb486f2 0x4a2a20 0xb27273 0xb27452 0xb68b50 0xb25ef5 0xb25fe7 0x4a2a20 0x9e5c68 0x719970 0xb605e0 0x75f86f 0x56ed90 0x159c97a 0x449c2f 0x7f4df820b830" Machine="10.87.200.38:4500" LogGroup="default" Roles="SS,TL" />
trace.10.87.200.38.4500.1600574883.ZRnVjr.1.xml:<Event Severity="40" Time="1600574884.903619" Type="SystemError" ID="0000000000000000" Error="internal_error" ErrorDescription="An internal error occurred" ErrorCode="4100" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1579226 0x157795f 0x1539a7a 0x1539cee 0xb2134b 0xb7002e 0xb788e2 0xb46c5e 0xb4724b 0xb47cb2 0xb483a3 0xb486f2 0x4a2a20 0xb27273 0xb27452 0xb68b50 0xb25ef5 0xb25fe7 0x4a2a20 0x9e5c68 0x719970 0xb605e0 0x75f86f 0x56ed90 0x159c97a 0x449c2f 0x7f4df820b830" Machine="10.87.200.38:4500" LogGroup="default" Roles="SS,TL" />
trace.10.87.200.38.4500.1600574883.ZRnVjr.1.xml:<Event Severity="40" Time="1600574884.903619" Type="SSUpdateError" ID="b260a712e27c7bd1" Error="internal_error" ErrorDescription="An internal error occurred" ErrorCode="4100" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1579226 0x157795f 0xb0b127 0xb47d0f 0xb483a3 0xb486f2 0x4a2a20 0xb27273 0xb27452 0xb68b50 0xb25ef5 0xb25fe7 0x4a2a20 0x9e5c68 0x719970 0xb605e0 0x75f86f 0x56ed90 0x159c97a 0x449c2f 0x7f4df820b830" Machine="10.87.200.38:4500" LogGroup="default" Roles="SS,TL" />
trace.10.87.200.38.4500.1600574883.ZRnVjr.1.xml:<Event Severity="40" Time="1600574884.921231" Type="StorageServerFailed" ID="b260a712e27c7bd1" Error="internal_error" ErrorDescription="An internal error occurred" ErrorCode="4100" Reason="Error" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1579226 0x157795f 0xbd2666 0xbd2f0f 0xbd3084 0xbbcff3 0xbbd2f7 0x49a649 0xbca854 0xbcc278 0x5268d8 0x526c17 0x49a649 0xbc4260 0xbc42e7 0x56ed90 0x77247f 0x772897 0x77300a 0x4a2a20 0x4a2a20 0x1594987 0x15956b8 0x1593fbb 0x1593a18 0x63e108 0x56ed90 0x159c97a 0x449c2f 0x7f4df820b830" Machine="10.87.200.38:4500" LogGroup="default" Roles="SS,TL" />
trace.10.87.200.38.4500.1600611724.p9XtOH.1.xml:<Event Severity="40" Time="1600611725.708995" Type="DeliveredToNotAssigned" ID="0000000000000000" Version="6609581472260" Mutation="code: SetValue param1: \x15\x03\x15\x01\x15+3\x00\x00\x06\x02\xe9\xc1z\x04\x00\x00\x00\x00\x14 param2: \x9c\x0atCiMxMC44Ni4wLjIwIC0+IDEwLjc4Lj\x01\x08\xf0<EyIFtwb3J0OjUzXRIdMTAwMDA6NTE1OjkyMDE3NjA5NzQxNTIyODUwOTkaHwo\x1d,xQxOjcyMDYwODc1MDk5NzYxODQ0MDciHB,\x00\xf0ZExMzUwNTY4Nzg2MDk4Njc1NTg4AVJQCCASCjEwLjg2LjAuMjAaDzI1NS4yNTUuMjU1LjI1NSIKMTAuODYuMC4yMCoNM2\x10\x00\xa08zMjCUgNhSOJSA2FJCCEVORFBPSU5USAFaUwggEgs!(\x0c3OC4\x01\x08\x14xMhoPM\x15d\x004\x15t\x00I>(\x00(ioOMTAuNzgu\x01\x08\xf0@MTIvMzI..." Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1579226 0x157795f 0xb212d7 0xb7002e 0xb788e2 0xb46c5e 0xb4724b 0xb47cb2 0xb483a3 0xb486f2 0x4a2a20 0xb27273 0xb27452 0xb68b50 0xb25ef5 0xb25fe7 0x4a2a20 0x9e5c68 0x719970 0xb605e0 0x75f86f 0x56ed90 0x159c97a 0x449c2f 0x7f3f629da830" Machine="10.87.200.38:4500" LogGroup="default" Roles="SS,TL" />
trace.10.87.200.38.4500.1600611724.p9XtOH.1.xml:<Event Severity="40" Time="1600611725.708995" Type="InternalError" ID="0000000000000000" Error="internal_error" ErrorDescription="An internal error occurred" ErrorCode="4100" File="fdbserver/storageserver.actor.cpp" Line="2177" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1579226 0x157795f 0x1539c7c 0xb2134b 0xb7002e 0xb788e2 0xb46c5e 0xb4724b 0xb47cb2 0xb483a3 0xb486f2 0x4a2a20 0xb27273 0xb27452 0xb68b50 0xb25ef5 0xb25fe7 0x4a2a20 0x9e5c68 0x719970 0xb605e0 0x75f86f 0x56ed90 0x159c97a 0x449c2f 0x7f3f629da830" Machine="10.87.200.38:4500" LogGroup="default" Roles="SS,TL" />
trace.10.87.200.38.4500.1600611724.p9XtOH.1.xml:<Event Severity="40" Time="1600611725.708995" Type="SystemError" ID="0000000000000000" Error="internal_error" ErrorDescription="An internal error occurred" ErrorCode="4100" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1579226 0x157795f 0x1539a7a 0x1539cee 0xb2134b 0xb7002e 0xb788e2 0xb46c5e 0xb4724b 0xb47cb2 0xb483a3 0xb486f2 0x4a2a20 0xb27273 0xb27452 0xb68b50 0xb25ef5 0xb25fe7 0x4a2a20 0x9e5c68 0x719970 0xb605e0 0x75f86f 0x56ed90 0x159c97a 0x449c2f 0x7f3f629da830" Machine="10.87.200.38:4500" LogGroup="default" Roles="SS,TL" />
trace.10.87.200.38.4500.1600611724.p9XtOH.1.xml:<Event Severity="40" Time="1600611725.708995" Type="SSUpdateError" ID="b260a712e27c7bd1" Error="internal_error" ErrorDescription="An internal error occurred" ErrorCode="4100" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1579226 0x157795f 0xb0b127 0xb47d0f 0xb483a3 0xb486f2 0x4a2a20 0xb27273 0xb27452 0xb68b50 0xb25ef5 0xb25fe7 0x4a2a20 0x9e5c68 0x719970 0xb605e0 0x75f86f 0x56ed90 0x159c97a 0x449c2f 0x7f3f629da830" Machine="10.87.200.38:4500" LogGroup="default" Roles="SS,TL" />
trace.10.87.200.38.4500.1600611724.p9XtOH.1.xml:<Event Severity="40" Time="1600611725.729125" Type="StorageServerFailed" ID="b260a712e27c7bd1" Error="internal_error" ErrorDescription="An internal error occurred" ErrorCode="4100" Reason="Error" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1579226 0x157795f 0xbd2666 0xbd2f0f 0xbd3084 0xbbcff3 0xbbd2f7 0x49a649 0xbca854 0xbcc278 0x5268d8 0x526c17 0x49a649 0xbc4260 0xbc42e7 0x56ed90 0x77247f 0x772897 0x77300a 0x4a2a20 0x4a2a20 0x1594987 0x15956b8 0x1593fbb 0x1593a18 0x63e108 0x56ed90 0x159c97a 0x449c2f 0x7f3f629da830" Machine="10.87.200.38:4500" LogGroup="default" Roles="SS,TL" />
foundationdb.conf is like below
[fdbmonitor]
user = ubuntu
group = ubuntu
[general]
restart_delay = 60
cluster_file = /etc/foundationdb/fdb.cluster
[fdbserver]
command = /usr/bin/fdbserver
public_address = auto:$ID
listen_address = public
datadir = /var/lib/foundationdb/data/$ID
logdir = /var/log/foundationdb
[fdbserver.4500]
[fdbserver.4501]
class = stateless
[backup_agent]
command = /usr/lib/foundationdb/backup_agent/backup_agent
logdir = /var/log/foundationdb
[backup_agent.1]
I read this discussion and restarted foundationdb processes in all the nodes. But it did not help.
Can someone let me know if it is possible to recover the cluster ? I am ok with removing the keyspaces for which both the replicas are missing.
FDB Version - 6.1.12.