Fdbserver error in a cluster with double redundancy

In a 5 Node cluster with double redundancy, after an outrage, on bringing all the nodes back we are observing that fdbserver is not coming up in one of the nodes with the error

LogGroup="default" Process="fdbserver.4500": Internal Error @ fdbserver/storageserver.actor.cpp 2177:
LogGroup="default" Process="fdbserver.4500":   addr2line -e fdbserver.debug -p -C -f -i 0x1539b59 0xb2134b 0xb7002e 0xb788e2 0xb46c5e 0xb4724b 0xb47cb2 0xb483a3 0xb486f2 0x4a2a20 0xb27273 0xb27452 0xb68b50 0xb25ef5 0xb25fe7 0x4a

Output of status details

Using cluster file `/etc/foundationdb/fdb.cluster'.

The database is available, but has issues (type 'status' for more information).

Welcome to the fdbcli. For help, type `help'.
fdb> status details

Using cluster file `/etc/foundationdb/fdb.cluster'.

Configuration:
  Redundancy mode        - double
  Storage engine         - ssd-2
  Coordinators           - 3
  Desired Proxies        - 1
  Desired Logs           - 2

Cluster:
  FoundationDB processes - 10 (less 0 excluded; 1 with errors)
  Machines               - 5
  Memory availability    - 22.1 GB per process on machine with least available
  Retransmissions rate   - 0 Hz
  Fault Tolerance        - 0 machines
  Server time            - 09/20/20 16:37:32

Data:
  Replication health     - UNHEALTHY: No replicas remain of some data
  Moving data            - unknown
  Sum of key-value sizes - unknown
  Disk space used        - 25.563 GB

Operating space:
  Storage server         - 788.9 GB free on most full server
  Log server             - 792.6 GB free on most full server

Workload:
  Read rate              - 30 Hz
  Write rate             - 0 Hz
  Transactions started   - 96 Hz
  Transactions committed - 0 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:
  10.87.200.35:4500      (  2% cpu;  3% machine; 0.001 Gbps;  0% disk IO; 0.4 GB / 22.1 GB RAM  )
  10.87.200.35:4501      (  0% cpu;  3% machine; 0.001 Gbps;  0% disk IO; 0.1 GB / 22.1 GB RAM  )
  10.87.200.36:4500      (  2% cpu;  5% machine; 0.001 Gbps;  0% disk IO; 0.6 GB / 22.4 GB RAM  )
  10.87.200.36:4501      (  1% cpu;  5% machine; 0.001 Gbps;  0% disk IO; 0.1 GB / 22.4 GB RAM  )
  10.87.200.37:4500      (  2% cpu;  3% machine; 0.004 Gbps;  0% disk IO; 0.6 GB / 23.2 GB RAM  )
  10.87.200.37:4501      (  6% cpu;  3% machine; 0.004 Gbps;  1% disk IO; 0.1 GB / 23.2 GB RAM  )
  10.87.200.38:4500      (  2% cpu;  2% machine; 0.004 Gbps;  0% disk IO; 0.5 GB / 22.9 GB RAM  )
    Last logged error: StorageServerFailed: internal_error at Sun Sep 20 14:22:05 2020
  10.87.200.38:4501      (  4% cpu;  2% machine; 0.004 Gbps;  0% disk IO; 0.3 GB / 22.9 GB RAM  )
  10.87.200.39:4500      (  1% cpu;  3% machine; 0.002 Gbps;  0% disk IO; 0.6 GB / 22.2 GB RAM  )
  10.87.200.39:4501      (  4% cpu;  3% machine; 0.002 Gbps;  0% disk IO; 0.1 GB / 22.2 GB RAM  )

Coordination servers:
  10.87.200.36:4500  (reachable)
  10.87.200.37:4500  (reachable)
  10.87.200.39:4500  (reachable)

Client time: 09/20/20 16:37:32

WARNING: A single process is both a transaction log and a storage server.
  For best performance use dedicated disks for the transaction logs by setting process classes.

On the problematic node I am seeing Severity "40" like below

trace.10.87.200.38.4500.1600574883.ZRnVjr.1.xml:<Event Severity="40" Time="1600574884.903619" Type="DeliveredToNotAssigned" ID="0000000000000000" Version="6609581472260" Mutation="code: SetValue param1: \x15\x03\x15\x01\x15+3\x00\x00\x06\x02\xe9\xc1z\x04\x00\x00\x00\x00\x14 param2: \x9c\x0atCiMxMC44Ni4wLjIwIC0+IDEwLjc4Lj\x01\x08\xf0&lt;EyIFtwb3J0OjUzXRIdMTAwMDA6NTE1OjkyMDE3NjA5NzQxNTIyODUwOTkaHwo\x1d,xQxOjcyMDYwODc1MDk5NzYxODQ0MDciHB,\x00\xf0ZExMzUwNTY4Nzg2MDk4Njc1NTg4AVJQCCASCjEwLjg2LjAuMjAaDzI1NS4yNTUuMjU1LjI1NSIKMTAuODYuMC4yMCoNM2\x10\x00\xa08zMjCUgNhSOJSA2FJCCEVORFBPSU5USAFaUwggEgs!(\x0c3OC4\x01\x08\x14xMhoPM\x15d\x004\x15t\x00I&gt;(\x00(ioOMTAuNzgu\x01\x08\xf0@MTIvMzI..." Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1579226 0x157795f 0xb212d7 0xb7002e 0xb788e2 0xb46c5e 0xb4724b 0xb47cb2 0xb483a3 0xb486f2 0x4a2a20 0xb27273 0xb27452 0xb68b50 0xb25ef5 0xb25fe7 0x4a2a20 0x9e5c68 0x719970 0xb605e0 0x75f86f 0x56ed90 0x159c97a 0x449c2f 0x7f4df820b830" Machine="10.87.200.38:4500" LogGroup="default" Roles="SS,TL" />
trace.10.87.200.38.4500.1600574883.ZRnVjr.1.xml:<Event Severity="40" Time="1600574884.903619" Type="InternalError" ID="0000000000000000" Error="internal_error" ErrorDescription="An internal error occurred" ErrorCode="4100" File="fdbserver/storageserver.actor.cpp" Line="2177" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1579226 0x157795f 0x1539c7c 0xb2134b 0xb7002e 0xb788e2 0xb46c5e 0xb4724b 0xb47cb2 0xb483a3 0xb486f2 0x4a2a20 0xb27273 0xb27452 0xb68b50 0xb25ef5 0xb25fe7 0x4a2a20 0x9e5c68 0x719970 0xb605e0 0x75f86f 0x56ed90 0x159c97a 0x449c2f 0x7f4df820b830" Machine="10.87.200.38:4500" LogGroup="default" Roles="SS,TL" />
trace.10.87.200.38.4500.1600574883.ZRnVjr.1.xml:<Event Severity="40" Time="1600574884.903619" Type="SystemError" ID="0000000000000000" Error="internal_error" ErrorDescription="An internal error occurred" ErrorCode="4100" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1579226 0x157795f 0x1539a7a 0x1539cee 0xb2134b 0xb7002e 0xb788e2 0xb46c5e 0xb4724b 0xb47cb2 0xb483a3 0xb486f2 0x4a2a20 0xb27273 0xb27452 0xb68b50 0xb25ef5 0xb25fe7 0x4a2a20 0x9e5c68 0x719970 0xb605e0 0x75f86f 0x56ed90 0x159c97a 0x449c2f 0x7f4df820b830" Machine="10.87.200.38:4500" LogGroup="default" Roles="SS,TL" />
trace.10.87.200.38.4500.1600574883.ZRnVjr.1.xml:<Event Severity="40" Time="1600574884.903619" Type="SSUpdateError" ID="b260a712e27c7bd1" Error="internal_error" ErrorDescription="An internal error occurred" ErrorCode="4100" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1579226 0x157795f 0xb0b127 0xb47d0f 0xb483a3 0xb486f2 0x4a2a20 0xb27273 0xb27452 0xb68b50 0xb25ef5 0xb25fe7 0x4a2a20 0x9e5c68 0x719970 0xb605e0 0x75f86f 0x56ed90 0x159c97a 0x449c2f 0x7f4df820b830" Machine="10.87.200.38:4500" LogGroup="default" Roles="SS,TL" />
trace.10.87.200.38.4500.1600574883.ZRnVjr.1.xml:<Event Severity="40" Time="1600574884.921231" Type="StorageServerFailed" ID="b260a712e27c7bd1" Error="internal_error" ErrorDescription="An internal error occurred" ErrorCode="4100" Reason="Error" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1579226 0x157795f 0xbd2666 0xbd2f0f 0xbd3084 0xbbcff3 0xbbd2f7 0x49a649 0xbca854 0xbcc278 0x5268d8 0x526c17 0x49a649 0xbc4260 0xbc42e7 0x56ed90 0x77247f 0x772897 0x77300a 0x4a2a20 0x4a2a20 0x1594987 0x15956b8 0x1593fbb 0x1593a18 0x63e108 0x56ed90 0x159c97a 0x449c2f 0x7f4df820b830" Machine="10.87.200.38:4500" LogGroup="default" Roles="SS,TL" />
trace.10.87.200.38.4500.1600611724.p9XtOH.1.xml:<Event Severity="40" Time="1600611725.708995" Type="DeliveredToNotAssigned" ID="0000000000000000" Version="6609581472260" Mutation="code: SetValue param1: \x15\x03\x15\x01\x15+3\x00\x00\x06\x02\xe9\xc1z\x04\x00\x00\x00\x00\x14 param2: \x9c\x0atCiMxMC44Ni4wLjIwIC0+IDEwLjc4Lj\x01\x08\xf0&lt;EyIFtwb3J0OjUzXRIdMTAwMDA6NTE1OjkyMDE3NjA5NzQxNTIyODUwOTkaHwo\x1d,xQxOjcyMDYwODc1MDk5NzYxODQ0MDciHB,\x00\xf0ZExMzUwNTY4Nzg2MDk4Njc1NTg4AVJQCCASCjEwLjg2LjAuMjAaDzI1NS4yNTUuMjU1LjI1NSIKMTAuODYuMC4yMCoNM2\x10\x00\xa08zMjCUgNhSOJSA2FJCCEVORFBPSU5USAFaUwggEgs!(\x0c3OC4\x01\x08\x14xMhoPM\x15d\x004\x15t\x00I&gt;(\x00(ioOMTAuNzgu\x01\x08\xf0@MTIvMzI..." Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1579226 0x157795f 0xb212d7 0xb7002e 0xb788e2 0xb46c5e 0xb4724b 0xb47cb2 0xb483a3 0xb486f2 0x4a2a20 0xb27273 0xb27452 0xb68b50 0xb25ef5 0xb25fe7 0x4a2a20 0x9e5c68 0x719970 0xb605e0 0x75f86f 0x56ed90 0x159c97a 0x449c2f 0x7f3f629da830" Machine="10.87.200.38:4500" LogGroup="default" Roles="SS,TL" />
trace.10.87.200.38.4500.1600611724.p9XtOH.1.xml:<Event Severity="40" Time="1600611725.708995" Type="InternalError" ID="0000000000000000" Error="internal_error" ErrorDescription="An internal error occurred" ErrorCode="4100" File="fdbserver/storageserver.actor.cpp" Line="2177" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1579226 0x157795f 0x1539c7c 0xb2134b 0xb7002e 0xb788e2 0xb46c5e 0xb4724b 0xb47cb2 0xb483a3 0xb486f2 0x4a2a20 0xb27273 0xb27452 0xb68b50 0xb25ef5 0xb25fe7 0x4a2a20 0x9e5c68 0x719970 0xb605e0 0x75f86f 0x56ed90 0x159c97a 0x449c2f 0x7f3f629da830" Machine="10.87.200.38:4500" LogGroup="default" Roles="SS,TL" />
trace.10.87.200.38.4500.1600611724.p9XtOH.1.xml:<Event Severity="40" Time="1600611725.708995" Type="SystemError" ID="0000000000000000" Error="internal_error" ErrorDescription="An internal error occurred" ErrorCode="4100" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1579226 0x157795f 0x1539a7a 0x1539cee 0xb2134b 0xb7002e 0xb788e2 0xb46c5e 0xb4724b 0xb47cb2 0xb483a3 0xb486f2 0x4a2a20 0xb27273 0xb27452 0xb68b50 0xb25ef5 0xb25fe7 0x4a2a20 0x9e5c68 0x719970 0xb605e0 0x75f86f 0x56ed90 0x159c97a 0x449c2f 0x7f3f629da830" Machine="10.87.200.38:4500" LogGroup="default" Roles="SS,TL" />
trace.10.87.200.38.4500.1600611724.p9XtOH.1.xml:<Event Severity="40" Time="1600611725.708995" Type="SSUpdateError" ID="b260a712e27c7bd1" Error="internal_error" ErrorDescription="An internal error occurred" ErrorCode="4100" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1579226 0x157795f 0xb0b127 0xb47d0f 0xb483a3 0xb486f2 0x4a2a20 0xb27273 0xb27452 0xb68b50 0xb25ef5 0xb25fe7 0x4a2a20 0x9e5c68 0x719970 0xb605e0 0x75f86f 0x56ed90 0x159c97a 0x449c2f 0x7f3f629da830" Machine="10.87.200.38:4500" LogGroup="default" Roles="SS,TL" />
trace.10.87.200.38.4500.1600611724.p9XtOH.1.xml:<Event Severity="40" Time="1600611725.729125" Type="StorageServerFailed" ID="b260a712e27c7bd1" Error="internal_error" ErrorDescription="An internal error occurred" ErrorCode="4100" Reason="Error" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x1579226 0x157795f 0xbd2666 0xbd2f0f 0xbd3084 0xbbcff3 0xbbd2f7 0x49a649 0xbca854 0xbcc278 0x5268d8 0x526c17 0x49a649 0xbc4260 0xbc42e7 0x56ed90 0x77247f 0x772897 0x77300a 0x4a2a20 0x4a2a20 0x1594987 0x15956b8 0x1593fbb 0x1593a18 0x63e108 0x56ed90 0x159c97a 0x449c2f 0x7f3f629da830" Machine="10.87.200.38:4500" LogGroup="default" Roles="SS,TL" />

foundationdb.conf is like below

[fdbmonitor]
user = ubuntu
group = ubuntu


[general]
restart_delay = 60
cluster_file = /etc/foundationdb/fdb.cluster


[fdbserver]
command = /usr/bin/fdbserver
public_address = auto:$ID
listen_address = public
datadir = /var/lib/foundationdb/data/$ID
logdir = /var/log/foundationdb


[fdbserver.4500]


[fdbserver.4501]
class = stateless


[backup_agent]
command = /usr/lib/foundationdb/backup_agent/backup_agent
logdir = /var/log/foundationdb

[backup_agent.1]

I read this discussion and restarted foundationdb processes in all the nodes. But it did not help.

Can someone let me know if it is possible to recover the cluster ? I am ok with removing the keyspaces for which both the replicas are missing.

FDB Version - 6.1.12.

DeliveredToNotAssigned

This usually means that there’s something wrong with the internal metadata the cluster uses to process transactions. If restoring a backup to a fresh cluster is an option here I’d recommend that. If some of the keyspace is still readable, then maybe you could copy it to a fresh cluster instead.

Since you noticed this after an outage it seems likely that the in-memory copy of the internal metadata was correct, but the on-disk copy restored during recovery was corrupt.

Is it possible that you were writing to a key starting with \xff (e.g. \xff/metadataVersion) and also doing something that could cause the database to become locked (e.g. fdbdr switch)? We recently found a long-standing bug where these two things used together can cause something like this: https://github.com/apple/foundationdb/pull/3674.

Otherwise this sort of thing could happen if you somehow were writing to \xff system keys and deleted some of the ones that encode the “how to process transactions” metadata, although I would expect this to show up immediately instead of after the next recovery.

2 Likes

Thanks for replying.

We are not using fdbdr.

For now we have restored to a fresh cluster.