Troubleshooting a "StorageServerFailed: internal_error..."


I have a test installation of foundationdb (simple 3 nodes), that reports the following error (at least I think it is an error) after some light usage (10 million inserts, followed by 10 minutes of read/write test)

fdbcli status report

  FoundationDB processes - 24 (less 0 excluded; 4 with errors)
  Zones                  - 3
  Machines               - 3

status json reveals

 "messages" : [
                        "description" : "StorageServerFailed: internal_error at Tue Feb  2 00:26:19 2021",
                        "name" : "internal_error",
                        "raw_log_message" : "\"Severity\"=\"40\", \"Time\"=\"1612225579.304060\", \"Type\"=\"StorageServerFailed\", \"
ID\"=\"513a5257b74ce9f6\", \"Error\"=\"internal_error\", \"ErrorDescription\"=\"An internal error occurred\", \"ErrorCode\"=\"4100\",
\"Reason\"=\"Error\", \"Backtrace\"=\"addr2line -e fdbserver.debug -p -C -f -i 0x19d19fc 0x19d11f8 0x19d12c1 0x106373a 0x106410d 0x106
4253 0x1047a4e 0x1047cce 0x6bd809 0x105f777 0x10600be 0xb35a08 0xb35c2e 0x6bd809 0x108c17e 0x108c4c0 0x6bd809 0x1057f43 0x1057fbf 0x10
5810c 0xb881c8 0xb886a7 0xb88d91 0x6be078 0x1a066e0 0x1a03347 0x1a03890 0xa53660 0x7fbad0 0x1a0f3d0 0x6746e2 0x7f369ef04bf7\", \"Machi
ne\"=\"\", \"LogGroup\"=\"default\", \"Roles\"=\"SS,TL\"",
                        "time" : 1612230000,
                        "type" : "StorageServerFailed"

How do I troubleshoot this further?

$ fdbcli --version
FoundationDB CLI 6.2 (v6.2.27)
source version 87cdf2b331bfe91d0a4d0e0ac09f1adbe1f2e012
protocol fdb00b062010001

To symbolicate backtraces, you can grab the *-debug.tar.gz off of the foundationdb downloads page. There’s a version dropdown to select whichever version you wish. You can then copy/paste the given addr2line command to get the backtrace.

Having done that, you’ll need to grep the logs for Severity=“40” on the failed storage server logs. It’s highly likely that this was due to an IO Error, and there should be something around the internal_error with a high severity that will tell you what the IO operation was that failed and how/why.

Is there a particular file you mean or do you mean os logs?

There should be some trace.*.xml files wherever you pointed --log_dir to