Foundationdb internal recovery - Troubleshooting

We have been running a 45-node FoundationDB cluster for the past three years and encountered this issue for the first time. Suddenly, both read and write operations dropped significantly — read throughput fell from 140k to 70k, and write operations dropped from 10k to 0. After two minutes, FoundationDB triggered an internal recovery, and the system returned to a normal state.

Are there any pointers or recommendations for identifying the root cause of this issue? During the drop in traffic, we observed the following error “tlog Failed” in the logs: As per the error message the recovery was triggered at “06:34” but the recovery was happened at 06:37. Normally the foundationdb cluster recovery happens in few seconds.

"2025-07-02T06:34:18Z",,1205,,,6e381e44f9c43dbf,default,"w.x.y.z:4505",,CC,20,,12598897781578542113,"1751438058.110105",ClusterRecoveryRetrying,"aep_core_identity~1973~71DE388E-1E82-4C63-A082-CEEE4AE7C79C","1973:1435119795",none,1751438064,"{  ""Severity"": ""20"", ""Time"": ""1751438058.110105"", ""DateTime"": ""2025-07-02T06:34:18Z"", ""Type"": ""ClusterRecoveryRetrying"", ""ID"": ""6e381e44f9c43dbf"", ""Error"": ""tlog_failed"", ""ErrorDescription"": ""Cluster recovery terminating because a TLog failed"", ""ErrorCode"": ""1205"", ""ThreadID"": ""12598897781578542113"", ""Machine"": ""w.x.y.z:4505"", ""LogGroup"": ""default"", ""Roles"": ""CC"" }",68,"