Foundationdb internal recovery - Troubleshooting

gvddk · August 5, 2025, 1:48am

We have been running a 45-node FoundationDB cluster for the past three years and encountered this issue for the first time. Suddenly, both read and write operations dropped significantly — read throughput fell from 140k to 70k, and write operations dropped from 10k to 0. After two minutes, FoundationDB triggered an internal recovery, and the system returned to a normal state.

Are there any pointers or recommendations for identifying the root cause of this issue? During the drop in traffic, we observed the following error “tlog Failed” in the logs: As per the error message the recovery was triggered at “06:34” but the recovery was happened at 06:37. Normally the foundationdb cluster recovery happens in few seconds.

"2025-07-02T06:34:18Z",,1205,,,6e381e44f9c43dbf,default,"w.x.y.z:4505",,CC,20,,12598897781578542113,"1751438058.110105",ClusterRecoveryRetrying,"aep_core_identity~1973~71DE388E-1E82-4C63-A082-CEEE4AE7C79C","1973:1435119795",none,1751438064,"{  ""Severity"": ""20"", ""Time"": ""1751438058.110105"", ""DateTime"": ""2025-07-02T06:34:18Z"", ""Type"": ""ClusterRecoveryRetrying"", ""ID"": ""6e381e44f9c43dbf"", ""Error"": ""tlog_failed"", ""ErrorDescription"": ""Cluster recovery terminating because a TLog failed"", ""ErrorCode"": ""1205"", ""ThreadID"": ""12598897781578542113"", ""Machine"": ""w.x.y.z:4505"", ""LogGroup"": ""default"", ""Roles"": ""CC"" }",68,"

Topic		Replies	Views
Cluster stuck in recovery after crash of one node Using FoundationDB	1	694	March 18, 2022
Database unavailable after shutting down a foundationdb node Using FoundationDB	17	8994	February 5, 2021
Fdb database is unavailable Running FoundationDB	2	559	August 30, 2023
SharedTLogFailed: internal_error Using FoundationDB performance	5	539	October 25, 2023
Are short outages when you lose a coordinator normal? Using FoundationDB	5	414	September 29, 2025

Foundationdb internal recovery - Troubleshooting

Related topics