How to recover a FDB cluster with Recovery Stopped TooMany Old Generations

A FDB cluster was unavailable, the messages info shows “RecoveryStoppedTooManyOldGenerations”. We tried to add the --knob_max_generations_override parameter, but the issue is still there, just the "OldGenerations"="100" changed to "OldGenerations"="102".

Any thoughts that we can bring this cluster back to healthy?

"messages" : [
                    {
                        "description" : "RecoveryStoppedTooManyOldGenerations at Thu Mar 21 18:02:31 2024",
                        "name" : "process_error",
                        "raw_log_message" : "\"Severity\"=\"40\", \"Time\"=\"1711044151.846410\", \"Type\"=\"RecoveryStoppedTooManyOldGenerations\", \"ID\"=\"0000000000000000\", \"OldGenerations\"=\"100\", \"Reason\"=\"Recovery stopped because too many recoveries have happened since the last time the cluster was fully_recovered. Set --knob_max_generations_override on your server processes to a value larger than OldGenerations to resume recovery once the underlying problem has been fixed.\", \"Backtrace\"=\"addr2line -e fdbserver.debug -p -C -f -i 0x19d19fc 0x19d11f8 0x19d12c1 0xf33cce 0xf363ff 0x6be078 0xf56503 0xf56673 0xa01890 0xa01ff3 0x9fbdf8 0x9fbfba 0x9fb2a8 0x9fb75a 0x9fb8e6 0x6be078 0x732d1e 0x6be078 0x9fb2a8 0x9f1ac3 0x9fb2a8 0x9f1f15 0x9fb2a8 0x9f2bad 0x9fb2a8 0x9f0c51 0x9fb2a8 0xa0438b 0x19212db 0x19213d5 0x7fbad0 0x1a0f3d0 0x6746e2 0x7fec95c4b083\", \"Machine\"=\"xx.xx.xx.xx:4300\", \"LogGroup\"=\"default\", \"Roles\"=\"MS\"",
                        "time" : 1711040000,
                        "type" : "RecoveryStoppedTooManyOldGenerations"
                    }
                ],

Before changing the knob, you need to find out what’s causing “too many old generations” (could be many different reasons). After that, fix the underlying issue and change the knob to bring your cluster back.

Is there any clue on how to find out what’s causing “too many old generations”?
There is no more message other than this one:

"messages" : [
            {
                "description" : "Unable to read database configuration.",
                "name" : "unreadable_configuration"
            }
        ],

unreadable_configuration indicates problem of reading data (i.e., database configuration) from the storage servers. This could be recovery not finished (most likely), or storage server problems.

You can start by checking MasterRecoveryState events in the logs. Find out why the state can’t reach 11 (accepting commit). FYI, the complete list of states are:

RecoveryStatus {
	reading_coordinated_state,
	locking_coordinated_state,
	locking_old_transaction_servers,
	reading_transaction_system_state,
	configuration_missing,
	configuration_never_created,
	configuration_invalid,
	recruiting_transaction_servers,
	initializing_transaction_servers,
	recovery_transaction,
	writing_coordinated_state,
	accepting_commits,
	all_logs_recruited,
	storage_recovered,
	fully_recovered,