How to recover a FDB cluster with Recovery Stopped TooMany Old Generations

highline · March 21, 2024, 7:07pm

A FDB cluster was unavailable, the messages info shows “RecoveryStoppedTooManyOldGenerations”. We tried to add the --knob_max_generations_override parameter, but the issue is still there, just the "OldGenerations"="100" changed to "OldGenerations"="102".

Any thoughts that we can bring this cluster back to healthy?

"messages" : [
                    {
                        "description" : "RecoveryStoppedTooManyOldGenerations at Thu Mar 21 18:02:31 2024",
                        "name" : "process_error",
                        "raw_log_message" : "\"Severity\"=\"40\", \"Time\"=\"1711044151.846410\", \"Type\"=\"RecoveryStoppedTooManyOldGenerations\", \"ID\"=\"0000000000000000\", \"OldGenerations\"=\"100\", \"Reason\"=\"Recovery stopped because too many recoveries have happened since the last time the cluster was fully_recovered. Set --knob_max_generations_override on your server processes to a value larger than OldGenerations to resume recovery once the underlying problem has been fixed.\", \"Backtrace\"=\"addr2line -e fdbserver.debug -p -C -f -i 0x19d19fc 0x19d11f8 0x19d12c1 0xf33cce 0xf363ff 0x6be078 0xf56503 0xf56673 0xa01890 0xa01ff3 0x9fbdf8 0x9fbfba 0x9fb2a8 0x9fb75a 0x9fb8e6 0x6be078 0x732d1e 0x6be078 0x9fb2a8 0x9f1ac3 0x9fb2a8 0x9f1f15 0x9fb2a8 0x9f2bad 0x9fb2a8 0x9f0c51 0x9fb2a8 0xa0438b 0x19212db 0x19213d5 0x7fbad0 0x1a0f3d0 0x6746e2 0x7fec95c4b083\", \"Machine\"=\"xx.xx.xx.xx:4300\", \"LogGroup\"=\"default\", \"Roles\"=\"MS\"",
                        "time" : 1711040000,
                        "type" : "RecoveryStoppedTooManyOldGenerations"
                    }
                ],

jzhou · March 21, 2024, 10:18pm

Before changing the knob, you need to find out what’s causing “too many old generations” (could be many different reasons). After that, fix the underlying issue and change the knob to bring your cluster back.

highline · March 21, 2024, 10:33pm

Is there any clue on how to find out what’s causing “too many old generations”?
There is no more message other than this one:

"messages" : [
            {
                "description" : "Unable to read database configuration.",
                "name" : "unreadable_configuration"
            }
        ],

jzhou · March 22, 2024, 4:36pm

unreadable_configuration indicates problem of reading data (i.e., database configuration) from the storage servers. This could be recovery not finished (most likely), or storage server problems.

You can start by checking MasterRecoveryState events in the logs. Find out why the state can’t reach 11 (accepting commit). FYI, the complete list of states are:

RecoveryStatus {
	reading_coordinated_state,
	locking_coordinated_state,
	locking_old_transaction_servers,
	reading_transaction_system_state,
	configuration_missing,
	configuration_never_created,
	configuration_invalid,
	recruiting_transaction_servers,
	initializing_transaction_servers,
	recovery_transaction,
	writing_coordinated_state,
	accepting_commits,
	all_logs_recruited,
	storage_recovered,
	fully_recovered,

Topic		Replies	Views
Cluster stuck in recovery after crash of one node Using FoundationDB	1	549	March 18, 2022
30 server cluster just died Using FoundationDB	7	726	June 6, 2021
Cluster stuck with status "Locked coordination state" even when all coordination servers available Using FoundationDB	2	713	February 4, 2021
Troubles scaling up the cluster Using FoundationDB	31	3729	November 1, 2018
Cluster stuck in recovery Running FoundationDB	3	686	March 12, 2021

How to recover a FDB cluster with Recovery Stopped TooMany Old Generations

Related topics