Stuck in "Initializing new transaction servers and recovering transaction logs"

lehu · April 27, 2020, 12:09am

We got into a situation where a log queue file becomes very large, 320GB, and the file system becomes almost full. The situation is similar to what’s described in this post:

I tried the suggestions in the post. Created a new Transaction process, and then excluded the old process. It ran for a couple of time without finishing.

Then I tried the second measure: to kill the old TX process. That put the cluster into the state of “Initializing new transaction servers and recovering transaction logs” (the output of “status” cmd), and the db has been in that status for 3+ hours.

Please advise what I can do next. Is there a way to bring the db out of that state forcefully?

Thanks.
Leo

alexmiller · April 27, 2020, 12:36am

What you did first was the correct thing to do. If you look at TLogMetrics on the full transaction log, you should be able to find QueuePoppedVersion and MinPoppedTagVersion. If they’re equal, then look at MinPoppedTagLocality and MinPoppedTagId to start to figure out what storage server is causing the problem. If they’re largely unequal, then this is a bug in FDB.

What you did second could work if you’re at double or higher replication factor, and you know no other TLog had issues. As killing it stopped your database from recovering, this doesn’t appear to have been safe. So, you’ll need to bring the full transaction log back alive, and then we’d need to debug why recovery isn’t finishing. Can I have trace logs from the Master and the full Transaction Log once you’ve brought it back online?

lehu · April 27, 2020, 1:00am

Hey Alex,

I really appreciate your prompt reply. Are the four metrics you mentioned to be found in our 300GB file? That can be a little daunting with the big file. Please confirm before I start the big search.

We are at triple replications, 3-DCs, 2-region config. Yeah, I can upload trace logs and maybe the big full Tlog file once the db is back online.

Leo

alexmiller · April 27, 2020, 1:02am

The metrics are in the trace file. If your trace.*.xml file is >300GB, then there’s an easy fix here…

If you have a multi-region triple replication setup, it sounds quite odd that killing one extra TLog would prevent recovery, as you should have 5 other copies of the data…

It’s possible that the Master logs from your cluster right now would be helpful. (As long as they include the Type="MasterRecover*" TrackLatestType="Original" events.)

lehu · April 27, 2020, 1:28am

I was confused between the sys log files and db transaction log files. I searched the sys logs by this command:
egrep -o '(QueuePoppedVersion|MinPoppedTagVersion|MinPoppedTagLocality|MinPoppedTagId).....' *.xml

And got these values:

QueuePoppedVersion=“0”
MinPoppedTagVersion=“0”

MinPoppedTagLocality="-1"
MinPoppedTagId=“0”

I’ll identify Master logs after dinner. Thanks.

lehu · April 27, 2020, 7:11am

Actually all transaction pods in the cluster got a very big tx log file. I chose transaction-01 pod at DC3 to kill and it caused the stuck situation.

I scanned our stateless pods, and found one pod containing the event you mentioned. Here is the matching lines (edited for readability). The log file is here.

root@fdbxid-stateless-09-56f7478fb5-rwsg6:~# egrep -o 'Type="MasterRecover.*TrackLatestType="Original"' /var/log/foundationdb/*.xml 
/var/log/foundationdb/trace.10.104.219.138.4300.1581376157.qDcAfp.2.695.xml:

Type="MasterRecoveryState" ID="27e0060cd26eb47b" StatusCode="0" Status="reading_coordinated_state" Machine="10.104.219.138:4300" LogGroup="default" Roles="MP,MS" TrackLatestType="Original"
Type="MasterRecoveryState" ID="27e0060cd26eb47b" StatusCode="1" Status="locking_coordinated_state" TLogs="1" MyRecoveryCount="99" ForceRecovery="0" Machine="10.104.219.138:4300" LogGroup="default" Roles="MP,MS" TrackLatestType="Original"
Type="MasterRecoveryState" ID="27e0060cd26eb47b" StatusCode="3" Status="reading_transaction_system_state" Machine="10.104.219.138:4300" LogGroup="default" Roles="MS" TrackLatestType="Original"
Type="MasterRecoveredConfig" ID="27e0060cd26eb47b" Conf="{&quot;log_spill&quot;:2,&quot;logs&quot;:15,&quot;proxies&quot;:22,&quot;redundancy_mode&quot;:&quot;triple&quot;,&quot;storage_engine&quot;:&quot;ssd-2&quot;,&quot;usable_regions&quot;:1}" Machine="10.104.219.138:4300" LogGroup="default" Roles="MS" TrackLatestType="Original"
Type="MasterRecoveryState" ID="27e0060cd26eb47b" StatusCode="7" Status="recruiting_transaction_servers" RequiredTLogs="3" DesiredTLogs="15" RequiredProxies="1" DesiredProxies="22" RequiredResolvers="1" DesiredResolvers="1" StoreType="ssd-2" Machine="10.104.219.138:4300" LogGroup="default" Roles="MS" TrackLatestType="Original"
Type="MasterRecoveryState" ID="27e0060cd26eb47b" StatusCode="8" Status="initializing_transaction_servers" Proxies="6" TLogs="15" Resolvers="1" Machine="10.104.219.138:4300" LogGroup="default" Roles="MS" TrackLatestType="Original"

lehu · April 27, 2020, 5:03pm

Alex or others:
I need urgent helps with the issue. I am wondering if we can establish more direct communications like over the phone. Highly appreciate it.

Leo

Evan · April 27, 2020, 9:17pm

The most likely explanations if all of the tlogs are large are that the remote DC could not keep up with newly ingested mutations, or data movement could not remove a failed server for some reason, so new mutations just kept building up in the tlogs.

In “TLogMetrics” you should be able to see “MinPoppedTagVersion”, “MinPoppedTagLocality”, and “MinPoppedTagId”, which will tell you which tag has not been popped which is causing mutations to build up.

You can also look at “DatacenterVersionDifference” on the cluster controller to determine of one region has fallen behind by a lot.

Feel free to give me a call

Topic		Replies	Views
SharedTLogFailed: internal_error Using FoundationDB performance	5	415	October 25, 2023
Fdb database is unavailable Running FoundationDB	2	457	August 30, 2023
Extremely large logqueue file Using FoundationDB	4	921	July 31, 2025
Database unavailable after shutting down a foundationdb node Using FoundationDB	17	8545	February 5, 2021
How to prevent tlogs from overcommitting Using FoundationDB	20	1632	October 23, 2018

Stuck in "Initializing new transaction servers and recovering transaction logs"

Related topics