Stuck in "Initializing new transaction servers and recovering transaction logs"

We got into a situation where a log queue file becomes very large, 320GB, and the file system becomes almost full. The situation is similar to what’s described in this post:

I tried the suggestions in the post. Created a new Transaction process, and then excluded the old process. It ran for a couple of time without finishing.

Then I tried the second measure: to kill the old TX process. That put the cluster into the state of “Initializing new transaction servers and recovering transaction logs” (the output of “status” cmd), and the db has been in that status for 3+ hours.

Please advise what I can do next. Is there a way to bring the db out of that state forcefully?

Thanks.
Leo

What you did first was the correct thing to do. If you look at TLogMetrics on the full transaction log, you should be able to find QueuePoppedVersion and MinPoppedTagVersion. If they’re equal, then look at MinPoppedTagLocality and MinPoppedTagId to start to figure out what storage server is causing the problem. If they’re largely unequal, then this is a bug in FDB.

What you did second could work if you’re at double or higher replication factor, and you know no other TLog had issues. As killing it stopped your database from recovering, this doesn’t appear to have been safe. So, you’ll need to bring the full transaction log back alive, and then we’d need to debug why recovery isn’t finishing. Can I have trace logs from the Master and the full Transaction Log once you’ve brought it back online?

Hey Alex,

I really appreciate your prompt reply. Are the four metrics you mentioned to be found in our 300GB file? That can be a little daunting with the big file. Please confirm before I start the big search.

We are at triple replications, 3-DCs, 2-region config. Yeah, I can upload trace logs and maybe the big full Tlog file once the db is back online.

Leo

The metrics are in the trace file. If your trace.*.xml file is >300GB, then there’s an easy fix here…

If you have a multi-region triple replication setup, it sounds quite odd that killing one extra TLog would prevent recovery, as you should have 5 other copies of the data…

It’s possible that the Master logs from your cluster right now would be helpful. (As long as they include the Type="MasterRecover*" TrackLatestType="Original" events.)

I was confused between the sys log files and db transaction log files. I searched the sys logs by this command:
egrep -o '(QueuePoppedVersion|MinPoppedTagVersion|MinPoppedTagLocality|MinPoppedTagId).....' *.xml

And got these values:

QueuePoppedVersion=“0”
MinPoppedTagVersion=“0”

MinPoppedTagLocality="-1"
MinPoppedTagId=“0”

I’ll identify Master logs after dinner. Thanks.

Actually all transaction pods in the cluster got a very big tx log file. I chose transaction-01 pod at DC3 to kill and it caused the stuck situation.

I scanned our stateless pods, and found one pod containing the event you mentioned. Here is the matching lines (edited for readability). The log file is here.

root@fdbxid-stateless-09-56f7478fb5-rwsg6:~# egrep -o 'Type="MasterRecover.*TrackLatestType="Original"' /var/log/foundationdb/*.xml 
/var/log/foundationdb/trace.10.104.219.138.4300.1581376157.qDcAfp.2.695.xml:

Type="MasterRecoveryState" ID="27e0060cd26eb47b" StatusCode="0" Status="reading_coordinated_state" Machine="10.104.219.138:4300" LogGroup="default" Roles="MP,MS" TrackLatestType="Original"
Type="MasterRecoveryState" ID="27e0060cd26eb47b" StatusCode="1" Status="locking_coordinated_state" TLogs="1" MyRecoveryCount="99" ForceRecovery="0" Machine="10.104.219.138:4300" LogGroup="default" Roles="MP,MS" TrackLatestType="Original"
Type="MasterRecoveryState" ID="27e0060cd26eb47b" StatusCode="3" Status="reading_transaction_system_state" Machine="10.104.219.138:4300" LogGroup="default" Roles="MS" TrackLatestType="Original"
Type="MasterRecoveredConfig" ID="27e0060cd26eb47b" Conf="{"log_spill":2,"logs":15,"proxies":22,"redundancy_mode":"triple","storage_engine":"ssd-2","usable_regions":1}" Machine="10.104.219.138:4300" LogGroup="default" Roles="MS" TrackLatestType="Original"
Type="MasterRecoveryState" ID="27e0060cd26eb47b" StatusCode="7" Status="recruiting_transaction_servers" RequiredTLogs="3" DesiredTLogs="15" RequiredProxies="1" DesiredProxies="22" RequiredResolvers="1" DesiredResolvers="1" StoreType="ssd-2" Machine="10.104.219.138:4300" LogGroup="default" Roles="MS" TrackLatestType="Original"
Type="MasterRecoveryState" ID="27e0060cd26eb47b" StatusCode="8" Status="initializing_transaction_servers" Proxies="6" TLogs="15" Resolvers="1" Machine="10.104.219.138:4300" LogGroup="default" Roles="MS" TrackLatestType="Original"

Alex or others:
I need urgent helps with the issue. I am wondering if we can establish more direct communications like over the phone. Highly appreciate it.

Leo

The most likely explanations if all of the tlogs are large are that the remote DC could not keep up with newly ingested mutations, or data movement could not remove a failed server for some reason, so new mutations just kept building up in the tlogs.

In “TLogMetrics” you should be able to see “MinPoppedTagVersion”, “MinPoppedTagLocality”, and “MinPoppedTagId”, which will tell you which tag has not been popped which is causing mutations to build up.

You can also look at “DatacenterVersionDifference” on the cluster controller to determine of one region has fallen behind by a lot.

Feel free to give me a call