SharedTLogFailed: internal_error

It is a 24 node cluster with 6 for transaction service. The cluster went down before 2days. It was found that the disk was 97% full on 3 nodes for TS. On all 3nodes, I found logqueue-xxxxxxx-1.fdq file with ~70G size taking up the filesystem. I have tried deleting the file on all 3 nodes and restarting service but it keeps getting generated. I have added 3 more TS nodes but it didnt help. It seems similar logs are getting generated on other TS nodes too. You can see the status of the cluster below. The cluster would be down as soon as the TS nodes are more than 95% full. Everytime I delete the logs, the cluster responds up until the logs are regenerated and then it goes down. Please suggest.

Found this entry on a storage node in trace.xxxx.xml file: trace.10.9.5.5.4501.1696950154.9GdRuw.0.1.xml:<Event Severity=“10” Time=“1696956175.102324” DateTime=“2023-10-10T16:42:55Z” Type=“FKBlockFail” ID=“76d391c91aa0b77a” Error=“transaction_too_old” ErrorDescription=“Transaction is too old to perform reads or be committed” ErrorCode=“1007” SuppressedEventCount=“191” FKID=“aee0d1d0dad58733” Machine=“10.49.75.5:4501” LogGroup=“default” Roles=“SS”

“Please restart following tlog interfaces, otherwise storage servers may never be able to catch up.” - Does this mean restarting the fdb service on transaction servers?

fdb> status details

WARNING: Long delay (Ctrl-C to interrupt)

Using cluster file `/etc/foundationdb/fdb.cluster’.

Unable to start default priority transaction after 2 seconds.

Unable to start batch priority transaction after 2 seconds.

Unable to retrieve all status information.

Configuration:
Redundancy mode - three_datacenter
Storage engine - ssd-2
Coordinators - 7
Desired Logs - 6
Usable Regions - 1

Cluster:
FoundationDB processes - 123 (less 0 excluded; 13 with errors)
Zones - 27
Machines - 27
Memory availability - 6.3 GB per process on machine with least available
Retransmissions rate - 0 Hz
Fault Tolerance - -1 machines

Warning: the database may have data loss and availability loss. Please restart following tlog interfaces, otherwise storage servers may never be able to catch up.
Old log epoch: 21 begin: 3367764062141 end: 3222016262026, missing log interfaces(id,address): 31f3bcb3fb24b233, 6d62230e43f672b0, 2162ced3712f04ef, 1bc3781a82022276,

Server time - 11/11/23 12:26:30

Data:
Replication health - HEALING: Restoring replication factor
Moving data - 806.308 GB
Sum of key-value sizes - 1.718 TB
Disk space used - 13.226 TB

Operating space:
Storage server - 4420.7 GB free on most full server
Log server - 0.0 GB free on most full server

Workload:
Read rate - 3022 Hz
Write rate - 0 Hz
Transactions started - 136 Hz
Transactions committed - 0 Hz
Conflict rate - 0 Hz
Performance limited by process: Log server MVCC memory.
Most limiting process: 11.46.76.166:4200

Backup and DR:
Running backups - 0
Running DRs - 0

Process performance details:
11.46.73.6:4200 ( 3% cpu; 3% machine; 0.001 Gbps; 0% disk IO; 2.0 GB / 6.4 GB RAM )
11.46.3.2:4200 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.3.2:4201 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.3.2:4202 ( 1% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.2 GB / 8.0 GB RAM )
11.46.3.2:4203 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.3.2:4204 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.3.2:4202 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.3.2:4206 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.74.163:4200 ( 3% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 2.0 GB / 6.4 GB RAM )
11.46.72.2:4200 ( 3% cpu; 4% machine; 0.032 Gbps; 12% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.72.2:4201 ( 6% cpu; 4% machine; 0.032 Gbps; 12% disk IO; 2.2 GB / 8.0 GB RAM )
Storage server lagging by 243668 seconds.
11.46.72.2:4202 ( 6% cpu; 4% machine; 0.032 Gbps; 12% disk IO; 2.2 GB / 8.0 GB RAM )
Storage server lagging by 244023 seconds.
11.46.72.2:4203 ( 7% cpu; 4% machine; 0.032 Gbps; 12% disk IO; 2.2 GB / 8.0 GB RAM )
Storage server lagging by 243776 seconds.
11.46.72.2:4204 ( 1% cpu; 4% machine; 0.032 Gbps; 12% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.72.2:4202 ( 12% cpu; 4% machine; 0.032 Gbps; 12% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.72.2:4206 ( 2% cpu; 4% machine; 0.032 Gbps; 17% disk IO; 2.6 GB / 8.0 GB RAM )
11.46.72.2:4207 ( 2% cpu; 4% machine; 0.032 Gbps; 12% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.72.2:4208 ( 2% cpu; 4% machine; 0.032 Gbps; 12% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.72.2:4206 ( 3% cpu; 4% machine; 0.032 Gbps; 12% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.72.2:4211 ( 1% cpu; 4% machine; 0.032 Gbps; 12% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.78.83:4200 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.78.83:4201 ( 1% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.2 GB / 8.0 GB RAM )
11.46.78.83:4202 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.78.83:4203 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.78.83:4204 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.78.83:4202 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.78.83:4206 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.78.170:4200 ( 2% cpu; 1% machine; 0.000 Gbps; 0% disk IO; 0.3 GB / 6.3 GB RAM )
11.46.76.76:4200 ( 2% cpu; 7% machine; 0.036 Gbps; 31% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.76.76:4201 ( 2% cpu; 7% machine; 0.036 Gbps; 31% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.76.76:4202 ( 1% cpu; 7% machine; 0.036 Gbps; 31% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.76.76:4203 ( 1% cpu; 7% machine; 0.036 Gbps; 31% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.76.76:4204 ( 2% cpu; 7% machine; 0.036 Gbps; 31% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.76.76:4202 ( 8% cpu; 7% machine; 0.036 Gbps; 31% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.76.76:4206 ( 1% cpu; 7% machine; 0.036 Gbps; 31% disk IO; 4.8 GB / 8.0 GB RAM )
11.46.76.76:4207 ( 8% cpu; 7% machine; 0.036 Gbps; 32% disk IO; 2.7 GB / 8.0 GB RAM )
11.46.76.76:4208 ( 6% cpu; 7% machine; 0.036 Gbps; 31% disk IO; 4.1 GB / 8.0 GB RAM )
Storage server lagging by 244022 seconds.
11.46.76.76:4206 ( 2% cpu; 7% machine; 0.036 Gbps; 31% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.76.76:4211 ( 1% cpu; 7% machine; 0.036 Gbps; 31% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.76.113:4200 ( 2% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.3 GB / 6.3 GB RAM )
11.46.76.166:4200 ( 3% cpu; 34% machine; 0.001 Gbps; 0% disk IO; 1.6 GB / 6.8 GB RAM )
11.46.80.66:4200 ( 3% cpu; 40% machine; 0.001 Gbps; 0% disk IO; 1.6 GB / 6.4 GB RAM )
Last logged error: SharedTLogFailed: internal_error at Tue Oct 11 12:02:46 2023
11.46.81.142:4200 ( 3% cpu; 40% machine; 0.001 Gbps; 0% disk IO; 1.8 GB / 6.8 GB RAM )
11.46.81.120:4200 ( 1% cpu; 1% machine; 0.000 Gbps; 0% disk IO; 0.3 GB / 6.3 GB RAM )
11.46.82.183:4200 ( 2% cpu; 6% machine; 0.074 Gbps; 27% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.82.183:4201 ( 22% cpu; 6% machine; 0.074 Gbps; 26% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.82.183:4202 ( 2% cpu; 6% machine; 0.074 Gbps; 28% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.82.183:4203 ( 11% cpu; 6% machine; 0.074 Gbps; 28% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.82.183:4204 ( 12% cpu; 6% machine; 0.074 Gbps; 26% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.82.183:4202 ( 2% cpu; 6% machine; 0.074 Gbps; 27% disk IO; 2.1 GB / 8.0 GB RAM )
Storage server lagging by 243840 seconds.
11.46.82.183:4206 ( 4% cpu; 6% machine; 0.074 Gbps; 28% disk IO; 4.8 GB / 8.0 GB RAM )
11.46.82.183:4207 ( 2% cpu; 6% machine; 0.074 Gbps; 28% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.82.183:4208 ( 3% cpu; 6% machine; 0.074 Gbps; 28% disk IO; 2.6 GB / 8.0 GB RAM )
11.46.82.183:4206 ( 3% cpu; 6% machine; 0.074 Gbps; 27% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.82.183:4211 ( 2% cpu; 6% machine; 0.074 Gbps; 30% disk IO; 2.7 GB / 8.0 GB RAM )
11.46.82.2:4200 ( 1% cpu; 18% machine; 0.000 Gbps; 0% disk IO; 0.3 GB / 6.3 GB RAM )
11.46.82.68:4200 ( 0% cpu; 1% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.82.68:4201 ( 0% cpu; 1% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.82.68:4202 ( 0% cpu; 1% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.82.68:4203 ( 0% cpu; 1% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.82.68:4204 ( 0% cpu; 1% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.82.68:4202 ( 0% cpu; 1% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.82.68:4206 ( 0% cpu; 1% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.82.117:4200 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.82.117:4201 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.82.117:4202 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.82.117:4203 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.82.117:4204 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.82.117:4202 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.82.117:4206 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.82.142:4200 ( 2% cpu; 3% machine; 0.038 Gbps; 23% disk IO; 4.7 GB / 8.0 GB RAM )
11.46.82.142:4201 ( 1% cpu; 3% machine; 0.038 Gbps; 23% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.82.142:4202 ( 2% cpu; 3% machine; 0.038 Gbps; 23% disk IO; 2.2 GB / 8.0 GB RAM )
11.46.82.142:4203 ( 6% cpu; 3% machine; 0.038 Gbps; 22% disk IO; 4.0 GB / 8.0 GB RAM )
Storage server lagging by 244026 seconds.
11.46.82.142:4204 ( 1% cpu; 3% machine; 0.038 Gbps; 23% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.82.142:4202 ( 3% cpu; 3% machine; 0.038 Gbps; 23% disk IO; 2.2 GB / 8.0 GB RAM )
11.46.82.142:4206 ( 3% cpu; 3% machine; 0.038 Gbps; 22% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.82.142:4207 ( 1% cpu; 3% machine; 0.038 Gbps; 23% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.82.142:4208 ( 2% cpu; 3% machine; 0.038 Gbps; 23% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.82.142:4206 ( 13% cpu; 3% machine; 0.038 Gbps; 23% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.82.142:4211 ( 2% cpu; 3% machine; 0.038 Gbps; 23% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.87.122:4200 ( 3% cpu; 4% machine; 0.001 Gbps; 0% disk IO; 2.0 GB / 6.4 GB RAM )
Last logged error: SharedTLogFailed: internal_error at Tue Oct 11 12:02:46 2023
11.46.88.2:4200 ( 1% cpu; 1% machine; 0.001 Gbps; 0% disk IO; 0.3 GB / 6.3 GB RAM )
11.46.88.217:4200 ( 14% cpu; 2% machine; 0.026 Gbps; 22% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.88.217:4201 ( 2% cpu; 2% machine; 0.026 Gbps; 22% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.88.217:4202 ( 3% cpu; 2% machine; 0.026 Gbps; 22% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.88.217:4203 ( 2% cpu; 2% machine; 0.026 Gbps; 22% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.88.217:4204 ( 1% cpu; 2% machine; 0.026 Gbps; 24% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.88.217:4202 ( 11% cpu; 2% machine; 0.026 Gbps; 24% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.88.217:4206 ( 3% cpu; 2% machine; 0.026 Gbps; 22% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.88.217:4207 ( 7% cpu; 2% machine; 0.026 Gbps; 22% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.88.217:4208 ( 7% cpu; 2% machine; 0.026 Gbps; 24% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.88.217:4206 ( 2% cpu; 2% machine; 0.026 Gbps; 22% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.88.217:4211 ( 2% cpu; 2% machine; 0.026 Gbps; 22% disk IO; 2.2 GB / 8.0 GB RAM )
Storage server lagging by 243826 seconds.
11.46.88.223:4200 ( 3% cpu; 40% machine; 0.001 Gbps; 0% disk IO; 1.6 GB / 6.8 GB RAM )
11.46.86.160:4200 ( 3% cpu; 22% machine; 0.001 Gbps; 0% disk IO; 1.6 GB / 6.3 GB RAM )
Last logged error: SharedTLogFailed: internal_error at Tue Oct 11 12:02:46 2023
11.46.60.112:4200 ( 1% cpu; 4% machine; 0.000 Gbps; 0% disk IO; 0.3 GB / 6.3 GB RAM )
11.46.60.166:4200 ( 11% cpu; 3% machine; 0.007 Gbps; 0% disk IO; 0.2 GB / 8.0 GB RAM )
11.46.60.166:4201 ( 3% cpu; 3% machine; 0.007 Gbps; 0% disk IO; 0.2 GB / 8.0 GB RAM )
11.46.60.166:4202 ( 2% cpu; 3% machine; 0.007 Gbps; 0% disk IO; 0.2 GB / 8.0 GB RAM )
11.46.60.166:4203 ( 0% cpu; 3% machine; 0.007 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.60.166:4204 ( 0% cpu; 3% machine; 0.007 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.60.166:4202 ( 6% cpu; 3% machine; 0.007 Gbps; 0% disk IO; 0.2 GB / 8.0 GB RAM )
11.46.60.166:4206 ( 1% cpu; 3% machine; 0.007 Gbps; 0% disk IO; 0.2 GB / 8.0 GB RAM )
11.46.64.233:4200 ( 0% cpu; 3% machine; 0.008 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.64.233:4201 ( 0% cpu; 3% machine; 0.008 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.64.233:4202 ( 0% cpu; 3% machine; 0.008 Gbps; 0% disk IO; 0.2 GB / 8.0 GB RAM )
11.46.64.233:4203 ( 24% cpu; 3% machine; 0.008 Gbps; 0% disk IO; 0.3 GB / 8.0 GB RAM )
11.46.64.233:4204 ( 6% cpu; 3% machine; 0.008 Gbps; 0% disk IO; 0.2 GB / 8.0 GB RAM )
11.46.64.233:4202 ( 0% cpu; 3% machine; 0.008 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.64.233:4206 ( 0% cpu; 3% machine; 0.008 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.62.31:4200 ( 3% cpu; 8% machine; 0.001 Gbps; 0% disk IO; 1.6 GB / 6.3 GB RAM )
Last logged error: SharedTLogFailed: internal_error at Tue Oct 11 12:02:46 2023
11.46.62.114:4200 ( 4% cpu; 3% machine; 0.046 Gbps; 30% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.62.114:4201 ( 11% cpu; 3% machine; 0.046 Gbps; 26% disk IO; 4.1 GB / 8.0 GB RAM )
Storage server lagging by 243632 seconds.
11.46.62.114:4202 ( 1% cpu; 3% machine; 0.046 Gbps; 31% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.62.114:4203 ( 2% cpu; 3% machine; 0.046 Gbps; 31% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.62.114:4204 ( 2% cpu; 3% machine; 0.046 Gbps; 31% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.62.114:4202 ( 3% cpu; 3% machine; 0.046 Gbps; 31% disk IO; 2.2 GB / 8.0 GB RAM )
11.46.62.114:4206 ( 2% cpu; 3% machine; 0.046 Gbps; 30% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.62.114:4207 ( 2% cpu; 3% machine; 0.046 Gbps; 31% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.62.114:4208 ( 6% cpu; 3% machine; 0.046 Gbps; 30% disk IO; 2.2 GB / 8.0 GB RAM )
Storage server lagging by 244021 seconds.
11.46.62.114:4206 ( 1% cpu; 3% machine; 0.046 Gbps; 30% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.62.114:4211 ( 2% cpu; 3% machine; 0.046 Gbps; 30% disk IO; 4.6 GB / 8.0 GB RAM )

Coordination servers:
11.46.78.170:4200 (reachable)
11.46.76.113:4200 (reachable)
11.46.81.120:4200 (reachable)
11.46.82.2:4200 (reachable)
11.46.87.122:4200 (reachable)
11.46.88.2:4200 (reachable)
11.46.60.112:4200 (reachable)

Client time: 11/11/23 12:26:18

It was found that the disk was 97% full on 3 nodes for TS. On all 3nodes, I found logqueue-xxxxxxx-1.fdq file with ~70G size taking up the filesystem. I have tried deleting the file on all 3 nodes and restarting service but it keeps getting generated.

In general it’s a bad idea to just delete files, this can cause dataloss.

“Please restart following tlog interfaces, otherwise storage servers may never be able to catch up.” - Does this mean restarting the fdb service on transaction servers?

This means you have to restart the fdb server processes that have the transaction/log class/role and containing the data.

11.46.72.2:4201 ( 6% cpu; 4% machine; 0.032 Gbps; 12% disk IO; 2.2 GB / 8.0 GB RAM )
Storage server lagging by 243668 seconds.

Have you verified why those SS are lagging that much? I the SS processes are not pulling the committed data from the log processes the WAL on the log processes will grow (assuming you still have clients writing to the cluster) until all mutations are fetched.

I dont see any error messages in the storage server logs or on transaction server logs. Since this is our load testing cluster, I had to clean up the data and logs to make the cluster available immediately.

But the same issue came up again. Could you help me understand why adding few more nodes having transaction service doesnt help? Even if there are 9 TS nodes, only 2/3 of them have full disk utilization. Doesnt this Tlog info get distributed across all the 9TS nodes?

As I mentioned before, this issue surfaced up again but the status only says “Most limiting process: 12.9.73.6:4500”.

So I searched for above IP in one of the storage server logs and found some logs like below only:

Instead of cleaning up the cluster, I have tried killing the fdb trasaction process on the TS server with the issue and it helped. It seems the storage server have read all the logs from the TS server immediately and logs got cleared up.

I am trying to understand the root cause here. Why do the SS server cannot read the logs until I bounce the TS service. The impacted TS node was 95% full.

How can I avoid a similar situation in future as adding more nodes doesnt seem to help? Please suggest.

It looks like when TLog is reach disk full, i.e., 95% in your case, the TLog exhibits a fail-stuck behavior. As a result, the lagging storage server could fetch data from this particular TLog.

After you killed the TLog, the storage server starts to fetch data from the other replicas in different TLogs. That’s why you see immediate recovery.

In essence, don’t run into disk full. You probably should alert on 60% full or less, which give you time to react before reaching 90% or higher.

@jzhou Thank you for your response.

It seems the application is making bulk writes periodically and the database couldn’t cope up with it and hence the transaction logs are increasing up till 95% and cluster is down.

Since there are multiple transaction servers in my cluster, why can’t the writes get distributed instead of piling up on a single transaction server? I am not sure if fdb is not designed in that manner. Please suggest.

Also, rather than tuning the application to reduce the insert rate, what can be done on the database side such that it can handle those bulk application loads. I see no use in adding more transaction nodes to the cluster. Does increasing the IOPS for these transaction servers help in coping up the application load?

Thank you for your time and help.

It might be useful to look at the log of the problematic tlog, as there are messages like SharedTLogFailed: internal_error. This might give clues why tlog behave stuck. You can look for Severity="40" events, and some events have backtraces.

Generally, storage servers pull mutation log data from tlogs. After processing them, storage servers pop mutation data from tlogs. So tlog filling up is due to storage server lag. The fact that after killing the problematic tlog the logs immediately got cleared up just means that storage server is capable of processing mutation log. The problem seems to be that tlog is probably not returning mutation log to storage servers.