It is a 24 node cluster with 6 for transaction service. The cluster went down before 2days. It was found that the disk was 97% full on 3 nodes for TS. On all 3nodes, I found logqueue-xxxxxxx-1.fdq file with ~70G size taking up the filesystem. I have tried deleting the file on all 3 nodes and restarting service but it keeps getting generated. I have added 3 more TS nodes but it didnt help. It seems similar logs are getting generated on other TS nodes too. You can see the status of the cluster below. The cluster would be down as soon as the TS nodes are more than 95% full. Everytime I delete the logs, the cluster responds up until the logs are regenerated and then it goes down. Please suggest.
Found this entry on a storage node in trace.xxxx.xml file: trace.10.9.5.5.4501.1696950154.9GdRuw.0.1.xml:<Event Severity=“10” Time=“1696956175.102324” DateTime=“2023-10-10T16:42:55Z” Type=“FKBlockFail” ID=“76d391c91aa0b77a” Error=“transaction_too_old” ErrorDescription=“Transaction is too old to perform reads or be committed” ErrorCode=“1007” SuppressedEventCount=“191” FKID=“aee0d1d0dad58733” Machine=“10.49.75.5:4501” LogGroup=“default” Roles=“SS”
“Please restart following tlog interfaces, otherwise storage servers may never be able to catch up.” - Does this mean restarting the fdb service on transaction servers?
fdb> status details
WARNING: Long delay (Ctrl-C to interrupt)
Using cluster file `/etc/foundationdb/fdb.cluster’.
Unable to start default priority transaction after 2 seconds.
Unable to start batch priority transaction after 2 seconds.
Unable to retrieve all status information.
Configuration:
Redundancy mode - three_datacenter
Storage engine - ssd-2
Coordinators - 7
Desired Logs - 6
Usable Regions - 1
Cluster:
FoundationDB processes - 123 (less 0 excluded; 13 with errors)
Zones - 27
Machines - 27
Memory availability - 6.3 GB per process on machine with least available
Retransmissions rate - 0 Hz
Fault Tolerance - -1 machines
Warning: the database may have data loss and availability loss. Please restart following tlog interfaces, otherwise storage servers may never be able to catch up.
Old log epoch: 21 begin: 3367764062141 end: 3222016262026, missing log interfaces(id,address): 31f3bcb3fb24b233, 6d62230e43f672b0, 2162ced3712f04ef, 1bc3781a82022276,
Server time - 11/11/23 12:26:30
Data:
Replication health - HEALING: Restoring replication factor
Moving data - 806.308 GB
Sum of key-value sizes - 1.718 TB
Disk space used - 13.226 TB
Operating space:
Storage server - 4420.7 GB free on most full server
Log server - 0.0 GB free on most full server
Workload:
Read rate - 3022 Hz
Write rate - 0 Hz
Transactions started - 136 Hz
Transactions committed - 0 Hz
Conflict rate - 0 Hz
Performance limited by process: Log server MVCC memory.
Most limiting process: 11.46.76.166:4200
Backup and DR:
Running backups - 0
Running DRs - 0
Process performance details:
11.46.73.6:4200 ( 3% cpu; 3% machine; 0.001 Gbps; 0% disk IO; 2.0 GB / 6.4 GB RAM )
11.46.3.2:4200 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.3.2:4201 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.3.2:4202 ( 1% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.2 GB / 8.0 GB RAM )
11.46.3.2:4203 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.3.2:4204 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.3.2:4202 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.3.2:4206 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.74.163:4200 ( 3% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 2.0 GB / 6.4 GB RAM )
11.46.72.2:4200 ( 3% cpu; 4% machine; 0.032 Gbps; 12% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.72.2:4201 ( 6% cpu; 4% machine; 0.032 Gbps; 12% disk IO; 2.2 GB / 8.0 GB RAM )
Storage server lagging by 243668 seconds.
11.46.72.2:4202 ( 6% cpu; 4% machine; 0.032 Gbps; 12% disk IO; 2.2 GB / 8.0 GB RAM )
Storage server lagging by 244023 seconds.
11.46.72.2:4203 ( 7% cpu; 4% machine; 0.032 Gbps; 12% disk IO; 2.2 GB / 8.0 GB RAM )
Storage server lagging by 243776 seconds.
11.46.72.2:4204 ( 1% cpu; 4% machine; 0.032 Gbps; 12% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.72.2:4202 ( 12% cpu; 4% machine; 0.032 Gbps; 12% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.72.2:4206 ( 2% cpu; 4% machine; 0.032 Gbps; 17% disk IO; 2.6 GB / 8.0 GB RAM )
11.46.72.2:4207 ( 2% cpu; 4% machine; 0.032 Gbps; 12% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.72.2:4208 ( 2% cpu; 4% machine; 0.032 Gbps; 12% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.72.2:4206 ( 3% cpu; 4% machine; 0.032 Gbps; 12% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.72.2:4211 ( 1% cpu; 4% machine; 0.032 Gbps; 12% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.78.83:4200 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.78.83:4201 ( 1% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.2 GB / 8.0 GB RAM )
11.46.78.83:4202 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.78.83:4203 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.78.83:4204 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.78.83:4202 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.78.83:4206 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.78.170:4200 ( 2% cpu; 1% machine; 0.000 Gbps; 0% disk IO; 0.3 GB / 6.3 GB RAM )
11.46.76.76:4200 ( 2% cpu; 7% machine; 0.036 Gbps; 31% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.76.76:4201 ( 2% cpu; 7% machine; 0.036 Gbps; 31% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.76.76:4202 ( 1% cpu; 7% machine; 0.036 Gbps; 31% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.76.76:4203 ( 1% cpu; 7% machine; 0.036 Gbps; 31% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.76.76:4204 ( 2% cpu; 7% machine; 0.036 Gbps; 31% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.76.76:4202 ( 8% cpu; 7% machine; 0.036 Gbps; 31% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.76.76:4206 ( 1% cpu; 7% machine; 0.036 Gbps; 31% disk IO; 4.8 GB / 8.0 GB RAM )
11.46.76.76:4207 ( 8% cpu; 7% machine; 0.036 Gbps; 32% disk IO; 2.7 GB / 8.0 GB RAM )
11.46.76.76:4208 ( 6% cpu; 7% machine; 0.036 Gbps; 31% disk IO; 4.1 GB / 8.0 GB RAM )
Storage server lagging by 244022 seconds.
11.46.76.76:4206 ( 2% cpu; 7% machine; 0.036 Gbps; 31% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.76.76:4211 ( 1% cpu; 7% machine; 0.036 Gbps; 31% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.76.113:4200 ( 2% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.3 GB / 6.3 GB RAM )
11.46.76.166:4200 ( 3% cpu; 34% machine; 0.001 Gbps; 0% disk IO; 1.6 GB / 6.8 GB RAM )
11.46.80.66:4200 ( 3% cpu; 40% machine; 0.001 Gbps; 0% disk IO; 1.6 GB / 6.4 GB RAM )
Last logged error: SharedTLogFailed: internal_error at Tue Oct 11 12:02:46 2023
11.46.81.142:4200 ( 3% cpu; 40% machine; 0.001 Gbps; 0% disk IO; 1.8 GB / 6.8 GB RAM )
11.46.81.120:4200 ( 1% cpu; 1% machine; 0.000 Gbps; 0% disk IO; 0.3 GB / 6.3 GB RAM )
11.46.82.183:4200 ( 2% cpu; 6% machine; 0.074 Gbps; 27% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.82.183:4201 ( 22% cpu; 6% machine; 0.074 Gbps; 26% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.82.183:4202 ( 2% cpu; 6% machine; 0.074 Gbps; 28% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.82.183:4203 ( 11% cpu; 6% machine; 0.074 Gbps; 28% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.82.183:4204 ( 12% cpu; 6% machine; 0.074 Gbps; 26% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.82.183:4202 ( 2% cpu; 6% machine; 0.074 Gbps; 27% disk IO; 2.1 GB / 8.0 GB RAM )
Storage server lagging by 243840 seconds.
11.46.82.183:4206 ( 4% cpu; 6% machine; 0.074 Gbps; 28% disk IO; 4.8 GB / 8.0 GB RAM )
11.46.82.183:4207 ( 2% cpu; 6% machine; 0.074 Gbps; 28% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.82.183:4208 ( 3% cpu; 6% machine; 0.074 Gbps; 28% disk IO; 2.6 GB / 8.0 GB RAM )
11.46.82.183:4206 ( 3% cpu; 6% machine; 0.074 Gbps; 27% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.82.183:4211 ( 2% cpu; 6% machine; 0.074 Gbps; 30% disk IO; 2.7 GB / 8.0 GB RAM )
11.46.82.2:4200 ( 1% cpu; 18% machine; 0.000 Gbps; 0% disk IO; 0.3 GB / 6.3 GB RAM )
11.46.82.68:4200 ( 0% cpu; 1% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.82.68:4201 ( 0% cpu; 1% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.82.68:4202 ( 0% cpu; 1% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.82.68:4203 ( 0% cpu; 1% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.82.68:4204 ( 0% cpu; 1% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.82.68:4202 ( 0% cpu; 1% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.82.68:4206 ( 0% cpu; 1% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.82.117:4200 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.82.117:4201 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.82.117:4202 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.82.117:4203 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.82.117:4204 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.82.117:4202 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.82.117:4206 ( 0% cpu; 0% machine; 0.000 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.82.142:4200 ( 2% cpu; 3% machine; 0.038 Gbps; 23% disk IO; 4.7 GB / 8.0 GB RAM )
11.46.82.142:4201 ( 1% cpu; 3% machine; 0.038 Gbps; 23% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.82.142:4202 ( 2% cpu; 3% machine; 0.038 Gbps; 23% disk IO; 2.2 GB / 8.0 GB RAM )
11.46.82.142:4203 ( 6% cpu; 3% machine; 0.038 Gbps; 22% disk IO; 4.0 GB / 8.0 GB RAM )
Storage server lagging by 244026 seconds.
11.46.82.142:4204 ( 1% cpu; 3% machine; 0.038 Gbps; 23% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.82.142:4202 ( 3% cpu; 3% machine; 0.038 Gbps; 23% disk IO; 2.2 GB / 8.0 GB RAM )
11.46.82.142:4206 ( 3% cpu; 3% machine; 0.038 Gbps; 22% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.82.142:4207 ( 1% cpu; 3% machine; 0.038 Gbps; 23% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.82.142:4208 ( 2% cpu; 3% machine; 0.038 Gbps; 23% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.82.142:4206 ( 13% cpu; 3% machine; 0.038 Gbps; 23% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.82.142:4211 ( 2% cpu; 3% machine; 0.038 Gbps; 23% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.87.122:4200 ( 3% cpu; 4% machine; 0.001 Gbps; 0% disk IO; 2.0 GB / 6.4 GB RAM )
Last logged error: SharedTLogFailed: internal_error at Tue Oct 11 12:02:46 2023
11.46.88.2:4200 ( 1% cpu; 1% machine; 0.001 Gbps; 0% disk IO; 0.3 GB / 6.3 GB RAM )
11.46.88.217:4200 ( 14% cpu; 2% machine; 0.026 Gbps; 22% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.88.217:4201 ( 2% cpu; 2% machine; 0.026 Gbps; 22% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.88.217:4202 ( 3% cpu; 2% machine; 0.026 Gbps; 22% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.88.217:4203 ( 2% cpu; 2% machine; 0.026 Gbps; 22% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.88.217:4204 ( 1% cpu; 2% machine; 0.026 Gbps; 24% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.88.217:4202 ( 11% cpu; 2% machine; 0.026 Gbps; 24% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.88.217:4206 ( 3% cpu; 2% machine; 0.026 Gbps; 22% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.88.217:4207 ( 7% cpu; 2% machine; 0.026 Gbps; 22% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.88.217:4208 ( 7% cpu; 2% machine; 0.026 Gbps; 24% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.88.217:4206 ( 2% cpu; 2% machine; 0.026 Gbps; 22% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.88.217:4211 ( 2% cpu; 2% machine; 0.026 Gbps; 22% disk IO; 2.2 GB / 8.0 GB RAM )
Storage server lagging by 243826 seconds.
11.46.88.223:4200 ( 3% cpu; 40% machine; 0.001 Gbps; 0% disk IO; 1.6 GB / 6.8 GB RAM )
11.46.86.160:4200 ( 3% cpu; 22% machine; 0.001 Gbps; 0% disk IO; 1.6 GB / 6.3 GB RAM )
Last logged error: SharedTLogFailed: internal_error at Tue Oct 11 12:02:46 2023
11.46.60.112:4200 ( 1% cpu; 4% machine; 0.000 Gbps; 0% disk IO; 0.3 GB / 6.3 GB RAM )
11.46.60.166:4200 ( 11% cpu; 3% machine; 0.007 Gbps; 0% disk IO; 0.2 GB / 8.0 GB RAM )
11.46.60.166:4201 ( 3% cpu; 3% machine; 0.007 Gbps; 0% disk IO; 0.2 GB / 8.0 GB RAM )
11.46.60.166:4202 ( 2% cpu; 3% machine; 0.007 Gbps; 0% disk IO; 0.2 GB / 8.0 GB RAM )
11.46.60.166:4203 ( 0% cpu; 3% machine; 0.007 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.60.166:4204 ( 0% cpu; 3% machine; 0.007 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.60.166:4202 ( 6% cpu; 3% machine; 0.007 Gbps; 0% disk IO; 0.2 GB / 8.0 GB RAM )
11.46.60.166:4206 ( 1% cpu; 3% machine; 0.007 Gbps; 0% disk IO; 0.2 GB / 8.0 GB RAM )
11.46.64.233:4200 ( 0% cpu; 3% machine; 0.008 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.64.233:4201 ( 0% cpu; 3% machine; 0.008 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.64.233:4202 ( 0% cpu; 3% machine; 0.008 Gbps; 0% disk IO; 0.2 GB / 8.0 GB RAM )
11.46.64.233:4203 ( 24% cpu; 3% machine; 0.008 Gbps; 0% disk IO; 0.3 GB / 8.0 GB RAM )
11.46.64.233:4204 ( 6% cpu; 3% machine; 0.008 Gbps; 0% disk IO; 0.2 GB / 8.0 GB RAM )
11.46.64.233:4202 ( 0% cpu; 3% machine; 0.008 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.64.233:4206 ( 0% cpu; 3% machine; 0.008 Gbps; 0% disk IO; 0.1 GB / 8.0 GB RAM )
11.46.62.31:4200 ( 3% cpu; 8% machine; 0.001 Gbps; 0% disk IO; 1.6 GB / 6.3 GB RAM )
Last logged error: SharedTLogFailed: internal_error at Tue Oct 11 12:02:46 2023
11.46.62.114:4200 ( 4% cpu; 3% machine; 0.046 Gbps; 30% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.62.114:4201 ( 11% cpu; 3% machine; 0.046 Gbps; 26% disk IO; 4.1 GB / 8.0 GB RAM )
Storage server lagging by 243632 seconds.
11.46.62.114:4202 ( 1% cpu; 3% machine; 0.046 Gbps; 31% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.62.114:4203 ( 2% cpu; 3% machine; 0.046 Gbps; 31% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.62.114:4204 ( 2% cpu; 3% machine; 0.046 Gbps; 31% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.62.114:4202 ( 3% cpu; 3% machine; 0.046 Gbps; 31% disk IO; 2.2 GB / 8.0 GB RAM )
11.46.62.114:4206 ( 2% cpu; 3% machine; 0.046 Gbps; 30% disk IO; 4.6 GB / 8.0 GB RAM )
11.46.62.114:4207 ( 2% cpu; 3% machine; 0.046 Gbps; 31% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.62.114:4208 ( 6% cpu; 3% machine; 0.046 Gbps; 30% disk IO; 2.2 GB / 8.0 GB RAM )
Storage server lagging by 244021 seconds.
11.46.62.114:4206 ( 1% cpu; 3% machine; 0.046 Gbps; 30% disk IO; 4.2 GB / 8.0 GB RAM )
11.46.62.114:4211 ( 2% cpu; 3% machine; 0.046 Gbps; 30% disk IO; 4.6 GB / 8.0 GB RAM )
Coordination servers:
11.46.78.170:4200 (reachable)
11.46.76.113:4200 (reachable)
11.46.81.120:4200 (reachable)
11.46.82.2:4200 (reachable)
11.46.87.122:4200 (reachable)
11.46.88.2:4200 (reachable)
11.46.60.112:4200 (reachable)
Client time: 11/11/23 12:26:18