Over the last week (7 days), our FDB cluster experienced a large increase in key-value size and disk usage. We have 85 storage pods with 340GB disks. For the last 7 days, our developer loaded about 300 GBs of data. However the key-value size was 1.1TB, much larger than 300GB, from 4.9TB to 6.0TB.
The loading stopped at August 24, 13:00 pm. Then at August 25 05:30am, the key-value size dropped suddenly to below 5.0TB.
We have a daily backup job for the cluster running for long time. The last week’s data loading is a batch, and it’s much bigger than usual.
We scan the trace logs and see a lot of errors like this:
<Event Severity="20" Time="1661403748.485664" Type="**TLogQueueCommitSlow**" ID="f2acefd5229f05de" LateProcessCount="6" LoggingDelay="\ 1" Machine="10.104.220.130:4000" LogGroup="default" Roles="TL" />
The event TLogQueueCommitSlow
does show up in the log. But it is severity of 20. It is not error. Another error occurred often is SlowSSLoopx100
.
We still have logs; if we want to look for the root cause, what events should we look into?
Also, the disk usage increase seems to be even more dramatic on the rise, but the drop is very slow (not like the sudden drop of the key-value size).
We don’t understand why the quicker increases and the sudden/slow drops in KV size and disk space usage. Any insight?