Big increase and sudden drop in key-value size and disk usage over one week

Over the last week (7 days), our FDB cluster experienced a large increase in key-value size and disk usage. We have 85 storage pods with 340GB disks. For the last 7 days, our developer loaded about 300 GBs of data. However the key-value size was 1.1TB, much larger than 300GB, from 4.9TB to 6.0TB.

The loading stopped at August 24, 13:00 pm. Then at August 25 05:30am, the key-value size dropped suddenly to below 5.0TB.

We have a daily backup job for the cluster running for long time. The last week’s data loading is a batch, and it’s much bigger than usual.

We scan the trace logs and see a lot of errors like this:

<Event Severity="20" Time="1661403748.485664" Type="**TLogQueueCommitSlow**" ID="f2acefd5229f05de" LateProcessCount="6" LoggingDelay="\ 1" Machine="10.104.220.130:4000" LogGroup="default" Roles="TL" />

The event TLogQueueCommitSlow does show up in the log. But it is severity of 20. It is not error. Another error occurred often is SlowSSLoopx100.

We still have logs; if we want to look for the root cause, what events should we look into?

Also, the disk usage increase seems to be even more dramatic on the rise, but the drop is very slow (not like the sudden drop of the key-value size).

We don’t understand why the quicker increases and the sudden/slow drops in KV size and disk space usage. Any insight?

The extra KV Bytes could be from backup. If a backup was running during the data load, then a copy of the mutation logs for that entire period including all data loaded would be stored in the database as KV data until it is flushed to the backup destination and then deleted from the database. It would also be deleted if the backup was aborted. The deletion would cause a very fast drop in KV size.

That Sounds like a good explanation for what happened. Thank you, Steve.