FDB degraded log processes. What does degraded mean?

v-pap · September 22, 2020, 1:28pm

As a follow up of this post Storage Server CPU bottleneck - Growing data lag, I managed to stabilize the cluster by developing a module which applies a retention policy that prunes old data, which in turn increases the locality of the inserted data. This in combination with the more aggressive knobs seem to have fixed the problem of the increasing data lag.

After running two clusters with the same configuration and a similar workload, I noticed that at several intervals, the clusters produced timeouts in writes, while 1-2 logs processes where marked as “degraded”.

Questions:

What exactly degraded means in the FDB context?
Is there a way to find out why a process is degraded?
When does a process stops being degraded?
If a log process is degraded why FDB doesn’t use the spare log instead?

Relevant data:

The cluster during the current workload:

Log stats from the same time range:

Logs with Severity > 10 from a log process which is marked as degraded (log-2 in the graph)

<Event Severity="20" Time="1600773328.062849" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="7" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600774230.094659" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="27" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600774337.097655" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="9" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600774606.106858" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="6" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600774609.106982" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="2" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600774610.106999" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="10" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600774881.114514" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="4" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600774942.117677" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="7" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600775624.158769" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="7" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600777911.238035" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="6" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600777912.238060" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="6" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600777913.238087" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="1" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600777927.238446" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="5" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />

Notes:

The timestamp of the last Event matches the timestamp in which one log process is marked as degraded (1->2 degraded process in the relevant graph).
The whole cluster is deployed on a GCE cluster with the FDB Kubernetes Operator

amanda · October 14, 2020, 9:12pm

I hit the same thing on one of my clusters and found this: https://github.com/apple/foundationdb/pull/1270

Guessing “degraded” has different meanings in fdb depending on context, and for TLogs it means that at some point it couldn’t commit for >5s. The degraded state remains for 2 days then gets reset. (Knobs: https://github.com/apple/foundationdb/pull/1270/files#diff-f409cd478d8df5100b9cc31fbc8a1f13cfab637930155952b7e01aa089bd2ea4)

I’m not sure what metrics/logs I can look at to debug why it couldn’t commit for >5s, if anyone has any advice here

alexmiller · October 15, 2020, 1:02am

Good code sleuthing! Currently only transaction logs have a degraded state, but it’s been discussed extending it to other roles.

There isn’t particularly anything within FDB that you can look at to find why a disk isn’t fsync()ing. You’d be better off examining the systems itself, and checking things like dmesg for drive-related system errors.

Topic		Replies	Views
Processes stop logging but appear to be doing work? Using FoundationDB	3	796	December 24, 2019
Foundation db process degrade alert for a cluster Running FoundationDB	3	59	June 25, 2025
SharedTLogFailed: internal_error Using FoundationDB performance	5	415	October 25, 2023
Sudden spike in TLogQueueCommitSlow warning Using FoundationDB	0	26	June 18, 2025
Storage Server CPU bottleneck - Growing data lag Using FoundationDB performance	22	3029	December 13, 2021

FDB degraded log processes. What does degraded mean?

Related topics