FDB degraded log processes. What does degraded mean?

As a follow up of this post Storage Server CPU bottleneck - Growing data lag, I managed to stabilize the cluster by developing a module which applies a retention policy that prunes old data, which in turn increases the locality of the inserted data. This in combination with the more aggressive knobs seem to have fixed the problem of the increasing data lag.

After running two clusters with the same configuration and a similar workload, I noticed that at several intervals, the clusters produced timeouts in writes, while 1-2 logs processes where marked as “degraded”.

Questions:

  1. What exactly degraded means in the FDB context?
  2. Is there a way to find out why a process is degraded?
  3. When does a process stops being degraded?
  4. If a log process is degraded why FDB doesn’t use the spare log instead?

Relevant data:

The cluster during the current workload:


Log stats from the same time range:

Logs with Severity > 10 from a log process which is marked as degraded (log-2 in the graph)

<Event Severity="20" Time="1600773328.062849" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="7" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600774230.094659" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="27" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600774337.097655" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="9" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600774606.106858" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="6" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600774609.106982" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="2" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600774610.106999" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="10" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600774881.114514" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="4" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600774942.117677" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="7" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600775624.158769" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="7" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600777911.238035" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="6" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600777912.238060" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="6" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600777913.238087" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="1" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600777927.238446" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="5" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />

Notes:

  1. The timestamp of the last Event matches the timestamp in which one log process is marked as degraded (1->2 degraded process in the relevant graph).
  2. The whole cluster is deployed on a GCE cluster with the FDB Kubernetes Operator

I hit the same thing on one of my clusters and found this: https://github.com/apple/foundationdb/pull/1270

Guessing “degraded” has different meanings in fdb depending on context, and for TLogs it means that at some point it couldn’t commit for >5s. The degraded state remains for 2 days then gets reset. (Knobs: https://github.com/apple/foundationdb/pull/1270/files#diff-f409cd478d8df5100b9cc31fbc8a1f13cfab637930155952b7e01aa089bd2ea4)

I’m not sure what metrics/logs I can look at to debug why it couldn’t commit for >5s, if anyone has any advice here

Good code sleuthing! Currently only transaction logs have a degraded state, but it’s been discussed extending it to other roles.

There isn’t particularly anything within FDB that you can look at to find why a disk isn’t fsync()ing. You’d be better off examining the systems itself, and checking things like dmesg for drive-related system errors.

1 Like