As a follow up of this post Storage Server CPU bottleneck - Growing data lag, I managed to stabilize the cluster by developing a module which applies a retention policy that prunes old data, which in turn increases the locality of the inserted data. This in combination with the more aggressive knobs seem to have fixed the problem of the increasing data lag.
After running two clusters with the same configuration and a similar workload, I noticed that at several intervals, the clusters produced timeouts in writes, while 1-2 logs processes where marked as “degraded”.
Questions:
- What exactly degraded means in the FDB context?
- Is there a way to find out why a process is degraded?
- When does a process stops being degraded?
- If a log process is degraded why FDB doesn’t use the spare log instead?
Relevant data:
The cluster during the current workload:
Log stats from the same time range:
Logs with Severity > 10
from a log process which is marked as degraded (log-2 in the graph)
<Event Severity="20" Time="1600773328.062849" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="7" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600774230.094659" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="27" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600774337.097655" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="9" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600774606.106858" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="6" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600774609.106982" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="2" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600774610.106999" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="10" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600774881.114514" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="4" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600774942.117677" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="7" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600775624.158769" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="7" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600777911.238035" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="6" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600777912.238060" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="6" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600777913.238087" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="1" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
<Event Severity="20" Time="1600777927.238446" Type="TLogQueueCommitSlow" ID="d20368ec5628ec29" LateProcessCount="5" LoggingDelay="1" Machine="10.16.45.5:4501" LogGroup="turing2-mist-fdb" Roles="TL" />
Notes:
- The timestamp of the last
Event
matches the timestamp in which one log process is marked as degraded (1->2 degraded process in the relevant graph). - The whole cluster is deployed on a GCE cluster with the FDB Kubernetes Operator