100% CPU usage on an idle cluster

losfair · April 25, 2023, 10:38am

Sometimes we are seeing 100% CPU usage on DR destination clusters and a high rate of “transaction started”, even when no client is connected and dr_agent has been stopped:

fdb> status details

Using cluster file `/etc/foundationdb/fdb.cluster'.

Configuration:
  Storage engine         - ssd-2
  Coordinators           - 1
  Exclusions             - 2 (type `exclude' for details)
  Usable Regions         - 1

Cluster:
  FoundationDB processes - 1
  Zones                  - 1
  Machines               - 1
  Memory availability    - 7.3 GB per process on machine with least available
  Fault Tolerance        - 0 machines
  Server time            - 04/25/23 10:31:29

Data:
  Replication health     - Healthy
  Moving data            - 0.000 GB
  Sum of key-value sizes - 20 MB
  Disk space used        - 238 MB

Operating space:
  Storage server         - 46.0 GB free on most full server
  Log server             - 46.0 GB free on most full server

Workload:
  Read rate              - 9 Hz
  Write rate             - 0 Hz
  Transactions started   - 35190 Hz
  Transactions committed - 0 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 1 as secondary

Running DR tags (as secondary):
  default                - <redacted>

Process performance details:
  10.220.11.2:4500:tls   (100% cpu; 51% machine; 0.000 Gbps;  0% disk IO; 3.1 GB / 7.3 GB RAM  )

Coordination servers:
  10.220.11.2:4500:tls  (reachable)

Client time: 04/25/23 10:31:25

Restarting the cluster by running kill all recovers it out of the bad state. The issue is hard to reproduce, since we only see it happen about once per cluster per week randomly.

Logs are mostly of type TransactionMetrics, and sometimes GlobalConfig_RetryableMigrationError:

<Event Severity="10" Time="1682418199.886493" DateTime="2023-04-25T10:23:19Z" Type="TransactionMetrics" ID="04108c7367c3f85a" Elapsed="5.00019" Cluster="" Internal="1" ReadVersions="1.39995 0.118028 63556" ReadVersionsThrottled="0 -1 0" ReadVersionsCompleted="0 -1 1" ReadVersionBatches="1.39995 0.118031 63556" BatchPriorityReadVersions="0 -1 0" DefaultPriorityReadVersions="1.39995 0.118028 63555" ImmediatePriorityReadVersions="0 -1 1" BatchPriorityReadVersionsCompleted="0 -1 0" DefaultPriorityReadVersionsCompleted="0 -1 0" ImmediatePriorityReadVersionsCompleted="0 -1 1" LogicalUncachedReads="1.39995 0.118028 63556" PhysicalReadRequests="0 -1 1" PhysicalReadRequestsCompleted="0 -1 1" GetKeyRequests="0 -1 0" GetValueRequests="1.39995 0.118028 63556" GetRangeRequests="0 -1 0" GetMappedRangeRequests="0 -1 0" GetRangeStreamRequests="0 -1 0" WatchRequests="0 -1 0" GetAddressesForKeyRequests="0 -1 0" BytesRead="0 -1 26" KeysRead="0 -1 1" MetadataVersionReads="0 -1 0" CommittedMutations="0 -1 0" CommittedMutationBytes="0 -1 0" SetMutations="0 -1 0" ClearMutations="0 -1 0" AtomicMutations="0 -1 0" CommitStarted="0 -1 0" CommitCompleted="0 -1 0" KeyServerLocationRequests="0 -1 1" KeyServerLocationRequestsCompleted="0 -1 1" StatusRequests="0 -1 0" TooOld="0 -1 0" FutureVersions="0 -1 0" NotCommitted="0 -1 0" MaybeCommitted="0 -1 0" ResourceConstrained="0 -1 0" ProcessBehind="0 -1 0" Throttled="0 -1 0" ExpensiveClearCostEstCount="0 -1 0" NumGrvFullBatches="0 -1 0" NumGrvTimedOutBatches="1.39995 0.118031 63556" CommitVersionNotFoundForSS="0 -1 0" LocationCacheEntryCount="2" MeanLatency="0" MedianLatency="0" Latency90="0" Latency98="0" MaxLatency="0" MeanRowReadLatency="0" MedianRowReadLatency="0" MaxRowReadLatency="0" MeanGRVLatency="0.132474" MedianGRVLatency="0.12978" MaxGRVLatency="0.154077" MeanCommitLatency="0" MedianCommitLatency="0" MaxCommitLatency="0" MeanMutationsPerCommit="0" MedianMutationsPerCommit="0" MaxMutationsPerCommit="0" MeanBytesPerCommit="0" MedianBytesPerCommit="0" MaxBytesPerCommit="0" NumLocalityCacheEntries="2" ThreadID="14882012433472669136" Machine="10.220.11.2:4500" LogGroup="default" Roles="CC,CD,CP,DD,GP,MS,RK,RV,SS,TL" />
<Event Severity="10" Time="1682418199.886493" DateTime="2023-04-25T10:23:19Z" Type="TransactionMetrics" ID="5d941bfd98485f66" Elapsed="5.00019" Cluster="" Internal="1" ReadVersions="1.39995 0.113238 65301" ReadVersionsThrottled="0 -1 0" ReadVersionsCompleted="0 -1 1" ReadVersionBatches="1.39995 0.113251 65301" BatchPriorityReadVersions="0 -1 0" DefaultPriorityReadVersions="1.39995 0.113238 65300" ImmediatePriorityReadVersions="0 -1 1" BatchPriorityReadVersionsCompleted="0 -1 0" DefaultPriorityReadVersionsCompleted="0 -1 0" ImmediatePriorityReadVersionsCompleted="0 -1 1" LogicalUncachedReads="1.39995 0.113238 65301" PhysicalReadRequests="0 -1 1" PhysicalReadRequestsCompleted="0 -1 1" GetKeyRequests="0 -1 0" GetValueRequests="1.39995 0.113238 65301" GetRangeRequests="0 -1 0" GetMappedRangeRequests="0 -1 0" GetRangeStreamRequests="0 -1 0" WatchRequests="0 -1 0" GetAddressesForKeyRequests="0 -1 0" BytesRead="0 -1 26" KeysRead="0 -1 1" MetadataVersionReads="0 -1 0" CommittedMutations="0 -1 0" CommittedMutationBytes="0 -1 0" SetMutations="0 -1 0" ClearMutations="0 -1 0" AtomicMutations="0 -1 0" CommitStarted="0 -1 0" CommitCompleted="0 -1 0" KeyServerLocationRequests="0 -1 1" KeyServerLocationRequestsCompleted="0 -1 1" StatusRequests="0 -1 0" TooOld="0 -1 0" FutureVersions="0 -1 0" NotCommitted="0 -1 0" MaybeCommitted="0 -1 0" ResourceConstrained="0 -1 0" ProcessBehind="0 -1 0" Throttled="0 -1 0" ExpensiveClearCostEstCount="0 -1 0" NumGrvFullBatches="0 -1 0" NumGrvTimedOutBatches="1.39995 0.113251 65301" CommitVersionNotFoundForSS="0 -1 0" LocationCacheEntryCount="2" MeanLatency="0" MedianLatency="0" Latency90="0" Latency98="0" MaxLatency="0" MeanRowReadLatency="0" MedianRowReadLatency="0" MaxRowReadLatency="0" MeanGRVLatency="0.131402" MedianGRVLatency="0.130134" MaxGRVLatency="0.14512" MeanCommitLatency="0" MedianCommitLatency="0" MaxCommitLatency="0" MeanMutationsPerCommit="0" MedianMutationsPerCommit="0" MaxMutationsPerCommit="0" MeanBytesPerCommit="0" MedianBytesPerCommit="0" MaxBytesPerCommit="0" NumLocalityCacheEntries="2" ThreadID="14882012433472669136" Machine="10.220.11.2:4500" LogGroup="default" Roles="CC,CD,CP,DD,GP,MS,RK,RV,SS,TL" />
<Event Severity="10" Time="1682418199.889216" DateTime="2023-04-25T10:23:19Z" Type="GlobalConfig_RetryableMigrationError" ID="0000000000000000" Error="database_locked" ErrorDescription="Database is locked" ErrorCode="1038" SuppressedEventCount="32350" ThreadID="14882012433472669136" Machine="10.220.11.2:4500" LogGroup="default" Roles="CC,CD,CP,DD,GP,MS,RK,RV,SS,TL" />

The nodes are running FoundationDB 7.1.31.

mengxu · April 26, 2023, 12:47am

after 6.3, FDB is not maintaining the old-style DR (which has two FDB clusters connected by dr_agent).
FDB uses HA to replace DR, which uses less resource and provide no-data loss guarantee in the failover

losfair · April 26, 2023, 2:11am

Is it possible to have more than two regions with the new HA configuration?

Thanks for the reply! @mengxu

mengxu · April 26, 2023, 3:58am

Nope. only 2 regions supported so far.
Region 1 has 1 primary dc and 1 satellite DC; and
Region 2 has 1 dc and 1 satellite DC.

losfair · May 2, 2023, 2:43pm

We have a fix for the issue with DR itself, but it seems that the FDB repository is closed for pull requests. Would you be willing to review and accept a patch? @mengxu

mengxu · May 2, 2023, 4:38pm

@ammolitor is managing the permission. He can do that.

losfair · May 3, 2023, 3:02pm

Thanks!

Hi @ammolitor , can I get permission to open pull requests to the foundationdb repository?

ammolitor · May 3, 2023, 3:15pm

invite sent, thanks for contributing!

losfair · May 3, 2023, 3:42pm

Didn’t receive the email, where should I check for the invite?

My GitHub username is losfair

ammolitor · May 3, 2023, 3:55pm

my mistake, I assumed you were the owner of the fork where the referenced commit was shared from. you should have it now.

Topic		Replies	Views
Newly added storage nodes have disk usage at 98% Using FoundationDB	15	1532	August 14, 2020
CPU limited storage processes Using FoundationDB performance	9	1713	May 18, 2021
Debugging abnormally high IO load Using FoundationDB	5	1507	October 31, 2019
FDB 6.2 - proxies processes have 100% CPU usage Running FoundationDB performance	1	445	October 10, 2022
Curious about CPU utilization of Single-core Write Test #1 Using FoundationDB performance	6	628	August 8, 2022

100% CPU usage on an idle cluster

Related topics