100% CPU usage on an idle cluster

Sometimes we are seeing 100% CPU usage on DR destination clusters and a high rate of “transaction started”, even when no client is connected and dr_agent has been stopped:

fdb> status details

Using cluster file `/etc/foundationdb/fdb.cluster'.

Configuration:
  Storage engine         - ssd-2
  Coordinators           - 1
  Exclusions             - 2 (type `exclude' for details)
  Usable Regions         - 1

Cluster:
  FoundationDB processes - 1
  Zones                  - 1
  Machines               - 1
  Memory availability    - 7.3 GB per process on machine with least available
  Fault Tolerance        - 0 machines
  Server time            - 04/25/23 10:31:29

Data:
  Replication health     - Healthy
  Moving data            - 0.000 GB
  Sum of key-value sizes - 20 MB
  Disk space used        - 238 MB

Operating space:
  Storage server         - 46.0 GB free on most full server
  Log server             - 46.0 GB free on most full server

Workload:
  Read rate              - 9 Hz
  Write rate             - 0 Hz
  Transactions started   - 35190 Hz
  Transactions committed - 0 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 1 as secondary

Running DR tags (as secondary):
  default                - <redacted>

Process performance details:
  10.220.11.2:4500:tls   (100% cpu; 51% machine; 0.000 Gbps;  0% disk IO; 3.1 GB / 7.3 GB RAM  )

Coordination servers:
  10.220.11.2:4500:tls  (reachable)

Client time: 04/25/23 10:31:25

Restarting the cluster by running kill all recovers it out of the bad state. The issue is hard to reproduce, since we only see it happen about once per cluster per week randomly.

Logs are mostly of type TransactionMetrics, and sometimes GlobalConfig_RetryableMigrationError:

<Event Severity="10" Time="1682418199.886493" DateTime="2023-04-25T10:23:19Z" Type="TransactionMetrics" ID="04108c7367c3f85a" Elapsed="5.00019" Cluster="" Internal="1" ReadVersions="1.39995 0.118028 63556" ReadVersionsThrottled="0 -1 0" ReadVersionsCompleted="0 -1 1" ReadVersionBatches="1.39995 0.118031 63556" BatchPriorityReadVersions="0 -1 0" DefaultPriorityReadVersions="1.39995 0.118028 63555" ImmediatePriorityReadVersions="0 -1 1" BatchPriorityReadVersionsCompleted="0 -1 0" DefaultPriorityReadVersionsCompleted="0 -1 0" ImmediatePriorityReadVersionsCompleted="0 -1 1" LogicalUncachedReads="1.39995 0.118028 63556" PhysicalReadRequests="0 -1 1" PhysicalReadRequestsCompleted="0 -1 1" GetKeyRequests="0 -1 0" GetValueRequests="1.39995 0.118028 63556" GetRangeRequests="0 -1 0" GetMappedRangeRequests="0 -1 0" GetRangeStreamRequests="0 -1 0" WatchRequests="0 -1 0" GetAddressesForKeyRequests="0 -1 0" BytesRead="0 -1 26" KeysRead="0 -1 1" MetadataVersionReads="0 -1 0" CommittedMutations="0 -1 0" CommittedMutationBytes="0 -1 0" SetMutations="0 -1 0" ClearMutations="0 -1 0" AtomicMutations="0 -1 0" CommitStarted="0 -1 0" CommitCompleted="0 -1 0" KeyServerLocationRequests="0 -1 1" KeyServerLocationRequestsCompleted="0 -1 1" StatusRequests="0 -1 0" TooOld="0 -1 0" FutureVersions="0 -1 0" NotCommitted="0 -1 0" MaybeCommitted="0 -1 0" ResourceConstrained="0 -1 0" ProcessBehind="0 -1 0" Throttled="0 -1 0" ExpensiveClearCostEstCount="0 -1 0" NumGrvFullBatches="0 -1 0" NumGrvTimedOutBatches="1.39995 0.118031 63556" CommitVersionNotFoundForSS="0 -1 0" LocationCacheEntryCount="2" MeanLatency="0" MedianLatency="0" Latency90="0" Latency98="0" MaxLatency="0" MeanRowReadLatency="0" MedianRowReadLatency="0" MaxRowReadLatency="0" MeanGRVLatency="0.132474" MedianGRVLatency="0.12978" MaxGRVLatency="0.154077" MeanCommitLatency="0" MedianCommitLatency="0" MaxCommitLatency="0" MeanMutationsPerCommit="0" MedianMutationsPerCommit="0" MaxMutationsPerCommit="0" MeanBytesPerCommit="0" MedianBytesPerCommit="0" MaxBytesPerCommit="0" NumLocalityCacheEntries="2" ThreadID="14882012433472669136" Machine="10.220.11.2:4500" LogGroup="default" Roles="CC,CD,CP,DD,GP,MS,RK,RV,SS,TL" />
<Event Severity="10" Time="1682418199.886493" DateTime="2023-04-25T10:23:19Z" Type="TransactionMetrics" ID="5d941bfd98485f66" Elapsed="5.00019" Cluster="" Internal="1" ReadVersions="1.39995 0.113238 65301" ReadVersionsThrottled="0 -1 0" ReadVersionsCompleted="0 -1 1" ReadVersionBatches="1.39995 0.113251 65301" BatchPriorityReadVersions="0 -1 0" DefaultPriorityReadVersions="1.39995 0.113238 65300" ImmediatePriorityReadVersions="0 -1 1" BatchPriorityReadVersionsCompleted="0 -1 0" DefaultPriorityReadVersionsCompleted="0 -1 0" ImmediatePriorityReadVersionsCompleted="0 -1 1" LogicalUncachedReads="1.39995 0.113238 65301" PhysicalReadRequests="0 -1 1" PhysicalReadRequestsCompleted="0 -1 1" GetKeyRequests="0 -1 0" GetValueRequests="1.39995 0.113238 65301" GetRangeRequests="0 -1 0" GetMappedRangeRequests="0 -1 0" GetRangeStreamRequests="0 -1 0" WatchRequests="0 -1 0" GetAddressesForKeyRequests="0 -1 0" BytesRead="0 -1 26" KeysRead="0 -1 1" MetadataVersionReads="0 -1 0" CommittedMutations="0 -1 0" CommittedMutationBytes="0 -1 0" SetMutations="0 -1 0" ClearMutations="0 -1 0" AtomicMutations="0 -1 0" CommitStarted="0 -1 0" CommitCompleted="0 -1 0" KeyServerLocationRequests="0 -1 1" KeyServerLocationRequestsCompleted="0 -1 1" StatusRequests="0 -1 0" TooOld="0 -1 0" FutureVersions="0 -1 0" NotCommitted="0 -1 0" MaybeCommitted="0 -1 0" ResourceConstrained="0 -1 0" ProcessBehind="0 -1 0" Throttled="0 -1 0" ExpensiveClearCostEstCount="0 -1 0" NumGrvFullBatches="0 -1 0" NumGrvTimedOutBatches="1.39995 0.113251 65301" CommitVersionNotFoundForSS="0 -1 0" LocationCacheEntryCount="2" MeanLatency="0" MedianLatency="0" Latency90="0" Latency98="0" MaxLatency="0" MeanRowReadLatency="0" MedianRowReadLatency="0" MaxRowReadLatency="0" MeanGRVLatency="0.131402" MedianGRVLatency="0.130134" MaxGRVLatency="0.14512" MeanCommitLatency="0" MedianCommitLatency="0" MaxCommitLatency="0" MeanMutationsPerCommit="0" MedianMutationsPerCommit="0" MaxMutationsPerCommit="0" MeanBytesPerCommit="0" MedianBytesPerCommit="0" MaxBytesPerCommit="0" NumLocalityCacheEntries="2" ThreadID="14882012433472669136" Machine="10.220.11.2:4500" LogGroup="default" Roles="CC,CD,CP,DD,GP,MS,RK,RV,SS,TL" />
<Event Severity="10" Time="1682418199.889216" DateTime="2023-04-25T10:23:19Z" Type="GlobalConfig_RetryableMigrationError" ID="0000000000000000" Error="database_locked" ErrorDescription="Database is locked" ErrorCode="1038" SuppressedEventCount="32350" ThreadID="14882012433472669136" Machine="10.220.11.2:4500" LogGroup="default" Roles="CC,CD,CP,DD,GP,MS,RK,RV,SS,TL" />

The nodes are running FoundationDB 7.1.31.

after 6.3, FDB is not maintaining the old-style DR (which has two FDB clusters connected by dr_agent).
FDB uses HA to replace DR, which uses less resource and provide no-data loss guarantee in the failover

Is it possible to have more than two regions with the new HA configuration?

Thanks for the reply! @mengxu

Nope. only 2 regions supported so far.
Region 1 has 1 primary dc and 1 satellite DC; and
Region 2 has 1 dc and 1 satellite DC.

We have a fix for the issue with DR itself, but it seems that the FDB repository is closed for pull requests. Would you be willing to review and accept a patch? @mengxu

@ammolitor is managing the permission. He can do that.

1 Like

Thanks!

Hi @ammolitor , can I get permission to open pull requests to the foundationdb repository?

invite sent, thanks for contributing!

Didn’t receive the email, where should I check for the invite?

My GitHub username is losfair

my mistake, I assumed you were the owner of the fork where the referenced commit was shared from. you should have it now.

1 Like