Sometimes we are seeing 100% CPU usage on DR destination clusters and a high rate of “transaction started”, even when no client is connected and dr_agent
has been stopped:
fdb> status details
Using cluster file `/etc/foundationdb/fdb.cluster'.
Configuration:
Storage engine - ssd-2
Coordinators - 1
Exclusions - 2 (type `exclude' for details)
Usable Regions - 1
Cluster:
FoundationDB processes - 1
Zones - 1
Machines - 1
Memory availability - 7.3 GB per process on machine with least available
Fault Tolerance - 0 machines
Server time - 04/25/23 10:31:29
Data:
Replication health - Healthy
Moving data - 0.000 GB
Sum of key-value sizes - 20 MB
Disk space used - 238 MB
Operating space:
Storage server - 46.0 GB free on most full server
Log server - 46.0 GB free on most full server
Workload:
Read rate - 9 Hz
Write rate - 0 Hz
Transactions started - 35190 Hz
Transactions committed - 0 Hz
Conflict rate - 0 Hz
Backup and DR:
Running backups - 0
Running DRs - 1 as secondary
Running DR tags (as secondary):
default - <redacted>
Process performance details:
10.220.11.2:4500:tls (100% cpu; 51% machine; 0.000 Gbps; 0% disk IO; 3.1 GB / 7.3 GB RAM )
Coordination servers:
10.220.11.2:4500:tls (reachable)
Client time: 04/25/23 10:31:25
Restarting the cluster by running kill all
recovers it out of the bad state. The issue is hard to reproduce, since we only see it happen about once per cluster per week randomly.
Logs are mostly of type TransactionMetrics
, and sometimes GlobalConfig_RetryableMigrationError
:
<Event Severity="10" Time="1682418199.886493" DateTime="2023-04-25T10:23:19Z" Type="TransactionMetrics" ID="04108c7367c3f85a" Elapsed="5.00019" Cluster="" Internal="1" ReadVersions="1.39995 0.118028 63556" ReadVersionsThrottled="0 -1 0" ReadVersionsCompleted="0 -1 1" ReadVersionBatches="1.39995 0.118031 63556" BatchPriorityReadVersions="0 -1 0" DefaultPriorityReadVersions="1.39995 0.118028 63555" ImmediatePriorityReadVersions="0 -1 1" BatchPriorityReadVersionsCompleted="0 -1 0" DefaultPriorityReadVersionsCompleted="0 -1 0" ImmediatePriorityReadVersionsCompleted="0 -1 1" LogicalUncachedReads="1.39995 0.118028 63556" PhysicalReadRequests="0 -1 1" PhysicalReadRequestsCompleted="0 -1 1" GetKeyRequests="0 -1 0" GetValueRequests="1.39995 0.118028 63556" GetRangeRequests="0 -1 0" GetMappedRangeRequests="0 -1 0" GetRangeStreamRequests="0 -1 0" WatchRequests="0 -1 0" GetAddressesForKeyRequests="0 -1 0" BytesRead="0 -1 26" KeysRead="0 -1 1" MetadataVersionReads="0 -1 0" CommittedMutations="0 -1 0" CommittedMutationBytes="0 -1 0" SetMutations="0 -1 0" ClearMutations="0 -1 0" AtomicMutations="0 -1 0" CommitStarted="0 -1 0" CommitCompleted="0 -1 0" KeyServerLocationRequests="0 -1 1" KeyServerLocationRequestsCompleted="0 -1 1" StatusRequests="0 -1 0" TooOld="0 -1 0" FutureVersions="0 -1 0" NotCommitted="0 -1 0" MaybeCommitted="0 -1 0" ResourceConstrained="0 -1 0" ProcessBehind="0 -1 0" Throttled="0 -1 0" ExpensiveClearCostEstCount="0 -1 0" NumGrvFullBatches="0 -1 0" NumGrvTimedOutBatches="1.39995 0.118031 63556" CommitVersionNotFoundForSS="0 -1 0" LocationCacheEntryCount="2" MeanLatency="0" MedianLatency="0" Latency90="0" Latency98="0" MaxLatency="0" MeanRowReadLatency="0" MedianRowReadLatency="0" MaxRowReadLatency="0" MeanGRVLatency="0.132474" MedianGRVLatency="0.12978" MaxGRVLatency="0.154077" MeanCommitLatency="0" MedianCommitLatency="0" MaxCommitLatency="0" MeanMutationsPerCommit="0" MedianMutationsPerCommit="0" MaxMutationsPerCommit="0" MeanBytesPerCommit="0" MedianBytesPerCommit="0" MaxBytesPerCommit="0" NumLocalityCacheEntries="2" ThreadID="14882012433472669136" Machine="10.220.11.2:4500" LogGroup="default" Roles="CC,CD,CP,DD,GP,MS,RK,RV,SS,TL" />
<Event Severity="10" Time="1682418199.886493" DateTime="2023-04-25T10:23:19Z" Type="TransactionMetrics" ID="5d941bfd98485f66" Elapsed="5.00019" Cluster="" Internal="1" ReadVersions="1.39995 0.113238 65301" ReadVersionsThrottled="0 -1 0" ReadVersionsCompleted="0 -1 1" ReadVersionBatches="1.39995 0.113251 65301" BatchPriorityReadVersions="0 -1 0" DefaultPriorityReadVersions="1.39995 0.113238 65300" ImmediatePriorityReadVersions="0 -1 1" BatchPriorityReadVersionsCompleted="0 -1 0" DefaultPriorityReadVersionsCompleted="0 -1 0" ImmediatePriorityReadVersionsCompleted="0 -1 1" LogicalUncachedReads="1.39995 0.113238 65301" PhysicalReadRequests="0 -1 1" PhysicalReadRequestsCompleted="0 -1 1" GetKeyRequests="0 -1 0" GetValueRequests="1.39995 0.113238 65301" GetRangeRequests="0 -1 0" GetMappedRangeRequests="0 -1 0" GetRangeStreamRequests="0 -1 0" WatchRequests="0 -1 0" GetAddressesForKeyRequests="0 -1 0" BytesRead="0 -1 26" KeysRead="0 -1 1" MetadataVersionReads="0 -1 0" CommittedMutations="0 -1 0" CommittedMutationBytes="0 -1 0" SetMutations="0 -1 0" ClearMutations="0 -1 0" AtomicMutations="0 -1 0" CommitStarted="0 -1 0" CommitCompleted="0 -1 0" KeyServerLocationRequests="0 -1 1" KeyServerLocationRequestsCompleted="0 -1 1" StatusRequests="0 -1 0" TooOld="0 -1 0" FutureVersions="0 -1 0" NotCommitted="0 -1 0" MaybeCommitted="0 -1 0" ResourceConstrained="0 -1 0" ProcessBehind="0 -1 0" Throttled="0 -1 0" ExpensiveClearCostEstCount="0 -1 0" NumGrvFullBatches="0 -1 0" NumGrvTimedOutBatches="1.39995 0.113251 65301" CommitVersionNotFoundForSS="0 -1 0" LocationCacheEntryCount="2" MeanLatency="0" MedianLatency="0" Latency90="0" Latency98="0" MaxLatency="0" MeanRowReadLatency="0" MedianRowReadLatency="0" MaxRowReadLatency="0" MeanGRVLatency="0.131402" MedianGRVLatency="0.130134" MaxGRVLatency="0.14512" MeanCommitLatency="0" MedianCommitLatency="0" MaxCommitLatency="0" MeanMutationsPerCommit="0" MedianMutationsPerCommit="0" MaxMutationsPerCommit="0" MeanBytesPerCommit="0" MedianBytesPerCommit="0" MaxBytesPerCommit="0" NumLocalityCacheEntries="2" ThreadID="14882012433472669136" Machine="10.220.11.2:4500" LogGroup="default" Roles="CC,CD,CP,DD,GP,MS,RK,RV,SS,TL" />
<Event Severity="10" Time="1682418199.889216" DateTime="2023-04-25T10:23:19Z" Type="GlobalConfig_RetryableMigrationError" ID="0000000000000000" Error="database_locked" ErrorDescription="Database is locked" ErrorCode="1038" SuppressedEventCount="32350" ThreadID="14882012433472669136" Machine="10.220.11.2:4500" LogGroup="default" Roles="CC,CD,CP,DD,GP,MS,RK,RV,SS,TL" />
The nodes are running FoundationDB 7.1.31.