We are seeing below RecordCore timeouts exceptions in several deployments that talk to FDB. We use a MetaDataVersionStampStoreStateCache
We started seeing this recently. There was no change to application logic/ cluster config. This seems to slow down the application and we are seeing increased latency.
Any pointers to troubleshoot/ narrow down the issue is much appreciated.
Caused by: com.apple.foundationdb.record.RecordCoreException: null
at com.apple.foundationdb.record.provider.foundationdb.FDBExceptions.wrapException(FDBExceptions.java:199)
at com.apple.foundationdb.record.provider.foundationdb.FDBDatabase.lambda$new$0(FDBDatabase.java:162)
at com.apple.foundationdb.record.provider.foundationdb.FDBDatabase.asyncToSync(FDBDatabase.java:1002)
at com.apple.foundationdb.record.provider.foundationdb.FDBRecordContext.asyncToSync(FDBRecordContext.java:986)
at com.apple.foundationdb.record.provider.foundationdb.FDBRecordStoreBase$BaseBuilder.createOrOpen(FDBRecordStoreBase.java:1948)
at io.branch.service.persona.fdb.IPRecordStore.lambda$getGeoIPRecordStore$1(IPRecordStore.java:62)
at io.branch.service.persona.fdb.IPRecordStore.getFDBRecordStore(IPRecordStore.java:47)
at io.branch.ip.store.provider.IPStoreCandidateImpl.lambda$batchWriteRecord$3(IPStoreCandidateImpl.java:171)
at com.apple.foundationdb.record.provider.foundationdb.FDBDatabaseRunnerImpl$RunRetriable.lambda$runAsync$4(FDBDatabaseRunnerImpl.java:331)
at com.apple.foundationdb.async.AsyncUtil$LoopPartial.apply(AsyncUtil.java:355)
Caused by: java.util.concurrent.TimeoutException: null
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
at com.apple.foundationdb.record.provider.foundationdb.FDBDatabase.asyncToSync(FDBDatabase.java:993)
at com.apple.foundationdb.record.provider.foundationdb.FDBRecordContext.asyncToSync(FDBRecordContext.java:986)
at com.apple.foundationdb.record.provider.foundationdb.FDBRecordStoreBase$BaseBuilder.createOrOpen(FDBRecordStoreBase.java:1948)
at io.branch.service.persona.fdb.IPRecordStore.lambda$getGeoIPRecordStore$1(IPRecordStore.java:62)
at io.branch.service.persona.fdb.IPRecordStore.getFDBRecordStore(IPRecordStore.java:47)
at io.branch.ip.store.provider.IPStoreCandidateImpl.lambda$batchWriteRecord$3(IPStoreCandidateImpl.java:171)
at com.apple.foundationdb.record.provider.foundationdb.FDBDatabaseRunnerImpl$RunRetriable.lambda$runAsync$4(FDBDatabaseRunnerImpl.java:331)
at com.apple.foundationdb.async.AsyncUtil$LoopPartial.apply(AsyncUtil.java:355)
fdbRecordLayer = “2.8.110.0”
fdbJava = “6.1.8”
fdbExtensions =“2.8.110.0”
Cluster details:
FoundationDB 6.2 (v6.2.15)
source version 20566f2ff06a7e822b30e8cfd91090fbd863a393
protocol fdb00b062010001
Configuration:
Redundancy mode - double
Storage engine - ssd-2
Coordinators - 6
Exclusions - 28 (type `exclude' for details)
Desired Proxies - 12
Desired Resolvers - 8
Desired Logs - 12
Cluster:
FoundationDB processes - 316 (less 2 excluded; 0 with errors)
Zones - 24
Machines - 24
Memory availability - 5.6 GB per process on machine with least available
Retransmissions rate - 4 Hz
Fault Tolerance - 1 machine
Server time - 05/10/22 00:38:00
Data:
Replication health - Healthy
Moving data - 0.000 GB
Sum of key-value sizes - 839.649 GB
Disk space used - 2.332 TB
Operating space:
Storage server - 1618.5 GB free on most full server
Log server - 825.8 GB free on most full server
Workload:
Read rate - 102118 Hz
Write rate - 56493 Hz
Transactions started - 60974 Hz
Transactions committed - 195 Hz
Conflict rate - 2 Hz
Backup and DR:
Running backups - 0
Running DRs - 0
Issue started on 04/28 at 04:50 AM PST.
Few graphs between 04:00 - 06:00 AM PST on 04/28
We do not see RateKeeper ratelimiting application.
GRV latency bands:
Metric - fdb_cluster_processes_roles_proxy_grv_latency_bands_0_XXXX
Proxy Commit latency bands
Metric - fdb_cluster_processes_roles_proxy_commit_latency_bands_0_XXXX
Tx Log Queue:
Read/Write Rate:
Tx started/committed/conflicts