FDB TimeoutExceptions degrading application performance

We are seeing below RecordCore timeouts exceptions in several deployments that talk to FDB. We use a MetaDataVersionStampStoreStateCache

We started seeing this recently. There was no change to application logic/ cluster config. This seems to slow down the application and we are seeing increased latency.
Any pointers to troubleshoot/ narrow down the issue is much appreciated.

Caused by: com.apple.foundationdb.record.RecordCoreException: null
	at com.apple.foundationdb.record.provider.foundationdb.FDBExceptions.wrapException(FDBExceptions.java:199)
	at com.apple.foundationdb.record.provider.foundationdb.FDBDatabase.lambda$new$0(FDBDatabase.java:162)
	at com.apple.foundationdb.record.provider.foundationdb.FDBDatabase.asyncToSync(FDBDatabase.java:1002)
	at com.apple.foundationdb.record.provider.foundationdb.FDBRecordContext.asyncToSync(FDBRecordContext.java:986)
	at com.apple.foundationdb.record.provider.foundationdb.FDBRecordStoreBase$BaseBuilder.createOrOpen(FDBRecordStoreBase.java:1948)
	at io.branch.service.persona.fdb.IPRecordStore.lambda$getGeoIPRecordStore$1(IPRecordStore.java:62)
	at io.branch.service.persona.fdb.IPRecordStore.getFDBRecordStore(IPRecordStore.java:47)
	at io.branch.ip.store.provider.IPStoreCandidateImpl.lambda$batchWriteRecord$3(IPStoreCandidateImpl.java:171)
	at com.apple.foundationdb.record.provider.foundationdb.FDBDatabaseRunnerImpl$RunRetriable.lambda$runAsync$4(FDBDatabaseRunnerImpl.java:331)
	at com.apple.foundationdb.async.AsyncUtil$LoopPartial.apply(AsyncUtil.java:355)
Caused by: java.util.concurrent.TimeoutException: null
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
	at com.apple.foundationdb.record.provider.foundationdb.FDBDatabase.asyncToSync(FDBDatabase.java:993)
	at com.apple.foundationdb.record.provider.foundationdb.FDBRecordContext.asyncToSync(FDBRecordContext.java:986)
	at com.apple.foundationdb.record.provider.foundationdb.FDBRecordStoreBase$BaseBuilder.createOrOpen(FDBRecordStoreBase.java:1948)
	at io.branch.service.persona.fdb.IPRecordStore.lambda$getGeoIPRecordStore$1(IPRecordStore.java:62)
	at io.branch.service.persona.fdb.IPRecordStore.getFDBRecordStore(IPRecordStore.java:47)
	at io.branch.ip.store.provider.IPStoreCandidateImpl.lambda$batchWriteRecord$3(IPStoreCandidateImpl.java:171)
	at com.apple.foundationdb.record.provider.foundationdb.FDBDatabaseRunnerImpl$RunRetriable.lambda$runAsync$4(FDBDatabaseRunnerImpl.java:331)
	at com.apple.foundationdb.async.AsyncUtil$LoopPartial.apply(AsyncUtil.java:355)

fdbRecordLayer = “2.8.110.0”
fdbJava = “6.1.8”
fdbExtensions =“2.8.110.0”

Cluster details:

FoundationDB 6.2 (v6.2.15)
source version 20566f2ff06a7e822b30e8cfd91090fbd863a393
protocol fdb00b062010001
Configuration:
  Redundancy mode        - double
  Storage engine         - ssd-2
  Coordinators           - 6
  Exclusions             - 28 (type `exclude' for details)
  Desired Proxies        - 12
  Desired Resolvers      - 8
  Desired Logs           - 12

Cluster:
  FoundationDB processes - 316 (less 2 excluded; 0 with errors)
  Zones                  - 24
  Machines               - 24
  Memory availability    - 5.6 GB per process on machine with least available
  Retransmissions rate   - 4 Hz
  Fault Tolerance        - 1 machine
  Server time            - 05/10/22 00:38:00

Data:
  Replication health     - Healthy
  Moving data            - 0.000 GB
  Sum of key-value sizes - 839.649 GB
  Disk space used        - 2.332 TB

Operating space:
  Storage server         - 1618.5 GB free on most full server
  Log server             - 825.8 GB free on most full server

Workload:
  Read rate              - 102118 Hz
  Write rate             - 56493 Hz
  Transactions started   - 60974 Hz
  Transactions committed - 195 Hz
  Conflict rate          - 2 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Issue started on 04/28 at 04:50 AM PST.

Few graphs between 04:00 - 06:00 AM PST on 04/28
We do not see RateKeeper ratelimiting application.

GRV latency bands:
Metric - fdb_cluster_processes_roles_proxy_grv_latency_bands_0_XXXX

Proxy Commit latency bands
Metric - fdb_cluster_processes_roles_proxy_commit_latency_bands_0_XXXX

Tx Log Queue:

Read/Write Rate:

Tx started/committed/conflicts

@amehta

Found a networking issue that was affecting subset of application pods. Resolved.

1 Like