Cluster in stuck

During performance testing, I installed a cluster and immediately run a high write workload, the cluster hangs in less than 5 seconds.

I created about 900 partitions with the ‘redistribute’ command, pre-split them, and run the workload, the cluster will not hang up.
And even i ran a light workload for a few minutes (for warmup) and then run a high load workload, the cluster will not hang.

Once a hangup occurs, the time it takes to recover varies greatly.
Sometimes it took 2 minutes, sometimes it took 10, 30, or 40 minutes, and sometimes it didn’t work until over an hour later.

When I checked the status while the cluster is hangup, there are ‘unknown’ lines.

fdb> status

Using cluster file `/etc/foundationdb/fdb.cluster'.

Unable to read database configuration.

Configuration:
  Redundancy mode        - unknown
  Storage engine         - unknown
  Coordinators           - unknown
  Usable Regions         - unknown

Cluster:
  FoundationDB processes - 192
  Zones                  - 12
  Machines               - 12
  Memory availability    - 6.2 GB per process on machine with least available
  Retransmissions rate   - 1 Hz
  Server time            - 05/16/24 19:58:03

Data:
  Replication health     - unknown
  Moving data            - unknown
  Sum of key-value sizes - unknown
  Disk space used        - unknown

Operating space:
  Unable to retrieve operating space status

Workload:
  Read rate              - unknown
  Write rate             - unknown
  Transactions started   - unknown
  Transactions committed - unknown
  Conflict rate          - unknown

Backup and DR:
  Running backups        - 0
  Running DRs            - 0 

I looked at the logs with Severity of 30 and 40 among the trace logs during hangups, there are these logs.

#Severity=30
<Event Severity="30" Time="1718006673.081693" DateTime="2024-06-10T08:04:33Z" Type="RkSSListFetchTimeout" ID="4026439550c60a85" SuppressedEventCount="19" ThreadID="11668067143732054327" 0.91. 22.43:4501" LogGroup="default" Roles="RK,SS" />
<Event Severity="30" Time="1718006673.578616" DateTime="2024-06-10T08:04:33Z" Type="RatekeeperGetSSListLongLatency" ID="4026439550c60a85" Latency="1542.49" ThreadID="11668067143732054327 " Machine="10.91. 22.43:4501" LogGroup="default" Roles="RK,SS" />

#Severity=40
<Event Severity="40" ErrorKind="DiskIssue" Time="1718005992.754601" DateTime="2024-06-10T07:53:12Z" Type="StorageServerFailed" ID="4e64c59561ab3bb7" Error="io_timeout" ErrorDescription="A disk IO operation failed to complete in a timely manner" ErrorCode="1521" Reason="Error" ThreadID="10388333767681656538" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x4440098 0x443e8cb 0x443ead1 0x27ce337 cee02 0x27cf35e 0x279fa21 0x279fcd3 0xe64b20 0x27b9e94 0x27bac7a 0x13034d0 0x13037b9 0xe64b20 0x27b1665 0x13034d0 0x13037b9 0xe64b20 0x2389bfd 0x43ca6d8 0xdd2086 0x7f467d844555" Machine="10.91.22.41:4510" LogGroup="default" Roles="SS" />

I tried changing the knobs below, but it had no effect.

knob_storage_hard_limit_bytes
knob_target_bytes_per_storage_server
knob_server_list_delay
knob_storage_server_list_fetch_timeout

I tried profiling with perf, but didn’t get any meaningful results.

I guess the storage servers can no longer process requests due to the overwhelming write load on one storage server team, but it is strange that the entire cluster hangs up and takes so long to recover.

I am curious about the fundamental cause of this phenomenon and a solution.

Test Environment

FDB Cluster

  • FoundationDB 7.1.59
  • 12 machines
    • 16 cores CPU, then 192 processes
    • 4GB Memory, 1TB Disk * 2

Client

  • YCSB (FDB 7.1.59)
  • Write workload command
YCSB_FDB_OPTION='-p foundationdb.batchsize=100 -p foundationdb.apiversion=710'
bin/ycsb.sh load foundationdb -s -p workload=site.ycsb.workloads.CoreWorkload -p recordcount=50000000 $YCSB_FDB_OPTION -threads 64
1 Like

(post deleted by author)

There was a mistake during the testing process where a KeyValueStoreMemory_OutOfSpace error occurred before the StorageServerFailed error. While testing in the memory storage engine state, it was identified that the cluster stopped due to the in-memory storage queue running out of memory (OOM). After changing the storage_migration_type to aggressive and converting all storage servers to the SSD storage engine, the issue of the cluster not recovering did not reoccur.

However, the problem of the cluster halting under a write-heavy workload when there is only one shard remains. The identified causes are as follows:

  • Initially, there is 1 shard, and there are 3 Storage Servers (SS) holding this shard (1 ServerTeam = 3 SS).
  • When the client starts a high load of writes,
    • Data starts accumulating rapidly in the TLog.
    • The 3 SSs begin fetching data from the TLog.
    • However, the rate at which data accumulates in the TLog is faster than the rate at which SSs can fetch it.
  • An SS enters the process_behind state if it satisfies any of the following conditions:
      1. The Storage Queue is full and unable to write data to the disk quickly.
      1. Upon inspection every BEHIND_CHECK_DELAY seconds, the “read version received from the GRV proxy” is greater than the “most recent version fetched from the TLog” + “BEHIND_CHECK_VERSION” for more than BEHIND_CHECK_COUNT times.
      • BEHIND_CHECK_DELAY = 2 seconds
      • BEHIND_CHECK_VERSION = 5 * VERSIONS_PER_SECOND
      • BEHIND_CHECK_COUNT = 2 times
  • In the process_behind state,
    • The SS does not handle read requests and only processes writes.
    • The task that periodically collects the list of SSs executed by the RateKeeper (RK) fails because it needs to read the Server key range stored in the SS.
      • If the STORAGE_SERVER_LIST_FETCH_TIMEOUT (default 20s) is exceeded, the TPSLimit is set to 0, preventing the TLog from receiving transactions.
    • The Data Distributor (DD) attempts to split the shard but fails.
    • In this state, both read/write requests are blocked, and traffic redistribution is not possible.
  • Eventually, one must wait until the SSs in the initial ServerTeam fully process the backlog of data in the TLog.
  • Once the backlog is cleared, normal operation resumes.