During performance testing, I installed a cluster and immediately run a high write workload, the cluster hangs in less than 5 seconds.
I created about 900 partitions with the ‘redistribute’ command, pre-split them, and run the workload, the cluster will not hang up.
And even i ran a light workload for a few minutes (for warmup) and then run a high load workload, the cluster will not hang.
Once a hangup occurs, the time it takes to recover varies greatly.
Sometimes it took 2 minutes, sometimes it took 10, 30, or 40 minutes, and sometimes it didn’t work until over an hour later.
When I checked the status while the cluster is hangup, there are ‘unknown’ lines.
fdb> status
Using cluster file `/etc/foundationdb/fdb.cluster'.
Unable to read database configuration.
Configuration:
Redundancy mode - unknown
Storage engine - unknown
Coordinators - unknown
Usable Regions - unknown
Cluster:
FoundationDB processes - 192
Zones - 12
Machines - 12
Memory availability - 6.2 GB per process on machine with least available
Retransmissions rate - 1 Hz
Server time - 05/16/24 19:58:03
Data:
Replication health - unknown
Moving data - unknown
Sum of key-value sizes - unknown
Disk space used - unknown
Operating space:
Unable to retrieve operating space status
Workload:
Read rate - unknown
Write rate - unknown
Transactions started - unknown
Transactions committed - unknown
Conflict rate - unknown
Backup and DR:
Running backups - 0
Running DRs - 0
I looked at the logs with Severity of 30 and 40 among the trace logs during hangups, there are these logs.
#Severity=30
<Event Severity="30" Time="1718006673.081693" DateTime="2024-06-10T08:04:33Z" Type="RkSSListFetchTimeout" ID="4026439550c60a85" SuppressedEventCount="19" ThreadID="11668067143732054327" 0.91. 22.43:4501" LogGroup="default" Roles="RK,SS" />
<Event Severity="30" Time="1718006673.578616" DateTime="2024-06-10T08:04:33Z" Type="RatekeeperGetSSListLongLatency" ID="4026439550c60a85" Latency="1542.49" ThreadID="11668067143732054327 " Machine="10.91. 22.43:4501" LogGroup="default" Roles="RK,SS" />
#Severity=40
<Event Severity="40" ErrorKind="DiskIssue" Time="1718005992.754601" DateTime="2024-06-10T07:53:12Z" Type="StorageServerFailed" ID="4e64c59561ab3bb7" Error="io_timeout" ErrorDescription="A disk IO operation failed to complete in a timely manner" ErrorCode="1521" Reason="Error" ThreadID="10388333767681656538" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x4440098 0x443e8cb 0x443ead1 0x27ce337 cee02 0x27cf35e 0x279fa21 0x279fcd3 0xe64b20 0x27b9e94 0x27bac7a 0x13034d0 0x13037b9 0xe64b20 0x27b1665 0x13034d0 0x13037b9 0xe64b20 0x2389bfd 0x43ca6d8 0xdd2086 0x7f467d844555" Machine="10.91.22.41:4510" LogGroup="default" Roles="SS" />
I tried changing the knobs below, but it had no effect.
knob_storage_hard_limit_bytes
knob_target_bytes_per_storage_server
knob_server_list_delay
knob_storage_server_list_fetch_timeout
I tried profiling with perf, but didn’t get any meaningful results.
I guess the storage servers can no longer process requests due to the overwhelming write load on one storage server team, but it is strange that the entire cluster hangs up and takes so long to recover.
I am curious about the fundamental cause of this phenomenon and a solution.
Test Environment
FDB Cluster
- FoundationDB 7.1.59
- 12 machines
- 16 cores CPU, then 192 processes
- 4GB Memory, 1TB Disk * 2
Client
- YCSB (FDB 7.1.59)
- Write workload command
YCSB_FDB_OPTION='-p foundationdb.batchsize=100 -p foundationdb.apiversion=710'
bin/ycsb.sh load foundationdb -s -p workload=site.ycsb.workloads.CoreWorkload -p recordcount=50000000 $YCSB_FDB_OPTION -threads 64