Hello experts,
I am reaching out today to try to diagnose the issues with my foundation DB operator environment that I’m using for testing. As of right now, I cannot get rid of write conflicts no matter what configuration I put the cluster in. I’m using the fdbcli command with the ‘status details’ argument to get information from the cluster, as well as running the fdbexplorer/fdbexporter to gather information.
We’re testing this on an ARO cluster with worker nodes with relevant configs:
vmSize: Standard_L8s_v3
storageAccountType: Premium_LRS
diskSizeGB: 128
As well we are running all stateless processes on Azure vms:
vmSize: Standard_F8s_v2
And to snapshot an fdbcli output in the middle of a test, we see this:
Redundancy mode - triple
Storage engine - ssd-redwood-1-experimental
Coordinators - 9
Desired Commit Proxies - 5
Desired GRV Proxies - 5
Desired Resolvers - 1
Desired Logs - 12
Desired Remote Logs - -1
Desired Log Routers - -1
Usable Regions - 2
Regions:
Primary -
Datacenter - dc1
Satellite datacenters - dc2, dc3
Satellite Logs - 3
Remote -
Datacenter - dc3
Satellite datacenters - dc2, dc1
Satellite Logs - 3
Cluster:
FoundationDB processes - 296
Zones - 59
Machines - 59
Memory availability - 6.2 GB per process on machine with least available
Retransmissions rate - 154 Hz
Fault Tolerance - 2 machines
Server time - 01/29/24 22:00:00
Data:
Replication health - (Re)initializing automatic data distribution
Moving data - unknown (initializing)
Sum of key-value sizes - unknown
Disk space used - 4.207 TB
Operating space:
Storage server - 1658.9 GB free on most full server
Log server - 1807.4 GB free on most full server
Workload:
Read rate - 59 Hz
Write rate - 34079 Hz
Transactions started - 441 Hz
Transactions committed - 169 Hz
Conflict rate - 222 Hz # <------ This is the important bit, why do we have conflicts???
Backup and DR:
Running backups - 0
Running DRs - 0
Of course our YCSB configs are also important, this is in our statefulset we have in kubernetes, these are environment variables + bash commands + our run loop to get multiple processes:
YCSB Setup:
update_proportion=0.0
read_proportion=0.0
insert_proportion=1.0
read_modify_write_proportion=0.0
num_keys=300000000
value_size_bytes=2000
batch_size=200
num_clients=20
threads_per_process=16
max_execution_time_seconds=900
keys_per_host=$((num_keys / num_clients))
operation_count=$((keys_per_process / batch_size))
run_phase() {
./bin/ycsb.sh run foundationdb -s \
-P $workloadrun \
-p foundationdb.clusterfile=$FDB_CLUSTER_FILE \
-p recordcount=$num_keys \
-p operationcount=$operation_count \
-p foundationdb.batchsize=$batch_size \
-p maxexecutiontime=$max_execution_time_seconds \
-p fieldcount=$field_count \
-p fieldlength=$field_length \
-p readproportion=$read_proportion \
-p insertproportion=$insert_proportion \
-p updateproportion=$update_proportion \
-p readmodifywriteproportion=$read_modify_write_proportion \
-p requestdistribution=uniform \
-p threadcount=$threads_per_process
}
I’ve omitted the least important things. The important bits are:
- 5 processes
- 16 threads per process
- 20 pods on 20 separate nodes in kubernetes
- batch_size of 200
- workload is 100% inserts
- 15 minute runtime
- Not all of the env vars are included, but everything the ycsb command uses are defined
As you can see in the fdbcli output we are getting conflicts when writing to the database. To me I don’t believe this should happen because we are splitting up the key space appropriately according to what YCSB needs so that the various threads don’t try to overwrite each other.
I’ve dove into the YCSB logs on process start up and I see that the keys are being split up appropriately. Every single YCSB thread starts its key space with one key after the last YCSB thread ends its key space. Example: YCSB thread 1: keys 1-100, YCSB thread 2: Keys 101-200. Sorry for the impromptu log, I don’t have one offhand at the moment.
Given that, if something looks obviously wrong with how I split up the keys, that could be one of the problems. Maybe I just missed it in the logs.
Otherwise, is there something we can look into regarding the operator configs to eliminate conflicts? We’re seeing some speed problems compard to our VM setup, and we think this might be the primary reason, but any operator tuning recommendations would be considered helpful.
Thanks all!