Write conflicts impossible to eliminate? FDB operator + YCSB testing tool

Hello experts,

I am reaching out today to try to diagnose the issues with my foundation DB operator environment that I’m using for testing. As of right now, I cannot get rid of write conflicts no matter what configuration I put the cluster in. I’m using the fdbcli command with the ‘status details’ argument to get information from the cluster, as well as running the fdbexplorer/fdbexporter to gather information.

We’re testing this on an ARO cluster with worker nodes with relevant configs:

vmSize: Standard_L8s_v3
storageAccountType: Premium_LRS
diskSizeGB: 128

As well we are running all stateless processes on Azure vms:

vmSize: Standard_F8s_v2

And to snapshot an fdbcli output in the middle of a test, we see this:

Redundancy mode        - triple
  Storage engine         - ssd-redwood-1-experimental
  Coordinators           - 9
  Desired Commit Proxies - 5
  Desired GRV Proxies    - 5
  Desired Resolvers      - 1
  Desired Logs           - 12
  Desired Remote Logs    - -1
  Desired Log Routers    - -1
  Usable Regions         - 2
    Primary -
        Datacenter                    - dc1
        Satellite datacenters         - dc2, dc3
        Satellite Logs                - 3
    Remote -
        Datacenter                    - dc3
        Satellite datacenters         - dc2, dc1
        Satellite Logs                - 3
  FoundationDB processes - 296
  Zones                  - 59
  Machines               - 59
  Memory availability    - 6.2 GB per process on machine with least available
  Retransmissions rate   - 154 Hz
  Fault Tolerance        - 2 machines
  Server time            - 01/29/24 22:00:00
  Replication health     - (Re)initializing automatic data distribution
  Moving data            - unknown (initializing)
  Sum of key-value sizes - unknown
  Disk space used        - 4.207 TB
Operating space:
  Storage server         - 1658.9 GB free on most full server
  Log server             - 1807.4 GB free on most full server
  Read rate              - 59 Hz
  Write rate             - 34079 Hz
  Transactions started   - 441 Hz
  Transactions committed - 169 Hz
  Conflict rate          - 222 Hz # <------ This is the important bit, why do we have conflicts???
Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Of course our YCSB configs are also important, this is in our statefulset we have in kubernetes, these are environment variables + bash commands + our run loop to get multiple processes:

YCSB Setup:
          keys_per_host=$((num_keys / num_clients))
          operation_count=$((keys_per_process / batch_size))
          run_phase() {
            ./bin/ycsb.sh run foundationdb -s \
              -P $workloadrun \
              -p foundationdb.clusterfile=$FDB_CLUSTER_FILE \
              -p recordcount=$num_keys \
              -p operationcount=$operation_count \
              -p foundationdb.batchsize=$batch_size \
              -p maxexecutiontime=$max_execution_time_seconds \
              -p fieldcount=$field_count \
              -p fieldlength=$field_length \
              -p readproportion=$read_proportion \
              -p insertproportion=$insert_proportion \
              -p updateproportion=$update_proportion \
              -p readmodifywriteproportion=$read_modify_write_proportion \
              -p requestdistribution=uniform \
              -p threadcount=$threads_per_process

I’ve omitted the least important things. The important bits are:

- 5 processes
- 16 threads per process
- 20 pods on 20 separate nodes in kubernetes
- batch_size of 200
- workload is 100% inserts
- 15 minute runtime
- Not all of the env vars are included, but everything the ycsb command uses are defined

As you can see in the fdbcli output we are getting conflicts when writing to the database. To me I don’t believe this should happen because we are splitting up the key space appropriately according to what YCSB needs so that the various threads don’t try to overwrite each other.

I’ve dove into the YCSB logs on process start up and I see that the keys are being split up appropriately. Every single YCSB thread starts its key space with one key after the last YCSB thread ends its key space. Example: YCSB thread 1: keys 1-100, YCSB thread 2: Keys 101-200. Sorry for the impromptu log, I don’t have one offhand at the moment.

Given that, if something looks obviously wrong with how I split up the keys, that could be one of the problems. Maybe I just missed it in the logs.

Otherwise, is there something we can look into regarding the operator configs to eliminate conflicts? We’re seeing some speed problems compard to our VM setup, and we think this might be the primary reason, but any operator tuning recommendations would be considered helpful.

Thanks all!

Do you do queries with KeySelector/lastLessOrEqual or KeySelector/firstGreaterOrEqual? That may go over your sharding range, specially while populating the database.