The single read and write delay of the three data center cluster exceeds 160ms

I have a three_datacenter mode cluster and each cluster has 6 nodes.

The three DCs of the cluster are ningbo1, ningbo2, and zhengzhou.

The network delay of ningbo1 and ningbo2 is 0.1ms, and the delay between them and Zhengzhou is 20ms.

The primary dc is ningbo2.

When running the performance test (use go-ycsb) in zhengzhou, the single write latency gradually rises to 450ms:

./go-ycsb load foundationdb -P workloada --threads 1

The single read latency gradually starts from 160ms and drops to 100ms:

Here’s the output of fdbcli:

fdb> status

Using cluster file `/etc/foundationdb/fdb.cluster'.

Configuration:
  Redundancy mode        - three_datacenter
  Storage engine         - ssd-2
  Coordinators           - 7
  Desired Commit Proxies - 3
  Desired GRV Proxies    - 1
  Desired Resolvers      - 1
  Desired Logs           - 18
  Usable Regions         - 1

Cluster:
  FoundationDB processes - 483
  Zones                  - 483
  Machines               - 18
  Memory availability    - 8.0 GB per process on machine with least available
  Retransmissions rate   - 46 Hz
  Fault Tolerance        - 3 zones
  Server time            - 01/21/23 15:11:23

Data:
  Replication health     - Healthy
  Moving data            - 0.000 GB
  Sum of key-value sizes - 124.486 GB
  Disk space used        - 1.291 TB

Storage wiggle:
  Wiggle server addresses- 10.195.154.7:5513
  Wiggle server count    - 1

Operating space:
  Storage server         - 1513.1 GB free on most full server
  Log server             - 1689.9 GB free on most full server

Workload:
  Read rate              - 548 Hz
  Write rate             - 0 Hz
  Transactions started   - 339 Hz
  Transactions committed - 0 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Client time: 01/21/23 15:10:29

fdb> status details

Using cluster file `/etc/foundationdb/fdb.cluster'.

Configuration:
  Redundancy mode        - three_datacenter
  Storage engine         - ssd-2
  Coordinators           - 7
  Desired Commit Proxies - 3
  Desired GRV Proxies    - 1
  Desired Resolvers      - 1
  Desired Logs           - 18
  Usable Regions         - 1

Cluster:
  FoundationDB processes - 483
  Zones                  - 483
  Machines               - 18
  Memory availability    - 8.0 GB per process on machine with least available
  Retransmissions rate   - 47 Hz
  Fault Tolerance        - 3 zones
  Server time            - 01/21/23 15:11:28

Data:
  Replication health     - Healthy
  Moving data            - 0.000 GB
  Sum of key-value sizes - 124.486 GB
  Disk space used        - 1.291 TB

Storage wiggle:
  Wiggle server addresses- 10.195.154.7:5513
  Wiggle server count    - 1

Operating space:
  Storage server         - 1513.1 GB free on most full server
  Log server             - 1689.9 GB free on most full server

Workload:
  Read rate              - 471 Hz
  Write rate             - 0 Hz
  Transactions started   - 615 Hz
  Transactions committed - 0 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:
  10.181.159.70:5500     (  2% cpu;  9% machine; 0.014 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.181.159.70:5501     (  2% cpu;  9% machine; 0.014 Gbps;  0% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.181.159.70:5502     (  2% cpu;  9% machine; 0.014 Gbps;  0% disk IO; 3.3 GB / 8.0 GB RAM  )
...

Here’s status json: https://gist.githubusercontent.com/Rjerk/459aef7b339fd62106f5c3d95789d4c9/raw/ed1fd65dc82f5e02f4c5ce5522df1ed363ed1a5a/cluster.json

Here’s proxies info:

+--------------------+------------+--------------+---------------+-------------+
|         ip         | datacenter |     role     | run_loop_busy | latency_p99 |
+--------------------+------------+--------------+---------------+-------------+
| 10.195.154.11:7500 |  ningbo2   | commit_proxy |      2.5%     |   22.62 ms  |
| 10.195.154.7:7500  |  ningbo2   | commit_proxy |      1.7%     |   29.37 ms  |
| 10.195.154.9:7500  |  ningbo2   |  grv_proxy   |     23.1%     |   0.56 ms   |
| 10.195.154.9:7501  |  ningbo2   | commit_proxy |      0.9%     |   25.78 ms  |
+--------------------+------------+--------------+---------------+-------------+

Each machine has a tlog holding a single NVME, here’s 18 tlogs:

+--------------------+------------+---------+-------------+---------------+-----------+------------+-----------+-----------+---------------+
|         ip         | datacenter |   disk  | input_bytes | durable_bytes | used_size | total_size | disk_busy | core_used | run_loop_busy |
+--------------------+------------+---------+-------------+---------------+-----------+------------+-----------+-----------+---------------+
| 10.195.154.9:6500  |  ningbo2   | nvme3n1 |    0 B/s    |    99.0 B/s   |   6.3 GB  |   1.9 TB   |    0.0%   |    0.1    |      4.7%     |
| 10.195.154.8:6500  |  ningbo2   | nvme3n1 |    0 B/s    |     0 B/s     |   8.9 GB  |   1.9 TB   |    0.0%   |    0.1    |      4.4%     |
| 10.195.154.7:6500  |  ningbo2   | nvme3n1 |    0 B/s    |     0 B/s     |   8.9 GB  |   1.9 TB   |    0.0%   |    0.1    |      8.4%     |
| 10.195.154.12:6500 |  ningbo2   | nvme3n1 |    0 B/s    |     0 B/s     |   9.4 GB  |   1.9 TB   |    0.0%   |    0.1    |      4.3%     |
| 10.195.154.11:6500 |  ningbo2   | nvme3n1 |    0 B/s    |    99.0 B/s   |   8.7 GB  |   1.9 TB   |    0.0%   |    0.1    |      4.1%     |
| 10.195.154.10:6500 |  ningbo2   | nvme3n1 |    0 B/s    |     0 B/s     |   8.1 GB  |   1.9 TB   |    0.0%   |    0.1    |      4.5%     |
| 10.195.152.8:6500  |  ningbo1   | nvme3n1 |    0 B/s    |     0 B/s     |   6.3 GB  |   1.9 TB   |    0.0%   |    0.1    |      4.4%     |
| 10.195.152.7:6500  |  ningbo1   | nvme3n1 |    0 B/s    |    99.0 B/s   |   6.9 GB  |   1.9 TB   |    0.0%   |    0.1    |      4.2%     |
| 10.195.152.12:6500 |  ningbo1   | nvme3n1 |    0 B/s    |    99.0 B/s   |   7.8 GB  |   1.9 TB   |    0.0%   |    0.1    |      4.0%     |
| 10.195.152.11:6500 |  ningbo1   | nvme3n1 |    0 B/s    |     0 B/s     |   8.6 GB  |   1.9 TB   |    0.0%   |    0.1    |      4.3%     |
| 10.195.152.10:6500 |  ningbo1   | nvme3n1 |    0 B/s    |    99.0 B/s   |   6.5 GB  |   1.9 TB   |    0.0%   |    0.1    |      4.2%     |
| 10.181.159.75:6500 | zhengzhou  | nvme3n1 |    0 B/s    |     0 B/s     |   6.5 GB  |   1.9 TB   |    0.1%   |    0.1    |      9.8%     |
| 10.181.159.74:6500 | zhengzhou  | nvme3n1 |    0 B/s    |     0 B/s     |   9.6 GB  |   1.9 TB   |    0.0%   |    0.1    |      4.3%     |
| 10.181.159.73:6500 | zhengzhou  | nvme3n1 |    0 B/s    |     0 B/s     |   7.6 GB  |   1.9 TB   |    0.0%   |    0.1    |      9.7%     |
| 10.181.159.72:6500 | zhengzhou  | nvme3n1 |    0 B/s    |     0 B/s     |   7.0 GB  |   1.9 TB   |    0.1%   |    0.1    |      9.8%     |
| 10.181.159.71:6500 | zhengzhou  | nvme3n1 |    0 B/s    |    99.0 B/s   |   7.6 GB  |   1.9 TB   |    0.0%   |    0.1    |      9.9%     |
| 10.181.159.70:6500 | zhengzhou  | nvme3n1 |    0 B/s    |     0 B/s     |   6.7 GB  |   1.9 TB   |    0.0%   |    0.1    |      9.5%     |
+--------------------+------------+---------+-------------+---------------+-----------+------------+-----------+-----------+---------------+

Each machine has 6 NVMEs for storage, and each NVME has 4 storage server.

The performance of 3dc mode is not as good as we expected.
Is there any optimization we can do so that the read/write delay can be controlled within 80ms.

@jzhou