Hey FoundationdDB-ers,
I’m trying to debug my performance cluster a bit. I’m running on AWS 3 x c5n.2xlarge machines with 4TB EBS io1 volumes with 10k of provisioned IOPs.
When I run fdbcli status details
I get the following print out:
fdb> status
Using cluster file `./var/conf/fdb.cluster'.
Configuration:
Redundancy mode - double
Storage engine - ssd-2
Coordinators - 3
Desired Proxies - 2
Desired Logs - 2
Cluster:
FoundationDB processes - 15
Zones - 3
Machines - 3
Memory availability - 4.2 GB per process on machine with least available
Fault Tolerance - 1 machine
Server time - 10/30/19 03:39:51
Data:
Replication health - Healthy (Repartitioning.)
Moving data - 13.739 GB
Sum of key-value sizes - 2.227 TB
Disk space used - 5.368 TB
Operating space:
Storage server - 2143.7 GB free on most full server
Log server - 4176.9 GB free on most full server
Workload:
Read rate - 7061 Hz
Write rate - 2135 Hz
Transactions started - 331 Hz
Transactions committed - 84 Hz
Conflict rate - 0 Hz
Backup and DR:
Running backups - 0
Running DRs - 0
Process performance details:
10.67.246.22:4500:tls ( 3% cpu; 7% machine; 0.071 Gbps; 46% disk IO; 4.0 GB / 4.2 GB RAM )
10.67.246.22:4501:tls ( 5% cpu; 7% machine; 0.071 Gbps; 0% disk IO; 0.5 GB / 4.2 GB RAM )
10.67.246.22:4502:tls ( 3% cpu; 7% machine; 0.071 Gbps; 46% disk IO; 4.0 GB / 4.2 GB RAM )
10.67.246.22:4503:tls ( 6% cpu; 7% machine; 0.071 Gbps; 46% disk IO; 4.0 GB / 4.2 GB RAM )
10.67.246.22:4504:tls ( 9% cpu; 7% machine; 0.071 Gbps; 46% disk IO; 4.2 GB / 4.2 GB RAM )
10.67.246.24:4500:tls ( 10% cpu; 11% machine; 0.043 Gbps; 75% disk IO; 4.0 GB / 4.2 GB RAM )
10.67.246.24:4501:tls ( 5% cpu; 11% machine; 0.043 Gbps; 0% disk IO; 0.5 GB / 4.2 GB RAM )
10.67.246.24:4502:tls ( 23% cpu; 11% machine; 0.043 Gbps; 74% disk IO; 4.1 GB / 4.2 GB RAM )
10.67.246.24:4503:tls ( 4% cpu; 11% machine; 0.043 Gbps; 75% disk IO; 4.0 GB / 4.2 GB RAM )
10.67.246.24:4504:tls ( 4% cpu; 11% machine; 0.043 Gbps; 75% disk IO; 4.1 GB / 4.2 GB RAM )
10.67.246.32:4500:tls ( 5% cpu; 13% machine; 0.070 Gbps; 75% disk IO; 3.8 GB / 4.2 GB RAM )
10.67.246.32:4501:tls ( 19% cpu; 13% machine; 0.070 Gbps; 76% disk IO; 3.8 GB / 4.2 GB RAM )
10.67.246.32:4502:tls ( 24% cpu; 13% machine; 0.070 Gbps; 76% disk IO; 3.7 GB / 4.2 GB RAM )
10.67.246.32:4503:tls ( 8% cpu; 13% machine; 0.070 Gbps; 76% disk IO; 3.8 GB / 4.2 GB RAM )
10.67.246.32:4504:tls ( 15% cpu; 13% machine; 0.070 Gbps; 75% disk IO; 4.0 GB / 4.2 GB RAM )
Coordination servers:
10.67.246.22:4500:tls (reachable)
10.67.246.24:4500:tls (reachable)
10.67.246.32:4500:tls (reachable)
Client time: 10/30/19 03:40:58
When I go to the 10.67.246.32
machine and run iotop
I get the following printout:
foundationdb
lsfdb3:/data
Total DISK READ : 15.40 M/s | Total DISK WRITE : 39.11 M/s
Actual DISK READ: 15.41 M/s | Actual DISK WRITE: 39.22 M/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
5271 be/4 root 1746.51 K/s 14.03 M/s 0.00 % 1.33 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4501 ~ile /storage/var/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem
5273 be/4 root 5.46 M/s 2.31 M/s 0.00 % 0.51 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4503 ~ile /storage/var/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem
5272 be/4 root 3.65 M/s 19.78 M/s 0.00 % 0.39 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4502 ~ile /storage/var/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem
5274 be/4 root 2.96 M/s 1439.69 K/s 0.00 % 0.27 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4504 ~ile /storage/var/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem
5270 be/4 root 1659.97 K/s 1600.97 K/s 0.00 % 0.22 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4500 ~ile /storage/var/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem
5301 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.02 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4501 ~/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem [fdbserver/eio]
5294 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.01 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4504 ~/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem [fdbserver/eio]
5303 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4503 ~/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem [fdbserver/eio]
5281 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4500 ~/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem [fdbserver/eio]
5282 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4500 ~/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem [fdbserver/eio]
5297 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4500 ~/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem [fdbserver/eio]
5288 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4503 ~/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem [fdbserver/eio]
5302 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4503 ~/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem [fdbserver/eio]
5305 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4504 ~/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem [fdbserver/eio]
5299 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4502 ~/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem [fdbserver/eio]
5304 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4504 ~/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem [fdbserver/eio]
5292 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4501 ~/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem [fdbserver/eio]
5293 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4502 ~/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem [fdbserver/eio]
5300 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4501 ~/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem [fdbserver/eio]
And running iostat -dx 1
I see the following:
Linux 4.14.123-111.109.amzn2.x86_64 (lsfdb3) 10/30/2019 _x86_64_ (8 CPU)
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme2n1 0.00 0.00 0.02 0.00 0.43 0.00 48.74 0.00 0.70 0.70 0.00 0.09 0.00
nvme1n1 0.77 6198.20 4767.44 2304.22 19073.65 33940.40 14.99 4.26 0.71 0.66 0.82 0.11 75.62
nvme0n1 0.00 0.47 2.91 1.49 86.65 66.44 69.58 0.00 0.77 0.65 0.99 0.14 0.06
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme2n1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
nvme1n1 0.00 9625.00 4770.00 3180.00 19080.00 51144.00 17.67 3.17 0.50 0.41 0.62 0.10 82.80
nvme0n1 0.00 0.00 0.00 2.00 0.00 104.00 104.00 0.00 0.00 0.00 0.00 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme2n1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
nvme1n1 0.00 12703.00 4051.00 3490.00 16204.00 64684.00 21.45 3.40 0.56 0.40 0.73 0.11 80.80
nvme0n1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
My question is that for a cluster of this size and under this load, I would have expected significantly less IO load. Is that expectation correct? The fdbcli is consistently showing the replication health as “Healthy (Repartitioning.)” - does this indicate that my my cluster is suffering a hotspace? Is there any debugging steps I can take to identify this?
Thanks,
Jared.