Debugging abnormally high IO load

Hey FoundationdDB-ers,

I’m trying to debug my performance cluster a bit. I’m running on AWS 3 x c5n.2xlarge machines with 4TB EBS io1 volumes with 10k of provisioned IOPs.

When I run fdbcli status details I get the following print out:

fdb> status

Using cluster file `./var/conf/fdb.cluster'.

Configuration:
  Redundancy mode        - double
  Storage engine         - ssd-2
  Coordinators           - 3
  Desired Proxies        - 2
  Desired Logs           - 2

Cluster:
  FoundationDB processes - 15
  Zones                  - 3
  Machines               - 3
  Memory availability    - 4.2 GB per process on machine with least available
  Fault Tolerance        - 1 machine
  Server time            - 10/30/19 03:39:51

Data:
  Replication health     - Healthy (Repartitioning.)
  Moving data            - 13.739 GB
  Sum of key-value sizes - 2.227 TB
  Disk space used        - 5.368 TB

Operating space:
  Storage server         - 2143.7 GB free on most full server
  Log server             - 4176.9 GB free on most full server

Workload:
  Read rate              - 7061 Hz
  Write rate             - 2135 Hz
  Transactions started   - 331 Hz
  Transactions committed - 84 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:
  10.67.246.22:4500:tls  (  3% cpu;  7% machine; 0.071 Gbps; 46% disk IO; 4.0 GB / 4.2 GB RAM  )
  10.67.246.22:4501:tls  (  5% cpu;  7% machine; 0.071 Gbps;  0% disk IO; 0.5 GB / 4.2 GB RAM  )
  10.67.246.22:4502:tls  (  3% cpu;  7% machine; 0.071 Gbps; 46% disk IO; 4.0 GB / 4.2 GB RAM  )
  10.67.246.22:4503:tls  (  6% cpu;  7% machine; 0.071 Gbps; 46% disk IO; 4.0 GB / 4.2 GB RAM  )
  10.67.246.22:4504:tls  (  9% cpu;  7% machine; 0.071 Gbps; 46% disk IO; 4.2 GB / 4.2 GB RAM  )
  10.67.246.24:4500:tls  ( 10% cpu; 11% machine; 0.043 Gbps; 75% disk IO; 4.0 GB / 4.2 GB RAM  )
  10.67.246.24:4501:tls  (  5% cpu; 11% machine; 0.043 Gbps;  0% disk IO; 0.5 GB / 4.2 GB RAM  )
  10.67.246.24:4502:tls  ( 23% cpu; 11% machine; 0.043 Gbps; 74% disk IO; 4.1 GB / 4.2 GB RAM  )
  10.67.246.24:4503:tls  (  4% cpu; 11% machine; 0.043 Gbps; 75% disk IO; 4.0 GB / 4.2 GB RAM  )
  10.67.246.24:4504:tls  (  4% cpu; 11% machine; 0.043 Gbps; 75% disk IO; 4.1 GB / 4.2 GB RAM  )
  10.67.246.32:4500:tls  (  5% cpu; 13% machine; 0.070 Gbps; 75% disk IO; 3.8 GB / 4.2 GB RAM  )
  10.67.246.32:4501:tls  ( 19% cpu; 13% machine; 0.070 Gbps; 76% disk IO; 3.8 GB / 4.2 GB RAM  )
  10.67.246.32:4502:tls  ( 24% cpu; 13% machine; 0.070 Gbps; 76% disk IO; 3.7 GB / 4.2 GB RAM  )
  10.67.246.32:4503:tls  (  8% cpu; 13% machine; 0.070 Gbps; 76% disk IO; 3.8 GB / 4.2 GB RAM  )
  10.67.246.32:4504:tls  ( 15% cpu; 13% machine; 0.070 Gbps; 75% disk IO; 4.0 GB / 4.2 GB RAM  )

Coordination servers:
  10.67.246.22:4500:tls  (reachable)
  10.67.246.24:4500:tls  (reachable)
  10.67.246.32:4500:tls  (reachable)

Client time: 10/30/19 03:40:58

When I go to the 10.67.246.32 machine and run iotop I get the following printout:

foundationdb
lsfdb3:/data
Total DISK READ :      15.40 M/s | Total DISK WRITE :      39.11 M/s
Actual DISK READ:      15.41 M/s | Actual DISK WRITE:      39.22 M/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
 5271 be/4 root     1746.51 K/s   14.03 M/s  0.00 %  1.33 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4501 ~ile /storage/var/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem
 5273 be/4 root        5.46 M/s    2.31 M/s  0.00 %  0.51 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4503 ~ile /storage/var/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem
 5272 be/4 root        3.65 M/s   19.78 M/s  0.00 %  0.39 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4502 ~ile /storage/var/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem
 5274 be/4 root        2.96 M/s 1439.69 K/s  0.00 %  0.27 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4504 ~ile /storage/var/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem
 5270 be/4 root     1659.97 K/s 1600.97 K/s  0.00 %  0.22 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4500 ~ile /storage/var/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem
 5301 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.02 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4501 ~/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem [fdbserver/eio]
 5294 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.01 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4504 ~/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem [fdbserver/eio]
 5303 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4503 ~/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem [fdbserver/eio]
 5281 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4500 ~/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem [fdbserver/eio]
 5282 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4500 ~/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem [fdbserver/eio]
 5297 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4500 ~/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem [fdbserver/eio]
 5288 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4503 ~/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem [fdbserver/eio]
 5302 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4503 ~/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem [fdbserver/eio]
 5305 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4504 ~/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem [fdbserver/eio]
 5299 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4502 ~/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem [fdbserver/eio]
 5304 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4504 ~/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem [fdbserver/eio]
 5292 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4501 ~/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem [fdbserver/eio]
 5293 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4502 ~/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem [fdbserver/eio]
 5300 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % fdbserver --cluster_file /storage/var/conf/fdb.cluster --datadir /storage/var/data/4501 ~/secrets/fdb_cert.pem --tls_key_file /storage/var/secrets/fdb_privkey.pem [fdbserver/eio]

And running iostat -dx 1 I see the following:

Linux 4.14.123-111.109.amzn2.x86_64 (lsfdb3) 	10/30/2019 	_x86_64_	(8 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00    0.02    0.00     0.43     0.00    48.74     0.00    0.70    0.70    0.00   0.09   0.00
nvme1n1           0.77  6198.20 4767.44 2304.22 19073.65 33940.40    14.99     4.26    0.71    0.66    0.82   0.11  75.62
nvme0n1           0.00     0.47    2.91    1.49    86.65    66.44    69.58     0.00    0.77    0.65    0.99   0.14   0.06

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
nvme1n1           0.00  9625.00 4770.00 3180.00 19080.00 51144.00    17.67     3.17    0.50    0.41    0.62   0.10  82.80
nvme0n1           0.00     0.00    0.00    2.00     0.00   104.00   104.00     0.00    0.00    0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme2n1           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
nvme1n1           0.00 12703.00 4051.00 3490.00 16204.00 64684.00    21.45     3.40    0.56    0.40    0.73   0.11  80.80
nvme0n1           0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

My question is that for a cluster of this size and under this load, I would have expected significantly less IO load. Is that expectation correct? The fdbcli is consistently showing the replication health as “Healthy (Repartitioning.)” - does this indicate that my my cluster is suffering a hotspace? Is there any debugging steps I can take to identify this?

Thanks,
Jared.

When I remove all external load from the cluster and let it sit for 5 min, it’s still displaying a large amount of IO load:

fdb> status details

Using cluster file `./var/conf/fdb.cluster'.

Configuration:
  Redundancy mode        - double
  Storage engine         - ssd-2
  Coordinators           - 3
  Desired Proxies        - 2
  Desired Logs           - 2

Cluster:
  FoundationDB processes - 15
  Zones                  - 3
  Machines               - 3
  Memory availability    - 4.1 GB per process on machine with least available
  Fault Tolerance        - 1 machine
  Server time            - 10/30/19 06:21:36

Data:
  Replication health     - Healthy (Repartitioning.)
  Moving data            - 2.695 GB
  Sum of key-value sizes - 2.228 TB
  Disk space used        - 5.405 TB

Operating space:
  Storage server         - 2103.7 GB free on most full server
  Log server             - 4176.9 GB free on most full server

Workload:
  Read rate              - 156 Hz
  Write rate             - 0 Hz
  Transactions started   - 5 Hz
  Transactions committed - 0 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:
  10.67.246.22:4500:tls  (  6% cpu;  3% machine; 0.032 Gbps; 41% disk IO; 4.0 GB / 4.1 GB RAM  )
  10.67.246.22:4501:tls  (  1% cpu;  3% machine; 0.032 Gbps;  0% disk IO; 0.5 GB / 4.1 GB RAM  )
  10.67.246.22:4502:tls  (  6% cpu;  3% machine; 0.032 Gbps; 35% disk IO; 4.1 GB / 4.1 GB RAM  )
  10.67.246.22:4503:tls  (  8% cpu;  3% machine; 0.032 Gbps; 35% disk IO; 4.1 GB / 4.1 GB RAM  )
  10.67.246.22:4504:tls  (  5% cpu;  3% machine; 0.032 Gbps; 42% disk IO; 4.0 GB / 4.1 GB RAM  )
  10.67.246.24:4500:tls  (  8% cpu;  3% machine; 0.028 Gbps; 42% disk IO; 4.1 GB / 4.1 GB RAM  )
  10.67.246.24:4501:tls  (  1% cpu;  3% machine; 0.028 Gbps;  0% disk IO; 0.4 GB / 4.1 GB RAM  )
  10.67.246.24:4502:tls  (  6% cpu;  3% machine; 0.028 Gbps; 42% disk IO; 4.2 GB / 4.1 GB RAM  )
  10.67.246.24:4503:tls  (  6% cpu;  3% machine; 0.028 Gbps; 42% disk IO; 3.9 GB / 4.1 GB RAM  )
  10.67.246.24:4504:tls  (  3% cpu;  3% machine; 0.028 Gbps; 42% disk IO; 4.1 GB / 4.1 GB RAM  )
  10.67.246.32:4500:tls  (  1% cpu;  1% machine; 0.018 Gbps; 11% disk IO; 4.1 GB / 4.2 GB RAM  )
  10.67.246.32:4501:tls  (  0% cpu;  1% machine; 0.018 Gbps; 11% disk IO; 4.2 GB / 4.2 GB RAM  )
  10.67.246.32:4502:tls  (  3% cpu;  1% machine; 0.018 Gbps; 11% disk IO; 4.2 GB / 4.2 GB RAM  )
  10.67.246.32:4503:tls  (  1% cpu;  1% machine; 0.018 Gbps; 11% disk IO; 3.9 GB / 4.2 GB RAM  )
  10.67.246.32:4504:tls  (  4% cpu;  1% machine; 0.018 Gbps;  9% disk IO; 4.2 GB / 4.2 GB RAM  )

Coordination servers:
  10.67.246.22:4500:tls  (reachable)
  10.67.246.24:4500:tls  (reachable)
  10.67.246.32:4500:tls  (reachable)

Client time: 10/30/19 06:21:36

fdb>

One thing I was thinking was that we did apply a workload that did a fair amount of range deletes, so perhaps the load could be background vacuuming? I tailed the trace logs a bit:

<Event Severity="10" Time="1572418409.724511" Type="SpringCleaningMetrics" ID="536f4a4d0da13ee4" SpringCleaningCount="59416" LazyDeletePages="6298860" VacuumedPages="59521" SpringCleaningTime="2506.94" LazyDeleteTime="2469.56" VacuumTime="37.3717" Machine="10.67.246.22:4503" LogGroup="default" Roles="SS" />
<Event Severity="10" Time="1572418414.724521" Type="SpringCleaningMetrics" ID="536f4a4d0da13ee4" SpringCleaningCount="59451" LazyDeletePages="6302360" VacuumedPages="59521" SpringCleaningTime="2508.4" LazyDeleteTime="2471.02" VacuumTime="37.3717" Machine="10.67.246.22:4503" LogGroup="default" Roles="SS" />
<Event Severity="10" Time="1572418419.724531" Type="SpringCleaningMetrics" ID="536f4a4d0da13ee4" SpringCleaningCount="59482" LazyDeletePages="6305460" VacuumedPages="59521" SpringCleaningTime="2510.18" LazyDeleteTime="2472.81" VacuumTime="37.3717" Machine="10.67.246.22:4503" LogGroup="default" Roles="SS" />
<Event Severity="10" Time="1572418424.724544" Type="SpringCleaningMetrics" ID="536f4a4d0da13ee4" SpringCleaningCount="59514" LazyDeletePages="6308660" VacuumedPages="59521" SpringCleaningTime="2511.79" LazyDeleteTime="2474.42" VacuumTime="37.3717" Machine="10.67.246.22:4503" LogGroup="default" Roles="SS" />
<Event Severity="10" Time="1572418429.724554" Type="SpringCleaningMetrics" ID="536f4a4d0da13ee4" SpringCleaningCount="59539" LazyDeletePages="6311160" VacuumedPages="59521" SpringCleaningTime="2513.77" LazyDeleteTime="2476.39" VacuumTime="37.3717" Machine="10.67.246.22:4503" LogGroup="default" Roles="SS" />

Each log is created every 5 seconds… it doesn’t look like that much load?

[edit]: actually, now that I think about it, we’re sharing 4 FDB processes to 1 10k IOPs SSD, so if everyone is spending ~1.5s vacuuming over the 5s period that could amount to a lot of IOPs.

I left the cluster sitting unloaded for ~30 min, and then re-applied the load. Everything looks healthy now:

fdb> status details

Using cluster file `./var/conf/fdb.cluster'.

Configuration:
  Redundancy mode        - double
  Storage engine         - ssd-2
  Coordinators           - 3
  Desired Proxies        - 2
  Desired Logs           - 2

Cluster:
  FoundationDB processes - 15
  Zones                  - 3
  Machines               - 3
  Memory availability    - 4.1 GB per process on machine with least available
  Retransmissions rate   - 1 Hz
  Fault Tolerance        - 1 machine
  Server time            - 10/30/19 07:57:41

Data:
  Replication health     - Healthy
  Moving data            - 0.000 GB
  Sum of key-value sizes - 2.229 TB
  Disk space used        - 5.403 TB

Operating space:
  Storage server         - 2104.1 GB free on most full server
  Log server             - 4176.9 GB free on most full server

Workload:
  Read rate              - 5330 Hz
  Write rate             - 1135 Hz
  Transactions started   - 313 Hz
  Transactions committed - 92 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:
  10.67.246.22:4500:tls  (  2% cpu;  5% machine; 0.010 Gbps;  3% disk IO; 4.0 GB / 4.1 GB RAM  )
  10.67.246.22:4501:tls  (  4% cpu;  5% machine; 0.010 Gbps;  0% disk IO; 0.6 GB / 4.1 GB RAM  )
  10.67.246.22:4502:tls  (  3% cpu;  5% machine; 0.010 Gbps;  2% disk IO; 4.1 GB / 4.1 GB RAM  )
  10.67.246.22:4503:tls  (  2% cpu;  5% machine; 0.010 Gbps;  2% disk IO; 4.1 GB / 4.1 GB RAM  )
  10.67.246.22:4504:tls  (  2% cpu;  5% machine; 0.010 Gbps;  3% disk IO; 4.0 GB / 4.1 GB RAM  )
  10.67.246.24:4500:tls  (  2% cpu;  6% machine; 0.014 Gbps;  2% disk IO; 3.9 GB / 4.2 GB RAM  )
  10.67.246.24:4501:tls  (  5% cpu;  6% machine; 0.014 Gbps;  0% disk IO; 0.4 GB / 4.2 GB RAM  )
  10.67.246.24:4502:tls  (  8% cpu;  6% machine; 0.014 Gbps;  2% disk IO; 4.2 GB / 4.2 GB RAM  )
  10.67.246.24:4503:tls  (  2% cpu;  6% machine; 0.014 Gbps;  2% disk IO; 3.9 GB / 4.2 GB RAM  )
  10.67.246.24:4504:tls  (  2% cpu;  6% machine; 0.014 Gbps;  2% disk IO; 3.8 GB / 4.2 GB RAM  )
  10.67.246.32:4500:tls  (  2% cpu;  8% machine; 0.015 Gbps;  5% disk IO; 4.1 GB / 4.2 GB RAM  )
  10.67.246.32:4501:tls  ( 11% cpu;  8% machine; 0.015 Gbps;  4% disk IO; 4.2 GB / 4.2 GB RAM  )
  10.67.246.32:4502:tls  (  6% cpu;  8% machine; 0.015 Gbps;  5% disk IO; 4.2 GB / 4.2 GB RAM  )
  10.67.246.32:4503:tls  (  2% cpu;  8% machine; 0.015 Gbps;  5% disk IO; 3.9 GB / 4.2 GB RAM  )
  10.67.246.32:4504:tls  ( 11% cpu;  8% machine; 0.015 Gbps;  4% disk IO; 4.3 GB / 4.2 GB RAM  )

Coordination servers:
  10.67.246.22:4500:tls  (reachable)
  10.67.246.24:4500:tls  (reachable)
  10.67.246.32:4500:tls  (reachable)

Client time: 10/30/19 07:57:41

I tailed the trace logs and I see it’s not spending much time vacuuming anymore:

<Event Severity="10" Time="1572422279.500701" Type="SpringCleaningMetrics" ID="68b25cd9eb96c2ee" SpringCleaningCount="12959" LazyDeletePages="309" VacuumedPages="1025" SpringCleaningTime="0.000506401" LazyDeleteTime="0.000506401" VacuumTime="0" Machine="10.67.246.22:4501" LogGroup="default" Roles="TL" />
<Event Severity="10" Time="1572422284.500712" Type="SpringCleaningMetrics" ID="68b25cd9eb96c2ee" SpringCleaningCount="12964" LazyDeletePages="309" VacuumedPages="1025" SpringCleaningTime="0.000506401" LazyDeleteTime="0.000506401" VacuumTime="0" Machine="10.67.246.22:4501" LogGroup="default" Roles="TL" />
<Event Severity="10" Time="1572422289.500721" Type="SpringCleaningMetrics" ID="68b25cd9eb96c2ee" SpringCleaningCount="12969" LazyDeletePages="309" VacuumedPages="1025" SpringCleaningTime="0.000506401" LazyDeleteTime="0.000506401" VacuumTime="0" Machine="10.67.246.22:4501" LogGroup="default" Roles="TL" />
<Event Severity="10" Time="1572422294.500729" Type="SpringCleaningMetrics" ID="68b25cd9eb96c2ee" SpringCleaningCount="12974" LazyDeletePages="309" VacuumedPages="1025" SpringCleaningTime="0.000506401" LazyDeleteTime="0.000506401" VacuumTime="0" Machine="10.67.246.22:4501" LogGroup="default" Roles="TL" />
<Event Severity="10" Time="1572422299.500741" Type="SpringCleaningMetrics" ID="68b25cd9eb96c2ee" SpringCleaningCount="12979" LazyDeletePages="309" VacuumedPages="1025" SpringCleaningTime="0.000506401" LazyDeleteTime="0.000506401" VacuumTime="0" Machine="10.67.246.22:4501" LogGroup="default" Roles="TL" />

So I’m guess the large delete workload that I had applied caused a large amount of vacuuming work to build up and run in the background. Guess this is a problem solved! Any hints at other ways to debug / track this kind of cluster load would be great. One thing to note is that I started debugging this since one of the machines repeatedly would become unavailable and require hard reseting via the AWS console.

What version of FDB are you running?

But it’s starting to sound like that this is a problem that only manifests on EBS. Relax consistency guarantees has ended up in a discussion of this similar sort of thing.

I was running on 6.1.8 for a bit when I had this problem. Half-way through I upgraded to 6.2.7 to see if it would help and I don’t think it did. Eventually, the cluster has settled and I kept it on 6.2.7.