Unknown FDB downtime

Screen Shot 2021-06-14 at 1.49.33 PM

Screen Shot 2021-06-14 at 1.48.24 PM

Our database was unresponsive for 6 hrs. From our application log, we see a bunch of transaction_timeout. The downtime corresponds to the HA Moving Data graph and seems to be caused by the Role Changes.

Our setup:

  Redundancy mode        - double
  Storage engine         - ssd-1
  Coordinators           - 3

  FoundationDB processes - 6
  Zones                  - 3
  Machines               - 3

The two processes on each machine do not have roles specified.
FDB version is 6.2 (v6.2.20).

Unfortunately, we do not have trace logs at the time. Can you provide insight as to what might be the cause of this downtime? Are we looking at the right metrics?

Our status details also show high %disk IO, but our other metrics indicate that there is plenty of IO headroom. How is this metric calculated?

Process performance details: ( 51% cpu; 36% machine; 0.114 Gbps; 98% disk IO; 4.3 GB / 13.2 GB RAM ) ( 29% cpu; 36% machine; 0.114 Gbps; 98% disk IO; 5.9 GB / 13.2 GB RAM ) ( 26% cpu; 43% machine; 0.222 Gbps; 99% disk IO; 6.4 GB / 13.2 GB RAM ) ( 73% cpu; 43% machine; 0.222 Gbps; 98% disk IO; 4.9 GB / 13.2 GB RAM ) ( 86% cpu; 48% machine; 0.226 Gbps; 98% disk IO; 5.3 GB / 13.3 GB RAM ) ( 37% cpu; 48% machine; 0.226 Gbps; 99% disk IO; 4.5 GB / 13.3 GB RAM )

We also noticed that during the down time, read was at a reduced capacity, but no commit was happening.

We read this from /proc/diskstats. How good of an indication this value is, we found, depends a lot on the type of disk you’re using. What kind of disk are you using? Is this a local SSD or something like EBS?

This is on AWS, on EBS volume. We read the commit and read value from the fdbcli itself.
fdbcli.cluster.workload.transactions.committed.hz was 0
fdbcli.cluster.workload.operations.reads.hz was 10% of its original rate.
We also noticed that fdbcli.cluster.degraded.processes was incremented by 1 at the start of the downtime.

Regarding to the diskio metrics, we read it from telegraf diskio plugin, which I believe read from /proc/diskstats on linux system.

that is weird, we found on EBS that the disk metrics are quite reliable – but they are not exactly the same as IOPS (as reported by AWS). You can find the code how we collect these here

thanks. given that if diskio % is disk_time_spent_in_io / elapsed_time, then the metric does match up. I thought it was IOPS used / IOPS available via EBS provision. High disk util is somewhat expected.

Seems like during the downtime, disk util is also minimum. Going back to the main question, do you have a clue on this issue?

I don’t think I can help here as I am not really familiar with FDB’s HA solution :frowning:

and to add more context here, before the database went into the recovery mode (meaning serving read requests in a degraded mode and absolute no write requests can be committed), we saw a sudden jump (about 4 GB) in the op space (we are using ssd-2 storage engine here). The jump aligns with the free disk metrics the OS collected, but the total KV size did not change accordingly. Is there any possible explanation for the sudden op space jump?
the green line —> op space (4GB jump within 1 minute)
the blue line —> kv size (continuous)

after reading the doc: foundationdb/recovery-internals.md at master · apple/foundationdb · GitHub
i feel like this is exactly what happened on our FDB cluster during the Saturday incident.
We saw a sudden jump on op space —> then bunch of role changes —> FDB seemed stuck in the recovery mode (no commit at all) (also no network or hardware failures during the recovery time)
My question would be:
(1) Is there possible explanation for why the FDB cluster stuck for such a long time (about 6 hours) in recovery mode and then it simply self recovered?
(2) Is there anything we could do to help make the recovery process faster? (If we were able to find which phase the recovery is stuck at)
(3) any specific events we can search for in the trace logs? (other than the MasterRecoveryState event)

Can you post MasterRecoveryState traces from that time period? I think two things could’ve happened and MasterRecoveryState will tell you which one it is:

  1. FDB is stuck somewhere in recovery. This usually would happen if it can’t recruit enough tlogs to satisfy the replication policy (either because not enough resources are available or because there was some network partition).
  2. FDB recovers but repeatedly fails – going back into another recovery. This could be cause by flaky network.

Another possible cause for this would be a bad disk on a tlog. That would explain why you can read but not commit. Maybe after 6 hours the tlog got finally marked as degraded?