Unknown FDB downtime

kentwang929 · June 14, 2021, 8:59pm

Screen Shot 2021-06-14 at 1.49.33 PM

Our database was unresponsive for 6 hrs. From our application log, we see a bunch of transaction_timeout. The downtime corresponds to the HA Moving Data graph and seems to be caused by the Role Changes.

Our setup:

Configuration:
  Redundancy mode        - double
  Storage engine         - ssd-1
  Coordinators           - 3

Cluster:
  FoundationDB processes - 6
  Zones                  - 3
  Machines               - 3

The two processes on each machine do not have roles specified.
FDB version is 6.2 (v6.2.20).

Unfortunately, we do not have trace logs at the time. Can you provide insight as to what might be the cause of this downtime? Are we looking at the right metrics?

kentwang929 · June 14, 2021, 11:02pm

Our status details also show high %disk IO, but our other metrics indicate that there is plenty of IO headroom. How is this metric calculated?

Process performance details:
10.25.0.30:4800 ( 51% cpu; 36% machine; 0.114 Gbps; 98% disk IO; 4.3 GB / 13.2 GB RAM )
10.25.0.30:4801 ( 29% cpu; 36% machine; 0.114 Gbps; 98% disk IO; 5.9 GB / 13.2 GB RAM )
10.25.0.95:4800 ( 26% cpu; 43% machine; 0.222 Gbps; 99% disk IO; 6.4 GB / 13.2 GB RAM )
10.25.0.95:4801 ( 73% cpu; 43% machine; 0.222 Gbps; 98% disk IO; 4.9 GB / 13.2 GB RAM )
10.25.0.155:4800 ( 86% cpu; 48% machine; 0.226 Gbps; 98% disk IO; 5.3 GB / 13.3 GB RAM )
10.25.0.155:4801 ( 37% cpu; 48% machine; 0.226 Gbps; 99% disk IO; 4.5 GB / 13.3 GB RAM )

kentwang929 · June 15, 2021, 5:18pm

We also noticed that during the down time, read was at a reduced capacity, but no commit was happening.

markus.pilman · June 15, 2021, 5:23pm

We read this from /proc/diskstats. How good of an indication this value is, we found, depends a lot on the type of disk you’re using. What kind of disk are you using? Is this a local SSD or something like EBS?

kentwang929 · June 15, 2021, 5:30pm

This is on AWS, on EBS volume. We read the commit and read value from the fdbcli itself.
fdbcli.cluster.workload.transactions.committed.hz was 0
fdbcli.cluster.workload.operations.reads.hz was 10% of its original rate.
We also noticed that fdbcli.cluster.degraded.processes was incremented by 1 at the start of the downtime.

kentwang929 · June 15, 2021, 5:35pm

Regarding to the diskio metrics, we read it from telegraf diskio plugin, which I believe read from /proc/diskstats on linux system.

markus.pilman · June 15, 2021, 5:45pm

that is weird, we found on EBS that the disk metrics are quite reliable – but they are not exactly the same as IOPS (as reported by AWS). You can find the code how we collect these here

kentwang929 · June 15, 2021, 7:24pm

thanks. given that if diskio % is disk_time_spent_in_io / elapsed_time, then the metric does match up. I thought it was IOPS used / IOPS available via EBS provision. High disk util is somewhat expected.

Seems like during the downtime, disk util is also minimum. Going back to the main question, do you have a clue on this issue?

markus.pilman · June 15, 2021, 8:17pm

I don’t think I can help here as I am not really familiar with FDB’s HA solution

mengranewo · June 16, 2021, 1:03am

and to add more context here, before the database went into the recovery mode (meaning serving read requests in a degraded mode and absolute no write requests can be committed), we saw a sudden jump (about 4 GB) in the op space (we are using ssd-2 storage engine here). The jump aligns with the free disk metrics the OS collected, but the total KV size did not change accordingly. Is there any possible explanation for the sudden op space jump?
the green line —> op space (4GB jump within 1 minute)
the blue line —> kv size (continuous)

mengranewo · June 17, 2021, 4:39pm

after reading the doc: foundationdb/recovery-internals.md at master · apple/foundationdb · GitHub
i feel like this is exactly what happened on our FDB cluster during the Saturday incident.
We saw a sudden jump on op space —> then bunch of role changes —> FDB seemed stuck in the recovery mode (no commit at all) (also no network or hardware failures during the recovery time)
My question would be:
(1) Is there possible explanation for why the FDB cluster stuck for such a long time (about 6 hours) in recovery mode and then it simply self recovered?
(2) Is there anything we could do to help make the recovery process faster? (If we were able to find which phase the recovery is stuck at)
(3) any specific events we can search for in the trace logs? (other than the MasterRecoveryState event)

markus.pilman · June 17, 2021, 4:52pm

Can you post MasterRecoveryState traces from that time period? I think two things could’ve happened and MasterRecoveryState will tell you which one it is:

FDB is stuck somewhere in recovery. This usually would happen if it can’t recruit enough tlogs to satisfy the replication policy (either because not enough resources are available or because there was some network partition).
FDB recovers but repeatedly fails – going back into another recovery. This could be cause by flaky network.

Another possible cause for this would be a bad disk on a tlog. That would explain why you can read but not commit. Maybe after 6 hours the tlog got finally marked as degraded?

Topic		Replies	Views
Cluster stuck in recovery after crash of one node Using FoundationDB	1	554	March 18, 2022
Any way to gracefully stop fdbserver process temporarily without affecting traffic? Using FoundationDB	8	1545	December 9, 2019
Database unavailable after shutting down a foundationdb node Using FoundationDB	17	8554	February 5, 2021
Fdb cluster is unavailable after delete a disk Using FoundationDB	3	1147	July 9, 2020
FDB cluster freeze Using FoundationDB	12	439	March 22, 2023

Unknown FDB downtime

Related topics