We had a fdb cluster incident today. It’s solved, but we’d like to know whys.
We have a 3-DC fdb cluster, with 23 stateless pod and 110 storage pods at first and third DC each. After a DC-level switching op, we got a couple of errors. The first error is db unavailable. We restarted all 23 stateless pods at DC1, the cluster became available.
However, we still had the second error: “Unable to start batch priority transaction after 5 seconds.”.
We thought we should restart transaction logs to speed up batch-priority transactions. We restarted one transaction pod. Then the cluster became unavailable again. We identified one stateless (stateless-04), which was recreated yesterday and had many errors in the trace logs. We wanted to exclude it, but got the following error.
fdb> exclude 10.xxx.yyy.45:4300 WARNING: Long delay (Ctrl-C to interrupt) ERROR: This exclude may cause the total free space in the cluster to drop below 10%. Type `exclude FORCE <ADDRESS>*' to exclude without checking free space.
The error looks scary, so we didn’t FORCE, instead we removed the pod out of the cluster (by scaling down the K8s deployment to 0), then the cluster was back to normal.
But the second error persisted after we brought the db up. The cluster was slow for writing and getting the status. We scanned all fdb trace logs and found one storage pod that had thousands of “N2_ConnectError” errors, generated quickly. The pod was also recreated yesterday.
We restarted the pod, but it didn’t help. Similarly we removed the pod out of the cluster, then the batch-priority transaction error disappeared, and the cluster was back to normal.
“Batch priority transactions can be indefinitely starved while one storage server is failed”, Alex wrote in this post, Batch priority transactions. It looks like we just hit it today.
A few questions:
- Why one stateless pod out of 23 would made the cluster unavailable?
- Why one storage pod out of 110 would indefinitely starved batch priority transactions?
- Are there any changes made related to these issues after v6.2.27 (the version we use)?
Thank you in advance for helping us understand the issues for future prevention.