One faulty stateless pod made cluster unavailable, and one storage server caused cluster slow

lehu · November 5, 2022, 4:10am

We had a fdb cluster incident today. It’s solved, but we’d like to know whys.

We have a 3-DC fdb cluster, with 23 stateless pod and 110 storage pods at first and third DC each. After a DC-level switching op, we got a couple of errors. The first error is db unavailable. We restarted all 23 stateless pods at DC1, the cluster became available.

However, we still had the second error: “Unable to start batch priority transaction after 5 seconds.”.

We thought we should restart transaction logs to speed up batch-priority transactions. We restarted one transaction pod. Then the cluster became unavailable again. We identified one stateless (stateless-04), which was recreated yesterday and had many errors in the trace logs. We wanted to exclude it, but got the following error.

fdb> exclude 10.xxx.yyy.45:4300
WARNING: Long delay (Ctrl-C to interrupt)
ERROR: This exclude may cause the total free space in the cluster to drop below 10%.
Type `exclude FORCE <ADDRESS>*' to exclude without checking free space.

The error looks scary, so we didn’t FORCE, instead we removed the pod out of the cluster (by scaling down the K8s deployment to 0), then the cluster was back to normal.

But the second error persisted after we brought the db up. The cluster was slow for writing and getting the status. We scanned all fdb trace logs and found one storage pod that had thousands of “N2_ConnectError” errors, generated quickly. The pod was also recreated yesterday.

We restarted the pod, but it didn’t help. Similarly we removed the pod out of the cluster, then the batch-priority transaction error disappeared, and the cluster was back to normal.

“Batch priority transactions can be indefinitely starved while one storage server is failed”, Alex wrote in this post, Batch priority transactions. It looks like we just hit it today.

A few questions:

Why one stateless pod out of 23 would made the cluster unavailable?
Why one storage pod out of 110 would indefinitely starved batch priority transactions?
Are there any changes made related to these issues after v6.2.27 (the version we use)?

Thank you in advance for helping us understand the issues for future prevention.

Topic		Replies	Views
FoundationDB @ Kubernetes having "issues" Using FoundationDB	3	2022	May 11, 2018
Stateless node keep reaching out to removed storage node Kubernetes Operator operator	0	387	April 14, 2022
Troubles scaling up the cluster Using FoundationDB	31	3729	November 1, 2018
Cluster stuck in recovery Running FoundationDB	3	686	March 12, 2021
Unexpected cluster state - Unable to read database configuration Using FoundationDB	1	1473	December 14, 2022

One faulty stateless pod made cluster unavailable, and one storage server caused cluster slow

Related topics