The cluster is continuously Restoring replication factor, and Moing data has not decreased

Northwo · October 10, 2024, 10:04am

Title: Help needed with replication issues and storage process errors after continuous data writes

Description:
I have a FoundationDB cluster(7.1.5) with 3 machines. Each machine has four RAID5 volumes, each made up of 6 SSDs. On each disk, there are four FDB processes. The redundancy mode is set to double replication.

Last week, we performed continuous data writes for 5 consecutive days. Today, we noticed that many queries are returning the error Storage process does not have recent mutations 1037. Checking the FDB cluster status, we found the replication status message: “Only one replica remains of some data.” One of the machines also had many io_timeout warnings on several storage servers (SS).

We decided to restart all processes on this machine. After the restart, the status changed to HEALING: Restoring replication factor, but many storage servers are showing Storage Server lagging by xxx seconds. The moving data size is over 700GB. However, after waiting the whole afternoon, the moving data size hasn’t reduced at all, and it seems like FDB cannot complete the recovery on its own.

Could anyone help me understand the potential cause of this issue? Is there a way to recover from this state? Would excluding the lagging processes be effective in resolving this?

Northwo · October 10, 2024, 10:05am

Topic		Replies	Views
Temporary hardware failure on singly replicated cluster Using FoundationDB	2	465	August 28, 2020
Storage servers restarting Using FoundationDB	0	280	January 19, 2023
How to repair lagging process? Using FoundationDB	9	2440	May 31, 2018
Cluster stuck in recovery after crash of one node Using FoundationDB	1	554	March 18, 2022
UNHEALTHY: No replicas remain of some data Using FoundationDB	4	444	June 14, 2021

The cluster is continuously Restoring replication factor, and Moing data has not decreased

Related topics