Title: Help needed with replication issues and storage process errors after continuous data writes
Description:
I have a FoundationDB cluster(7.1.5) with 3 machines. Each machine has four RAID5 volumes, each made up of 6 SSDs. On each disk, there are four FDB processes. The redundancy mode is set to double replication.
Last week, we performed continuous data writes for 5 consecutive days. Today, we noticed that many queries are returning the error Storage process does not have recent mutations 1037
. Checking the FDB cluster status, we found the replication status message: “Only one replica remains of some data.” One of the machines also had many io_timeout
warnings on several storage servers (SS).
We decided to restart all processes on this machine. After the restart, the status changed to HEALING: Restoring replication factor
, but many storage servers are showing Storage Server lagging by xxx seconds
. The moving data
size is over 700GB. However, after waiting the whole afternoon, the moving data
size hasn’t reduced at all, and it seems like FDB cannot complete the recovery on its own.
Could anyone help me understand the potential cause of this issue? Is there a way to recover from this state? Would excluding the lagging processes be effective in resolving this?