Hi, I had an incident with one of my fdb clusters. The cluster has 3 machines, each one running 5-6 processes that include 3 storage processes. It’s on double replication. The storage servers came close to filling so we expanded our EBS drives and ran xfs_grow. Then transactions started failing due to taking over 5s.
status details showed that storage processes on all machines were failing with
StorageServerFailed: io_error, and replication health reported
UNHEALTHY: No replicas remain of some data. We tried restarting the whole cluster, and it would either report as repairing or Healthy removing storage server for awhile, until storage processes once again began failing with io_error.
My understanding of SS lagging behind is that since we hit an io_error the storage server role was terminated. And since storage servers on all machines were failing, we lost all replicas for some of our data.
Looking through the codebase it seems (?) that io_error being thrown is generally due to either IO performance or file corruption. IO load wasn’t too high, and we cut off incoming requests from the client application. We’ve expanded our drives many times and I find it unlikely that it would cause file corruption, much less on all machines. What could be the cause of hitting io_error on all machines?