I run many foundationDB instances in our testing environment. After upgrading foundationDB to 6.2.30, I see that the instances turn unhealthy more frequently with StorageServerFailed error. Generally, a restart of the foundationDB processes brings the cluster to a healthy state. Can this behavior be explained by the changes in the above PR?
I see from a comment on the PR that StorageServer will not be automatically restarted due to the changes in the PR in case of an io_error? What is the change in the case of StorageServer experience io_timeout? Is io_timeout treated as an io_error?
Did you see IoDegraded events more often?
What investigation did you do to figure out the behavior?
IIRC, there are two behavior changes in 6.2 (maybe 6.2.30):
SS will prioritize write requests when it saw hot shard;
SS does not automatically get recruited on io_timeout.
Both behavior changes do not seem to introduce negative impact to the cluster. @Evan can comment more
@mengxu Thanks for the reply. There were IoDegraded events while the storage failure is observed, which were not seen before. I also see other slowness-related errors at the same time.
The output of status details is as follows:
Using cluster file `/etc/foundationdb/fdb.cluster'.Configuration:
Redundancy mode - double
Storage engine - ssd-2
Coordinators - 3
Desired Proxies - 1
Desired Logs - 2
Usable Regions - 1Cluster:
FoundationDB processes - 10 (less 0 excluded; 2 with errors)
Zones - 5
Machines - 5
Memory availability - 15.3 GB per process on machine with least available
Fault Tolerance - 0 machines
Server time - 06/01/21 22:08:36Data:
Replication health - UNHEALTHY: No replicas remain of some data
Moving data - 102.309 GB
Sum of key-value sizes - 143.216 GB
Disk space used - 219.980 GBOperating space:
Storage server - 376.3 GB free on most full server
Log server - 396.2 GB free on most full serverWorkload:
Read rate - 309 Hz
Write rate - 78 Hz
Transactions started - 176 Hz
Transactions committed - 27 Hz
Conflict rate - 0 HzBackup and DR:
Running backups - 0
Running DRs - 0Process performance details:
172.18.0.30:4500 ( 1% cpu; 1% machine; 0.007 Gbps; 0% disk IO; 3.5 GB / 15.3 GB RAM )
Last logged error: StorageServerFailed: io_timeout at Tue Jun 1 02:58:35 2021
172.18.0.30:4501 ( 1% cpu; 1% machine; 0.007 Gbps; 0% disk IO; 0.3 GB / 15.3 GB RAM )
172.18.0.31:4500 ( 2% cpu; 2% machine; 0.015 Gbps; 1% disk IO; 3.6 GB / 15.7 GB RAM )
172.18.0.31:4501 ( 3% cpu; 2% machine; 0.015 Gbps; 1% disk IO; 0.4 GB / 15.7 GB RAM )
172.18.0.32:4500 ( 1% cpu; 1% machine; 0.006 Gbps; 0% disk IO; 3.5 GB / 16.3 GB RAM )
172.18.0.32:4501 ( 2% cpu; 1% machine; 0.006 Gbps; 0% disk IO; 0.4 GB / 16.3 GB RAM )
172.18.0.35:4500 ( 3% cpu; 1% machine; 0.003 Gbps; 14% disk IO; 4.6 GB / 16.8 GB RAM )
172.18.0.35:4501 ( 0% cpu; 1% machine; 0.003 Gbps; 14% disk IO; 0.2 GB / 16.8 GB RAM )
172.18.0.36:4500 ( 2% cpu; 1% machine; 0.007 Gbps; 21% disk IO; 4.6 GB / 17.9 GB RAM )
Last logged error: StorageServerFailed: io_timeout at Tue Jun 1 02:58:39 2021
172.18.0.36:4501 ( 0% cpu; 1% machine; 0.007 Gbps; 11% disk IO; 0.3 GB / 17.9 GB RAM )Coordination servers:
172.18.0.30:4501 (reachable)
172.18.0.31:4501 (reachable)
172.18.0.32:4501 (reachable)Client time: 06/01/21 22:08:36WARNING: A single process is both a transaction log and a storage server.
For best performance use dedicated disks for the transaction logs by setting process classes.
The clients connecting to FDB experience a lot of transaction failures during this time, which gets resolved when the FDB processes are manually restarted.