Behaviour of Storage Server in case of bad disk from 6.2.30 onwards

tuk · June 4, 2021, 2:51pm

Can someone answer my below queries in context of the below PR

github.com/apple/foundationdb

throw an io_timeout if it takes more than 2 minutes to commit on the storage server

apple:release-6.2 ← sfc-gh-etschannen:feature-storage-io-timeout

opened 07:51PM - 27 Jan 21 UTC

sfc-gh-etschannen

+77 -48

This PR is resolves #... Changes in this PR: If a commit to disk on a stor…age server stalls indefinitely, there is currently no mechanism to remove it from the cluster. ## General guideline: - If this PR is ready to be merged (and all checkboxes below are either ticked or not applicable), make this a regular PR - If this PR still needs work, please make this a draft PR - If you wish to get feedback/code-review, please add the label RFC to this PR Please verify that all things listed below were considered and check them. If an item doesn't apply to this type of PR (for example a documentation change doesn't need to be performance tested), you should make a ~~strikethrough~~ (markdown syntax: `~~strikethrough~~`). More infos on the guidlines can be found [here](https://github.com/apple/foundationdb/wiki/FoundationDB-Commit-Process). ### Style - [ ] All variable and function names make sense. - [ ] The code is properly formatted (consider running `git clang-format`). ### Performance - [ ] All CPU-hot paths are well optimized. - [ ] The proper containers are used (for example `std::vector` vs `VectorRef`). - [ ] There are no new known `SlowTask` traces. ### Testing - [ ] The code was sufficiently tested in simulation. - [ ] If there are new parameters or knobs, different values are tested in simulation. - [ ] `ASSERT`, `ASSERT_WE_THINK`, and `TEST` macros are added in appropriate places. - [ ] Unit tests were added for new algorithms and data structure that make sense to unit-test - [ ] If this is a bugfix: there is a test that can easily reproduce the bug.

From 6.2.30 will the SS gets removed and restarted when io_timeout error is thrown (due to commit in SS taking more than 2 mins?)
What happens in case of io error to SS process? Will it be in the hung state now?

tuk · June 14, 2021, 5:31am

Anyone any thoughts on my query?

Ajmal · June 15, 2021, 5:32am

I run many foundationDB instances in our testing environment. After upgrading foundationDB to 6.2.30, I see that the instances turn unhealthy more frequently with StorageServerFailed error. Generally, a restart of the foundationDB processes brings the cluster to a healthy state. Can this behavior be explained by the changes in the above PR?

I see from a comment on the PR that StorageServer will not be automatically restarted due to the changes in the PR in case of an io_error? What is the change in the case of StorageServer experience io_timeout? Is io_timeout treated as an io_error?

mengxu · June 15, 2021, 8:42pm

Did you see IoDegraded events more often?
What investigation did you do to figure out the behavior?
IIRC, there are two behavior changes in 6.2 (maybe 6.2.30):

SS will prioritize write requests when it saw hot shard;
SS does not automatically get recruited on io_timeout.

Both behavior changes do not seem to introduce negative impact to the cluster. @Evan can comment more

Ajmal · June 16, 2021, 5:12am

@mengxu Thanks for the reply. There were IoDegraded events while the storage failure is observed, which were not seen before. I also see other slowness-related errors at the same time.

The output of status details is as follows:

Using cluster file `/etc/foundationdb/fdb.cluster'.Configuration:
  Redundancy mode        - double
  Storage engine         - ssd-2
  Coordinators           - 3
  Desired Proxies        - 1
  Desired Logs           - 2
  Usable Regions         - 1Cluster:
  FoundationDB processes - 10 (less 0 excluded; 2 with errors)
  Zones                  - 5
  Machines               - 5
  Memory availability    - 15.3 GB per process on machine with least available
  Fault Tolerance        - 0 machines
  Server time            - 06/01/21 22:08:36Data:
  Replication health     - UNHEALTHY: No replicas remain of some data
  Moving data            - 102.309 GB
  Sum of key-value sizes - 143.216 GB
  Disk space used        - 219.980 GBOperating space:
  Storage server         - 376.3 GB free on most full server
  Log server             - 396.2 GB free on most full serverWorkload:
  Read rate              - 309 Hz
  Write rate             - 78 Hz
  Transactions started   - 176 Hz
  Transactions committed - 27 Hz
  Conflict rate          - 0 HzBackup and DR:
  Running backups        - 0
  Running DRs            - 0Process performance details:
  172.18.0.30:4500       (  1% cpu;  1% machine; 0.007 Gbps;  0% disk IO; 3.5 GB / 15.3 GB RAM  )
    Last logged error: StorageServerFailed: io_timeout at Tue Jun  1 02:58:35 2021
  172.18.0.30:4501       (  1% cpu;  1% machine; 0.007 Gbps;  0% disk IO; 0.3 GB / 15.3 GB RAM  )
  172.18.0.31:4500       (  2% cpu;  2% machine; 0.015 Gbps;  1% disk IO; 3.6 GB / 15.7 GB RAM  )
  172.18.0.31:4501       (  3% cpu;  2% machine; 0.015 Gbps;  1% disk IO; 0.4 GB / 15.7 GB RAM  )
  172.18.0.32:4500       (  1% cpu;  1% machine; 0.006 Gbps;  0% disk IO; 3.5 GB / 16.3 GB RAM  )
  172.18.0.32:4501       (  2% cpu;  1% machine; 0.006 Gbps;  0% disk IO; 0.4 GB / 16.3 GB RAM  )
  172.18.0.35:4500       (  3% cpu;  1% machine; 0.003 Gbps; 14% disk IO; 4.6 GB / 16.8 GB RAM  )
  172.18.0.35:4501       (  0% cpu;  1% machine; 0.003 Gbps; 14% disk IO; 0.2 GB / 16.8 GB RAM  )
  172.18.0.36:4500       (  2% cpu;  1% machine; 0.007 Gbps; 21% disk IO; 4.6 GB / 17.9 GB RAM  )
    Last logged error: StorageServerFailed: io_timeout at Tue Jun  1 02:58:39 2021
  172.18.0.36:4501       (  0% cpu;  1% machine; 0.007 Gbps; 11% disk IO; 0.3 GB / 17.9 GB RAM  )Coordination servers:
  172.18.0.30:4501  (reachable)
  172.18.0.31:4501  (reachable)
  172.18.0.32:4501  (reachable)Client time: 06/01/21 22:08:36WARNING: A single process is both a transaction log and a storage server.
  For best performance use dedicated disks for the transaction logs by setting process classes.

The clients connecting to FDB experience a lot of transaction failures during this time, which gets resolved when the FDB processes are manually restarted.

Topic		Replies	Views
StorageServerFailed: io_timeout Using FoundationDB bindings	6	794	May 9, 2022
Brand new macOS installation "has issues" Using FoundationDB	12	1539	June 7, 2020
Macos version database is available, but has issues. All of a sudden Running FoundationDB	4	621	April 28, 2023
Storage node failure test Running FoundationDB	9	1076	November 5, 2020
Mac version of FoundationDB seems unhealthy Development	4	1798	August 9, 2018

Behaviour of Storage Server in case of bad disk from 6.2.30 onwards

Related topics