Storage node failure test

One further thought is that even in the absence of a significant increase in client reads, data distribution will be trying to move data from the storage servers that remain in order to re-replicate it. It’s supposed to run at a low-ish speed, but in the event that you are running near enough to the performance limit prior to removing the storage servers, it could push you over the top. You could potentially gauge the effects of data distribution by disabling it during this test. Probably the most surgical way to do that would be to use maintenance mode, which is discussed a bit here:

You could also just disable data movement for all storage server failures by running the following in fdbcli:

fdb> datadistribution disable ssfailure

If data distribution is pushing you over the edge, then I think the immediate options you have available would be to:

  1. Decrease the client workload
  2. Increase the cluster size (and/or maybe increase replication)
  3. Tweak some knobs to slow down data movement (which would result in slower healing)