Storage node failure test

Would love to get clarifications/suggestions on the results of a storage node failure test we did.

Cluster Details:
16 nodes, 3 i3.xl (nodes 1-3) + 13 i3.4xl (nodes 4-16)
All 3 i3.xl act as Tx processes (2 in each node)
All 13 i3.4xl have 12 storage servers + 4 stateless each node

FoundationDB - 6.2

The load to cluster is ~210K RPS + ~150K WPS.

Configuration:
Redundancy mode - double
Storage engine - ssd-2
Coordinators - 6 (First 6 nodes, 1 per each node)
Desired Resolvers - 3
Desired Logs - 6

Cluster:
FoundationDB processes - 220
Zones - 16
Machines - 16
Memory availability - 7.3 GB per process on machine with least available
Fault Tolerance - 1 machine

Test:

  1. Stop one of the storage nodes (12 storage server + 4 stateless processes)
  2. Wait for ~15 minutes
  3. Start the node back

Rough timeline of events:
12:10p - Stopped EC2 node
12:25p - Started EC2 node back up

Overall status:
Cluster stayed UP, but reported UNHEALTHY (log_server_write_queue bottleneck).
Data replication status showed UNHEALTHY as expected.
The client application built up lag. We saw lag for ~2 hours.
The Tx log queue was at 1.5G during the time and logs were spilled to disk.
The lag built up for first ~1.5hours and came down suddenly as Tx log queue came down.

Is this expected? I was expecting it to go much smoother, meaning didn’t expect the lag to build up/last for so long.

Both disk read/ write went up.

Transactions started dropped, while transactions committed did not change.

I also noticed is the number of client connections dropped to almost half and it came back to normal once the lag was fully/almost recovered. Not sure why that happened.

Any help in connecting dots here is much appreciated. Also what can be done to make recovery process go faster? More Tx processes?

Any insights here is much appreciated.

We also noticed that during the time of lag build up, couple of storage servers in 2 different machines had higher lag compared to the other 150+ storage processes.

As per my understanding taking down storage servers like we did, should not cause this long performance issues. Any insights here are much appreciated. Please let me know if we need more info.

What is the read workload like here? One consequence of losing a storage node is that all reads that could go to that node now have one fewer replica to choose (so in your case with double replication, only 1 choice remains). If that read workload were heavy enough, it could overwhelm 1 storage server while 2 could handle it.

This won’t necessarily be as bad as it sounds if your reads are well-distributed amongst the key-space. Even though any given key will have one copy remaining, each storage server should be responsible for multiple key ranges (called shards internally), each of which could be shared with a different storage server. Therefore with well-distributed reads you might expect to see a more modest increase on a lot of storage servers.

If your reads aren’t well-distributed and instead are mainly happening in a narrow range, then it’s possible that they could all be going to a single shard. In that case, all of those reads only have 1 alternative, and so you would see the read workload double on that alternative when the other dies.

If your read workload is non-negligible, it would be interesting to see how it behaves on each of the storage servers, even if it should be well-distributed. You can check some of the data in the StorageMetrics trace event, such as QueryQueue, RowsQueried, and BytesQueried, which will tell you the number of read requests, keys returned, and bytes returned by each storage server, respectively. Note that the value in each of these is a triple, with the first value being the rate per second.

I should also add that you can check to see if your storage servers (in particular, your most affected ones) see heavy CPU usage during this time. You can look at ProcessMetrics.CPUSeconds compared to ProcessMetrics.Elapsed, or similarly NetworkMetrics.PriorityBusy1 vs. NetworkMetrics.Elapsed. If reads are overwhelming the storage server, it could happen that you become CPU saturated and the time spent busy will be roughly equal to the elapsed time.

One further thought is that even in the absence of a significant increase in client reads, data distribution will be trying to move data from the storage servers that remain in order to re-replicate it. It’s supposed to run at a low-ish speed, but in the event that you are running near enough to the performance limit prior to removing the storage servers, it could push you over the top. You could potentially gauge the effects of data distribution by disabling it during this test. Probably the most surgical way to do that would be to use maintenance mode, which is discussed a bit here:

You could also just disable data movement for all storage server failures by running the following in fdbcli:

fdb> datadistribution disable ssfailure

If data distribution is pushing you over the edge, then I think the immediate options you have available would be to:

  1. Decrease the client workload
  2. Increase the cluster size (and/or maybe increase replication)
  3. Tweak some knobs to slow down data movement (which would result in slower healing)

Thank You @ajbeamon
Just wanted to update that we are dealing with some other issues in the cluster. So waiting till those are resolved before we resume tests based on your suggestions.