Storage node failure test

Would love to get clarifications/suggestions on the results of a storage node failure test we did.

Cluster Details:
16 nodes, 3 i3.xl (nodes 1-3) + 13 i3.4xl (nodes 4-16)
All 3 i3.xl act as Tx processes (2 in each node)
All 13 i3.4xl have 12 storage servers + 4 stateless each node

FoundationDB - 6.2

The load to cluster is ~210K RPS + ~150K WPS.

Configuration:
Redundancy mode - double
Storage engine - ssd-2
Coordinators - 6 (First 6 nodes, 1 per each node)
Desired Resolvers - 3
Desired Logs - 6

Cluster:
FoundationDB processes - 220
Zones - 16
Machines - 16
Memory availability - 7.3 GB per process on machine with least available
Fault Tolerance - 1 machine

Test:

  1. Stop one of the storage nodes (12 storage server + 4 stateless processes)
  2. Wait for ~15 minutes
  3. Start the node back

Rough timeline of events:
12:10p - Stopped EC2 node
12:25p - Started EC2 node back up

Overall status:
Cluster stayed UP, but reported UNHEALTHY (log_server_write_queue bottleneck).
Data replication status showed UNHEALTHY as expected.
The client application built up lag. We saw lag for ~2 hours.
The Tx log queue was at 1.5G during the time and logs were spilled to disk.
The lag built up for first ~1.5hours and came down suddenly as Tx log queue came down.

Is this expected? I was expecting it to go much smoother, meaning didn’t expect the lag to build up/last for so long.

Both disk read/ write went up.

Transactions started dropped, while transactions committed did not change.

I also noticed is the number of client connections dropped to almost half and it came back to normal once the lag was fully/almost recovered. Not sure why that happened.

Any help in connecting dots here is much appreciated. Also what can be done to make recovery process go faster? More Tx processes?

Any insights here is much appreciated.

We also noticed that during the time of lag build up, couple of storage servers in 2 different machines had higher lag compared to the other 150+ storage processes.

As per my understanding taking down storage servers like we did, should not cause this long performance issues. Any insights here are much appreciated. Please let me know if we need more info.

What is the read workload like here? One consequence of losing a storage node is that all reads that could go to that node now have one fewer replica to choose (so in your case with double replication, only 1 choice remains). If that read workload were heavy enough, it could overwhelm 1 storage server while 2 could handle it.

This won’t necessarily be as bad as it sounds if your reads are well-distributed amongst the key-space. Even though any given key will have one copy remaining, each storage server should be responsible for multiple key ranges (called shards internally), each of which could be shared with a different storage server. Therefore with well-distributed reads you might expect to see a more modest increase on a lot of storage servers.

If your reads aren’t well-distributed and instead are mainly happening in a narrow range, then it’s possible that they could all be going to a single shard. In that case, all of those reads only have 1 alternative, and so you would see the read workload double on that alternative when the other dies.

If your read workload is non-negligible, it would be interesting to see how it behaves on each of the storage servers, even if it should be well-distributed. You can check some of the data in the StorageMetrics trace event, such as QueryQueue, RowsQueried, and BytesQueried, which will tell you the number of read requests, keys returned, and bytes returned by each storage server, respectively. Note that the value in each of these is a triple, with the first value being the rate per second.

I should also add that you can check to see if your storage servers (in particular, your most affected ones) see heavy CPU usage during this time. You can look at ProcessMetrics.CPUSeconds compared to ProcessMetrics.Elapsed, or similarly NetworkMetrics.PriorityBusy1 vs. NetworkMetrics.Elapsed. If reads are overwhelming the storage server, it could happen that you become CPU saturated and the time spent busy will be roughly equal to the elapsed time.

One further thought is that even in the absence of a significant increase in client reads, data distribution will be trying to move data from the storage servers that remain in order to re-replicate it. It’s supposed to run at a low-ish speed, but in the event that you are running near enough to the performance limit prior to removing the storage servers, it could push you over the top. You could potentially gauge the effects of data distribution by disabling it during this test. Probably the most surgical way to do that would be to use maintenance mode, which is discussed a bit here:

You could also just disable data movement for all storage server failures by running the following in fdbcli:

fdb> datadistribution disable ssfailure

If data distribution is pushing you over the edge, then I think the immediate options you have available would be to:

  1. Decrease the client workload
  2. Increase the cluster size (and/or maybe increase replication)
  3. Tweak some knobs to slow down data movement (which would result in slower healing)

Thank You @ajbeamon
Just wanted to update that we are dealing with some other issues in the cluster. So waiting till those are resolved before we resume tests based on your suggestions.

We did another round of testing. This time node to be taken down was put under maintenance mode. From what I observed, it looks like data migration is pushing us over the edge. Please let me know if I am missing something.

In short following was performed:

  1. Put node on maintenance window (11:43 AM, 30 minutes)
  2. Stop EC2 node - Node 12 (11:47 AM)
  3. Observe
  4. Bring back EC2 node within maintenance window (12:04 PM)
  5. Maintenance window lifted (12:13 PM)

Observations:

  • Ratekeeper slowed down transactions a bit when node was stopped
  • In memory Tx queue was getting filled (it was hitting 1.5G and then spilling to disk)
  • Although we saw reduction in transactions started, client application was still able to keep up

fdb_cluster_qos_transactions_per_second_limit and fdb_cluster_qos_released_transactions_per_second

fdb_cluster_workload_operations_read_requests_hz

Saw a drop in RPS when the node was started back.
Do not understand why we observe this behavior.

Application Lag

Lag started ~12:04 PM, around the time when stopped ec2 node was started back.

As similar to our last test, once data migration started we started seeing storage lag for couple of storage servers. We do range reads, but it should be well distributed.

Once maintenance window was shifted, we saw data migration kick in. My expectation was it would only copy delta data (meaning data it lagged behind when node was taken out). However, it did not seem that way looking at data in queue. This in turn caused more application lag. The application lag came down fast after server with storage lag came down.

fdb_cluster_processes_roles_storage_query_queue_max, although it spiked, did not see skewness.

Is this the number of reads queued per storage server?
Also, did not notice skewness in fdb_trace_bytes_queried_value

fdb_trace_query_queue_value monitoring

Could you please help me understand the difference between fdb_cluster_processes_roles_storage_query_queue_max and fdb_trace_bytes_queried_value ?

More detailed timeline of events performed/ observed

11:43a Maintenance on for node12 zone for 30 minutes
11:47a Instance stop issued on node12
11:49a Database reported unhealthy due to log_server_write_queue
11:50a Tx log queue size reached 1.5GB for all Tx processes
11:56a Tx log started spilling to disk
12:04p Started node back up. Still inside maintenance window.
12:05p Read rate started dropping (~220K RPS at 12:04p)
12:13p Lag started building up at client side
12:14p Maintenance window over
12:14p Data migration started
12:16p Read rate - ~150K RPS
12:32p Storage lag starting to build up for couple of storage servers
12:33p Read rate climbed back up to 230K RPS
12:48p Ratekeeper started throttling clients
12:50p Read rate starts to drop again
12:53p Read rate ~120K RPS
13:00p Read rate ~30K RPS
13:04p Ratekeeper throttling stops
13:05p Read rate back up to ~250K RPS
13:20p Tx log queue starting to come down slowly
14:00p Application lag completely recovered
14:30p Tx log queue back to normal
14:30p Data migration completed

@ajbeamon any thoughts?