Cluster:
FoundationDB processes - 220
Zones - 16
Machines - 16
Memory availability - 7.3 GB per process on machine with least available
Fault Tolerance - 1 machine
Test:
Stop one of the storage nodes (12 storage server + 4 stateless processes)
Wait for ~15 minutes
Start the node back
Rough timeline of events:
12:10p - Stopped EC2 node
12:25p - Started EC2 node back up
Overall status:
Cluster stayed UP, but reported UNHEALTHY (log_server_write_queue bottleneck).
Data replication status showed UNHEALTHY as expected.
The client application built up lag. We saw lag for ~2 hours.
The Tx log queue was at 1.5G during the time and logs were spilled to disk.
The lag built up for first ~1.5hours and came down suddenly as Tx log queue came down.
I also noticed is the number of client connections dropped to almost half and it came back to normal once the lag was fully/almost recovered. Not sure why that happened.
We also noticed that during the time of lag build up, couple of storage servers in 2 different machines had higher lag compared to the other 150+ storage processes.
As per my understanding taking down storage servers like we did, should not cause this long performance issues. Any insights here are much appreciated. Please let me know if we need more info.
What is the read workload like here? One consequence of losing a storage node is that all reads that could go to that node now have one fewer replica to choose (so in your case with double replication, only 1 choice remains). If that read workload were heavy enough, it could overwhelm 1 storage server while 2 could handle it.
This won’t necessarily be as bad as it sounds if your reads are well-distributed amongst the key-space. Even though any given key will have one copy remaining, each storage server should be responsible for multiple key ranges (called shards internally), each of which could be shared with a different storage server. Therefore with well-distributed reads you might expect to see a more modest increase on a lot of storage servers.
If your reads aren’t well-distributed and instead are mainly happening in a narrow range, then it’s possible that they could all be going to a single shard. In that case, all of those reads only have 1 alternative, and so you would see the read workload double on that alternative when the other dies.
If your read workload is non-negligible, it would be interesting to see how it behaves on each of the storage servers, even if it should be well-distributed. You can check some of the data in the StorageMetrics trace event, such as QueryQueue, RowsQueried, and BytesQueried, which will tell you the number of read requests, keys returned, and bytes returned by each storage server, respectively. Note that the value in each of these is a triple, with the first value being the rate per second.
I should also add that you can check to see if your storage servers (in particular, your most affected ones) see heavy CPU usage during this time. You can look at ProcessMetrics.CPUSeconds compared to ProcessMetrics.Elapsed, or similarly NetworkMetrics.PriorityBusy1 vs. NetworkMetrics.Elapsed. If reads are overwhelming the storage server, it could happen that you become CPU saturated and the time spent busy will be roughly equal to the elapsed time.
One further thought is that even in the absence of a significant increase in client reads, data distribution will be trying to move data from the storage servers that remain in order to re-replicate it. It’s supposed to run at a low-ish speed, but in the event that you are running near enough to the performance limit prior to removing the storage servers, it could push you over the top. You could potentially gauge the effects of data distribution by disabling it during this test. Probably the most surgical way to do that would be to use maintenance mode, which is discussed a bit here:
You could also just disable data movement for all storage server failures by running the following in fdbcli:
fdb> datadistribution disable ssfailure
If data distribution is pushing you over the edge, then I think the immediate options you have available would be to:
Decrease the client workload
Increase the cluster size (and/or maybe increase replication)
Tweak some knobs to slow down data movement (which would result in slower healing)
Thank You @ajbeamon
Just wanted to update that we are dealing with some other issues in the cluster. So waiting till those are resolved before we resume tests based on your suggestions.
We did another round of testing. This time node to be taken down was put under maintenance mode. From what I observed, it looks like data migration is pushing us over the edge. Please let me know if I am missing something.
In short following was performed:
Put node on maintenance window (11:43 AM, 30 minutes)
Stop EC2 node - Node 12 (11:47 AM)
Observe
Bring back EC2 node within maintenance window (12:04 PM)
Maintenance window lifted (12:13 PM)
Observations:
Ratekeeper slowed down transactions a bit when node was stopped
In memory Tx queue was getting filled (it was hitting 1.5G and then spilling to disk)
Although we saw reduction in transactions started, client application was still able to keep up
fdb_cluster_qos_transactions_per_second_limit and fdb_cluster_qos_released_transactions_per_second
Lag started ~12:04 PM, around the time when stopped ec2 node was started back.
As similar to our last test, once data migration started we started seeing storage lag for couple of storage servers. We do range reads, but it should be well distributed.
Once maintenance window was shifted, we saw data migration kick in. My expectation was it would only copy delta data (meaning data it lagged behind when node was taken out). However, it did not seem that way looking at data in queue. This in turn caused more application lag. The application lag came down fast after server with storage lag came down.
fdb_cluster_processes_roles_storage_query_queue_max, although it spiked, did not see skewness.