Would love to get clarifications/suggestions on the results of a storage node failure test we did.
Cluster Details:
16 nodes, 3 i3.xl (nodes 1-3) + 13 i3.4xl (nodes 4-16)
All 3 i3.xl act as Tx processes (2 in each node)
All 13 i3.4xl have 12 storage servers + 4 stateless each node
FoundationDB - 6.2
The load to cluster is ~210K RPS + ~150K WPS.
Configuration:
Redundancy mode - double
Storage engine - ssd-2
Coordinators - 6 (First 6 nodes, 1 per each node)
Desired Resolvers - 3
Desired Logs - 6
Cluster:
FoundationDB processes - 220
Zones - 16
Machines - 16
Memory availability - 7.3 GB per process on machine with least available
Fault Tolerance - 1 machine
Test:
- Stop one of the storage nodes (12 storage server + 4 stateless processes)
- Wait for ~15 minutes
- Start the node back
Rough timeline of events:
12:10p - Stopped EC2 node
12:25p - Started EC2 node back up
Overall status:
Cluster stayed UP, but reported UNHEALTHY (log_server_write_queue bottleneck).
Data replication status showed UNHEALTHY as expected.
The client application built up lag. We saw lag for ~2 hours.
The Tx log queue was at 1.5G during the time and logs were spilled to disk.
The lag built up for first ~1.5hours and came down suddenly as Tx log queue came down.
Is this expected? I was expecting it to go much smoother, meaning didn’t expect the lag to build up/last for so long.
Both disk read/ write went up.
Transactions started dropped, while transactions committed did not change.
I also noticed is the number of client connections dropped to almost half and it came back to normal once the lag was fully/almost recovered. Not sure why that happened.
Any help in connecting dots here is much appreciated. Also what can be done to make recovery process go faster? More Tx processes?