I’ve been working through failure scenarios with our operations team and I wanted to describe a storage topology and various maintenance scenarios.
We run in three data hall mode with our minimum footprint being 6 tlog-oriented boxes (2 tlogs each + stateless/proxy/resolver roles) and 3 storage machines (hosting 8 SSDs running at 4 storage processes per SSD based on our benchmarking and SSD saturation observations – 32 storage processes each – plus a couple of stateless processes). This machine/process layout allows us to survive an AZ/hall failure plus any other machine in another AZ/hall nicely.
We are looking at launching a public environment with 6 storage nodes to help make up for the write throughput hit that continuous backups cause, plus just generally relieving the storage queue bottlenecks we see at high ingest rates (300k+ writes per second on 3 storage nodes being typically achievable in our benchmarks w/o backup running – I’m looking to crack 500k+ with 6 hopefully).
But rather than performance, my question is about operations. We are examining the case in which a storage node fails altogether, taking down all 8 of the 3.8 TB SSDs.
In the case of 3 storage nodes, we just run with 2 storage nodes while performing maintenance (see later tlog concerns though).
In the case of 6 storage nodes (2 per AZ/data hall) we’d be left with 1 functioning storage node in the affected AZ if its comrade goes down.
DATA_DISTRIBUTION_FAILURE_REACTION_TIME/60s we expect the surviving replicas affected by the down storage machine to begin replicating data to the survivor in the affected AZ. But this isn’t always desirable for us and may have disk-space-used concerns.
The most likely case for an entire storage machine failure is either an OS disk failure or networking issue. We run the OS in RAID1 so we can bring back a machine suffering from that failure mode with a pretty quick rebuild. Network issues could be transient as unusual but not unheard of as that may be for 60s+.
The ops team raises the concern --*would it be unreasonable to use ~>50% disk usage on these storage machines given that storage failures will trigger replication of the now n=2 replica data to the survivor of the affected AZ?
Now, you can see these machines are very storage dense. When we are 5-10% full sure, it’s nice to maybe be able to replicate data to the survivor if recovery is taking a while for some reason. But at 50% full we’re talking about ~30.4TB of data to replicate. Filling the survivors disks aside, that’s just a long time and a lot of data for redistribution.
In this case it’s likely we want to leverage the maintenance mode ability for zones. While we haven’t yet configured this in our test environments I’m thinking we’d want to set the zones == data halls == AZs for us. This way an operator which is paged about a storage machine failure could apply the maintenance state to the affected zone to prevent data redistribution (of many TBs) if we know it’s just going to be a short wait on an OS RAID rebuild or whatever.
Does that sound right? Are are folks in the wild embracing the data redistribution? With these dense storage nodes (read: cost effective and performant) it seems a bit inefficient to embrace data redistribution unless the affected storage machine has been melted into a puddle and a replacement isn’t available for $REASONS for some amount of time much greater than the time to redistribute data.
Finally, one more concern regarding tlogs and storage during a storage failure. I mentioned we have the separate tlog-oriented boxes. Naturally during storage failures in a data hall FDB may attempt to recruit storage processes on these boxes. In our FDB machine layout this isn’t really desirable – the tlog-oriented boxes have much less overall disk space. I worry that storage failures leading to data redistribution will wind up pushing storage data to both the tlog boxes in an affected AZ as well as the storage survivor (in the 2 storage per AZ scenario). If the tlogs are busy this could also limit the CPU resources available between tlog/storage processes on the tlog machines after storage is recruited on the tlog-intended disks, plus lowering IO performance with the mix of high-frequency fsyncs and storage access. After recovering the failed storage machine I worry about the ‘unique’ state a cluster might be left in, mixing tlogs and storage until maybe FDB is able to ~restore the initial state before the storage failure.
Is maintenance mode for a zone the answer to prevent the recruitment of storage on the tlogs boxes in a storage-machine failure scenario? In architectures with specialized machines for tlogs/storage hosting, have operators observed an eventual restoration of the original desired process role assignment after restoring affected storage processes?
Alright that was quite a few thoughts so I’ll leave it there for now. Thanks for your time,