Questions regarding maintenance for multiple storage-oriented machines in a data hall

Hey all,

I’ve been working through failure scenarios with our operations team and I wanted to describe a storage topology and various maintenance scenarios.

We run in three data hall mode with our minimum footprint being 6 tlog-oriented boxes (2 tlogs each + stateless/proxy/resolver roles) and 3 storage machines (hosting 8 SSDs running at 4 storage processes per SSD based on our benchmarking and SSD saturation observations – 32 storage processes each – plus a couple of stateless processes). This machine/process layout allows us to survive an AZ/hall failure plus any other machine in another AZ/hall nicely.

We are looking at launching a public environment with 6 storage nodes to help make up for the write throughput hit that continuous backups cause, plus just generally relieving the storage queue bottlenecks we see at high ingest rates (300k+ writes per second on 3 storage nodes being typically achievable in our benchmarks w/o backup running – I’m looking to crack 500k+ with 6 hopefully).

But rather than performance, my question is about operations. We are examining the case in which a storage node fails altogether, taking down all 8 of the 3.8 TB SSDs.

In the case of 3 storage nodes, we just run with 2 storage nodes while performing maintenance (see later tlog concerns though).

In the case of 6 storage nodes (2 per AZ/data hall) we’d be left with 1 functioning storage node in the affected AZ if its comrade goes down.

Now, after DATA_DISTRIBUTION_FAILURE_REACTION_TIME/60s we expect the surviving replicas affected by the down storage machine to begin replicating data to the survivor in the affected AZ. But this isn’t always desirable for us and may have disk-space-used concerns.

The most likely case for an entire storage machine failure is either an OS disk failure or networking issue. We run the OS in RAID1 so we can bring back a machine suffering from that failure mode with a pretty quick rebuild. Network issues could be transient as unusual but not unheard of as that may be for 60s+.

The ops team raises the concern --*would it be unreasonable to use ~>50% disk usage on these storage machines given that storage failures will trigger replication of the now n=2 replica data to the survivor of the affected AZ?

Now, you can see these machines are very storage dense. When we are 5-10% full sure, it’s nice to maybe be able to replicate data to the survivor if recovery is taking a while for some reason. But at 50% full we’re talking about ~30.4TB of data to replicate. Filling the survivors disks aside, that’s just a long time and a lot of data for redistribution.

In this case it’s likely we want to leverage the maintenance mode ability for zones. While we haven’t yet configured this in our test environments I’m thinking we’d want to set the zones == data halls == AZs for us. This way an operator which is paged about a storage machine failure could apply the maintenance state to the affected zone to prevent data redistribution (of many TBs) if we know it’s just going to be a short wait on an OS RAID rebuild or whatever.

Does that sound right? Are are folks in the wild embracing the data redistribution? With these dense storage nodes (read: cost effective and performant) it seems a bit inefficient to embrace data redistribution unless the affected storage machine has been melted into a puddle and a replacement isn’t available for $REASONS for some amount of time much greater than the time to redistribute data.

Finally, one more concern regarding tlogs and storage during a storage failure. I mentioned we have the separate tlog-oriented boxes. Naturally during storage failures in a data hall FDB may attempt to recruit storage processes on these boxes. In our FDB machine layout this isn’t really desirable – the tlog-oriented boxes have much less overall disk space. I worry that storage failures leading to data redistribution will wind up pushing storage data to both the tlog boxes in an affected AZ as well as the storage survivor (in the 2 storage per AZ scenario). If the tlogs are busy this could also limit the CPU resources available between tlog/storage processes on the tlog machines after storage is recruited on the tlog-intended disks, plus lowering IO performance with the mix of high-frequency fsyncs and storage access. After recovering the failed storage machine I worry about the ‘unique’ state a cluster might be left in, mixing tlogs and storage until maybe FDB is able to ~restore the initial state before the storage failure.

Is maintenance mode for a zone the answer to prevent the recruitment of storage on the tlogs boxes in a storage-machine failure scenario? In architectures with specialized machines for tlogs/storage hosting, have operators observed an eventual restoration of the original desired process role assignment after restoring affected storage processes?

Alright that was quite a few thoughts so I’ll leave it there for now. Thanks for your time,

Kyle

Ah yes, I think I recall know the extra write hit might be happening in the tlog layer? So maybe the extra storage machines won’t help with the backup write hit v. extra tlogs. We’re testing that ~soon regardless.

The extra write throughput (disregarding backup) is still appealing to us.

I’m going to summarize this a few questions:

  1. How should I best handle storage servers running on very large, reliable disks?
  2. How can I prevent storage servers from being recruited on my transaction logs?
  3. What is the overhead of continuous backups and where is it paid?

Large and reliable disks

There’s a few things to think about here, all of which are about how various pieces of the system will react to a failed storage server. One aspect is how data distribution will react, as you’ve discussed. The tradeoff to consider is the impact on transaction logs.

For moving data, I agree with your assessment that, given the strong belief that you can bring back a very large amount of data “lost” from a failed machine, beginning data redistribution after 60s isn’t the best for you. I think that in your case, changing the exact knob that you found, DATA_DISTRIBUTION_FAILURE_REACTION_TIME, to the 90th percentile of what a RAID rebuild takes could make sense. In a sense, rather than having a person or automation invoke maintenance mode in reaction to a host failure, chainging the failure reaction time would automate that response, but inside of FDB. If you’re intentionally removing a machine, you should be excluding it first, and then if its failure is noticed or not when it’s finally removed shouldn’t matter. This is assuming that in your specific situation, you can RAID rebuild your way out of most failures.

However, there is a tradeoff. When you bring your dead machine back online, and the storage servers rejoin the cluster, they begin pulling all the mutations that they missed from the transaction logs so that they can catch up. This means that transaction logs are required to hold all mutations destined to an offline storage server until it comes back online. Data Distribution trying to heal the storage server failure by copying the data to a new machine doesn’t only restore the desired replication factor, it also frees the transaction logs from having to keep the mutation log for the failed, and now unnecessary storage server. Whatever you end up setting DATA_DISTRIBUTION_FAILURE_REACTION_TIME to, you also need to make sure that your TLogs can buffer the incoming writes for that amount of time plus the time it will take to do the full data copy between storage servers. We discussed this some on Quick question on tlog disk space for large clusters, and I’ll assume that math is still roughly valid.

If you do have a permanent machine failure and add a new machine to the cluster to replace the now forever gone one, do be aware that disks will on average get more full for a bit before starting to decline in space. Data distribution does not currently do a great job in deciding what teams of servers to build, and will do a larger data shuffling between servers than the minimum necessary to replace the failed machine. In the next major release, there will be a failed parameter that you can add to exclude to have FDB drop the queued mutations in the transaction log for an excluded machine that you know is never coming back before the data distribution finishes.

And lastly, to cover some of your specific questions:

I think data redistribution as a reaction to a failure is embraced by most people because they fall into one or more of three categories:

  1. They run larger clusters, where the free space reserve that’s needed for a failure is less than 50%.
  2. They run on ephemeral or un-RAID’d disks, where there’s no strong belief that a machine that died will reappear after a minute with its data intact.
  3. They have smaller disks for tlogs, and thus can’t buffer as much data during storage server failures.

Your case sounds somewhat unique that you’re planning for small clusters on very reliable disks, and thus there’s fun quirks to examine about that setup as a result. :slight_smile:

Storage server recruitment

The code agrees with your fears. If you lose both storage servers in one datahall, but not the transaction log, then FDB will try to recruit storage servers on the transaction log processes. This will probably murder your cluster.

Specifically, the loss of a datahall will cause all teams to appear unhealthy. In the data distribution code, this will cause us to flag our request for storage servers as a critical recruitment. On the cluster controller side answering that request, we then consider any alive stateful process. There’s also… no knob to toggle to avoid this. Uniquely in this situation, a DDRecruitingEmergency trace event will be logged, that you could use as a paging event. Please file an issue if you’d like a knob to be added to make critical storage recruitments to not be treated as critical.

It looks like fdbcli> datadistribution off will give you an emergency out in this situation while you prepare other mitigations, but note the warning below about continuous backup. The proper solution here is what you already mentioned: to use fdbcli> maintenance. Maintenance mode will mark the failed zones as healthy, and keep them that way for a specified amount of time, but you’ll need to know the exact zoneID(s) that were affected. This should undo any effects of having recruited a storage server on a transaction log, as those storage servers will now be viewed as undesireable as the failed storage teams will become healthy again.

Continuous backup

The overhead of continuous backup is that it doubles the writes done to the cluster. Every key and value is written once as the kv pair, and once as a block of mutations into a subspace in \xff\x02 that holds the backup data waiting to be read by backup or DR agents. These \xff\x02 keys also get written to the storage servers responsible for shards in that range, so this is a tax paid by all TLogs, and some storage servers. If your backup falls behind, or your S3-like store is unavailable, then mutations will accumulate in the mutation stream subspace, which will show up as an increasing amount of space used by storage servers and status will show your backup lag as increasing.

The mutations for backup are batched together, so it’s not quite a 2x overhead, but it’s a reasonable approximation. The commit version/1000000 is “hashed” into 256 buckets to provide somewhat of an even load distribution over the keyspace. As we’ve discovered in running backup for some time, backup creates a lot of data distribution churn, as it dumps writes into a bucket for 1s, and then some seconds later, a backup agent clears the data after it consumed it. This puts a decent amount of pressure on data distribution to do splits and merges, which normally is fine.

Now, an important aspect of three_datahall_mode that we haven’t discussed yet, is that when one datahall fails (including TLogs in this case), other data movement (ie. splits and merges) won’t be allowed for any current shard, because there are no longer any healthy teams to move data into. When combined with backup, this means that losing a datahall can mean that backup will start to hotspot a shard, and data distribution won’t be able to split to save you. Because this shows up as a high storage queue or the storage server failing to keep up in applying versions, ratekeeper will limit based on this. This has bitten us in the past when trying to disable data distribution for causing other issues.

Also be aware that there are incoming changes to how backup is implemented to solve a good number of these issues. @jzhou is implementing a version of backup, to be included in the next release, that reuses the mechanisms of how mutations are streamed in multi-region setups to make backup not have a 2x write bandwidth tax. Backup agents in that model become FDB processes that tail the mutation stream directly from the transaction logs, which means less overhead all around.


That was a lot, and I might have missed something, so if there was a question you had that I missed above, please quote it, and I’ll get back to you on that also. :slight_smile:

1 Like

Alex, thank you for the detailed and thoughtful reply. We’ll digest it and come back with any questions or comments.

I’ll clarify a little bit too for posterity that only the OS disks are in RAID1, all storage are just JBOD.

We’re still not made a final decision on bigger/smaller TLogs so we’ll consider the implications you have outlined here.

We are open to smaller and more storage oriented machines as well so we’ll continue to ponder that, although the current setup has its advantages for us.

Thank you!

Quick question: we can set the zone ids to the data halls in order to use maintenance mode correct? That’s not verboten when running with data halls?

ZoneID and Data Hall are two different components of locality, with ZoneID being hierarchically nested within Data Hall. maintenance mode operates on ZoneIDs, so if you wish to put a data hall into maintenance mode, you’ll need to put each individual zoneID that’s in that data hall into maintenance mode.

If you set zoneid == datahall, then your three_datahall_mode cluster will break, because the TLog policy for three datahall mode is “In each of two different data halls, give me two different zoneids”. So you’ll need at least 2 zoneIDs per datahall. Be careful that you have enough TLog processes such that you can survive the loss of one datahall and one machine at the same time, and still be able to satisfy the TLog recruitment policy.

OK great, thanks Alex, yeah it was smelling fishy to me to use a hall for the zone id. We run four tlogs across two machines per AZ, in three AZs, so we can survive that failure mode and be available.

EDIT: Alex points out to me that because the zone ids map to machines, what I say above isn’t really the clean case I described, but is a bit more messy e.g. tlogs being recruited on storage-oriented machines in the estate.