Questions regarding maintenance for multiple storage-oriented machines in a data hall

I’m going to summarize this a few questions:

  1. How should I best handle storage servers running on very large, reliable disks?
  2. How can I prevent storage servers from being recruited on my transaction logs?
  3. What is the overhead of continuous backups and where is it paid?

Large and reliable disks

There’s a few things to think about here, all of which are about how various pieces of the system will react to a failed storage server. One aspect is how data distribution will react, as you’ve discussed. The tradeoff to consider is the impact on transaction logs.

For moving data, I agree with your assessment that, given the strong belief that you can bring back a very large amount of data “lost” from a failed machine, beginning data redistribution after 60s isn’t the best for you. I think that in your case, changing the exact knob that you found, DATA_DISTRIBUTION_FAILURE_REACTION_TIME, to the 90th percentile of what a RAID rebuild takes could make sense. In a sense, rather than having a person or automation invoke maintenance mode in reaction to a host failure, chainging the failure reaction time would automate that response, but inside of FDB. If you’re intentionally removing a machine, you should be excluding it first, and then if its failure is noticed or not when it’s finally removed shouldn’t matter. This is assuming that in your specific situation, you can RAID rebuild your way out of most failures.

However, there is a tradeoff. When you bring your dead machine back online, and the storage servers rejoin the cluster, they begin pulling all the mutations that they missed from the transaction logs so that they can catch up. This means that transaction logs are required to hold all mutations destined to an offline storage server until it comes back online. Data Distribution trying to heal the storage server failure by copying the data to a new machine doesn’t only restore the desired replication factor, it also frees the transaction logs from having to keep the mutation log for the failed, and now unnecessary storage server. Whatever you end up setting DATA_DISTRIBUTION_FAILURE_REACTION_TIME to, you also need to make sure that your TLogs can buffer the incoming writes for that amount of time plus the time it will take to do the full data copy between storage servers. We discussed this some on Quick question on tlog disk space for large clusters, and I’ll assume that math is still roughly valid.

If you do have a permanent machine failure and add a new machine to the cluster to replace the now forever gone one, do be aware that disks will on average get more full for a bit before starting to decline in space. Data distribution does not currently do a great job in deciding what teams of servers to build, and will do a larger data shuffling between servers than the minimum necessary to replace the failed machine. In the next major release, there will be a failed parameter that you can add to exclude to have FDB drop the queued mutations in the transaction log for an excluded machine that you know is never coming back before the data distribution finishes.

And lastly, to cover some of your specific questions:

I think data redistribution as a reaction to a failure is embraced by most people because they fall into one or more of three categories:

  1. They run larger clusters, where the free space reserve that’s needed for a failure is less than 50%.
  2. They run on ephemeral or un-RAID’d disks, where there’s no strong belief that a machine that died will reappear after a minute with its data intact.
  3. They have smaller disks for tlogs, and thus can’t buffer as much data during storage server failures.

Your case sounds somewhat unique that you’re planning for small clusters on very reliable disks, and thus there’s fun quirks to examine about that setup as a result. :slight_smile:

Storage server recruitment

The code agrees with your fears. If you lose both storage servers in one datahall, but not the transaction log, then FDB will try to recruit storage servers on the transaction log processes. This will probably murder your cluster.

Specifically, the loss of a datahall will cause all teams to appear unhealthy. In the data distribution code, this will cause us to flag our request for storage servers as a critical recruitment. On the cluster controller side answering that request, we then consider any alive stateful process. There’s also… no knob to toggle to avoid this. Uniquely in this situation, a DDRecruitingEmergency trace event will be logged, that you could use as a paging event. Please file an issue if you’d like a knob to be added to make critical storage recruitments to not be treated as critical.

It looks like fdbcli> datadistribution off will give you an emergency out in this situation while you prepare other mitigations, but note the warning below about continuous backup. The proper solution here is what you already mentioned: to use fdbcli> maintenance. Maintenance mode will mark the failed zones as healthy, and keep them that way for a specified amount of time, but you’ll need to know the exact zoneID(s) that were affected. This should undo any effects of having recruited a storage server on a transaction log, as those storage servers will now be viewed as undesireable as the failed storage teams will become healthy again.

Continuous backup

The overhead of continuous backup is that it doubles the writes done to the cluster. Every key and value is written once as the kv pair, and once as a block of mutations into a subspace in \xff\x02 that holds the backup data waiting to be read by backup or DR agents. These \xff\x02 keys also get written to the storage servers responsible for shards in that range, so this is a tax paid by all TLogs, and some storage servers. If your backup falls behind, or your S3-like store is unavailable, then mutations will accumulate in the mutation stream subspace, which will show up as an increasing amount of space used by storage servers and status will show your backup lag as increasing.

The mutations for backup are batched together, so it’s not quite a 2x overhead, but it’s a reasonable approximation. The commit version/1000000 is “hashed” into 256 buckets to provide somewhat of an even load distribution over the keyspace. As we’ve discovered in running backup for some time, backup creates a lot of data distribution churn, as it dumps writes into a bucket for 1s, and then some seconds later, a backup agent clears the data after it consumed it. This puts a decent amount of pressure on data distribution to do splits and merges, which normally is fine.

Now, an important aspect of three_datahall_mode that we haven’t discussed yet, is that when one datahall fails (including TLogs in this case), other data movement (ie. splits and merges) won’t be allowed for any current shard, because there are no longer any healthy teams to move data into. When combined with backup, this means that losing a datahall can mean that backup will start to hotspot a shard, and data distribution won’t be able to split to save you. Because this shows up as a high storage queue or the storage server failing to keep up in applying versions, ratekeeper will limit based on this. This has bitten us in the past when trying to disable data distribution for causing other issues.

Also be aware that there are incoming changes to how backup is implemented to solve a good number of these issues. @jzhou is implementing a version of backup, to be included in the next release, that reuses the mechanisms of how mutations are streamed in multi-region setups to make backup not have a 2x write bandwidth tax. Backup agents in that model become FDB processes that tail the mutation stream directly from the transaction logs, which means less overhead all around.


That was a lot, and I might have missed something, so if there was a question you had that I missed above, please quote it, and I’ll get back to you on that also. :slight_smile:

1 Like