Impact of losing two nodes in 7.1.x cluster with RF=2

ajames · October 12, 2023, 1:21am

We are planning to build a 7.1.x cluster with a replication factor (RF) of 2 on AWS. If we were to lose two storage process nodes (ephemeral storage), what would be the impact on cluster health and operability? While I expect there would be some data loss, would the cluster remain operational with the remaining available data? cc: @amehta @samitsawant

ajames · October 12, 2023, 5:10pm

Would really appreciate any feedbacks/ recommendations here.

jzhou · October 13, 2023, 7:07pm

There are some chance of losing data, i.e., when 2 replicas of data happen to be on the failed process nodes. If some critical data, e.g., \xff\serverList is lost, then the database will become unavailable and be hard to recover. If the data is not critical, e.g., user data, then database is able to function and status will report shard loss.

It’s recommended to run triple configuration if you expect to lose 2 storage nodes.

ajames · October 13, 2023, 7:11pm

Thank you @jzhou for the reply. We will most likely end up with RF=3. However, in worst case if we end up losing 3 nodes and critical data like \xff\serverList end up in those servers, how can we recover? Is recovery even possible?

jzhou · October 13, 2023, 7:17pm

If \xff\serverList is lost, we don’t have tool to recover from that. Your data is still on available storage servers, and the best chance is to use fdbserver -r kvfiledump to dump the data from the remaining storage servers.

ajames · October 13, 2023, 7:21pm

I haven’t used fdbserver -r kvfiledump before. Does this mean that we’ll need to develop applications to read the data dump and restore it to a new cluster, or are there alternative mechanisms for recovering data from the available dump?

jzhou · October 13, 2023, 10:51pm

-r kvfiledump is contributed from the community. So yes, you need to develop applications to restore the data to a new cluster. Alternatively, you need to consider backup and restore solutions, which is already documented. Snowflake uses a snapshot backup, taking advantage of AWS’s disk snapshot capability, which you could also ask around.

SteavedHams · October 18, 2023, 4:12am

Some more info about data loss probability - FDB stores data on groups of ReplicationFactor storage servers, called “teams.” Teams are created such that each storage server is a member of many teams but an arbitrary set of ReplicationFactor storage servers most likely do not constitute a team. The number of teams that exist is a very low percentage of the number of possible teams. This why with RF=2 if you lose 2 storage servers you have a possibility of data loss but it’s not certain.

ajames · October 18, 2023, 5:48am

Thank you, @SteavedHams, for providing additional information. I’m not entirely sure I grasp this concept fully. Where can I read more about this? How will RF=2 vs RF=3 affect this?

SteavedHams · October 18, 2023, 7:46am

I’m not sure if there’s a comment block or any docs about team selection, but the concept can be explained as:

If you have N storage hosts, there are (N choose RF) possible groups of hosts that could be represented in a storage team of size RF. For N=30 and RF=3, N choose RF would be over 24k but FDB will only choose a small percentage of the possibilities as storage teams. If you lose 3 hosts which comprise one of those teams, you would lose data, but if you lose 3 random hosts you have a low chance of there being a team that includes all of them.

If I recall correctly the team usage percentage is something like 5% for large clusters with RF=3, but I don’t remember what it would be for RF=2.

Topic		Replies	Views
The cluster is continuously Restoring replication factor, and Moing data has not decreased Running FoundationDB performance , operator	1	49	October 10, 2024
Storage servers 95% full - how to recover Using FoundationDB	8	1552	May 1, 2024
Fault Tolerance changes from "2 machines" to "0 machines (2 without data loss)" Using FoundationDB	1	569	June 5, 2019
How to save partial data in case of lost all tlog nodes and some of storage nodes Using FoundationDB	3	687	January 27, 2022
Recovery from lost all transaction node Using FoundationDB	2	507	January 16, 2022

Impact of losing two nodes in 7.1.x cluster with RF=2

Related topics