How to Speed up Bulk Loading for the FDB Cluster with Multiple DCs and Triple Data Copy?

(Hieu Nguyen) #1

Our FDB cluster configuration has two regions and three datacenters. Region 1 (West Coast) contains DC1 and DC2, and Region 2 ( East Coast) contains DC3. DC1 is the primary DC and DC2 is the satellite DC. DC3 is the standby DC.

In our multi-DC FDB cluster, bulk loading is invoked to populate the initial data, before the whole database is able to serve the traffic. We are currently trying to improve our bulk loading utility tool. Following the strategy described in the section of “Migrating a database to use a region configuration” in the FDB architecture document, when we do the bulk loading, we only need to have the West Coast Region Setup and then use the FDB data synchronization protocol to make a full copy to DC3 of the East Coast Region.

Question 1: Can we only use Primary DC 1 for bulk loading, so that we can save the log-store data synchronization to the satellite DC 2? In this initial data loading, we really do not need to worry about having two Log-Stores. How can we configure only the Primary DC setup?

Question 2: To load the data into the Primary DC, can we start with the configuration that has just only 2 data copies (replication factor =2). And only when the data loading is finished, we increase the replication factor to be 3 and thus force the FDB cluster to perform data re-balancing to meet the data replication requirement. Can this solution work?

(Meng Xu) #2

It is hard to say. When the replication factor changes, a new set of 3-storage-server teams (each team holds the same range of data) will be recreated. Although data distribution tries its best to keep data where it was to avoid the amount of data moved, the data size on each team will also affect the movement. The worst case is that data distribution will reshuffle the data across servers, which uses more disk space during movement and causes lots of disk cleanup (like file truncating). So it is hard to say how fast the data movement can finish.

If the input of data can be evenly distributed when data is loaded, it is probably faster to just run with 3 replications at the beginning.

1 Like
(Alex Miller) #3

Yes, you could bulk load what will be Primary DC1 as a non-region single DC cluster, and then convert it into a multi-region setup. You could also just write a regions configuration that is one region, and only one DC, which is Primary DC1. The two approaches will be roughly equivalent.

Meng answered most of this, so I won’t repeat, but this is a trick that you can use to save yourself some WAN bandwidth also. Data distribution isn’t smart enough yet to only do one copy across the database for replicating a primary triple replicated cluster into a secondary triple replicated cluster. If you add the secondary DC as a single-replicated DC, and then once a full copy of the database is in the secondary, convert it to triple replicated, then you’ll save yourself the WAN cost of the other two copies.

(Hieu Nguyen) #4

Thanks, @mengxu and @alexmiller. I will try the tricks to see whether it can speed up our bulk loading.