Data location constraints


While FoundationDB meets most of our requirements, one challenge that we are having is figuring out how to “pin” the data down to a particular geo region and make sure that the data is not replicated outside of the home region. Our testing shows that having the data colocated in the same region with the clients delivers the fastest query performance. We have this working already with regional SQL databases replicated across multiple data centers within the region. We tried a variety of FoundationDB replication options (datacenter, data_hall, etc.) but we are still seeing the data being replicated globally. This leads to worse performance that what we have in place today. We were hoping to turn all these silo-ed databases into a global cluster and enable the data to flow freely without having to do lots of ETLs.

Has anyone else tried to make this work?

Any guidance will be greatly appreciated.

Thanks in advance,
Aurelian ‘AD’ Dumitru
VP Engineering

I previously asked a similar question regarding data protection and this doesn’t appear to be on the roadmap. You would have to build this yourself using FoundationDB as a building block.

FoundationDB does not offer locale-aware replication in the sense you are using the term. The closest it has it something like the three_datacenter mode where it makes sure that everything is stored in three different data centers, which allows for solving the problem of “I want to be able to survive a datacenter going down and not lose availability or durability”. (It is locale aware in that it chooses locations to place your data while keeping the locale in mind.) It doesn’t have something like “store these data (about my eastern US customers) in this (eastern US) datacenter and these data (about my western US customers) in this (western US) detacenter”.

As @ryanworl alludes to, if you wanted to do something like separate your data like that, you would still have to do something like have one FDB cluster per region or per DC, so you could use FDB clusters as the building block, but it doesn’t work like that out of the box.

That being said, FoundationDB does allow you to specify locality information about your clients and servers that can allow for things like “DC-aware” queries. In particular, you need to set the --locality_datacenter_id and (for good measure) --locality_machine_id command line arguments to fdbserver. Then if you also call set_datacenter_id and set_machine_id on all of your clients, then any client who sees that there is data available from its own datacenter (or machine if you sometimes run clients and servers on the same machine) will use that one if available. This doesn’t help with commit latencies (or with get-read-version latencies…), but it can be useful for making reads significantly faster in multi-DC configurations.

So, if your use case were instead, “I have one global DB that I want to be able to query from multiple DCs with only single-DC latencies,” then FDB could probably do it, though you’d still be putting the data everywhere.

I will also say that a lot of our multi-DC story gets better in 6.0 (with more work to make more things DC-local so most things are LAN latencies rather than WAN latencies). There is still some work to this affect currently on the 6.1 milestone:

Thanks Alec! Much appreciated!

We’ll be more than happy to take those new features for a test drive. Please keep us posted.

Aurelian ‘AD’ Dumitru
VP Engineering