I’m investigating FDB as a database engine for apps that will have a relatively small number of users and relatively small datasets, but which have the following characteristics:
Users may be spread around the world yet,
They need low read latencies
Servers will run on relatively cheap clouds or dedicated machines
I’m wondering what the best configuration is because despite reading the docs a few times and watching the multi-region tech talk by Evan in 2018, I find I’m still not entirely sure. All I want is (say) 3 replicas of the dataset, separated by a WAN, in which the loss of one isn’t fatal, and in which if the database grows an additional machine can be added to each replica to add storage capacity.
Multi-region sounds good, but, only two regions are supported. But it may be that some projects need at least three (Europe, USA, Asia) and quite possibly more, to ensure app servers are close to users. So it seems this is out.
three_data_hall mode doesn’t seem right, because then you need at least 4 machines to make progress. But I want only 3 to make progress, with 1 lost (they’ll probably be VMs so I expect “loss” to be a rare and user-driven event).
None of the single datacenter modes seem right. E.g. triple replication is good but you need 5 machines to make progress.
three_datacenter mode therefore seems the best, at least for apps that want low latencies in Asia as well as Europe/USA, or want to split the US into two coastal regions. But it says data is replicated six times, which seems wasteful. Why isn’t it replicated three times? And what if I wanted four datacenters e.g. east-us/west-us/europe/taiwan but am OK with the same level of fault tolerance as 3, i.e. the 4th doesn’t take part in leader election so having an even number isn’t fatal?
Maybe FDB isn’t really designed/optimised for many replicas of a small dataset?
You could use the Disaster Recovery solution to asynchronously replicate one FDB database into multiple other clusters, but that would leave you with potential data loss if your main geographic site disappeared completely.
Currently, multi-region essentially gives you a seamless failover between a hot primary cluster and a warm secondary cluster, not a full geo-replication solution. I think what you’re looking we’ve discussed as part of extending multi-region to >2 regions. This has been on the roadmap, but as of now in 2020, it isn’t implemented.
I’d recommend looking into CockroachDB, YugaByte, or TiDB, as alternatives that currently geo-replicate better. (Cassandra or CouchDB if you’re looking for your regions to operate independently and sync data later.)
If I understand the low latency requirements correctly, @mikehearn wants to have several clusters with active-active multimaster replication, and the user application should select the nearest cluster for connection.
Hmm, that’s a pity, as FDB is exactly the right fit for these apps otherwise (I have a library similar to the records layer I want to use). And honestly I feel more secure with Apple funding the DB than some VC-d to the hilt startup that’s on Series D funding already.
Multi-master isn’t a requirement: it’s OK if commit latencies are somewhat high as long as reads are fast, because when mapping data models to key/value spaces, you end up blocking on reads all the time, whereas writes commit in a batch at the end anyway. And I guess it’s OK for apps to be using cached read versions and other tricks to avoid WAN roundtrips for every single read-only tx, if the client knows it hasn’t done any writes recently (and e.g. the user hasn’t pushed a refresh button). IIRC Facebook used to use a technique like this, where a user would be routed to a lagging slave datacenter until they did a write transaction, at that point a cookie would be used to transparently route them back to the master DC for a few minutes, where “a few” was set to be longer than the async replication delay. I suppose that might also work. However then you lose the fast reads for however long the DR cluster lags by, which isn’t ideal, and of course you’d ideally have an automated mechanism for detecting when the replica has caught up (not entirely sure how beyond incrementing a global atomic counter with every write … maybe using fdb’s own version numbers)?
Perhaps multi-region can be combined with DR? So you have two regions for robustness purposes, and then DR for getting data closer to users. If a user is hitting a lagged replica and does a write, that tx and all subsequent are sent to the master region until the lagged replica is caught up.
honestly I feel more secure with Apple funding the DB than some VC-d to the hilt startup that’s on Series D funding already.
Cassandra and CouchDB don’t quite fit that category…being Apache projects. CouchDB is specifically a strategic focus for IBM. In fact, CouchDB 4.0 is being developed on FDB.
This is a possibility. I’m hesitant to recommend it because I don’t think I’m aware of anyone else running with >1 DRs, and only O(days) of running DR and multi-region concurrently. It’s in the “it should work” category, but it’s possible that there’s lurking bugs.
But even if it does work in terms of running stably, it’s still not going to be great in terms of writing an application against it. From the FDB client perspective, this would look like having two different clusters. You would be able to create read-only transactions against your local (DR) cluster, and create read-write onces against the global cluster (the reads will also have high latency, and you’ll be bound to the 5s limit still), and your application would need to be able to decide which one to do.
Closer on the multi-region roadmap was getting a read version across the WAN, doing all reads locally, and then committing the transaction across the WAN, which sounds like more of what you’re looking for. But again, not yet implemented.
It kind of depends… The main problem I see is that operationalizing something like this would be quite painful… Also the high latencies could be a huge problem: if a replica ever falls behind by more than 5 seconds, you will not be able to read from it. There are other weird issues that you would need to address - for example currently past_version exceptions are handled by the client under the assumptions that the storage is falling behind which wouldn’t necessarily be true anymore. This could lead to some surprising behavior.
An alternative would be the following:
Get a read version from the replica.
Commit to the global cluster.
That way you would only pay for the commit - everything else would be low latency (and read-only transactions would be as fast as running locally). However, this would come at a significant cost:
You lose strict serializability across regions (this one might be ok - your transactions will still be serializable, but you might occasionally read stale data. Sometimes you might not even see your own writes, which might be a bigger issue).
Depending on your workload, the conflict rate could become unacceptably high - to the point where your application stops making progress. Especially if you do not read your own writes, the sharding won’t help you to mitigate this.
We have a project in the pipeline that might be of better use for you which is basically a multi-cluster configuration. The basic idea is to split the global key space into ranges and each FDB cluster will own a set of ranges. That way you could do something like this:
Shard your data by users.
Each FDB cluster will be responsible for all users in its region.
Have DR (or even two-datacenter mode) to get a hot standby in a second region - but just for fail over. Depending on your HA needs, just having a good backup story could be a cheaper alternative.
If you have some global ranges, try to keep access to them minimal and just take the latency-hit.
Obviously without more knowledge about your workload, it’s hard to tell whether this would be an acceptable solution. Also it will take more time to get this feature implemented and working (though we always look for external contributors if you want to speed stuff up ).
If you have DR’s, then you don’t need to have multi-region. You can obtain some level of robustness with DR.
Seems several DRs satisfy you needs. But there are some concerns
There are several cluster files for connection to different DRs or to the master. You application would be responsible for choosing the necessary cluster files depending on the latency and the necessarity of writing.
By default DR clusters are forbidden for queries against them. You application need to enable READ_LOCK_AWARE transaction option for overriding this.
The data in DR clusters may be not consistent (when some mutations from the same transaction have already been applied to the DR, but some haven’t yet).
When you application makes writing to the master and then reads data from the nearest DR, it may see the old data that were before writting.