When to use multiple FDB clusters

pdeva · April 20, 2024, 7:00pm

all the talks i see of large corporations using foundationdb like apple, datadog, etc they talk about how they use multiple fdb clusters.

fdb being a distributed database, a newcomer like me would expect it to just keep growing as you add nodes. so at what point do you hit the scalability limit with fdb and have to think of multiple fdb clusters?

AlexandreBrg · April 22, 2024, 9:37am

Here are some information I can think of, but I believe there might be some additional reasons.

From a pure software point of view, FoundationDB has limitation in term of scaling (doc). The most limiting factor is the 100 TB disk size, even though you could go bigger than that, FDB simulation didn’t go test situations with bigger size. In addition to this, components like proxies, data distributors, storage servers are able to scale, but again they have been testing within limitations.

Moreover, due to FoundationDB authorisation system, if your credentials of the cluster are leaking, unattended software can have access to the whole database and thus to the data. That’s because it doesn’t intend to give an AuthZ / AuthN features. Something that is close to authorisation would be tenancy added in 7.3. It is currently used in production by companies, but multi-tenancy could need another cluster (called Meta-cluster) which contains metadata in order to work with specific architectures. That’s for that reason you’d also like your clusters to be in isolated networks.

For those two points, you can check this thread. It’s kind of out of date, but can give you some more hints about security and scalability limits of clusters.

Then, for clusters holding personal information, GDPR laws restrict the data you can store related to a person. Those laws change between regions (e.g US vs EU), and as FoundationDB replicates the data by design, you might not want to have EU data in US region (and vice-versa).

Finally, there are cases where you don’t want the same replication factor depending on criticality of the data. More than this, for network latency reasons, you’d rather have a cluster close to customer than a global one for the whole region.

newhook · April 22, 2024, 1:21pm

100TB and 500 cores seems tiny. Can anyone share anecdotes of production usage at scale?

PierreZ · April 22, 2024, 2:11pm

I think these limitations are related to versions prior to 7.X. I’m not sure if they are still relevant
Our biggest cluster, designed to hold time-series data, have:

110k w/s 24/7
from 5k to 9M r/s
40 TB of key-values
73 processes spread across 43 virtual machines.

newhook · April 22, 2024, 2:28pm

Prior to apple acquiring foundationdb and we ran 3.0.7 with ~40 large bare metal machines (64G Xeon E3 4 core) each with 3x800 SSD. I can’t remember how much KV storage we had but quite a bit.

Rjerk · April 26, 2024, 2:30am

We’ve tested a scalability limit: DD crashed when storage servers exceed 1200

When there are too many processes, fdb will crash.

Rjerk · April 26, 2024, 2:46am

Can you share your database configuration?

We have 10 nodes with:

48 cores, 376GB memory
4 * 3.2TB NVME

Each node has 1 NVME for log (1 log per NVME) and 3 NVME for storage (8 storage per NVME). 1 GRVProxy + 3 CommitProxy and triple mode.

We did some tests and found that the performance was not very good and the cluster could only reach 30+k ops.

Any advice for deployment configuration?

wmd · May 2, 2024, 7:19pm

We have been running FDB successfully for many years. >150TB KV store across our largest clusters.

Your configuration all depends on the workload. For our case we have found the following to work:

2 to 3 log procs per volume (~100GB size volumes)
2 to 3 stateless procs per volume (sharing log process volume)
1 storage proc per volume (anything up to 3TB sized volumes, although that size is rare)
1 master proc
1 dd proc
1 cluster controller proc

One important point is to set a correct PRL (proxy/logs/resolvers). This is something that is usually adjusted according to traffic but in our case it’s usually:

logs = number of transaction processes in cluster
resolvers = 3 (never needed more)
grv_proxies = (total number of stateless) / 4
commit_proxies = remaining stateless

We found setting cache_memory helps when you have a large KV. @PierreZ wrote a neat post about this here Redwood’s memory tuning in FoundationDB - Pierre Zemb under Redwood.

Again it all depends on your traffic, ymmv.

pdeva · May 2, 2024, 9:07pm

if i understand correctly you have described how you configure each of your clusters.

but why have multiple clusters in the first place? why not just 1 large fdb cluster? what specific metrics cause you to split the fdb cluster from one to many?

wmd · May 3, 2024, 8:26pm

We use multiple clusters to keep data isolated. But if that wasn’t my concern, I would be wary of creating a large cluster for operational reasons. I don’t trust hardware enough to not fail in such a way that could cause dataloss. Storing a lot of data in FDB increases recovery time when a machine or disk fails (long rebuild times, degraded cluster state etc). We ran a >900 process cluster a few times briefly during migrations. It worked fine because it was “peacetime”. I never had a chance to test failure scenarios on a >1000 process cluster.
As for splitting, that is entirely subjective. It depends on the type of traffic and amount of data stored in disk. Keeping your KV store <=10TB would be a good start. This would reduce recovery time and allow you to migrate to other instance types if needed in a sensible timescale.

dlambrig · May 14, 2024, 3:23pm

what specific metrics cause you to split the fdb cluster from one to many?

We had two concerns with large FDB clusters, and ran tests with up to 600 cores with a heavy write workload to investigate them specifically.

worse commit latency - Batched writes are sent in parallel to all TLogs, and its possible latency could grow as the TL count grows. The TL count is 22 on our case; its the smallest we found acceptable after testing. Larger clusters may need more TLogs. (ymmv)
longer recovery time - The size of the transaction state store grows with the cluster size (more shards to track), which could in turn impact recovery times, when it is copied from the previous TL generation.

We did not see recovery times or write latency increase. But it looks like other users are experimenting with larger cluster sizes than we tried, so those metrics could be monitored in those experiments. I see mentioned in the forum (cited above) that the DD struggles to keep up with larger clusters than we tried.

markus.pilman · May 24, 2024, 8:40pm

It is important to note that even if FDB would scale perfectly linearly, there’s other good reasons to split an FDB cluster into many (this is true for any DB btw):

Blast radius of failures: A cluster can become unavailable for some time due to hardware failure, bugs, workload issues (ratekeeper…), human mistakes… If you have a single cluster it means all of your data will be unavailable during these times. If you have many small clusters only a small portion of your data will be unavailable (this can make a significant impact when looking at customer impact).
Forcing better data isolation: If you have all your data in one FDB cluster you will be able to run transactions across all your data. This is an awesome feature of FDB, however, it might not be what you want. For example if you look how Apple uses the record layer, they explicitly try to isolate data (I am not working for Apple, but this is my understanding from the record layer paper). There’s good reasons to do this: it allows you to move data more quickly between regions, it introduces logical boundaries etc. Whether you want to do this depends a lot on your workload/use-case though.
Security: A blunt but effective way to ensure that you don’t accidentally read data you shouldn’t read is to have more sensitive data in different clusters you don’t have access to. The fact that it’s a blunt tool can even be a very good thing: a customer might tell you something like “I don’t want my data to be on the same physical machine as the data of other customers”. Putting them on separate clusters is an easy way to give them piece of mind (it’s also expensive, but depending on the nature of the data a customer will be more than happy to pay for that – they might even be forced to do that due to regulations in their industry and the country they’re operating in, think of health care data, insurance companies etc).
Easier and safer management: An example here is upgrades. When you upgrade FDB one risk is always that the new version degrades in unexpected ways (we don’t know your workloads, so we can’t test your specific workload – and even if you do good testing, it’s still somewhat risky). But now you can for example first upgrade clusters with less important data. Or if you run into problems it might be easier to identify the problematic workload since you have some logical separation of workloads.
Cost: Running FDB can be expensive. Especially if you’re paranoid. If 0.1% of your data is so important that even minor corruptions or data loss would be a business risk (this is often true for metadata like encryption keys or usernames and password hashes) you want this data to be super safe. The way to do that is to run on expensive reliable hardware, have a high replication factor, enable HA and DR, run frequent backups etc. This costs money. Probably more than you want to spend if you store large amounts of data in FDB.
One possible, again quite blunt, solution to this problem is to move this super important data to a super expensive but reliable and small cluster. It only needs to hold 0.1% of your data.The rest of the data can go into cheaper clusters – with maybe lower replication factor, cheaper hardware, no HA but only DR etc.
Btw: this is also true for performance requirements: explicitly partitioning your data and storing some on a cluster with the in-memory engine and other on a cluster configured with redwood is also a good option.

The drawback of having multiple clusters is a bit more obvious (it’s MUCH harder to model your workloads this way since you lose the ability to run transactions across partitions) but I would argue that if you can model your data into independendent partitions you should always do that. Even if you then store all your partitions on the same cluster, it will just give you more flexibility down the road.