How are the IDs being generated? Are they sequential or random?
Well, they are sequential-ish in groups. Some groups are pretty large, some are smaller.
It looks like this in the database
namespace/source/cache/id or in a real example
discover/platform1/cache/8n7M-UYSIJA discover/platform1/cache/_YOg9_pbsnk discover/platform1/cache/4HN41of8WK8 discover/platform1/cache/5u85ljWc8ck discover/platform1/cache/R15juwpxYu0 discover/platform2/cache/1808674919186284 discover/platform2/cache/1930138083685906 discover/platform1/cache/UVf55-LvRHQ discover/platform1/cache/9sT0AMIL88E discover/platform2/cache/1990978757644243 discover/platform2/cache/1363389550372795 discover/platform1/cache/hmYu-Cjk3CI discover/platform1/cache/n8uXo5Scwfk discover/platform1/cache/h0arxNpGdXE discover/platform2/cache/693874064116539 discover/platform2/cache/10215744445793392 discover/platform2/cache/10215965264421613 discover/platform2/cache/1885408654837611 discover/platform2/cache/1796496660388522 discover/platform2/cache/1541671129225742 ....
If you’re writing sequentially to a range of keys, it can produce the effect you’re seeing where one disk is being disproportionately stressed. From what you’re describing that could be what’s happening.
Try writing to a completely random range of keys and see if you see the same issues.
I realize this isn’t useful for your application! Just as a diagnosis.
That is unfortunately not possible. We have no control over the keys as they are coming from external services at a pace that we also have no control over.
I may be wrong, but this may not be the a big issue. If you look at the servers, it’s not great but it’s also not super uneven.
Could you maybe took a shot at my questions about the general structure of the cluster? What’s the best setup, how to maybe think about it, anything else that could be helpful?
Unfortunately, I am not a FoundationDB expert, I’ve just ran across that specific issue before and had it manifest in that way with high IO on one server due to a hot key range.
You can try separating the transaction logging disk from the storage disk which I do know helps.
Oh gotcha. Appreciate your pointers though!
The transaction logging disk is separate, if you are referring to the log dir
datadir = /var/lib/foundationdb/data/$ID logdir = /var/log/foundationdb
Filesystem Size Used Avail Use% Mounted on udev 7.4G 0 7.4G 0% /dev tmpfs 1.5G 12M 1.5G 1% /run /dev/sda1 30G 5.1G 24G 18% / tmpfs 7.4G 4.0K 7.4G 1% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 7.4G 0 7.4G 0% /sys/fs/cgroup tmpfs 1.5G 0 1.5G 0% /run/user/1010 /dev/nvme0n1 369G 31G 319G 9% /var/lib/foundationdb/data
What I’d actually suggest optimizing for is a process:disk ratio. 32 processes : 1 disk is probably a bit too much. Doing 1-2 processes per SSD is a thing that I’ve generally seen able to reach saturation of an SSD before, and anything above that just means an increase in disk latency but not throughput.
FDB will, by default, recruit a storage server on every process, so having 32 processes connected in a cluster will, if process classes aren’t specified, mean 32 storage servers all competing for disk accesses.
This advice applies a bit more to bare metal than VMs. I’ve generally only seem VM providers offer you one disk, so in that sense, it’s better to run more VMs that are smaller than fewer larger VMs because that gives you a better process:disk ratio.
Running 1 log per disk, or 2 storage servers per disk, and setting the rest of the processes you might run as stateless will probably give you full throughput, and the best latency.
You would need to set them up manually. Did you already see this thread?
Admittedly, these really do need to get easier to use, or preferably not necessary to understand at all to get a well performing FDB cluster. But until then, it helps to specify your cluster layout.
I would expect NVMe disks to be better for FDB, and by the metrics that FDB, or any database should care about, NVMe drives should be much better. If you do
iotop, or some equivalent process that can break down disk usage by process, can you double check it is indeed
logdir is where our textual logs go, but not the transaction log data files. Those still go into datadir. The only way to prevent a storage server being recruited on the same process as a transaction log is to use process classes.
Thanks @alexmiller for the answer. I read and re-read it multiple times including the thread you pointed me to and I think I have more questions than I got answers to.
Maybe you could help me (and others) to understand how to properly structure a cluster. I read here the suggestion how to split 8 servers, but what am I not understanding is how would you deploy and expand something like this. What if you don’t have 8 but 100 servers?
So here are some questions that I’m struggling with (lets consider that I have 10 servers with a preference to expand to more):
You suggested that each server should be dedicated to a very specific class. So now if you have 4 core server, then 1 process will be lets say
class=storage and the rest would be
stateless. If it’s 4 core server, then for 10 servers, there are 40 cores, but only 10 assigned to storage/transactions and lets say 8 proxies and 5 logs.
- What about the rest? What are those going to do?
- Additionally how would you structure those servers, specially hard-drives? Are they equal (e.g. 1TB space each?)
- What about coordinators? Is it sensible to make one process from each “physical” server a coordinator? What is coordinator anyway. Is it a process that is dedicated to something, or just a discovery mechanism and it truly doesn’t matter how many are there?
- What’s the best spread of classes? Based on this guide it looks like 10 for logs, 9 for proxies. From your example you had 2 for logs, 6 for storage. So extrapolating that on 10 servers: 7 storage, 3 logs, 30 stateless with 10 of them used as proxies (on 4 core server)
- Would it be more sensible to have just 2 core server then?
Also one other question. I just altered the cluster to
Configuration: Redundancy mode - double Storage engine - ssd-2 Coordinators - 9 Desired Proxies - 5 Desired Logs - 8
but there is no information on which process is utilized and what class is being assigned.
I still have each server to have both data and log class
[fdbserver.4500] class=stateless [fdbserver.4501] class=storage [fdbserver.4502] class=transaction [fdbserver.4503] class=stateless
as I need to figure out how to do the deployment across equal servers unequally.
Thank you for your help!
Probably 80-90 of the 100 severs are just going to be running storage servers. It’s probably slightly better to not assign them any process class, rather than assign them storage process class, so that in the case of failures, you have processes to fall back on that you can recruit other roles from.
It doesn’t really matter, depending on your workload. Data distribution aims to keep the used space percent roughly equal among storage servers. So nothing bad will happen if your run a cluster, half with 1TB drives and half with 10TB drives.
You will, however, have created a cluster where you have 10x more disk time for some data than other data. If you’re randomly reading keys from your database, 1TB storage servers are 10x more likely to have that data in memory, and if they don’t, have 10x less requests going to their drives. If you have largely cold data, then this won’t matter. If you’re latency sensitive, then you want to size your drives equally, and preferably to match your IO/s,
Think of coordinators as a zookeeper cluster. You’ll want to run 2n+1 of then, where n is the number of failures you can take and keep running. If you’re running in triple replication, then n=2. If you’re running double replication, then n=1.
If you’re running across multiple DC’s, it’s best to adjust this math a little bit, so that you have 2n+1 coordinators per DC, as always having a local coordinator will save you WAN traffic.
The exact best number of logs and proxies is going to vary strongly based on your workload. You essentially want the minimum number of them that can still support your load with some overhead left for safety.
My very rough guess would be to say when running with with the
ssd storage engine, you’ll probably want about 1 log for every 8 storage servers, and probably around 1 proxy per log. But I would strongly suggest running some representative performance test, and varying the number of proxies and logs until you find an optimal spot.
Quite possibly, yes. The extra cores could be useful to run stateless roles, but they have their own demands. Proxies and logs are both network bandwidth hungry, as all commits in the system flow through them, so it’s good to plan out their network bandwidth allocations as well.
I’m used to thinking about this in a bare metal world, and not one in which your disk (and network?) scales with the number of cores you have. FDB, overall, should be more disk and network bottlenecked than CPU, as its primary job is to do IO.
If you’re writing any layers to run on top of FDB, then extra cores alongside your storage servers are great places to run you layers processes – both from a latency and binpacking perspective.
Thanks for that write up @alexmiller, I too had questions about good baseline ratios of different process classes to start with when deploying more processes into the cluster. So far I had just been doing that by trial and error until I got a passable result compared to what my intuition said should be happening.
The 1 log : 8 ssd storage ratio is only roughly what I’ve seen. I’m being very specific about
ssd, because the memory storage engine can apply mutations to durable storage faster, so the ratio changes. I think it’s closer to 1:2, but it’s not a thing I’ve benchmarked as often.
Ideally, you’d run a
single ssd proxy=5 log=5 1 storage server cluster, run a write heavy benchmark, and look at the trace files from the storage server to determine what rate of mutations your storage server can apply on your hardware. Then do the same with a
single ssd proxy=5 log=1 10 storage server cluster, and see what rate of mutations one tlog can support.
These two figures then give you your ratios. Each additional tlog and proxy you add isn’t going to give you quite as much benefit as the last, but your goal is to make sure that you have enough tlogs to be able to feed your storage servers at their full rate, and then give yourself an extra little bit of headroom so that you aren’t running your cluster at 100% and thus getting poor latency.
This is, of course, assuming you’re targeting being able to do full write throughput. If you have a lot of cold data, and care more about commit latency, then running with less proxies or logs than your maximum would be a better option.
I filed Storage Server recruitment should consider existing recruited roles #552 a bit ago, as explicitly assigning process classes is mostly about keeping storage servers away from other latency-critical parts of the system. I don’t really have any ideas on how we would do better auto-configuration of proxies and logs, as changing them invokes a recovery, and a recovery means O(hundreds of milliseconds) write downtime / latency spike.
@alexmiller this is all incredibly helpful. I spend weeks reading through the docs and preparing for spinning the cluster, I got a bit blindsided by the challenges of actually optimizing the cluster. But this helps a lot.
I still run a very suboptimal cluster at this point, but even with it I see a lot of improvement. My next step is going to be try to follow your guidance and rebuild the cluster with those parameters, specifically [24 servers, each 2cpu/8GB ram, 375GB local ssd] and running them with these settings:
- double ssd
- 20x with 1 storage and 1 stateless
- 4x with 1 transaction and 1 stateless
- 4 coordinators
- 4x proxy
Also need to figure out backup somehow, but that’s for later. This means that I will have 24 processes in
stateless, 4 recruited as
proxy and the rest just idling.
Does it sound about right to you?
(Oh, sorry, apparently I never hit reply.)
I’m glad to have helped.
Your proposed layout sounds good to me. It’d be an interesting test to see if 2x storage servers per host runs better than 1x storage server.
The FDB documentation should probably be scrubbed clean of the
transaction class, which should probably be replaced with the
transaction allows either a transaction log or a proxy to be recruited there, oddly. Looking back over the documentation on setting process classes, much of the above thread should probably be shuffled into the documentation also.
In that case I will stick with 4 CPUs, 2 storage, 2 stateless.
Would you also suggest to do 2x log?
Oh, good one. Didn’t know that. Will replace transaction with
I made changes to the existing cluster (still running
transaction on the same server) and getting incredible performance.
I have been asking myself those same questions. It’s good to see I wasn’t the only one. Thank you very much @alexmiller for your inputs!
I have followed your thinking process, and I am now wondering if you deployed those FDB machines with different configurations depending on each role your were assigning them. And if so, how you achieved that unequal deployment (to reuse your term)? Manually or with some sort of automation/reproducibility… ?
And do you mind confirming that you organised your 24 servers in 2 groups (configuration template):
- 20 x configured with 2 processes (class=storage, class=stateless)
- 4 x configured with 2 processes (class=log, class=stateless)
Then applied the following fdbcli configure commands:
coordinators addr1 addr2 addr3 addr4
Looking forward to hearing from you
Thank you so much / Matt
I am interested in running your suggested benchmarks however I am wondering if I am reading this properly…
You ask to run a first test with
single ssd proxy=5 log=5 and then a second one with
single ssd proxy=5 log=1 but is it a typo that you are changing the cluster size? You’re mentioning a cluster of 1 storage server for the first test, and then a cluster of 10 storage servers for the second time. I’m a bit puzzled why would that be… I thought I would stick to a cluster of 10 for both tests.
Thank you for your assistance!
Actually no. I deployed equal servers,
4 vCPU, 16GB RAM, 500GB SSD with 4 fdb processes, 2 for
storage and 2 for
stateless. On two servers I swapped the 2x
storage for 2x
log and then I went with
ssd double, 4 coordinators and 4 desired logs, 2 resolvers and 3 proxies.
We’re still testing this deployment. I’m going to try one more, where we swap SSD for Local SSD on all servers. We are particularly trying to find out how many concurrent reads from how many external servers can we do.
A smaller cluster that is running for a while is doing actually pretty well (9 servers, 4 vcpu, 16GB RAM, …).
Thank you Thomas for your fast response, it’s awesome. Everything makes sense, I’ll try the same cluster on AWS with NVME SSD disks.
Have you compared the performance of both clusters (9 instances vs 24), and do you see a linear(-fish) performance improvement? I still don’t understand how (or have verified that) performance is improved with larger cluster size. In my case I was getting the same results with a cluster of 10 instances, and one of 20 instances (but I had left the process class unspecified (so default to storage server)).
I didn’t have the opportunity to test it side by side yet. Our main goal is to see how it performs with much larger deployment (we need to store more than 100TB of data) and see if we can get thousands of servers to operate on the cluster.