FoundationDB

Production optimizations


(Thomas Johson) #1

Hi,

Finally my cluster is up and running so I was able to do some significant testing. At this point, we’re using it as a cache, with a very simple transaction logic

_, err = db.Transact(func(tr fdb.Transaction) (ret interface{}, err error) {
    future := tr.Get(sub.Pack(tuple.Tuple{"cache", ID}))
    res := future.MustGet()

    if len(res) == 0 {
        tr.Set(sub.Pack(tuple.Tuple{"cache", ID}), []byte(strconv.FormatInt(time.Now().UnixNano(), 10)))

        processChan <- id
    }

    return
})

I tested three different deployments, all on GCP (google cloud)

  1. 3 servers - each 4 vcpu, 16GB ram, 1TB ssd
  2. 3 servers - each 32 vcpu, 120GB ram, 1TB ssd
  3. 11 servers - each 4 vcpu, 16GB ram, 375GB NVME SSD

I’m running FDB with double redundancy and ssd mode. Each server has equivalent of processes to vcpus (4 has 4, 32 has 32 …). Each servers also has one coordinator, so 3 servers will have 3 coordinators).

There are 80 servers running that run the above code and do some other stuff, with each at 60 gourutines (each goroutines creates a single connection to FDB).

With the 1) setting, I started running into issues that the disk IO was too high and the latency was way up.

Then I switched to 2), especially because GCP has different SSD performance based on how many cores the server has. The max you can get out of the setup is around 32 cores and 1TB, especially for writes.

This setup immediately showed improvements, while the HDD throughput (for writes) was running as high as 300MB/s on those three servers.

And finally I switched to 3) where I got way more servers all running NVME disks. This setup is running in testing “production” for last 3 days so I have the most data for it. One of the confusing things I’m seeing is that the server is reporting ~90% disk IO but nor IOPS nor the throughput is at any significant number and I can’t get it any higher. I though that NVME will be showing the best results, but it’s far from it.

Here is a dump from status details

fdb> status details

Using cluster file `fdb.cluster'.

Could not communicate with all of the coordination servers.
  The database will remain operational as long as we
  can connect to a quorum of servers, however the fault
  tolerance of the system is reduced as long as the
  servers remain disconnected.

  10.240.0.252:4500  (reachable)
  10.240.1.34:4500  (unreachable)
  10.240.1.155:4500  (reachable)
  10.240.1.171:4500  (reachable)
  10.240.2.103:4500  (reachable)
  10.240.3.144:4500  (reachable)
  10.240.3.192:4500  (reachable)
  10.240.4.228:4500  (unreachable)
  10.240.5.32:4500  (reachable)
  10.240.5.59:4500  (reachable)
  10.240.10.139:4500  (reachable)

Configuration:
  Redundancy mode        - double
  Storage engine         - ssd-2
  Coordinators           - 11

Cluster:
  FoundationDB processes - 36
  Machines               - 9
  Memory availability    - 3.8 GB per process on machine with least available
                           >>>>> (WARNING: 4.0 GB recommended) <<<<<
  Retransmissions rate   - 2 Hz
  Fault Tolerance        - 1 machine
  Server time            - 07/30/18 14:10:22

Data:
  Replication health     - Healthy
  Moving data            - 0.000 GB
  Sum of key-value sizes - 114.105 GB
  Disk space used        - 298.159 GB

Operating space:
  Storage server         - 321.6 GB free on most full server
  Log server             - 321.6 GB free on most full server

Workload:
  Read rate              - 90375 Hz
  Write rate             - 7996 Hz
  Transactions started   - 90130 Hz
  Transactions committed - 2664 Hz
  Conflict rate          - 200 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:
  10.240.0.252:4500      ( 29% cpu; 31% machine; 0.018 Gbps; 50% disk IO; 2.9 GB / 3.8 GB RAM  )
  10.240.0.252:4501      ( 26% cpu; 31% machine; 0.018 Gbps; 49% disk IO; 2.9 GB / 3.8 GB RAM  )
  10.240.0.252:4502      ( 26% cpu; 31% machine; 0.018 Gbps; 49% disk IO; 2.8 GB / 3.8 GB RAM  )
  10.240.0.252:4503      ( 30% cpu; 31% machine; 0.018 Gbps; 49% disk IO; 2.8 GB / 3.8 GB RAM  )
  10.240.1.155:4500      ( 64% cpu; 48% machine; 0.049 Gbps; 46% disk IO; 3.1 GB / 3.9 GB RAM  )
  10.240.1.155:4501      ( 64% cpu; 48% machine; 0.049 Gbps; 48% disk IO; 2.8 GB / 3.9 GB RAM  )
  10.240.1.155:4502      ( 27% cpu; 48% machine; 0.049 Gbps; 48% disk IO; 2.7 GB / 3.9 GB RAM  )
  10.240.1.155:4503      ( 27% cpu; 48% machine; 0.049 Gbps; 49% disk IO; 3.0 GB / 3.9 GB RAM  )
  10.240.1.171:4500      ( 31% cpu; 30% machine; 0.017 Gbps; 46% disk IO; 3.0 GB / 3.8 GB RAM  )
  10.240.1.171:4501      ( 28% cpu; 30% machine; 0.017 Gbps; 47% disk IO; 2.7 GB / 3.8 GB RAM  )
  10.240.1.171:4502      ( 30% cpu; 30% machine; 0.017 Gbps; 47% disk IO; 2.9 GB / 3.8 GB RAM  )
  10.240.1.171:4503      ( 29% cpu; 30% machine; 0.017 Gbps; 47% disk IO; 2.7 GB / 3.8 GB RAM  )
  10.240.2.103:4500      ( 32% cpu; 36% machine; 0.030 Gbps; 48% disk IO; 3.1 GB / 3.9 GB RAM  )
  10.240.2.103:4501      ( 29% cpu; 36% machine; 0.030 Gbps; 48% disk IO; 2.8 GB / 3.9 GB RAM  )
  10.240.2.103:4502      ( 36% cpu; 36% machine; 0.030 Gbps; 49% disk IO; 2.9 GB / 3.9 GB RAM  )
  10.240.2.103:4503      ( 37% cpu; 36% machine; 0.030 Gbps; 49% disk IO; 2.8 GB / 3.9 GB RAM  )
  10.240.3.144:4500      ( 31% cpu; 32% machine; 0.017 Gbps; 47% disk IO; 2.8 GB / 3.9 GB RAM  )
  10.240.3.144:4501      ( 30% cpu; 32% machine; 0.017 Gbps; 46% disk IO; 2.9 GB / 3.9 GB RAM  )
  10.240.3.144:4502      ( 31% cpu; 32% machine; 0.017 Gbps; 47% disk IO; 2.9 GB / 3.9 GB RAM  )
  10.240.3.144:4503      ( 29% cpu; 32% machine; 0.017 Gbps; 46% disk IO; 2.9 GB / 3.9 GB RAM  )
  10.240.3.192:4500      ( 35% cpu; 53% machine; 0.042 Gbps; 84% disk IO; 3.0 GB / 3.9 GB RAM  )
  10.240.3.192:4501      ( 68% cpu; 53% machine; 0.042 Gbps; 84% disk IO; 3.0 GB / 3.9 GB RAM  )
  10.240.3.192:4502      ( 33% cpu; 53% machine; 0.042 Gbps; 84% disk IO; 2.9 GB / 3.9 GB RAM  )
  10.240.3.192:4503      ( 74% cpu; 53% machine; 0.042 Gbps; 84% disk IO; 2.8 GB / 3.9 GB RAM  )
  10.240.5.32:4500       ( 65% cpu; 43% machine; 0.034 Gbps; 88% disk IO; 3.0 GB / 3.9 GB RAM  )
  10.240.5.32:4501       ( 36% cpu; 43% machine; 0.034 Gbps; 89% disk IO; 2.9 GB / 3.9 GB RAM  )
  10.240.5.32:4502       ( 33% cpu; 43% machine; 0.034 Gbps; 86% disk IO; 3.0 GB / 3.9 GB RAM  )
  10.240.5.32:4503       ( 30% cpu; 43% machine; 0.034 Gbps; 84% disk IO; 2.9 GB / 3.9 GB RAM  )
  10.240.5.59:4500       ( 31% cpu; 29% machine; 0.018 Gbps; 47% disk IO; 3.0 GB / 3.8 GB RAM  )
  10.240.5.59:4501       ( 27% cpu; 29% machine; 0.018 Gbps; 47% disk IO; 2.9 GB / 3.8 GB RAM  )
  10.240.5.59:4502       ( 29% cpu; 29% machine; 0.018 Gbps; 47% disk IO; 2.9 GB / 3.8 GB RAM  )
  10.240.5.59:4503       ( 28% cpu; 29% machine; 0.018 Gbps; 47% disk IO; 2.7 GB / 3.8 GB RAM  )
  10.240.10.139:4500     ( 26% cpu; 36% machine; 0.035 Gbps; 87% disk IO; 3.1 GB / 3.9 GB RAM  )
  10.240.10.139:4501     ( 23% cpu; 36% machine; 0.035 Gbps; 83% disk IO; 2.8 GB / 3.9 GB RAM  )
  10.240.10.139:4502     ( 29% cpu; 36% machine; 0.035 Gbps; 86% disk IO; 2.9 GB / 3.9 GB RAM  )
  10.240.10.139:4503     ( 58% cpu; 36% machine; 0.035 Gbps; 87% disk IO; 3.1 GB / 3.9 GB RAM  )

Client time: 07/30/18 14:10:20

I would appreciate any pointers regarding the best setup that allows the highest utilization of the resources.

Some of my questions

  • is it better to have many smaller servers or a few large ones?
  • what is the most optimal configuration for a single server?
  • I’m still pretty confused by the process class. Do I set them up manually? Do I let the client/server to decide which ones to dedicate or should I choose them manually?
  • It seems like NVME disks may not be optimal for FDB. Is it better to use just SSD?

Anything else that you would suggest?

BTW I must say the resiliency of the system if mind-blowing. I took down one server to expand it and everything worked incredibly well. Each time I brought the server back the whole cluster synced within couple of minutes and I could move on on the next one.


(Ryan Worl) #2

How are the IDs being generated? Are they sequential or random?


(Thomas Johson) #3

Well, they are sequential-ish in groups. Some groups are pretty large, some are smaller.

It looks like this in the database
namespace/source/cache/id or in a real example

discover/platform1/cache/8n7M-UYSIJA
discover/platform1/cache/_YOg9_pbsnk
discover/platform1/cache/4HN41of8WK8
discover/platform1/cache/5u85ljWc8ck
discover/platform1/cache/R15juwpxYu0
discover/platform2/cache/1808674919186284
discover/platform2/cache/1930138083685906
discover/platform1/cache/UVf55-LvRHQ
discover/platform1/cache/9sT0AMIL88E
discover/platform2/cache/1990978757644243
discover/platform2/cache/1363389550372795
discover/platform1/cache/hmYu-Cjk3CI
discover/platform1/cache/n8uXo5Scwfk
discover/platform1/cache/h0arxNpGdXE
discover/platform2/cache/693874064116539
discover/platform2/cache/10215744445793392
discover/platform2/cache/10215965264421613
discover/platform2/cache/1885408654837611
discover/platform2/cache/1796496660388522
discover/platform2/cache/1541671129225742
....

(Ryan Worl) #4

If you’re writing sequentially to a range of keys, it can produce the effect you’re seeing where one disk is being disproportionately stressed. From what you’re describing that could be what’s happening.

Try writing to a completely random range of keys and see if you see the same issues.


(Ryan Worl) #5

I realize this isn’t useful for your application! Just as a diagnosis.


(Thomas Johson) #6

That is unfortunately not possible. We have no control over the keys as they are coming from external services at a pace that we also have no control over.

I may be wrong, but this may not be the a big issue. If you look at the servers, it’s not great but it’s also not super uneven.

Could you maybe took a shot at my questions about the general structure of the cluster? What’s the best setup, how to maybe think about it, anything else that could be helpful?


(Ryan Worl) #7

Unfortunately, I am not a FoundationDB expert, I’ve just ran across that specific issue before and had it manifest in that way with high IO on one server due to a hot key range.

You can try separating the transaction logging disk from the storage disk which I do know helps.


(Thomas Johson) #8

Oh gotcha. Appreciate your pointers though!

The transaction logging disk is separate, if you are referring to the log dir

datadir = /var/lib/foundationdb/data/$ID
logdir = /var/log/foundationdb
Filesystem      Size  Used Avail Use% Mounted on
udev            7.4G     0  7.4G   0% /dev
tmpfs           1.5G   12M  1.5G   1% /run
/dev/sda1        30G  5.1G   24G  18% /
tmpfs           7.4G  4.0K  7.4G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           7.4G     0  7.4G   0% /sys/fs/cgroup
tmpfs           1.5G     0  1.5G   0% /run/user/1010
/dev/nvme0n1    369G   31G  319G   9% /var/lib/foundationdb/data

(Alex Miller) #9

What I’d actually suggest optimizing for is a process:disk ratio. 32 processes : 1 disk is probably a bit too much. Doing 1-2 processes per SSD is a thing that I’ve generally seen able to reach saturation of an SSD before, and anything above that just means an increase in disk latency but not throughput.

FDB will, by default, recruit a storage server on every process, so having 32 processes connected in a cluster will, if process classes aren’t specified, mean 32 storage servers all competing for disk accesses.

This advice applies a bit more to bare metal than VMs. I’ve generally only seem VM providers offer you one disk, so in that sense, it’s better to run more VMs that are smaller than fewer larger VMs because that gives you a better process:disk ratio.

Running 1 log per disk, or 2 storage servers per disk, and setting the rest of the processes you might run as stateless will probably give you full throughput, and the best latency.

You would need to set them up manually. Did you already see this thread?

Admittedly, these really do need to get easier to use, or preferably not necessary to understand at all to get a well performing FDB cluster. But until then, it helps to specify your cluster layout.

I would expect NVMe disks to be better for FDB, and by the metrics that FDB, or any database should care about, NVMe drives should be much better. If you do iotop, or some equivalent process that can break down disk usage by process, can you double check it is indeed fdbserver?

logdir is where our textual logs go, but not the transaction log data files. Those still go into datadir. The only way to prevent a storage server being recruited on the same process as a transaction log is to use process classes.


(Thomas Johson) #10

Thanks @alexmiller for the answer. I read and re-read it multiple times including the thread you pointed me to and I think I have more questions than I got answers to.

Maybe you could help me (and others) to understand how to properly structure a cluster. I read here the suggestion how to split 8 servers, but what am I not understanding is how would you deploy and expand something like this. What if you don’t have 8 but 100 servers?

So here are some questions that I’m struggling with (lets consider that I have 10 servers with a preference to expand to more):

You suggested that each server should be dedicated to a very specific class. So now if you have 4 core server, then 1 process will be lets say class=storage and the rest would be stateless. If it’s 4 core server, then for 10 servers, there are 40 cores, but only 10 assigned to storage/transactions and lets say 8 proxies and 5 logs.

  • What about the rest? What are those going to do?
  • Additionally how would you structure those servers, specially hard-drives? Are they equal (e.g. 1TB space each?)
  • What about coordinators? Is it sensible to make one process from each “physical” server a coordinator? What is coordinator anyway. Is it a process that is dedicated to something, or just a discovery mechanism and it truly doesn’t matter how many are there?
  • What’s the best spread of classes? Based on this guide it looks like 10 for logs, 9 for proxies. From your example you had 2 for logs, 6 for storage. So extrapolating that on 10 servers: 7 storage, 3 logs, 30 stateless with 10 of them used as proxies (on 4 core server)
  • Would it be more sensible to have just 2 core server then?

Also one other question. I just altered the cluster to

Configuration:
  Redundancy mode        - double
  Storage engine         - ssd-2
  Coordinators           - 9
  Desired Proxies        - 5
  Desired Logs           - 8

but there is no information on which process is utilized and what class is being assigned.

I still have each server to have both data and log class

[fdbserver.4500]
class=stateless
[fdbserver.4501]
class=storage
[fdbserver.4502]
class=transaction
[fdbserver.4503]
class=stateless

as I need to figure out how to do the deployment across equal servers unequally.

Thank you for your help!


(Alex Miller) #11

Probably 80-90 of the 100 severs are just going to be running storage servers. It’s probably slightly better to not assign them any process class, rather than assign them storage process class, so that in the case of failures, you have processes to fall back on that you can recruit other roles from.

It doesn’t really matter, depending on your workload. Data distribution aims to keep the used space percent roughly equal among storage servers. So nothing bad will happen if your run a cluster, half with 1TB drives and half with 10TB drives.

You will, however, have created a cluster where you have 10x more disk time for some data than other data. If you’re randomly reading keys from your database, 1TB storage servers are 10x more likely to have that data in memory, and if they don’t, have 10x less requests going to their drives. If you have largely cold data, then this won’t matter. If you’re latency sensitive, then you want to size your drives equally, and preferably to match your IO/s,

Think of coordinators as a zookeeper cluster. You’ll want to run 2n+1 of then, where n is the number of failures you can take and keep running. If you’re running in triple replication, then n=2. If you’re running double replication, then n=1.

If you’re running across multiple DC’s, it’s best to adjust this math a little bit, so that you have 2n+1 coordinators per DC, as always having a local coordinator will save you WAN traffic.

The exact best number of logs and proxies is going to vary strongly based on your workload. You essentially want the minimum number of them that can still support your load with some overhead left for safety.

My very rough guess would be to say when running with with the ssd storage engine, you’ll probably want about 1 log for every 8 storage servers, and probably around 1 proxy per log. But I would strongly suggest running some representative performance test, and varying the number of proxies and logs until you find an optimal spot.

Quite possibly, yes. The extra cores could be useful to run stateless roles, but they have their own demands. Proxies and logs are both network bandwidth hungry, as all commits in the system flow through them, so it’s good to plan out their network bandwidth allocations as well.

I’m used to thinking about this in a bare metal world, and not one in which your disk (and network?) scales with the number of cores you have. FDB, overall, should be more disk and network bottlenecked than CPU, as its primary job is to do IO.

If you’re writing any layers to run on top of FDB, then extra cores alongside your storage servers are great places to run you layers processes – both from a latency and binpacking perspective.


(Ryan Worl) #12

Thanks for that write up @alexmiller, I too had questions about good baseline ratios of different process classes to start with when deploying more processes into the cluster. So far I had just been doing that by trial and error until I got a passable result compared to what my intuition said should be happening.


(Alex Miller) #13

The 1 log : 8 ssd storage ratio is only roughly what I’ve seen. I’m being very specific about ssd, because the memory storage engine can apply mutations to durable storage faster, so the ratio changes. I think it’s closer to 1:2, but it’s not a thing I’ve benchmarked as often.

Ideally, you’d run a single ssd proxy=5 log=5 1 storage server cluster, run a write heavy benchmark, and look at the trace files from the storage server to determine what rate of mutations your storage server can apply on your hardware. Then do the same with a single ssd proxy=5 log=1 10 storage server cluster, and see what rate of mutations one tlog can support.

These two figures then give you your ratios. Each additional tlog and proxy you add isn’t going to give you quite as much benefit as the last, but your goal is to make sure that you have enough tlogs to be able to feed your storage servers at their full rate, and then give yourself an extra little bit of headroom so that you aren’t running your cluster at 100% and thus getting poor latency.

This is, of course, assuming you’re targeting being able to do full write throughput. If you have a lot of cold data, and care more about commit latency, then running with less proxies or logs than your maximum would be a better option.

I filed Storage Server recruitment should consider existing recruited roles #552 a bit ago, as explicitly assigning process classes is mostly about keeping storage servers away from other latency-critical parts of the system. I don’t really have any ideas on how we would do better auto-configuration of proxies and logs, as changing them invokes a recovery, and a recovery means O(hundreds of milliseconds) write downtime / latency spike.


(Thomas Johson) #14

@alexmiller this is all incredibly helpful. I spend weeks reading through the docs and preparing for spinning the cluster, I got a bit blindsided by the challenges of actually optimizing the cluster. But this helps a lot.

I still run a very suboptimal cluster at this point, but even with it I see a lot of improvement. My next step is going to be try to follow your guidance and rebuild the cluster with those parameters, specifically [24 servers, each 2cpu/8GB ram, 375GB local ssd] and running them with these settings:

  • double ssd
  • 20x with 1 storage and 1 stateless
  • 4x with 1 transaction and 1 stateless
  • 4 coordinators
  • 4x proxy

Also need to figure out backup somehow, but that’s for later. This means that I will have 24 processes in stateless, 4 recruited as proxy and the rest just idling.

Does it sound about right to you?


(Alex Miller) #15

(Oh, sorry, apparently I never hit reply.)

I’m glad to have helped. :slightly_smiling_face:

Your proposed layout sounds good to me. It’d be an interesting test to see if 2x storage servers per host runs better than 1x storage server.

The FDB documentation should probably be scrubbed clean of the transaction class, which should probably be replaced with the log class. transaction allows either a transaction log or a proxy to be recruited there, oddly. Looking back over the documentation on setting process classes, much of the above thread should probably be shuffled into the documentation also.


(Thomas Johson) #16

In that case I will stick with 4 CPUs, 2 storage, 2 stateless.

Would you also suggest to do 2x log?

Oh, good one. Didn’t know that. Will replace transaction with log.

I made changes to the existing cluster (still running storage and transaction on the same server) and getting incredible performance.


(Matt Lohier) #17

Hi Thomas,

I have been asking myself those same questions. It’s good to see I wasn’t the only one. Thank you very much @alexmiller for your inputs!
I have followed your thinking process, and I am now wondering if you deployed those FDB machines with different configurations depending on each role your were assigning them. And if so, how you achieved that unequal deployment (to reuse your term)? Manually or with some sort of automation/reproducibility… ?

And do you mind confirming that you organised your 24 servers in 2 groups (configuration template):

  • 20 x configured with 2 processes (class=storage, class=stateless)
  • 4 x configured with 2 processes (class=log, class=stateless)

Then applied the following fdbcli configure commands:
configure proxies=4
coordinators addr1 addr2 addr3 addr4

Looking forward to hearing from you
Thank you so much / Matt


(Matt Lohier) #18

Hi Alex,

I am interested in running your suggested benchmarks however I am wondering if I am reading this properly…

You ask to run a first test with single ssd proxy=5 log=5 and then a second one with single ssd proxy=5 log=1 but is it a typo that you are changing the cluster size? You’re mentioning a cluster of 1 storage server for the first test, and then a cluster of 10 storage servers for the second time. I’m a bit puzzled why would that be… I thought I would stick to a cluster of 10 for both tests.

Thank you for your assistance!


(Thomas Johson) #19

Actually no. I deployed equal servers, 4 vCPU, 16GB RAM, 500GB SSD with 4 fdb processes, 2 for storage and 2 for stateless. On two servers I swapped the 2x storage for 2x log and then I went with ssd double, 4 coordinators and 4 desired logs, 2 resolvers and 3 proxies.

Correct.

We’re still testing this deployment. I’m going to try one more, where we swap SSD for Local SSD on all servers. We are particularly trying to find out how many concurrent reads from how many external servers can we do.

A smaller cluster that is running for a while is doing actually pretty well (9 servers, 4 vcpu, 16GB RAM, …).


(Matt Lohier) #20

Thank you Thomas for your fast response, it’s awesome. Everything makes sense, I’ll try the same cluster on AWS with NVME SSD disks.

Have you compared the performance of both clusters (9 instances vs 24), and do you see a linear(-fish) performance improvement? I still don’t understand how (or have verified that) performance is improved with larger cluster size. In my case I was getting the same results with a cluster of 10 instances, and one of 20 instances (but I had left the process class unspecified (so default to storage server)).

Thanks!