Configuring FoundationDB to Use More Than One Resolver

performance
(Ricky Saltzer) #1

Hi!

I’m currently running a POC for a new project that will use FoundationDB as its backend. The workload is inherently bursty and has a mix of read and write workload (20/80 I’d say).

My current configuration is the following

Configuration:
Redundancy mode        - triple
Storage engine         - ssd-2
Coordinators           - 7
Desired Proxies        - 7
Desired Resolvers      - 3
Desired Logs           - 3

Cluster:
FoundationDB processes - 72
Machines               - 18
Memory availability    - 3.7 GB per process on machine with least available
                       >>>>> (WARNING: 4.0 GB recommended) <<<<<
Retransmissions rate   - 0 Hz
Fault Tolerance        - 2 machines
Server time            - 05/30/19 16:46:55

Each of machines are running on c5d.2xlarge instance types. I currently have three dedicated machines (4 processes on each) dedicated to the stateless class. So, 12 stateless processes in total.

My issue is the following:

I cannot get the cluster to recruit more than one resolver. I’ve set the desired amount to 3. I’ve even gone as far as to manually setclass <ip:port> resolution, but for some reason FDB is refusing to recruit any additional process as a resolver.

The odd part is, some of my stateless processes have zero roles assigned to them. So you’d think they’d be perfect candidates for the role.

Normally this wouldn’t be an issue, but it seems that the resolver is in fact becoming my bottleneck. My write throughput starts hitting a ceiling at around 250k Hz with my resolver process’s CPU maxed out at 100%.

I’ve attached a recent [1] status json report for anybody who is willing to help :slight_smile:

Most notably, if you search for 10.49.58.119:4503 you will see that the class type is set to resolution, yet the process has zero roles assigned to it.

[1] https://gist.github.com/rickysaltzer/cf6327a26a7fc45a553253f1ed51e19a

Ricky

(Bhaskar Muppana) #2

I don’t know much about increasing resolvers, I will let someone else answer that.

But, if the resolver is the only process CPU bound and all other processes are doing good, probably it’s worth reviewing the data model. You should check and make sure you are not generating too many conflict ranges.

If you have too many point-gets or sets, that could cause too many conflict ranges. Then, resolvers have to do too many range comparisons. Instead, if you find a smallest overlapping range and add that as the conflict range that would be better for resolver performance.

Document Layer inserts can be a very good example here. Document Layer stores each JSON field in a different FDB key with the same prefix. For example, let’s say I have a document in the collection employee looking like

{
  "_id": 345,
  "name": "Eric",
  "marks": 90,
  "grade": A
}

Note that _id is the primary key. So, the document is stored with key _id. This would be stored under 4 different keys

employee:345: -> _
employee:345:name -> Eric
employee:345:marks -> 90
employee:345:grade -> A

When you insert these keys, it will create 4 write conflict ranges, one per each key. If my document can have embedded documents with deep arrays, this could get much worse. This keeps resolver super busy. To avoid this document layer explicitly adds a write conflict range covering all keys. In this case, adds the conflict range for prefix employee:345:. That would reduce the work for resolver 4 times, depending on the document size this could make or break.

If you already made sure that’s not the case, just ignore this :slight_smile:

1 Like
(Ricky Saltzer) #3

Hey thanks for your response.

To be transparent, I am using RecordLayer for the workload.

I designed the model to have as few conflicts as possible. Last time I looked during the peak of the workload there was maybe 1-3Hz of conflicts.

(Alex Miller) #4

Can you fdbcli> kill; kill all and see if you get a second resolver? (or, really, anything to cause a recovery)

My suspicion is that BetterMasterExists didn’t catch that set_class changing a storage server to a resolver generates a cluster layout that better fits your desired configuration, and thus didn’t cause a recovery to trigger recruiting a second resolver. Otherwise, Evan and I don’t see anything in the recruitment logic that would explain why you aren’t getting a second resolver recruited. Admittedly, our operational tooling doesn’t use set_class, so it wouldn’t entirely surprise me if there was some small bug/bad interaction like this lurking.

1 Like
(Ryan Worl) #5

This is what I recommended yesterday when this issue first came up for them, but not with that exact syntax. I just said kill -9 a proxy to trigger a recovery. They were on the three_datacenter config (and now changed to triple) which may be related (not sure how that interacts with recruitment).

Is there anything that would be exposed in the logs to help diagnose this?

(Ricky Saltzer) #6

That’s a neat trick for force reinitializing the database topology! Unfortunately, it doesn’t seem to solve the predicament I’m currently in.

>>> kill all
Attempted to kill 72 processes
...
$ fdbcli -C /tmp/fdb.cluster.grants --exec "status json" | grep -B3 '\"role\" : \"resolver\"'
                "roles" : [
                    {
                        "id" : "bc51311e0dedf06f",
                        "role" : "resolver"

Since this is becoming more and more likely attributed to a bug, I think it might be worth mentioning the origin of this database’s topology.

Back when we first started using FoundationDB for other use cases, we created a terraform script that orchestrated everything. Unfortunately, we didn’t fully understand the best way to deploy a cluster, and so, every node was homogeneous with the same configuration.

  • 4 processes per machine (three_datacenter mode)
    • (2) storage processes
    • (1) transaction process
    • (1) stateless process

With that being said…with 18 machines, we originally, we had 18 stateless, 18 transaction processes, and 36 storage processes.

We’ve since learned we don’t need so many transaction and stateless processes. And so, we downgraded to running only 7 stateless and 3 transaction processes.

Some of this was done via the foundationdb.conf and some were through the setclass command.

FWIW - Our other cluster which uses three_datacenter mode, and has 56 machines / 216 processes still only has ONE resolver. This cluster still uses our old topology, and so, there are a total of 54 stateless processes running inside it.

Thank you so much for your help!
Ricky

(Ricky Saltzer) #7

To be fair, we were having the same issue when in three_datacenter mode. However, changing to triple did increase our write throughput. As you pointed out (off thread), this was due to the replication factor being decreased from 6x to 3x. Although, our issue remains the same…that is, our write throughput is bottlenecked at the resolver before anything else.

(Ricky Saltzer) #8

Since I’ve manually set some processes as a resolution class, I should set the record straight. Here is the final listing of the setclass (minus storage processes).

➜  /tmp fdbcli -C fdb.cluster.grants --exec "setclass" | grep -v 'storage'
There are currently 72 processes in the database:
  10.49.58.58:4500: stateless (set_class)
  10.49.58.58:4501: resolution (set_class)
  10.49.58.58:4502: transaction (command_line)
  10.49.58.58:4503: stateless (command_line)
  10.49.58.119:4500: stateless (set_class)
  10.49.58.119:4501: stateless (set_class)
  10.49.58.119:4502: transaction (command_line)
  10.49.58.119:4503: resolution (set_class)
  10.49.58.190:4500: stateless (set_class)
  10.49.58.190:4501: stateless (set_class)
  10.49.58.190:4502: transaction (command_line)
  10.49.58.190:4503: stateless (command_line)
(Ryan Worl) #9

I did some investigation myself this morning with a 3 node cluster (each 2 cores). Spun it up, got coordinators into a good state, ran configure triple memory and verified transactions worked.

When I ran configure resolvers=2; kill; kill all with unset as the process class in a 3 process cluster, I only ever got one resolver.

When I ran configure resolvers=2; kill; kill all with unset as the process class for 3 processes and 1 additional process set as resolution (each VM is 2 cores), I still only ever got one resolver. The one chosen is the one process set to resolution class, thankfully.

When I ran configure resolvers=2; kill; kill all with unset as the process class for 3 processes and 2 process set as resolution, I finally got 2 resolvers.

I understand “Desired Resolvers” is not “Must Set This Number Resolvers”, but that behavior is not what I would have expected.

(Ricky Saltzer) #10

I’m having trouble reproducing this result.

[1]

➜  /tmp fdbcli -C fdb.cluster.grants --exec "setclass" | egrep 'resolution|unset'
  10.49.58.58:4500: unset (set_class)
  10.49.58.58:4501: resolution (set_class)
  10.49.58.58:4503: unset (set_class)
  10.49.58.119:4500: unset (set_class)
  10.49.58.119:4501: unset (set_class)
  10.49.58.119:4503: resolution (set_class)
  10.49.58.190:4500: unset (set_class)
  10.49.58.190:4501: unset (set_class)
  10.49.58.190:4503: unset (set_class)

[2]

➜  /tmp fdbcli -C fdb.cluster.grants --exec "configure resolvers=2; kill; kill all;"
>>> configure resolvers=2
Configuration changed
>>> kill
...
...
>>> kill all
Attempted to kill 72 processes

[3]

Configuration:
  Redundancy mode        - triple
  Storage engine         - ssd-2
  Coordinators           - 7
  Desired Resolvers      - 2

Cluster:
  FoundationDB processes - 72
  Machines               - 18
  Memory availability    - 3.7 GB per process on machine with least available
                           >>>>> (WARNING: 4.0 GB recommended) <<<<<
  Retransmissions rate   - 8 Hz
  Fault Tolerance        - 2 machines
  Server time            - 05/31/19 10:18:16

Data:
  Replication health     - Healthy (Rebalancing)
  Moving data            - 0.000 GB
  Sum of key-value sizes - 3 MB
  Disk space used        - 20.265 GB

Operating space:
  Storage server         - 174.6 GB free on most full server
  Log server             - 174.6 GB free on most full server

Workload:
  Read rate              - 32 Hz
  Write rate             - 1 Hz
  Transactions started   - 6 Hz
  Transactions committed - 1 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

[4]

➜  /tmp fdbcli -C fdb.cluster.grants --exec "status json" | grep resolver
            "resolvers" : 2,
                        "role" : "resolver"
(Ryan Worl) #11

Yes, I assume we’d already tried something equivalent to that previously. I just wanted to start from zero and see if I could actually make two resolvers happen. Still, very strange.

Edit: For anyone interested in investigating this as well, Ricky is on 6.0.15 and I just tested 6.1.8. If there are differences in how recruitment works introduced between those two that would be helpful to know.

1 Like
(Ricky Saltzer) #12

I upgraded the cluster but still stuck with one resolver :frowning:

(Alex Miller) #13

Ah, I had missed this.

You have the cluster controller and proxy recruited in one process that is

                "locality" : {
                    "dcid" : "7df98dcf291d4eed3c4e863c2f07a715",
                    "machineid" : "1e1b5d03332c24896d2abc1f03f83657",
                    "processid" : "99ac06f627656fe6f4d6aa05a69d0d7b",
                    "zoneid" : "1e1b5d03332c24896d2abc1f03f83657"
                },

your master and proxy are in

                "locality" : {
                    "dcid" : "7df98dcf291d4eed3c4e863c2f07a715",
                    "machineid" : "1e1b5d03332c24896d2abc1f03f83657",
                    "processid" : "8be186ec55e0f9cf9a2f0847ee08631b",
                    "zoneid" : "1e1b5d03332c24896d2abc1f03f83657"
                },

Your unused resolution class process is:

                "locality" : {
                    "dcid" : "643df5d94568f2f25772668935e05cf1",
                    "machineid" : "f942a3745a0086b2d1926a3bde011a2d",
                    "processid" : "411d623bf96e44b31c6516b87138e8a2",
                    "zoneid" : "f942a3745a0086b2d1926a3bde011a2d"
                },

and your used one is

                "locality" : {
                    "dcid" : "7df98dcf291d4eed3c4e863c2f07a715",
                    "machineid" : "1e1b5d03332c24896d2abc1f03f83657",
                    "processid" : "7b0abba967dd30eb4e86fbdaec217086",
                    "zoneid" : "1e1b5d03332c24896d2abc1f03f83657"
                },

So, everything that’s getting recruited is in dcid 7df98dcf291d4eed3c4e863c2f07a715

Which is intended. Definitely so for three_datacenter. I’m not actually sure how triple behaves with multiple dcIds specified, but apparently “like the above” is the answer. Recruiting proxies and resolvers across multiple datacenters would only increase latency (and likely decrease throughput) with little gain.

I’ll strongly suspect that if you take another dcId 7df98dcf291d4eed3c4e863c2f07a715 process and mark it as resolution, that you’ll get a second resolver. You would need to mark it as resolution and not stateless, because recruitment won’t let the process fit degrade with additional recruitments, so you’d either need to have multiple resolution processes or only stateless processes.

Relevant recruitment code: https://github.com/apple/foundationdb/blob/b30fe29c9fc2ea23d679bf4fdc88daaea067a346/fdbserver/ClusterController.actor.cpp#L675

1 Like
(Ricky Saltzer) #14

Totally worked!!!

➜  /tmp fdbcli -C fdb.cluster.grants --exec "status json" | grep 'resolution'
                "class_type" : "resolution",
                "class_type" : "resolution",

I guess this begs the question…why am I not able to have a resolver in each datacenter?

(Alex Miller) #15

I… don’t think you’d want a resolver in each datacenter?

In your cluster, you have your master, proxies, and resolver in the same datacenter. You’ll only incur a WAN round trip as part of writing to the TLogs, but that’s unavoidable, because the entire point of three_datacenter mode is to have writes durably stored in two geographically distributed locations. Resolvers are stateless, so you gain no fault tolerance properties by having them geographically distributed.

If you added a resolver to another DC, then you’d have to doing a broadcast and waiting for replies to the resolvers would be one WAN roundtrip, and then you’d pay a second for the WAN TLog roundtrip.

I mean, yeah, maybe it’d be nice if somewhere a message was surfaced for the administrator that says “Actual resolvers less than desired resolvers. Please add more resolution/stateless processes to dcId NNN”. But I’m still of the opinion that automatically including resolvers in far away DCs for the transaction system isn’t universally desirable.

1 Like
(Ryan Worl) #16

This cluster is deployed with three AZs in one region as the “datacenters”. That is why he was hoping to recruit resolvers in each datacenter, but I don’t think three_datacenter was intended for that use case.

1 Like
(Ricky Saltzer) #17

Totally makes sense. Thank you for clearing this up for me.

(Ricky Saltzer) #18

Do we think there is a bug where triple replication respects the dcid when it really shouldn’t?

(Alex Miller) #19

With the goal of making the process layout on each AZ more symmetric? IIRC, AZ to AZ communication has a slightly higher price and latency, so my point still somewhat stands.

If the concern is about reserving two resolution processes in each AZ when you’d only ever use 1/3 of that, then I don’t have a great answer to that. You can run stateless processes instead of resolution, with likely no meaningful change to the resulting cluster’s behavior as long as you supply enough stateless processes so that proxies and resolvers don’t share. You can also technically not run stateless or resolution class processes in one of the three AZs, because if you lose one AZ you’d still be assured to have one left that can run your transaction subsystem.

… we’ve also totally talked through Ricky’s deployment before, and apparently all the details had left my head. Maybe I should go change my logo to a goldfish…

That’s what I’ve been thinking through now, and I haven’t come to a conclusion. My mental model of triple is that it should only care about zoneId, so I’m assuming that this was accidentally introduced when adding three_datacenter or multi-region, and they assumed that you wouldn’t specify dcId for a cluster that’s triple replicated. On the other hand, I feel like triple sort of did the “right” thing here, in that it “helped” you avoid a potentially sub-optimal database configuration. My idiomatic and pragmatic sides are at war.

(Ricky Saltzer) #20

That’s interesting. I wonder if triple replication ensures that the other two replicas are stored in separate dcId storage processes.