Can't scale proxies

dkaz · May 6, 2021, 1:46pm

Hello, I’m running three node ssd cluster in double configuration (posted below). I cannot seem to scale proxies through fdbcli:

fdb> configure proxies=3
Configuration changed

Yet when I grep roles, there is only one proxy. Not that I need to scale for real, but that makes me uneasy, why proxy numbers are not increasing? There are plenty of stateless processes without roles:

13:44:27 rr-kv-db1.vinted.net /etc/foundationdb$ echo 'status json' | docker exec -i foundationdb fdbcli | tail -n +7 | jq '.cluster.processes | keys[] as $k | "\(.[$k].address) \([.[$k].roles[].role])"' | sort | grep '\[\]'
"10.33.14.18:4501 []"
"10.33.14.18:4502 []"
"10.33.14.18:4503 []"
"10.33.14.18:4504 []"
"10.33.14.18:4510 []"
"10.34.13.18:4512 []"
"10.35.12.19:4501 []"
"10.35.12.19:4502 []"
"10.35.12.19:4503 []"
"10.35.12.19:4504 []"
"10.35.12.19:4505 []"
"10.35.12.19:4506 []"
"10.35.12.19:4507 []"
"10.35.12.19:4508 []"
"10.35.12.19:4509 []"
"10.35.12.19:4511 []"

Yet only one proxy:

13:44:40 rr-kv-db1.vinted.net /etc/foundationdb$ echo 'status json' | docker exec -i foundationdb fdbcli | tail -n +7 | jq '.cluster.processes | keys[] as $k | "\(.[$k].address) \([.[$k].roles[].role])"' | sort | grep 'proxy'
"10.34.13.18:4502 [\"proxy\"]"
13:44:45 rr-kv-db1.vinted.net /etc/foundationdb$

Here is status

fdb> status

Using cluster file `/etc/foundationdb/fdb.cluster'.

Configuration:
  Redundancy mode        - double
  Storage engine         - ssd-2
  Coordinators           - 3
  Desired Proxies        - 3
  Desired Resolvers      - 1
  Desired Logs           - 6
  Usable Regions         - 1

Cluster:
  FoundationDB processes - 149
  Zones                  - 3
  Machines               - 3
  Memory availability    - 7.1 GB per process on machine with least available
  Fault Tolerance        - 1 machines
  Server time            - 05/06/21 13:42:11

Data:
  Replication health     - Healthy
  Moving data            - 0.000 GB
  Sum of key-value sizes - 2.678 GB
  Disk space used        - 14.370 GB

Operating space:
  Storage server         - 1796.5 GB free on most full server
  Log server             - 1796.5 GB free on most full server

Workload:
  Read rate              - 1780 Hz
  Write rate             - 1713 Hz
  Transactions started   - 53 Hz
  Transactions committed - 43 Hz
  Conflict rate          - 2 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Client time: 05/06/21 13:42:10

Could anyone help me?

dkaz · May 6, 2021, 1:48pm

Here is .conf file in one of the servers (same on all servers except datacenter part):

## foundationdb.conf
##
## Configuration file for FoundationDB server processes
## Full documentation is available at
## https://apple.github.io/foundationdb/configuration.html#the-configuration-file

[fdbmonitor]
user = root
group = root

[general]
restart_delay = 0
## by default, restart_backoff = restart_delay_reset_interval = restart_delay
initial_restart_delay 0
restart_backoff = 1.0
restart_delay_reset_interval = 60
cluster_file = /etc/foundationdb/fdb.cluster
# delete_envvars = FDB_CLUSTER_FILE_CONTENTS FDB_PORT FDB_NETWORKING_MODE FDB_PROCESS_CLASS FDB_COORDINATOR FDB_COORDINATOR_PORT
kill_on_configuration_change = true

## Default parameters for individual fdbserver processes
[fdbserver]
command = /usr/bin/fdbserver
public_address = 10.35.12.19:$ID
listen_address = public
# logsize = 10MiB
# maxlogssize = 100MiB
# machine_id =
datacenter_id = rr
# class =
memory = 8GiB
# storage_memory = 1GiB
cache_memory = 2GiB
datadir = /var/lib/foundationdb/data/$ID
logdir = /var/log/foundationdb
# metrics_cluster =
# metrics_prefix =

[fdbserver.4500]
class = storage

[fdbserver.4501]
class = stateless
[fdbserver.4502]
class = stateless
[fdbserver.4503]
class = stateless
[fdbserver.4504]
class = stateless
[fdbserver.4505]
class = stateless
[fdbserver.4506]
class = stateless
[fdbserver.4507]
class = stateless
[fdbserver.4508]
class = stateless
[fdbserver.4509]
class = stateless

[fdbserver.4510]
class = log
[fdbserver.4511]
class = log
[fdbserver.4512]
class = log

[fdbserver.4520]
class = storage
[fdbserver.4521]
class = storage
[fdbserver.4522]
class = storage
[fdbserver.4523]
class = storage
[fdbserver.4524]
class = storage
[fdbserver.4525]
class = storage
[fdbserver.4526]
class = storage
[fdbserver.4527]
class = storage
[fdbserver.4528]
class = storage
[fdbserver.4529]
class = storage
[fdbserver.4530]
class = storage
[fdbserver.4531]
class = storage
[fdbserver.4532]
class = storage
[fdbserver.4533]
class = storage
[fdbserver.4534]
class = storage
[fdbserver.4535]
class = storage
[fdbserver.4536]
class = storage
[fdbserver.4537]
class = storage
[fdbserver.4538]
class = storage
[fdbserver.4539]
class = storage
[fdbserver.4540]
class = storage
[fdbserver.4541]
class = storage
[fdbserver.4542]
class = storage
[fdbserver.4543]
class = storage
[fdbserver.4544]
class = storage
[fdbserver.4545]
class = storage
[fdbserver.4546]
class = storage
[fdbserver.4547]
class = storage
[fdbserver.4548]
class = storage
[fdbserver.4549]
class = storage
[fdbserver.4550]
class = storage
[fdbserver.4551]
class = storage
[fdbserver.4552]
class = storage
[fdbserver.4553]
class = storage
[fdbserver.4554]
class = storage
[fdbserver.4555]
class = storage
[fdbserver.4556]
class = storage
[fdbserver.4557]
class = storage
[fdbserver.4558]
class = storage
[fdbserver.4559]
class = storage

dkaz · May 7, 2021, 5:03am

I removed datacenter_id from all nodes, then it started working. Is this behaviour documented somewhere?

ajbeamon · May 7, 2021, 10:08pm

I can’t find anywhere in documentation where this is explained, but I believe that what’s happening here is that the cluster intentionally tries to recruit all of the stateless transaction subsystem processes in the same datacenter. This is because they all need to interact with each other frequently, and it is costly if this is happening over a large latency.

I’m not actually sure why it would be unable to recruit multiple proxies on the same host when you have a bunch of stateless class processes there. I didn’t think there were any restrictions that prevented that, but I could be mistaken.

dkaz · May 8, 2021, 1:23pm

Yes, I added more stateless services to the nodes, and now I can scale. Nice feature, but should be documented.

osamarin · June 6, 2023, 8:37am

Seems there is a bug in foundationdb that doesn’t run desired proxies on available stateless processes.

This problem also has been mentioned at Identifying number of proxies - #10 by osamarin.

I submited PR Fixed unability to run more than one commit proxy in small configurations by oleg68 · Pull Request #10411 · apple/foundationdb · GitHub that should solve this bug upstream.

Topic		Replies	Views
Identifying number of proxies Using FoundationDB	9	2026	May 6, 2022
Unable to run more than one commit proxy in a small cluster Running FoundationDB performance	1	383	June 26, 2023
FDB 6.2 - proxies processes have 100% CPU usage Running FoundationDB performance	1	368	October 10, 2022
Cluster tuning cookbook Using FoundationDB	26	8861	February 1, 2019
Foundation DB cluster configuration Running FoundationDB performance	0	488	April 29, 2021

Can't scale proxies

Related topics