Setting up a cluster with 5 servers

I’ve been attempting to spin up a 5 server cluster and I’m running into issues. The servers are 12vCPU and 48GB RAM

Here’s the steps I’ve taken:

  • Install fdb via “foundationdb-client_5.2.5-1_amd64.deb” and “foundationdb-server_5.2.5-1_amd64.deb”

  • Update foundationdb.conf to spawn 10 fdbserver processes with 2 on each server as transaction and 3 on each server as stateless (RAM per server process is set to 4Gi):

    [fdbserver.4500]

    [fdbserver.4501]
    class = transaction

    [fdbserver.4502]
    class = transaction

    [fdbserver.4503]
    class = stateless

    [fdbserver.4504]
    class = stateless

    [fdbserver.4505]
    class = stateless

    [fdbserver.4506]

    [fdbserver.4507]

    [fdbserver.4508]

    [fdbserver.4509]

I’m then able to get all 10 server processes running on each machine and I’m able to:

  • set the storage engine as ssd
  • configure all the coordinators with IP1:4500 IP2:4500 IP3:4500 IP4:4500 IP5:4500

I then tried to set redunancy to triple and things stopped working on all machines.

The output from fdbcli was:

Coordination state changed
fdb> configure triple

WARNING: Long delay (Ctrl-C to interrupt)

The database is unavailable; type `status' for more information.
WARNING: The cluster file is not up to date. Type 'status' for more information.

^Cfdb> status

WARNING: Long delay (Ctrl-C to interrupt)

# fdbcli
Using cluster file `/etc/foundationdb/fdb.cluster'.

The database is unavailable; type `status' for more information.

Welcome to the fdbcli. For help, type `help'.
fdb> status

WARNING: Long delay (Ctrl-C to interrupt)

Using cluster file `/etc/foundationdb/fdb.cluster'.

Recruiting new transaction servers.

Need at least 3 log servers, 1 proxies and 1 resolvers.

Have 10 processes on 1 machines.

Timed out trying to retrieve storage servers.

fdb> configure proxies=5

WARNING: Long delay (Ctrl-C to interrupt)

The database is unavailable; type `status' for more information.


^Cfdb> configure logs=8

WARNING: Long delay (Ctrl-C to interrupt)
^Cfdb> status

WARNING: Long delay (Ctrl-C to interrupt)

Using cluster file `/etc/foundationdb/fdb.cluster'.

Recruiting new transaction servers.

Need at least 3 log servers, 1 proxies and 1 resolvers.

Have 10 processes on 1 machines.

Timed out trying to retrieve storage servers.

I’m now getting this status:

  "messages" : [
        {
            "description" : "Unable to start immediate priority transaction after 5 seconds.",
            "name" : "immediate_priority_transaction_start_probe_timeout"
        },
        {
            "description" : "Unable to start default priority transaction after 5 seconds.",
            "name" : "transaction_start_probe_timeout"
        },
        {
            "description" : "Unable to start batch priority transaction after 5 seconds.",
            "name" : "batch_priority_transaction_start_probe_timeout"
        },
        {
            "description" : "Unable to read after 5 seconds.",
            "name" : "read_probe_timeout"
        },
        {
            "description" : "Unable to commit after 5 seconds.",
            "name" : "commit_probe_timeout"
        },
        {
            "description" : "Timed out trying to retrieve storage servers.",
            "name" : "storage_servers_error"
        },
        {
            "description" : "Unable to retrieve all status information.",
            "name" : "status_incomplete",
            "reasons" : [
                {
                    "description" : "Unable to determine if database is locked after 5 seconds."
                },
                {
                    "description" : "Unable to retrieve layer status (Operation aborted because the transaction timed out)."
                },
                {
                    "description" : "Unknown performance state."
                },
                {
                    "description" : "Unknown read state."
                }
            ]
        }
    ],
    "recovery_state" : {
        "description" : "Recruiting new transaction servers.",
        "name" : "recruiting_transaction_servers",
        "required_logs" : 3,
        "required_proxies" : 1,
        "required_resolvers" : 1
    },

What did I do wrong? Should I be setting logs=8 and proxies=5 before trying to configure redundancy? Is there some status I should have waited for after setting coordinators?

Also would be useful to understand if my configuration is optimal. Should I be setting the classes differently? Is more smaller machines better?

Did you make sure that all the machines had the same fdb.cluster file (copied from the first one to the others) before configure triple replication? What do you mean exactly by “configure all the coordinators with IP1:4500 IP2:4500 IP3:4500 IP4:4500 IP5:4500” ?

Usually, you are supposed to setup the first machine, edit its fdb.cluster to use its public IP address (restart the service and make sure that you see the public ip in fdbcli), install the other four hosts, replace their fdb.cluster with the one from the first machine (which still has a single coordinator), remove all content in the data folder, make sure they all connect to each other, and then update the list of coordinators. Once this is done, you can safely change the replication to double or triple.

Ah, I missed copying over fdb.cluster so I did that. I then deleted the data/ directory and restarted all services.

Now I’m getting an issue about old transaction servers:

fdb> status

Using cluster file `fdb.cluster'.

Locking old transaction servers. Verify that at least one transaction server
from the previous generation is running.

Need one or more of the following log servers:
c1d9b9ff01d4288324325d7d2d8c8d72, cdd15624702edffeccae6c7e45dcc7b7

I’m going to try deleting all the data/ again and restarting the services

You can actually use a node from another cluster to be a coordinator for the current cluster. So when you forced all the nodes to be coordinators, most probably the others were assisting the first one for this task but without actually joining it. You ended up with 5 hosts all being in their own single-host, 10-process cluster.

So when you did configure triple most probably you asked the host on which you were connected to recruit at least two other hosts, but since it was seeing only one (itself) it had to wait for new friends to join to have at last three distinct machine ids.

Thanks, good to know. Everything appears to be up and running after deleting data/ again. I forgot to do that before re-launching after copying over fdb.cluster.