How to reverse back role assignments changed after system upgrades

lehu · March 19, 2021, 3:54pm

After we had Kubernetes version upgrade and host OS upgrade in DCs, I found the roles for fdb processes at DC2 had changed. DC2 contains TX pods only in our 3-DC architecture. We have 15 TX pods there. Before the system upgrades, we role assignments were like the following.

Blockquote
DC=dc2, IP=xxx:4000:tls, type=transaction, role=coordinator role=log
DC=dc2, IP=xxx:4000:tls, type=transaction, role=log
DC=dc2, IP=xxx:4000:tls, type=transaction, role=log
DC=dc2, IP=xxx:4000:tls, type=transaction, role=log
DC=dc2, IP=xxx:4000:tls, type=transaction, role=log
DC=dc2, IP=xxx:4000:tls, type=transaction, role=log
DC=dc2, IP=xxx:4000:tls, type=transaction, role=coordinator role=log
DC=dc2, IP=xxx:4000:tls, type=transaction, role=log
DC=dc2, IP=xxx:4000:tls, type=transaction, role=log
DC=dc2, IP=xxx:4000:tls, type=transaction, role=log
DC=dc2, IP=xxx:4000:tls, type=transaction, role=log
DC=dc2, IP=xxx:4000:tls, type=transaction, role=log
DC=dc2, IP=xxx:4000:tls, type=transaction, role=log
DC=dc2, IP=xxx:4000:tls, type=transaction, role=coordinator role=log
DC=dc2, IP=xxx:4000:tls, type=transaction, role=log

After the system upgrades, they became like this:

Blockquote
DC=dc2, IP=xxx:4000:tls, type=transaction, role=coordinator
DC=dc2, IP=xxx:4000:tls, type=transaction,
DC=dc2, IP=xxx:4000:tls, type=transaction, role=log
DC=dc2, IP=xxx:4000:tls, type=transaction,
DC=dc2, IP=xxx:4000:tls, type=transaction,
DC=dc2, IP=xxx:4000:tls, type=transaction,
DC=dc2, IP=xxx:4000:tls, type=transaction,
DC=dc2, IP=xxx:4000:tls, type=transaction, role=coordinator
DC=dc2, IP=xxx:4000:tls, type=transaction,
DC=dc2, IP=xxx:4000:tls, type=transaction, role=coordinator
DC=dc2, IP=xxx:4000:tls, type=transaction, role=log
DC=dc2, IP=xxx:4000:tls, type=transaction,
DC=dc2, IP=xxx:4000:tls, type=transaction,
DC=dc2, IP=xxx:4000:tls, type=transaction,
DC=dc2, IP=xxx:4000:tls, type=transaction, role=log

Our fdb cluster is running fine and is performant, but I’d like to change the role assignments back. How do I achieve that?

Thank you.

lehu · March 30, 2021, 6:04pm

It seems the issue is rare among FDB community. But it’s important to us. We developed monitoring programs to keep an eye on any changes on the total number of fdb processes and their role assignments. Most of the time a change results from a fdb node being down. We send alerts on these changes. This is the reason I’d like to reverse the role assignments.

The 3-DC architecture I mentioned above is the 2-Region/3-AZ architecture recommended by FDB.

@mengxu @alexmiller Can you help look into the issue? Thank you.

mengxu · April 4, 2021, 5:10am

Hi @lehu ,

I didn’t quite get the question. Did you mean:

after the change, the number of tlogs in dc2 decreased and you wanted to keep the same number of tLogs in dc2? or
the coordinators’ IPs in dc2 changed? or
both

cc. our SRE team @john_brownlee and @mbhaskar

lehu · April 8, 2021, 3:37am

Hi @mengxu, it’s your #1. The number of processes assigned with the “log” role was decreased, from 15 “log” processes to 3 “log” processes.

I’d like to make it back to the original, i.e., 15 processes with “log” role.

mengxu · April 8, 2021, 4:20am

Did the total number of recruited tLogs decrease?

lehu · April 8, 2021, 4:24am

That’s right. We still have 15 processes, but only 3 are assigned the log role now, vs. 15 previously.

mengxu · April 8, 2021, 4:59am

I see. My guess is that CC recruits a different set of tLogs when the cluster bounced.
The cluster recruitment logic is ClusterController.actor.cpp file at foundationdb/ClusterController.actor.cpp at c98c6ac3be0022d68fc725fee849ebc56415fba3 · apple/foundationdb · GitHub

A recent PR touches the recruitment logic, which can guide to the related code.

@alexmiller Do you know a way to force a cluster to a fixed layout?

(If we reduce the amount of transaction processes in a DC, CC will recruit tLogs in other DCs. But I feel that’s not an ideal idea.)

alexmiller · April 8, 2021, 10:02am

A different set of TLogs would be quite common. A different number of TLogs is quite suspicious.

To double check the simplest case, if you used some automation to do the upgrade, are you sure it didn’t also change the configuration of the database? Can you confirm that the desired number of TLogs is still 15?

lehu · April 8, 2021, 5:46pm

No, we didn’t change the db config.

The way we upgraded FDB was
(1) built a Docker image with the new fdb version.
(2) restarted pods to let it pull the new image.
Didn’t change anything else with the db.

Here is the current config from status json:

    "configuration" : {
        "coordinators_count" : 9,
        "excluded_servers" : [
        ],
        "log_spill" : 2,
        "logs" : 14,
        "proxies" : 16, ......

BTW, the fdbcli “configure” command (v6.2.27) only prints out the Usage.
Can we make it print out the current setting as well, if no new setting value is given?

fdb> configure
Usage: configure [new] <single|double|triple|three_data_hall|three_datacenter|ssd|memory|proxies=|logs=|resolvers=>*
fdb> configure logs
Usage: configure [new] <single|double|triple|three_data_hall|three_datacenter|ssd|memory|proxies=|logs=|resolvers=>*

alexmiller · April 8, 2021, 9:00pm

The status json there is clipped off a bit too soon. Given that you said that DC2 is the satellite in your multi-region configuration, the relevant setting would be "satellite_logs" in the region config. Would you mind pastebin’ing the whole configuration?

And I noted as well when double checking commands that it’s weirdly difficult to pull out the full configuration setting. Filing an issue as a feature request for configure to print the full configuration would be pretty reasonable.

lehu · April 9, 2021, 3:06pm

Here you go. But no satellite_logs in the config. I have searched the whole json, not in other places either.

    "configuration" : {
        "coordinators_count" : 9,
        "excluded_servers" : [
        ],
        "log_spill" : 2,
        "logs" : 14,
        "proxies" : 16,
        "redundancy_mode" : "triple",
        "regions" : [
            {
                "datacenters" : [
                    {
                        "id" : "dc1",
                        "priority" : 2
                    },
                    {
                        "id" : "dc2",
                        "priority" : 0,
                        "satellite" : 1
                    }
                ],
                "satellite_redundancy_mode" : "one_satellite_double"
            },
            {
                "datacenters" : [
                    {
                        "id" : "dc3",
                        "priority" : 1
                    }
                ]
            }
        ],
        "storage_engine" : "ssd-2",
        "usable_regions" : 2
    },

alexmiller · April 9, 2021, 9:00pm

satellite_logs defaults to 3, so that explains why you have 3 satellite logs then. If you’d like it to be something else, then you’ll need to modify your region config to add "satellite_logs": N,

Configuration — FoundationDB 6.2 has some explanation of satellite_logs and examples of how to set it.

ajbeamon · April 9, 2021, 11:17pm

It’s still interesting that you had 15 log processes before. How are you determining which processes are running a log role, is that coming from somewhere in status?

lehu · April 12, 2021, 3:09am

I wrote a script to parse the status json and extract roles.

Topic		Replies	Views
New roles in FDB 6.2? Using FoundationDB	1	1000	March 18, 2020
Manually assign roles for fdbserver processes Using FoundationDB	1	774	May 17, 2021
How to redistribute fdb processes roles in the existing cluster Using FoundationDB	6	637	September 9, 2021
Swap the roles of DC2 and DC3 in a 2-region, 3-DC cluster Using FoundationDB	0	474	May 28, 2020
Updating 6.0 -> 6.2 Using FoundationDB	14	1672	August 28, 2019

How to reverse back role assignments changed after system upgrades

Related topics