How to reverse back role assignments changed after system upgrades

After we had Kubernetes version upgrade and host OS upgrade in DCs, I found the roles for fdb processes at DC2 had changed. DC2 contains TX pods only in our 3-DC architecture. We have 15 TX pods there. Before the system upgrades, we role assignments were like the following.

Blockquote
DC=dc2, IP=xxx:4000:tls, type=transaction, role=coordinator role=log
DC=dc2, IP=xxx:4000:tls, type=transaction, role=log
DC=dc2, IP=xxx:4000:tls, type=transaction, role=log
DC=dc2, IP=xxx:4000:tls, type=transaction, role=log
DC=dc2, IP=xxx:4000:tls, type=transaction, role=log
DC=dc2, IP=xxx:4000:tls, type=transaction, role=log
DC=dc2, IP=xxx:4000:tls, type=transaction, role=coordinator role=log
DC=dc2, IP=xxx:4000:tls, type=transaction, role=log
DC=dc2, IP=xxx:4000:tls, type=transaction, role=log
DC=dc2, IP=xxx:4000:tls, type=transaction, role=log
DC=dc2, IP=xxx:4000:tls, type=transaction, role=log
DC=dc2, IP=xxx:4000:tls, type=transaction, role=log
DC=dc2, IP=xxx:4000:tls, type=transaction, role=log
DC=dc2, IP=xxx:4000:tls, type=transaction, role=coordinator role=log
DC=dc2, IP=xxx:4000:tls, type=transaction, role=log

After the system upgrades, they became like this:

Blockquote
DC=dc2, IP=xxx:4000:tls, type=transaction, role=coordinator
DC=dc2, IP=xxx:4000:tls, type=transaction,
DC=dc2, IP=xxx:4000:tls, type=transaction, role=log
DC=dc2, IP=xxx:4000:tls, type=transaction,
DC=dc2, IP=xxx:4000:tls, type=transaction,
DC=dc2, IP=xxx:4000:tls, type=transaction,
DC=dc2, IP=xxx:4000:tls, type=transaction,
DC=dc2, IP=xxx:4000:tls, type=transaction, role=coordinator
DC=dc2, IP=xxx:4000:tls, type=transaction,
DC=dc2, IP=xxx:4000:tls, type=transaction, role=coordinator
DC=dc2, IP=xxx:4000:tls, type=transaction, role=log
DC=dc2, IP=xxx:4000:tls, type=transaction,
DC=dc2, IP=xxx:4000:tls, type=transaction,
DC=dc2, IP=xxx:4000:tls, type=transaction,
DC=dc2, IP=xxx:4000:tls, type=transaction, role=log

Our fdb cluster is running fine and is performant, but I’d like to change the role assignments back. How do I achieve that?

Thank you.

It seems the issue is rare among FDB community. But it’s important to us. We developed monitoring programs to keep an eye on any changes on the total number of fdb processes and their role assignments. Most of the time a change results from a fdb node being down. We send alerts on these changes. This is the reason I’d like to reverse the role assignments.

The 3-DC architecture I mentioned above is the 2-Region/3-AZ architecture recommended by FDB.

@mengxu @alexmiller Can you help look into the issue? Thank you.

Hi @lehu ,

I didn’t quite get the question. Did you mean:

  1. after the change, the number of tlogs in dc2 decreased and you wanted to keep the same number of tLogs in dc2? or
  2. the coordinators’ IPs in dc2 changed? or
  3. both

cc. our SRE team @john_brownlee and @mbhaskar

Hi @mengxu, it’s your #1. The number of processes assigned with the “log” role was decreased, from 15 “log” processes to 3 “log” processes.

I’d like to make it back to the original, i.e., 15 processes with “log” role.

Did the total number of recruited tLogs decrease?

That’s right. We still have 15 processes, but only 3 are assigned the log role now, vs. 15 previously.

I see. My guess is that CC recruits a different set of tLogs when the cluster bounced.
The cluster recruitment logic is ClusterController.actor.cpp file at foundationdb/ClusterController.actor.cpp at c98c6ac3be0022d68fc725fee849ebc56415fba3 · apple/foundationdb · GitHub

A recent PR touches the recruitment logic, which can guide to the related code.

@alexmiller Do you know a way to force a cluster to a fixed layout?

(If we reduce the amount of transaction processes in a DC, CC will recruit tLogs in other DCs. But I feel that’s not an ideal idea.)

A different set of TLogs would be quite common. A different number of TLogs is quite suspicious.

To double check the simplest case, if you used some automation to do the upgrade, are you sure it didn’t also change the configuration of the database? Can you confirm that the desired number of TLogs is still 15?

No, we didn’t change the db config.

The way we upgraded FDB was
(1) built a Docker image with the new fdb version.
(2) restarted pods to let it pull the new image.
Didn’t change anything else with the db.

Here is the current config from status json:

    "configuration" : {
        "coordinators_count" : 9,
        "excluded_servers" : [
        ],
        "log_spill" : 2,
        "logs" : 14,
        "proxies" : 16, ......

BTW, the fdbcli “configure” command (v6.2.27) only prints out the Usage.
Can we make it print out the current setting as well, if no new setting value is given?

fdb> configure
Usage: configure [new] <single|double|triple|three_data_hall|three_datacenter|ssd|memory|proxies=|logs=|resolvers=>*
fdb> configure logs
Usage: configure [new] <single|double|triple|three_data_hall|three_datacenter|ssd|memory|proxies=|logs=|resolvers=>*

The status json there is clipped off a bit too soon. Given that you said that DC2 is the satellite in your multi-region configuration, the relevant setting would be "satellite_logs" in the region config. Would you mind pastebin’ing the whole configuration?

And I noted as well when double checking commands that it’s weirdly difficult to pull out the full configuration setting. Filing an issue as a feature request for configure to print the full configuration would be pretty reasonable.

Here you go. But no satellite_logs in the config. I have searched the whole json, not in other places either.

    "configuration" : {
        "coordinators_count" : 9,
        "excluded_servers" : [
        ],
        "log_spill" : 2,
        "logs" : 14,
        "proxies" : 16,
        "redundancy_mode" : "triple",
        "regions" : [
            {
                "datacenters" : [
                    {
                        "id" : "dc1",
                        "priority" : 2
                    },
                    {
                        "id" : "dc2",
                        "priority" : 0,
                        "satellite" : 1
                    }
                ],
                "satellite_redundancy_mode" : "one_satellite_double"
            },
            {
                "datacenters" : [
                    {
                        "id" : "dc3",
                        "priority" : 1
                    }
                ]
            }
        ],
        "storage_engine" : "ssd-2",
        "usable_regions" : 2
    },

satellite_logs defaults to 3, so that explains why you have 3 satellite logs then. If you’d like it to be something else, then you’ll need to modify your region config to add "satellite_logs": N,

Configuration — FoundationDB 6.2 has some explanation of satellite_logs and examples of how to set it.

It’s still interesting that you had 15 log processes before. How are you determining which processes are running a log role, is that coming from somewhere in status?

I wrote a script to parse the status json and extract roles.