'Locking coordination' state after process removal

fdbcli> status json includes the process class settings and role recruitments.

See https://pastebin.com/kj5XCNPM (from Are spikes of 500ms+ MaxRowReadLatency normal?) for an example.

            "54a7a3995096944c1ecb563e81ff61d9" : {
                "class_source" : "command_line",
                "class_type" : "storage",
                [snip]
                "roles" : [
                    {
                        "role" : "storage",
                        [snip]
                    }
                ],
            },

If you remove Replication Factor machines or more from a cluster without excluding them first, and waiting for the exclude to finish, then you’re going to break your cluster, because there’s data (including system metadata) that will be permanently missing.

The recovery step of locking_coordinated_state also waits for the previous generation of TLogs to come back, so that we can read out the system metadata. As you’ve removed >=Replication Factor number of machines, that’s never going to finish.

(I’ve also been confused by this naming, so maybe we should go rename this step sometime…)

I’m confused though that fdbcli> configure single ssd shouldn’t bring you back to a working cluster. Running fdbcli> configure new single ssd and thus throwing away the previous database might? Did you happen to elide the new by accident when posting, or should I go think harder?