'Locking coordination' state after process removal

I have a fresh database with ‘configure new single ssd’. I start with 4 processes in foundationdb.conf. Then I remove 3 of them. Result of 'status ’ is now:

Locking coordination state. Verify that a majority of coordination server
processes are active.

  127.0.0.1:4500  (reachable)

The database doesnt work anymore and does not seem to ever recover.
It does work again after I do ‘configure single ssd’ again, but why is that necessary when it works fine to remove 2 or 1 process but not 3. There is no indication of what is wrong here.
I think this must be a case of some internal statistics that are causing conflicts, I heard about process classes
Is there a way to show the roles/classes in the status output for processes?

A majority of coordinators must be available for the database to be available. By removing three processes from a four process cluster there is no way for a majority to be available.

Running fdbcli --exec "status details" will show you where your coordinators are and if they are reachable or not.

There were only 4 processes and the first one, the 4500 coordinator is still available. So that’s my point, it shows the same as what I already posted above. In my opinion it makes no sense that the cluster is erroring out in this case. 1 out of 1 processes is still a coordinator, it should work fine. It also doesnt answer my question about how to see which of the processes are log servers, coordinators or something else.

fdbcli> status json includes the process class settings and role recruitments.

See https://pastebin.com/kj5XCNPM (from Are spikes of 500ms+ MaxRowReadLatency normal?) for an example.

            "54a7a3995096944c1ecb563e81ff61d9" : {
                "class_source" : "command_line",
                "class_type" : "storage",
                [snip]
                "roles" : [
                    {
                        "role" : "storage",
                        [snip]
                    }
                ],
            },

If you remove Replication Factor machines or more from a cluster without excluding them first, and waiting for the exclude to finish, then you’re going to break your cluster, because there’s data (including system metadata) that will be permanently missing.

The recovery step of locking_coordinated_state also waits for the previous generation of TLogs to come back, so that we can read out the system metadata. As you’ve removed >=Replication Factor number of machines, that’s never going to finish.

(I’ve also been confused by this naming, so maybe we should go rename this step sometime…)

I’m confused though that fdbcli> configure single ssd shouldn’t bring you back to a working cluster. Running fdbcli> configure new single ssd and thus throwing away the previous database might? Did you happen to elide the new by accident when posting, or should I go think harder?

If you remove Replication Factor machines or more from a cluster without excluding them first, and waiting for the exclude to finish, then you’re going to break your cluster, because there’s data (including system metadata) that will be permanently missing.

I didnt remove any machines, I removed processes! Because when looking at the docs, I dont see how “1 machine 1 process” suddenly shouldnt be a workable state just because I temporarily increase the amount of processes and then remove them again.
In my understanding I dont have any replication here to begin with. If I am violating something then it is invisible to me and the docs dont mention it.

I dont see any indication that I’m violating something called a ‘Replication Factor’ when reading this: https://apple.github.io/foundationdb/configuration.html#single-datacenter-modes

I’m confused though that fdbcli> configure single ssd shouldn’t bring you back to a working cluster. Running fdbcli> configure new single ssd and thus throwing away the previous database might? Did you happen to elide the new by accident when posting, or should I go think harder?

well today I tried it again and now it seems completely arbitrary to me. Sometimes it locks up even if I just remove 1 of 4 processes. Sometimes only when I remove 2. It definitely always happen when I remove 3 at the same time.

I made two videos to show the behavior: (at first it seems fine, processes just say “no metrics available” but seconds later the whole thing goes into error mode)
https://webm.red/Fyct

second video I cant believe that it’s working again at first but then I try removing 2 at the same time and it goes into error mode again:
https://webm.red/u80p

I cant replicate the behavior of the ‘configure single ssd’ thing again, I do believe that it happened just like I said.

Hm. That’s a fair point. Let me give you more context to explain why it’s described this way, but I do see your point that this is confusing for the case of single replication.

single, double, and triple ask FDB to spread 1, 2, of 3 copies of your data across “zones”, where a zone is however you wish to define your failure domain. The zone is settable on the command line via --zone_id.

If you don’t provide a zone, the default is that a machine is a failure domain. This is coordinated via fdbserver opening a shared memory segment (/dev/shm/fdbserver, IIRC) and generates a random UUID and stashes it in the shared memory file. Future fdbserver processes will open the same file, and then see the UUID, and use that as its zone. Thus, all fdbserver processes will agree on a zoneId, and thus present themselves as one failure domain, and FDB will appropriately not place all three copies of triple replicated data on the same machine.

Tangentially but notably, if one uses docker, then it’s likely that the /dev/shm/fdbserver file won’t be shared across your fdbserver instances, and one might actually trick FDB into putting multiple copies of data in one failure domain.

When using single replication, you’re only storing one copy of each piece of data. On-disk storage is tied to a process. This replication is important, because if one removes a process with double or triple, FDB can make another copy of the data from the copy that still remained in some other process. As you’re running with single replication, if you remove a process, there’s no other place that FDB can go to re-replicate it. Thus, FDB must instead wait for an fdbserver process to reappear that can offer the missing data.

Adding processes to a cluster will allow data distribution to redistribute data to the new processes. Once data is assigned to these processes, removing them will cause a problem due to everything described above.


If in the docs, I change it to:

single double triple
Best for 1-2 machines 3-4 machines 5+ machines
Replication 1 copy 2 copy 3 copy
# live machines to make progress 1 2 3
Minimum # of machines for fault tolerance impossible 3 4
Ideal # of coordination servers 1 3 5
# simultaneous failures after which data may be lost any process 2+ machines 3+ machines

Would that have made this clearer?


It’s possible that with an effectively empty database, there won’t be shards of data to assign to all four of the processes. Or, the processes were added and removed quick enough that redistributing data to them didn’t yet complete. If you kill/remove one of the processes that was assigned any data, then you would indeed break your database. If you kill a process that didn’t have any assigned data and also wasn’t the transaction log, then it sounds quite plausible to me that the database will actually recover and run.

I would change ‘#’ to ‘nr’ because # is a python comment in my mind first. Also another thing that is confusing is that it says Replication: 1 copy. To me that sounds like it does replicate… one time. So I end up with 2 instances of data, one of which would be a copy of the real data.
In my opinion to be crystal clear you have to remove the word ‘replication’ and just say ‘total nr of copies’ instead. Because replication in my mind strongly implies that it RE-did something. So it took something that exists and did something on top of that. So that would mean that 1 replicated copy means that I have 2 total copies of data. I know this is probably better understood by database people who work with this terminology all day but in my opinion it’s very misleading and unnecessarily forces me to think about what the author actually means here.
I also dont understand this advice that single ssd is “best for” 1-2 machines. Are the docs trying to tell me that I dont need fault tolerance? I can just rely on my disk never failing? I dont understand what is being advised here, it seems contrary to popular advice elsewhere on the internet to always have backups. Do you advise therefore that I use the fdbbackup tool instead as long as I have less than 3 machines?
Yes the last line is definitely clearer when you say any process.

If you have one machine, then running double will result in a broken cluster, as FDB won’t have a place to put a second copy. If you have two machines, then running double will give you two copies, but if once machine dies we’re back in the previous case of having a broken cluster because there’s no place to put a new second copy. You need at least three machines for double to actually mean that FDB can tolerate the failure of one machine. Similarly, triple is only useful for 5 or more machines. I agree the documentation could be more explicit in explaining this.

Thanks, I greatly appreciate the explanations of what the docs are communicating to you. I’ve filed a GitHub issue to make sure we don’t forget to do the suggested documentation updates. It’s rather difficult for me to temporarily forget the years of immersion in database speak. :slight_smile:

Replication, and Multi-Region or fdbdr, gives additional availability guarantees, but don’t protect against unintended modifications to the database. Backups are generally advisable to protect against application bugs, while also providing a way out from horrific unavailability events.