Database unavailable after shutting down a foundationdb node

tuk · May 11, 2020, 4:09pm

I am running foundation 6.1.12 on Ubuntu 16.04.6 in one of my single node test setup.

After shutting down the node (with foundationdb running) and starting it I am observing that database is not available. Can someone let me know what could be happening here?
Below is the output of status details

fdb> status details

Using cluster file `/etc/foundationdb/fdb.cluster'.

Locking coordination state. Verify that a majority of coordination server
processes are active.

  127.0.0.1:4500  (reachable)

Unable to locate the data distributor worker.

Unable to locate the ratekeeper worker.

The output of status json is below

{
    "client" : {
        "cluster_file" : {
            "path" : "/etc/foundationdb/fdb.cluster",
            "up_to_date" : true
        },
        "coordinators" : {
            "coordinators" : [
                {
                    "address" : "127.0.0.1:4500",
                    "reachable" : true
                }
            ],
            "quorum_reachable" : true
        },
        "database_status" : {
            "available" : false,
            "healthy" : false
        },
        "messages" : [
        ],
        "timestamp" : 1589212043
    },
    "cluster" : {
        "clients" : {
            "count" : 7,
            "supported_versions" : [
                {
                    "client_version" : "6.1.12",
                    "connected_clients" : [
                        {
                            "address" : "127.0.0.1:55764",
                            "connected_coordinators" : 1,
                            "log_group" : "default"
                        },
                        {
                            "address" : "127.0.0.1:55900",
                            "connected_coordinators" : 1,
                            "log_group" : "default"
                        },
                        {
                            "address" : "127.0.0.1:55914",
                            "connected_coordinators" : 1,
                            "log_group" : "default"
                        },
                        {
                            "address" : "127.0.0.1:56078",
                            "connected_coordinators" : 1,
                            "log_group" : "default"
                        },
                        {
                            "address" : "127.0.0.1:56148",
                            "connected_coordinators" : 1,
                            "log_group" : "default"
                        },
                        {
                            "address" : "127.0.0.1:56314",
                            "connected_coordinators" : 1,
                            "log_group" : "default"
                        },
                        {
                            "address" : "127.0.0.1:56450",
                            "connected_coordinators" : 1,
                            "log_group" : "default"
                        }
                    ],
                    "count" : 7,
                    "protocol_version" : "fdb00b061060001",
                    "source_version" : "90ba203c166c62ae9173b2523081a0c6afaa1604"
                }
            ]
        },
        "cluster_controller_timestamp" : 1589212048,
        "connection_string" : "IqgQyLoo:XrIlR1QY@127.0.0.1:4500",
        "datacenter_version_difference" : 0,
        "degraded_processes" : 0,
        "incompatible_connections" : [
        ],
        "layers" : {
            "_error" : "configurationMissing",
            "_valid" : false
        },
        "machines" : {
            "3b2ef843117d3dbdfe93ffc023fbf374" : {
                "address" : "127.0.0.1",
                "contributing_workers" : 2,
                "cpu" : {
                    "logical_core_utilization" : 0.0027520600000000002
                },
                "excluded" : false,
                "locality" : {
                    "machineid" : "3b2ef843117d3dbdfe93ffc023fbf374",
                    "processid" : "4e9fa38cd41ba4893090e422144b30b6",
                    "zoneid" : "3b2ef843117d3dbdfe93ffc023fbf374"
                },
                "machine_id" : "3b2ef843117d3dbdfe93ffc023fbf374",
                "memory" : {
                    "committed_bytes" : 1567232000,
                    "free_bytes" : 15258861568,
                    "total_bytes" : 16826093568
                },
                "network" : {
                    "megabits_received" : {
                        "hz" : 0.25043499999999996
                    },
                    "megabits_sent" : {
                        "hz" : 0.25043499999999996
                    },
                    "tcp_segments_retransmitted" : {
                        "hz" : 0
                    }
                }
            }
        },
        "messages" : [
            {
                "description" : "Unable to locate the data distributor worker.",
                "name" : "unreachable_dataDistributor_worker"
            },
            {
                "description" : "Unable to locate the ratekeeper worker.",
                "name" : "unreachable_ratekeeper_worker"
            },
            {
                "description" : "Unable to read database configuration.",
                "name" : "unreadable_configuration"
            }
        ],
        "processes" : {
            "4e9fa38cd41ba4893090e422144b30b6" : {
                "address" : "127.0.0.1:4500",
                "class_source" : "command_line",
                "class_type" : "unset",
                "command_line" : "/usr/bin/fdbserver --cluster_file=/etc/foundationdb/fdb.cluster --datadir=/var/lib/foundationdb/data/4500 --listen_address=public --logdir=/var/log/foundationdb --public_address=auto:4500",
                "cpu" : {
                    "usage_cores" : 0.0031999800000000003
                },
                "disk" : {
                    "busy" : 0,
                    "free_bytes" : 111117496320,
                    "reads" : {
                        "counter" : 52138,
                        "hz" : 0,
                        "sectors" : 0
                    },
                    "total_bytes" : 117510316032,
                    "writes" : {
                        "counter" : 208806,
                        "hz" : 4.3999699999999997,
                        "sectors" : 176
                    }
                },
                "fault_domain" : "3b2ef843117d3dbdfe93ffc023fbf374",
                "locality" : {
                    "machineid" : "3b2ef843117d3dbdfe93ffc023fbf374",
                    "processid" : "4e9fa38cd41ba4893090e422144b30b6",
                    "zoneid" : "3b2ef843117d3dbdfe93ffc023fbf374"
                },
                "machine_id" : "3b2ef843117d3dbdfe93ffc023fbf374",
                "memory" : {
                    "available_bytes" : 7881066496,
                    "limit_bytes" : 8589934592,
                    "unused_allocated_memory" : 0,
                    "used_bytes" : 361541632
                },
                "messages" : [
                ],
                "network" : {
                    "connection_errors" : {
                        "hz" : 0
                    },
                    "connections_closed" : {
                        "hz" : 0.19999899999999998
                    },
                    "connections_established" : {
                        "hz" : 0
                    },
                    "current_connections" : 5,
                    "megabits_received" : {
                        "hz" : 0.0218783
                    },
                    "megabits_sent" : {
                        "hz" : 0.011631899999999999
                    }
                },
                "roles" : [
                ],
                "uptime_seconds" : 90985.699999999997,
                "version" : "6.1.12"
            },
            "c9c79c494b819706b0eadafdbc15dfd0" : {
                "address" : "127.0.0.1:4501",
                "class_source" : "command_line",
                "class_type" : "stateless",
                "command_line" : "/usr/bin/fdbserver --class=stateless --cluster_file=/etc/foundationdb/fdb.cluster --datadir=/var/lib/foundationdb/data/4501 --listen_address=public --logdir=/var/log/foundationdb --public_address=auto:4501",
                "cpu" : {
                    "usage_cores" : 0.0047999799999999997
                },
                "disk" : {
                    "busy" : 0,
                    "free_bytes" : 111117496320,
                    "reads" : {
                        "counter" : 52138,
                        "hz" : 0,
                        "sectors" : 0
                    },
                    "total_bytes" : 117510316032,
                    "writes" : {
                        "counter" : 208806,
                        "hz" : 4.3999800000000002,
                        "sectors" : 176
                    }
                },
                "fault_domain" : "3b2ef843117d3dbdfe93ffc023fbf374",
                "locality" : {
                    "machineid" : "3b2ef843117d3dbdfe93ffc023fbf374",
                    "processid" : "c9c79c494b819706b0eadafdbc15dfd0",
                    "zoneid" : "3b2ef843117d3dbdfe93ffc023fbf374"
                },
                "machine_id" : "3b2ef843117d3dbdfe93ffc023fbf374",
                "memory" : {
                    "available_bytes" : 7881066496,
                    "limit_bytes" : 8589934592,
                    "unused_allocated_memory" : 0,
                    "used_bytes" : 141729792
                },
                "messages" : [
                ],
                "network" : {
                    "connection_errors" : {
                        "hz" : 0
                    },
                    "connections_closed" : {
                        "hz" : 0.19999899999999998
                    },
                    "connections_established" : {
                        "hz" : 0
                    },
                    "current_connections" : 5,
                    "megabits_received" : {
                        "hz" : 0.034758199999999996
                    },
                    "megabits_sent" : {
                        "hz" : 0.047089399999999997
                    }
                },
                "roles" : [
                    {
                        "id" : "c7a0a4b137cf2015",
                        "role" : "master"
                    },
                    {
                        "id" : "956b61b1e846510d",
                        "role" : "cluster_controller"
                    }
                ],
                "uptime_seconds" : 90985.699999999997,
                "version" : "6.1.12"
            }
        },
        "protocol_version" : "fdb00b061060001",
        "recovery_state" : {
            "description" : "Locking coordination state. Verify that a majority of coordination server processes are active.",
            "name" : "locking_coordinated_state"
        }
    }
}

I am not seeing any trace with Severity 30 and Severity 40 in the fdb traces log. I have placed the logs here

My foundationdb.conf looks like below

[fdbmonitor]
user = ubuntu
group = ubuntu


[general]
restart_delay = 60
cluster_file = /etc/foundationdb/fdb.cluster


[fdbserver]
command = /usr/bin/fdbserver
public_address = auto:$ID
listen_address = public
datadir = /var/lib/foundationdb/data/$ID
logdir = /var/log/foundationdb


[fdbserver.4500]


[fdbserver.4501]
class = stateless


[backup_agent]
command = /usr/lib/foundationdb/backup_agent/backup_agent
logdir = /var/log/foundationdb

[backup_agent.1]

tuk · May 12, 2020, 5:29am

@ajbeamon @alexmiller - I see someone reported something similar in Can I remove a process? . But then no conclusion has been reached there. Any suggestion what may have gone wrong here?

ajbeamon · May 12, 2020, 4:42pm

The logs here are a bit incomplete (they don’t contain the events at the beginning of the recovery), but I do see a lot of messages about connections failing. If your processes can’t talk to each other, then that could prevent the cluster from recovering.

tuk · May 12, 2020, 5:14pm

Thanks @ajbeamon for replying.

This is a single node setup.

Any suggestion how can I recover the database?

I tried stopping and starting fdbmonitor multiple times but still same issue.

ajbeamon · May 12, 2020, 5:17pm

You do have multiple processes running, and they are reporting connection failures, possibly with talking to each other. It’s hard for me to say why that is happening, but it could be relevant.

tuk · June 9, 2020, 7:17pm

@ajbeamon I got this issue again. Shutting down a VM running foundationdb and then starting it, making the database unavailable.

ubuntu@vrni-proxy-release:~$ fdbcli
Using cluster file `fdb.cluster'.

The database is unavailable; type `status' for more information.

Welcome to the fdbcli. For help, type `help'.
fdb> status details

Using cluster file `fdb.cluster'.

Locking coordination state. Verify that a majority of coordination server
processes are active.

  127.0.0.1:4500  (reachable)

Unable to locate the data distributor worker.

Unable to locate the ratekeeper worker.

fdb>

foundationdb.conf same as in posted above.

ubuntu@vrni-proxy-release:~$ who -b
         system boot  2020-06-09 18:40
ubuntu@vrni-proxy-release:~$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 16.04.6 LTS
Release:	16.04
Codename:	xenial

FoundationDB Version - 6.1.12

Once I hit this issue . Then I could not find anyway to recover the database.

I have to stop the foundationdb , delete the entire data directory of foundationdb. Create a new /etc/foundationdb/fdb.cluster with content like below and start the foundationdb again.

IqgQyLoo:XrIlR1QY@127.0.0.1:4500

Once I perform the above steps then foundationdb again becomes operational. So I think when I am shutting down the VM some data may be getting corrupted in foundationdb which is making the database completely unusable. The fdb logs are also not helping me much.

I have placed the content of the fdb data directory, logs and fdb.cluster file here.

Output of status json

{
    "client" : {
        "cluster_file" : {
            "path" : "fdb.cluster",
            "up_to_date" : true
        },
        "coordinators" : {
            "coordinators" : [
                {
                    "address" : "127.0.0.1:4500",
                    "reachable" : true
                }
            ],
            "quorum_reachable" : true
        },
        "database_status" : {
            "available" : false,
            "healthy" : false
        },
        "messages" : [
        ],
        "timestamp" : 1591729978
    },
    "cluster" : {
        "clients" : {
            "count" : 8,
            "supported_versions" : [
                {
                    "client_version" : "6.1.12",
                    "connected_clients" : [
                        {
                            "address" : "127.0.0.1:53980",
                            "connected_coordinators" : 1,
                            "log_group" : "default"
                        },
                        {
                            "address" : "127.0.0.1:54076",
                            "connected_coordinators" : 1,
                            "log_group" : "default"
                        },
                        {
                            "address" : "127.0.0.1:54086",
                            "connected_coordinators" : 1,
                            "log_group" : "default"
                        },
                        {
                            "address" : "127.0.0.1:54266",
                            "connected_coordinators" : 1,
                            "log_group" : "default"
                        },
                        {
                            "address" : "127.0.0.1:54364",
                            "connected_coordinators" : 1,
                            "log_group" : "default"
                        },
                        {
                            "address" : "127.0.0.1:54504",
                            "connected_coordinators" : 1,
                            "log_group" : "default"
                        },
                        {
                            "address" : "127.0.0.1:54656",
                            "connected_coordinators" : 1,
                            "log_group" : "default"
                        },
                        {
                            "address" : "127.0.0.1:54756",
                            "connected_coordinators" : 1,
                            "log_group" : "default"
                        }
                    ],
                    "count" : 8,
                    "protocol_version" : "fdb00b061060001",
                    "source_version" : "90ba203c166c62ae9173b2523081a0c6afaa1604"
                }
            ]
        },
        "cluster_controller_timestamp" : 1591729983,
        "connection_string" : "IqgQyLoo:XrIlR1QY@127.0.0.1:4500",
        "datacenter_version_difference" : 0,
        "degraded_processes" : 0,
        "incompatible_connections" : [
        ],
        "layers" : {
            "_error" : "configurationMissing",
            "_valid" : false
        },
        "machines" : {
            "b1da0c3824d31c5a3f861b1df636d474" : {
                "address" : "127.0.0.1",
                "contributing_workers" : 2,
                "cpu" : {
                    "logical_core_utilization" : 0.010020000000000001
                },
                "excluded" : false,
                "locality" : {
                    "machineid" : "b1da0c3824d31c5a3f861b1df636d474",
                    "processid" : "cbffdce3940166e1b61a4b5253a05a49",
                    "zoneid" : "b1da0c3824d31c5a3f861b1df636d474"
                },
                "machine_id" : "b1da0c3824d31c5a3f861b1df636d474",
                "memory" : {
                    "committed_bytes" : 1253486592,
                    "free_bytes" : 11345350656,
                    "total_bytes" : 12598837248
                },
                "network" : {
                    "megabits_received" : {
                        "hz" : 0.30287799999999998
                    },
                    "megabits_sent" : {
                        "hz" : 0.30287799999999998
                    },
                    "tcp_segments_retransmitted" : {
                        "hz" : 0
                    }
                }
            }
        },
        "messages" : [
            {
                "description" : "Unable to locate the data distributor worker.",
                "name" : "unreachable_dataDistributor_worker"
            },
            {
                "description" : "Unable to locate the ratekeeper worker.",
                "name" : "unreachable_ratekeeper_worker"
            },
            {
                "description" : "Unable to read database configuration.",
                "name" : "unreadable_configuration"
            }
        ],
        "processes" : {
            "1562ed9ef68b4b22c344ad858ed8876a" : {
                "address" : "127.0.0.1:4501",
                "class_source" : "command_line",
                "class_type" : "stateless",
                "command_line" : "/usr/bin/fdbserver --class=stateless --cluster_file=/etc/foundationdb/fdb.cluster --datadir=/var/lib/foundationdb/data/4501 --listen_address=public --logdir=/var/log/foundationdb --public_address=auto:4501",
                "cpu" : {
                    "usage_cores" : 0.0055999800000000001
                },
                "disk" : {
                    "busy" : 0,
                    "free_bytes" : 111130923008,
                    "reads" : {
                        "counter" : 802,
                        "hz" : 0,
                        "sectors" : 0
                    },
                    "total_bytes" : 117510316032,
                    "writes" : {
                        "counter" : 1461,
                        "hz" : 2.3999899999999998,
                        "sectors" : 96
                    }
                },
                "fault_domain" : "b1da0c3824d31c5a3f861b1df636d474",
                "locality" : {
                    "machineid" : "b1da0c3824d31c5a3f861b1df636d474",
                    "processid" : "1562ed9ef68b4b22c344ad858ed8876a",
                    "zoneid" : "b1da0c3824d31c5a3f861b1df636d474"
                },
                "machine_id" : "b1da0c3824d31c5a3f861b1df636d474",
                "memory" : {
                    "available_bytes" : 5919866880,
                    "limit_bytes" : 8589934592,
                    "unused_allocated_memory" : 0,
                    "used_bytes" : 133255168
                },
                "messages" : [
                ],
                "network" : {
                    "connection_errors" : {
                        "hz" : 0
                    },
                    "connections_closed" : {
                        "hz" : 0.19999899999999998
                    },
                    "connections_established" : {
                        "hz" : 0
                    },
                    "current_connections" : 5,
                    "megabits_received" : {
                        "hz" : 0.0335535
                    },
                    "megabits_sent" : {
                        "hz" : 0.046278199999999999
                    }
                },
                "roles" : [
                    {
                        "id" : "d8b1d66f1128667d",
                        "role" : "master"
                    },
                    {
                        "id" : "169b474905a518f7",
                        "role" : "cluster_controller"
                    }
                ],
                "uptime_seconds" : 1905.01,
                "version" : "6.1.12"
            },
            "cbffdce3940166e1b61a4b5253a05a49" : {
                "address" : "127.0.0.1:4500",
                "class_source" : "command_line",
                "class_type" : "unset",
                "command_line" : "/usr/bin/fdbserver --cluster_file=/etc/foundationdb/fdb.cluster --datadir=/var/lib/foundationdb/data/4500 --listen_address=public --logdir=/var/log/foundationdb --public_address=auto:4500",
                "cpu" : {
                    "usage_cores" : 0.0031999900000000002
                },
                "disk" : {
                    "busy" : 0,
                    "free_bytes" : 111130923008,
                    "reads" : {
                        "counter" : 802,
                        "hz" : 0,
                        "sectors" : 0
                    },
                    "total_bytes" : 117510316032,
                    "writes" : {
                        "counter" : 1461,
                        "hz" : 2.3999999999999999,
                        "sectors" : 96
                    }
                },
                "fault_domain" : "b1da0c3824d31c5a3f861b1df636d474",
                "locality" : {
                    "machineid" : "b1da0c3824d31c5a3f861b1df636d474",
                    "processid" : "cbffdce3940166e1b61a4b5253a05a49",
                    "zoneid" : "b1da0c3824d31c5a3f861b1df636d474"
                },
                "machine_id" : "b1da0c3824d31c5a3f861b1df636d474",
                "memory" : {
                    "available_bytes" : 5919866880,
                    "limit_bytes" : 8589934592,
                    "unused_allocated_memory" : 0,
                    "used_bytes" : 361127936
                },
                "messages" : [
                ],
                "network" : {
                    "connection_errors" : {
                        "hz" : 0
                    },
                    "connections_closed" : {
                        "hz" : 0.20000000000000001
                    },
                    "connections_established" : {
                        "hz" : 0
                    },
                    "current_connections" : 5,
                    "megabits_received" : {
                        "hz" : 0.021672
                    },
                    "megabits_sent" : {
                        "hz" : 0.011384
                    }
                },
                "roles" : [
                ],
                "uptime_seconds" : 1905.0699999999999,
                "version" : "6.1.12"
            }
        },
        "protocol_version" : "fdb00b061060001",
        "recovery_state" : {
            "description" : "Locking coordination state. Verify that a majority of coordination server processes are active.",
            "name" : "locking_coordinated_state"
        }
    }
}

I am maintaining the VM in that state. I can help in performing some tests that may help in debugging this issue.

Do you think I should file an issue for this?

ajbeamon · June 9, 2020, 8:37pm

Any idea what’s going on with the trace log trace.127.0.0.1.4500.1591509270.KBigDP.1.xml? It just has a bunch of null bytes in it. I’m also wondering whether something weird happened during that incarnation of the process, as things looked ok before it and didn’t work after it.

tuk · June 10, 2020, 6:02am

@ajbeamon - Just to add this just a fdb single node cluster with very less data. Probably this situation can be recreated if someone tries to start a single node fdb cluster using the given Cluster and data files.

Any idea what’s going on with the trace log trace.127.0.0.1.4500.1591509270.KBigDP.1.xml ?

No. I did not modify anything on the files. It was like that only.

I’m also wondering whether something weird happened during that incarnation of the process, as things looked ok before it and didn’t work after it.

Yes everything was fine before. Now if I wipe out everything from foundationdb data directory fdb will again start to work properly.

tuk · June 10, 2020, 12:14pm

I am able to repro the same cluster state in a healthy fdb single node cluster (on ubuntu 16.04.6) by using the data and cluster file which I have shared. Steps followed

Shutdown foundation db
Replace /etc/foundationdb/foundationdb.conf with the one posted above.
Note the permission of all files under /var/lib/foundationdb/data
Replace /var/lib/foundationdb/data with the content of the fdb_data.tar.gz which was shared.
Make the permission of all files under /var/lib/foundationdb/data same as the permission noted in Step 3.
Replace /etc/foundationdb/fdb.cluster with the content of the fdb.cluster file I shared.
Start foundationdb

Just to rule out that there are no errors in the above steps. I have performed the above steps by using data and cluster file from a healthy single node fdb cluster. Things work fine and this one becomes a copy of the cluster from where data is copied. Only when I use this faulty data and cluster file I see the problem.

ajbeamon · June 10, 2020, 3:16pm

Yeah, I was able to reproduce the stuck cluster from your data files as well (these were helpful, thank you), from which I can see there is some problem trying to recover from a transaction log. It’s not clear to me yet what exactly has happened though, and I suspect some of the interesting details occurred during the period when the transaction logs are invalid.

If something weird has happened to the log files, I wonder if something bad has also happened to your data files, and what the nature of that damage might be. It doesn’t seem as if your data files are completely destroyed, though, like the log file is. If you ever get a reproduction where the log files are complete, I’d be very interested to see them.

tuk · June 10, 2020, 4:15pm

@ajbeamon - Is there a way to remove/reset the tlog files, with an understanding that it will cause data loss for any transaction that is not yet applied on storage servers.?

Some of our deployments are single node, so if we hit this issue I would like to come out with whatever can be salvaged.

ajbeamon · June 10, 2020, 4:22pm

At present, no, though there’s a proposed issue for recovering from a partial set of files:

In your case, though, I’d really like to understand what’s happening and resolve the root problem such that you aren’t running into these issues. Is there anything of note you can say about the process that causes this? How easily can you reproduce the behavior starting from a working cluster?

tuk · June 10, 2020, 5:00pm

It is not very frequent.

In our environment Fdb runs on a Ubuntu (16.04.6) VM on VMWare ESXi virtualized environment.

The pattern was same in all the cases. In a single node deployment of foundationdb shut down the VM with foundationdb processes running (using shutdown command). We have not seen this issue in any of our clustered deployment so far.

tuk · June 11, 2020, 12:35pm

I will try to repro the problem and once I am able to do it, I will get back with more logs.

ajbeamon · June 11, 2020, 9:04pm

Ok, thank you. I’ll take a further look at the results that you provided to see if anything else pops out at me.

amanda · January 28, 2021, 8:38pm

Hi, I hit this issue on one of my kubernetes FDB clusters. (We’re not using the official fdb operator)

The TLog error seemed to be because we ran out of memory:

<Event Severity="30" Time="1611857459.869919" Type="AbsolutePathError" ID="0000000000000000" Path="/storage/var/data/4500/logqueue-V_3_LS_2-82e69f55d06bdc2fdbe35467c6618dda-" UnixErrorCode="c" UnixError="Cannot allocate memory" Error="platform_error" ErrorDescription="Platform error" ErrorCode="1500" Machine="10.0.1.64:4500" LogGroup="default" Roles="MP,RV,SS,TL" />

(Something else on the host started eating memory, the memory usage reported by FDB was around 40% of its 8GB max)

I checked that the log file its trying to access exists and similarly to what was reported here, after restarts and ensuring proper memory resources I wasn’t able to get the DB back.

No idea if this was a coincidence, doesn’t seem to make much sense to me–but I ran a setclass in fdbcli on another node just to list the processes. It took a super long time but did return properly, and then the cluster immediately became functional after that. Would something about getting the workers force the bad fdbserver to retry the tlog or something like that?

alexmiller · January 28, 2021, 10:50pm

That’s an error that calling abspath() returned with an error that it couldn’t allocate the memory required to hold the string for the return value. That’s an indication that something either in the kubernetes container, or on the host outside of the container, was eating so much memory that malloc() failed.

Listing processes requires a cluster controller to exist, which also probably means the cluster to be mostly up and available. I’d be suspicious that a similar thing was happening on other machines, or that your machines were swapping themselves to death and thus exceedingly slow, and that something like that impeded FDB’s ability to recover and get back to being available.

amanda · February 5, 2021, 10:16pm

Ah, this happened again and I got to investigate a bit more. Same thing - one fdbserver ran out of memory. I noticed something interesting - When this happens it totally prevents any log/proxy/storage from being recruited. I checked status json while the cluster was in its unavailable state and there were no log/proxy/storage roles in the cluster. Looks like CC (which was not on the fdbserver that went down) was getting RegisterWorkerRequests but not RecruitStorageRequests? I first wondered if CC was in a weird state and stopped retrying so I killed that fdbserver and another one got recruited as CC and had the same issue.

Also fdbservers were showing SS roles in their logs but not in the status json output.

It makes sense that a fdbserver running out of memory requires it to be restarted, but I’m surprised it surfaced as a Locking coordination state. Verify that a majority of coordination server processes are active. error when it was not a coordinator (it was DD and ratekeeper)

Also, I believe fdbserver will bounce itself if it runs out of its configured memory, but it won’t do this if it runs out of available memory (if the full amount of configured memory is not available on the host). It does track the available memory though, I saw it being reported in the log as <1GB when it went down. Has anyone considered also bouncing fdbserver if it runs low on available memory as well?

Topic		Replies	Views
Error: The database is unavailable Using FoundationDB	15	4161	September 30, 2021
Locking coordination state. Verify that a majority of coordinattion server process are active. Single machine Using FoundationDB	4	1165	March 8, 2021
Fdb database is unavailable Running FoundationDB	2	441	August 30, 2023
How to troubleshoot foundation db 6.0 startup issue Using FoundationDB	5	839	May 23, 2019
How to restore cluster after accidentally dropping coordinators Using FoundationDB	9	2274	February 11, 2021

Database unavailable after shutting down a foundationdb node

Related topics