Database unavailable after shutting down a foundationdb node

I am running foundation 6.1.12 on Ubuntu 16.04.6 in one of my single node test setup.

After shutting down the node (with foundationdb running) and starting it I am observing that database is not available. Can someone let me know what could be happening here?
Below is the output of status details

fdb> status details

Using cluster file `/etc/foundationdb/fdb.cluster'.

Locking coordination state. Verify that a majority of coordination server
processes are active.

  127.0.0.1:4500  (reachable)

Unable to locate the data distributor worker.

Unable to locate the ratekeeper worker.

The output of status json is below

{
    "client" : {
        "cluster_file" : {
            "path" : "/etc/foundationdb/fdb.cluster",
            "up_to_date" : true
        },
        "coordinators" : {
            "coordinators" : [
                {
                    "address" : "127.0.0.1:4500",
                    "reachable" : true
                }
            ],
            "quorum_reachable" : true
        },
        "database_status" : {
            "available" : false,
            "healthy" : false
        },
        "messages" : [
        ],
        "timestamp" : 1589212043
    },
    "cluster" : {
        "clients" : {
            "count" : 7,
            "supported_versions" : [
                {
                    "client_version" : "6.1.12",
                    "connected_clients" : [
                        {
                            "address" : "127.0.0.1:55764",
                            "connected_coordinators" : 1,
                            "log_group" : "default"
                        },
                        {
                            "address" : "127.0.0.1:55900",
                            "connected_coordinators" : 1,
                            "log_group" : "default"
                        },
                        {
                            "address" : "127.0.0.1:55914",
                            "connected_coordinators" : 1,
                            "log_group" : "default"
                        },
                        {
                            "address" : "127.0.0.1:56078",
                            "connected_coordinators" : 1,
                            "log_group" : "default"
                        },
                        {
                            "address" : "127.0.0.1:56148",
                            "connected_coordinators" : 1,
                            "log_group" : "default"
                        },
                        {
                            "address" : "127.0.0.1:56314",
                            "connected_coordinators" : 1,
                            "log_group" : "default"
                        },
                        {
                            "address" : "127.0.0.1:56450",
                            "connected_coordinators" : 1,
                            "log_group" : "default"
                        }
                    ],
                    "count" : 7,
                    "protocol_version" : "fdb00b061060001",
                    "source_version" : "90ba203c166c62ae9173b2523081a0c6afaa1604"
                }
            ]
        },
        "cluster_controller_timestamp" : 1589212048,
        "connection_string" : "IqgQyLoo:XrIlR1QY@127.0.0.1:4500",
        "datacenter_version_difference" : 0,
        "degraded_processes" : 0,
        "incompatible_connections" : [
        ],
        "layers" : {
            "_error" : "configurationMissing",
            "_valid" : false
        },
        "machines" : {
            "3b2ef843117d3dbdfe93ffc023fbf374" : {
                "address" : "127.0.0.1",
                "contributing_workers" : 2,
                "cpu" : {
                    "logical_core_utilization" : 0.0027520600000000002
                },
                "excluded" : false,
                "locality" : {
                    "machineid" : "3b2ef843117d3dbdfe93ffc023fbf374",
                    "processid" : "4e9fa38cd41ba4893090e422144b30b6",
                    "zoneid" : "3b2ef843117d3dbdfe93ffc023fbf374"
                },
                "machine_id" : "3b2ef843117d3dbdfe93ffc023fbf374",
                "memory" : {
                    "committed_bytes" : 1567232000,
                    "free_bytes" : 15258861568,
                    "total_bytes" : 16826093568
                },
                "network" : {
                    "megabits_received" : {
                        "hz" : 0.25043499999999996
                    },
                    "megabits_sent" : {
                        "hz" : 0.25043499999999996
                    },
                    "tcp_segments_retransmitted" : {
                        "hz" : 0
                    }
                }
            }
        },
        "messages" : [
            {
                "description" : "Unable to locate the data distributor worker.",
                "name" : "unreachable_dataDistributor_worker"
            },
            {
                "description" : "Unable to locate the ratekeeper worker.",
                "name" : "unreachable_ratekeeper_worker"
            },
            {
                "description" : "Unable to read database configuration.",
                "name" : "unreadable_configuration"
            }
        ],
        "processes" : {
            "4e9fa38cd41ba4893090e422144b30b6" : {
                "address" : "127.0.0.1:4500",
                "class_source" : "command_line",
                "class_type" : "unset",
                "command_line" : "/usr/bin/fdbserver --cluster_file=/etc/foundationdb/fdb.cluster --datadir=/var/lib/foundationdb/data/4500 --listen_address=public --logdir=/var/log/foundationdb --public_address=auto:4500",
                "cpu" : {
                    "usage_cores" : 0.0031999800000000003
                },
                "disk" : {
                    "busy" : 0,
                    "free_bytes" : 111117496320,
                    "reads" : {
                        "counter" : 52138,
                        "hz" : 0,
                        "sectors" : 0
                    },
                    "total_bytes" : 117510316032,
                    "writes" : {
                        "counter" : 208806,
                        "hz" : 4.3999699999999997,
                        "sectors" : 176
                    }
                },
                "fault_domain" : "3b2ef843117d3dbdfe93ffc023fbf374",
                "locality" : {
                    "machineid" : "3b2ef843117d3dbdfe93ffc023fbf374",
                    "processid" : "4e9fa38cd41ba4893090e422144b30b6",
                    "zoneid" : "3b2ef843117d3dbdfe93ffc023fbf374"
                },
                "machine_id" : "3b2ef843117d3dbdfe93ffc023fbf374",
                "memory" : {
                    "available_bytes" : 7881066496,
                    "limit_bytes" : 8589934592,
                    "unused_allocated_memory" : 0,
                    "used_bytes" : 361541632
                },
                "messages" : [
                ],
                "network" : {
                    "connection_errors" : {
                        "hz" : 0
                    },
                    "connections_closed" : {
                        "hz" : 0.19999899999999998
                    },
                    "connections_established" : {
                        "hz" : 0
                    },
                    "current_connections" : 5,
                    "megabits_received" : {
                        "hz" : 0.0218783
                    },
                    "megabits_sent" : {
                        "hz" : 0.011631899999999999
                    }
                },
                "roles" : [
                ],
                "uptime_seconds" : 90985.699999999997,
                "version" : "6.1.12"
            },
            "c9c79c494b819706b0eadafdbc15dfd0" : {
                "address" : "127.0.0.1:4501",
                "class_source" : "command_line",
                "class_type" : "stateless",
                "command_line" : "/usr/bin/fdbserver --class=stateless --cluster_file=/etc/foundationdb/fdb.cluster --datadir=/var/lib/foundationdb/data/4501 --listen_address=public --logdir=/var/log/foundationdb --public_address=auto:4501",
                "cpu" : {
                    "usage_cores" : 0.0047999799999999997
                },
                "disk" : {
                    "busy" : 0,
                    "free_bytes" : 111117496320,
                    "reads" : {
                        "counter" : 52138,
                        "hz" : 0,
                        "sectors" : 0
                    },
                    "total_bytes" : 117510316032,
                    "writes" : {
                        "counter" : 208806,
                        "hz" : 4.3999800000000002,
                        "sectors" : 176
                    }
                },
                "fault_domain" : "3b2ef843117d3dbdfe93ffc023fbf374",
                "locality" : {
                    "machineid" : "3b2ef843117d3dbdfe93ffc023fbf374",
                    "processid" : "c9c79c494b819706b0eadafdbc15dfd0",
                    "zoneid" : "3b2ef843117d3dbdfe93ffc023fbf374"
                },
                "machine_id" : "3b2ef843117d3dbdfe93ffc023fbf374",
                "memory" : {
                    "available_bytes" : 7881066496,
                    "limit_bytes" : 8589934592,
                    "unused_allocated_memory" : 0,
                    "used_bytes" : 141729792
                },
                "messages" : [
                ],
                "network" : {
                    "connection_errors" : {
                        "hz" : 0
                    },
                    "connections_closed" : {
                        "hz" : 0.19999899999999998
                    },
                    "connections_established" : {
                        "hz" : 0
                    },
                    "current_connections" : 5,
                    "megabits_received" : {
                        "hz" : 0.034758199999999996
                    },
                    "megabits_sent" : {
                        "hz" : 0.047089399999999997
                    }
                },
                "roles" : [
                    {
                        "id" : "c7a0a4b137cf2015",
                        "role" : "master"
                    },
                    {
                        "id" : "956b61b1e846510d",
                        "role" : "cluster_controller"
                    }
                ],
                "uptime_seconds" : 90985.699999999997,
                "version" : "6.1.12"
            }
        },
        "protocol_version" : "fdb00b061060001",
        "recovery_state" : {
            "description" : "Locking coordination state. Verify that a majority of coordination server processes are active.",
            "name" : "locking_coordinated_state"
        }
    }
}

I am not seeing any trace with Severity 30 and Severity 40 in the fdb traces log. I have placed the logs here

My foundationdb.conf looks like below

[fdbmonitor]
user = ubuntu
group = ubuntu


[general]
restart_delay = 60
cluster_file = /etc/foundationdb/fdb.cluster


[fdbserver]
command = /usr/bin/fdbserver
public_address = auto:$ID
listen_address = public
datadir = /var/lib/foundationdb/data/$ID
logdir = /var/log/foundationdb


[fdbserver.4500]


[fdbserver.4501]
class = stateless


[backup_agent]
command = /usr/lib/foundationdb/backup_agent/backup_agent
logdir = /var/log/foundationdb

[backup_agent.1]

@ajbeamon @alexmiller - I see someone reported something similar in Can I remove a process? . But then no conclusion has been reached there. Any suggestion what may have gone wrong here?

The logs here are a bit incomplete (they don’t contain the events at the beginning of the recovery), but I do see a lot of messages about connections failing. If your processes can’t talk to each other, then that could prevent the cluster from recovering.

Thanks @ajbeamon for replying.

This is a single node setup.

Any suggestion how can I recover the database?

I tried stopping and starting fdbmonitor multiple times but still same issue.

You do have multiple processes running, and they are reporting connection failures, possibly with talking to each other. It’s hard for me to say why that is happening, but it could be relevant.

@ajbeamon I got this issue again. Shutting down a VM running foundationdb and then starting it, making the database unavailable.

ubuntu@vrni-proxy-release:~$ fdbcli
Using cluster file `fdb.cluster'.

The database is unavailable; type `status' for more information.

Welcome to the fdbcli. For help, type `help'.
fdb> status details

Using cluster file `fdb.cluster'.

Locking coordination state. Verify that a majority of coordination server
processes are active.

  127.0.0.1:4500  (reachable)

Unable to locate the data distributor worker.

Unable to locate the ratekeeper worker.

fdb>

foundationdb.conf same as in posted above.

ubuntu@vrni-proxy-release:~$ who -b
         system boot  2020-06-09 18:40
ubuntu@vrni-proxy-release:~$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 16.04.6 LTS
Release:	16.04
Codename:	xenial

FoundationDB Version - 6.1.12

Once I hit this issue . Then I could not find anyway to recover the database.

I have to stop the foundationdb , delete the entire data directory of foundationdb. Create a new /etc/foundationdb/fdb.cluster with content like below and start the foundationdb again.

IqgQyLoo:XrIlR1QY@127.0.0.1:4500

Once I perform the above steps then foundationdb again becomes operational. So I think when I am shutting down the VM some data may be getting corrupted in foundationdb which is making the database completely unusable. The fdb logs are also not helping me much.

I have placed the content of the fdb data directory, logs and fdb.cluster file here.

Output of status json

{
    "client" : {
        "cluster_file" : {
            "path" : "fdb.cluster",
            "up_to_date" : true
        },
        "coordinators" : {
            "coordinators" : [
                {
                    "address" : "127.0.0.1:4500",
                    "reachable" : true
                }
            ],
            "quorum_reachable" : true
        },
        "database_status" : {
            "available" : false,
            "healthy" : false
        },
        "messages" : [
        ],
        "timestamp" : 1591729978
    },
    "cluster" : {
        "clients" : {
            "count" : 8,
            "supported_versions" : [
                {
                    "client_version" : "6.1.12",
                    "connected_clients" : [
                        {
                            "address" : "127.0.0.1:53980",
                            "connected_coordinators" : 1,
                            "log_group" : "default"
                        },
                        {
                            "address" : "127.0.0.1:54076",
                            "connected_coordinators" : 1,
                            "log_group" : "default"
                        },
                        {
                            "address" : "127.0.0.1:54086",
                            "connected_coordinators" : 1,
                            "log_group" : "default"
                        },
                        {
                            "address" : "127.0.0.1:54266",
                            "connected_coordinators" : 1,
                            "log_group" : "default"
                        },
                        {
                            "address" : "127.0.0.1:54364",
                            "connected_coordinators" : 1,
                            "log_group" : "default"
                        },
                        {
                            "address" : "127.0.0.1:54504",
                            "connected_coordinators" : 1,
                            "log_group" : "default"
                        },
                        {
                            "address" : "127.0.0.1:54656",
                            "connected_coordinators" : 1,
                            "log_group" : "default"
                        },
                        {
                            "address" : "127.0.0.1:54756",
                            "connected_coordinators" : 1,
                            "log_group" : "default"
                        }
                    ],
                    "count" : 8,
                    "protocol_version" : "fdb00b061060001",
                    "source_version" : "90ba203c166c62ae9173b2523081a0c6afaa1604"
                }
            ]
        },
        "cluster_controller_timestamp" : 1591729983,
        "connection_string" : "IqgQyLoo:XrIlR1QY@127.0.0.1:4500",
        "datacenter_version_difference" : 0,
        "degraded_processes" : 0,
        "incompatible_connections" : [
        ],
        "layers" : {
            "_error" : "configurationMissing",
            "_valid" : false
        },
        "machines" : {
            "b1da0c3824d31c5a3f861b1df636d474" : {
                "address" : "127.0.0.1",
                "contributing_workers" : 2,
                "cpu" : {
                    "logical_core_utilization" : 0.010020000000000001
                },
                "excluded" : false,
                "locality" : {
                    "machineid" : "b1da0c3824d31c5a3f861b1df636d474",
                    "processid" : "cbffdce3940166e1b61a4b5253a05a49",
                    "zoneid" : "b1da0c3824d31c5a3f861b1df636d474"
                },
                "machine_id" : "b1da0c3824d31c5a3f861b1df636d474",
                "memory" : {
                    "committed_bytes" : 1253486592,
                    "free_bytes" : 11345350656,
                    "total_bytes" : 12598837248
                },
                "network" : {
                    "megabits_received" : {
                        "hz" : 0.30287799999999998
                    },
                    "megabits_sent" : {
                        "hz" : 0.30287799999999998
                    },
                    "tcp_segments_retransmitted" : {
                        "hz" : 0
                    }
                }
            }
        },
        "messages" : [
            {
                "description" : "Unable to locate the data distributor worker.",
                "name" : "unreachable_dataDistributor_worker"
            },
            {
                "description" : "Unable to locate the ratekeeper worker.",
                "name" : "unreachable_ratekeeper_worker"
            },
            {
                "description" : "Unable to read database configuration.",
                "name" : "unreadable_configuration"
            }
        ],
        "processes" : {
            "1562ed9ef68b4b22c344ad858ed8876a" : {
                "address" : "127.0.0.1:4501",
                "class_source" : "command_line",
                "class_type" : "stateless",
                "command_line" : "/usr/bin/fdbserver --class=stateless --cluster_file=/etc/foundationdb/fdb.cluster --datadir=/var/lib/foundationdb/data/4501 --listen_address=public --logdir=/var/log/foundationdb --public_address=auto:4501",
                "cpu" : {
                    "usage_cores" : 0.0055999800000000001
                },
                "disk" : {
                    "busy" : 0,
                    "free_bytes" : 111130923008,
                    "reads" : {
                        "counter" : 802,
                        "hz" : 0,
                        "sectors" : 0
                    },
                    "total_bytes" : 117510316032,
                    "writes" : {
                        "counter" : 1461,
                        "hz" : 2.3999899999999998,
                        "sectors" : 96
                    }
                },
                "fault_domain" : "b1da0c3824d31c5a3f861b1df636d474",
                "locality" : {
                    "machineid" : "b1da0c3824d31c5a3f861b1df636d474",
                    "processid" : "1562ed9ef68b4b22c344ad858ed8876a",
                    "zoneid" : "b1da0c3824d31c5a3f861b1df636d474"
                },
                "machine_id" : "b1da0c3824d31c5a3f861b1df636d474",
                "memory" : {
                    "available_bytes" : 5919866880,
                    "limit_bytes" : 8589934592,
                    "unused_allocated_memory" : 0,
                    "used_bytes" : 133255168
                },
                "messages" : [
                ],
                "network" : {
                    "connection_errors" : {
                        "hz" : 0
                    },
                    "connections_closed" : {
                        "hz" : 0.19999899999999998
                    },
                    "connections_established" : {
                        "hz" : 0
                    },
                    "current_connections" : 5,
                    "megabits_received" : {
                        "hz" : 0.0335535
                    },
                    "megabits_sent" : {
                        "hz" : 0.046278199999999999
                    }
                },
                "roles" : [
                    {
                        "id" : "d8b1d66f1128667d",
                        "role" : "master"
                    },
                    {
                        "id" : "169b474905a518f7",
                        "role" : "cluster_controller"
                    }
                ],
                "uptime_seconds" : 1905.01,
                "version" : "6.1.12"
            },
            "cbffdce3940166e1b61a4b5253a05a49" : {
                "address" : "127.0.0.1:4500",
                "class_source" : "command_line",
                "class_type" : "unset",
                "command_line" : "/usr/bin/fdbserver --cluster_file=/etc/foundationdb/fdb.cluster --datadir=/var/lib/foundationdb/data/4500 --listen_address=public --logdir=/var/log/foundationdb --public_address=auto:4500",
                "cpu" : {
                    "usage_cores" : 0.0031999900000000002
                },
                "disk" : {
                    "busy" : 0,
                    "free_bytes" : 111130923008,
                    "reads" : {
                        "counter" : 802,
                        "hz" : 0,
                        "sectors" : 0
                    },
                    "total_bytes" : 117510316032,
                    "writes" : {
                        "counter" : 1461,
                        "hz" : 2.3999999999999999,
                        "sectors" : 96
                    }
                },
                "fault_domain" : "b1da0c3824d31c5a3f861b1df636d474",
                "locality" : {
                    "machineid" : "b1da0c3824d31c5a3f861b1df636d474",
                    "processid" : "cbffdce3940166e1b61a4b5253a05a49",
                    "zoneid" : "b1da0c3824d31c5a3f861b1df636d474"
                },
                "machine_id" : "b1da0c3824d31c5a3f861b1df636d474",
                "memory" : {
                    "available_bytes" : 5919866880,
                    "limit_bytes" : 8589934592,
                    "unused_allocated_memory" : 0,
                    "used_bytes" : 361127936
                },
                "messages" : [
                ],
                "network" : {
                    "connection_errors" : {
                        "hz" : 0
                    },
                    "connections_closed" : {
                        "hz" : 0.20000000000000001
                    },
                    "connections_established" : {
                        "hz" : 0
                    },
                    "current_connections" : 5,
                    "megabits_received" : {
                        "hz" : 0.021672
                    },
                    "megabits_sent" : {
                        "hz" : 0.011384
                    }
                },
                "roles" : [
                ],
                "uptime_seconds" : 1905.0699999999999,
                "version" : "6.1.12"
            }
        },
        "protocol_version" : "fdb00b061060001",
        "recovery_state" : {
            "description" : "Locking coordination state. Verify that a majority of coordination server processes are active.",
            "name" : "locking_coordinated_state"
        }
    }
}

I am maintaining the VM in that state. I can help in performing some tests that may help in debugging this issue.

Do you think I should file an issue for this?

Any idea what’s going on with the trace log trace.127.0.0.1.4500.1591509270.KBigDP.1.xml? It just has a bunch of null bytes in it. I’m also wondering whether something weird happened during that incarnation of the process, as things looked ok before it and didn’t work after it.

@ajbeamon - Just to add this just a fdb single node cluster with very less data. Probably this situation can be recreated if someone tries to start a single node fdb cluster using the given Cluster and data files.

Any idea what’s going on with the trace log trace.127.0.0.1.4500.1591509270.KBigDP.1.xml ?

No. I did not modify anything on the files. It was like that only.

I’m also wondering whether something weird happened during that incarnation of the process, as things looked ok before it and didn’t work after it.

Yes everything was fine before. Now if I wipe out everything from foundationdb data directory fdb will again start to work properly.

I am able to repro the same cluster state in a healthy fdb single node cluster (on ubuntu 16.04.6) by using the data and cluster file which I have shared. Steps followed

  1. Shutdown foundation db
  2. Replace /etc/foundationdb/foundationdb.conf with the one posted above.
  3. Note the permission of all files under /var/lib/foundationdb/data
  4. Replace /var/lib/foundationdb/data with the content of the fdb_data.tar.gz which was shared.
  5. Make the permission of all files under /var/lib/foundationdb/data same as the permission noted in Step 3.
  6. Replace /etc/foundationdb/fdb.cluster with the content of the fdb.cluster file I shared.
  7. Start foundationdb

Just to rule out that there are no errors in the above steps. I have performed the above steps by using data and cluster file from a healthy single node fdb cluster. Things work fine and this one becomes a copy of the cluster from where data is copied. Only when I use this faulty data and cluster file I see the problem.

Yeah, I was able to reproduce the stuck cluster from your data files as well (these were helpful, thank you), from which I can see there is some problem trying to recover from a transaction log. It’s not clear to me yet what exactly has happened though, and I suspect some of the interesting details occurred during the period when the transaction logs are invalid.

If something weird has happened to the log files, I wonder if something bad has also happened to your data files, and what the nature of that damage might be. It doesn’t seem as if your data files are completely destroyed, though, like the log file is. If you ever get a reproduction where the log files are complete, I’d be very interested to see them.

@ajbeamon - Is there a way to remove/reset the tlog files, with an understanding that it will cause data loss for any transaction that is not yet applied on storage servers.?

Some of our deployments are single node, so if we hit this issue I would like to come out with whatever can be salvaged.

At present, no, though there’s a proposed issue for recovering from a partial set of files:

In your case, though, I’d really like to understand what’s happening and resolve the root problem such that you aren’t running into these issues. Is there anything of note you can say about the process that causes this? How easily can you reproduce the behavior starting from a working cluster?

It is not very frequent.

In our environment Fdb runs on a Ubuntu (16.04.6) VM on VMWare ESXi virtualized environment.

The pattern was same in all the cases. In a single node deployment of foundationdb shut down the VM with foundationdb processes running (using shutdown command). We have not seen this issue in any of our clustered deployment so far.

I will try to repro the problem and once I am able to do it, I will get back with more logs.

Ok, thank you. I’ll take a further look at the results that you provided to see if anything else pops out at me.