Triple ssd fdb cluster on 3 node, one node poweroff, but the fdb cluster is unavailable!

Hi, i run a fdb cluster on 3 centos7 node configured as triple ssd, fdb version is 5.2.5.

when i power off 1 machine, the fdb database status status change to unavailable, so what happened to it and why?

foundationdb.conf is same on the 3 machine like:

`## foundationdb.conf

Configuration file for FoundationDB server processes

Full documentation is available at

https://apple.github.io/foundationdb/configuration.html#the-configuration-file

[fdbmonitor]
user = foundationdb
group = foundationdb

[general]
restart_delay = 60

by default, restart_backoff = restart_delay_reset_interval = restart_delay

initial_restart_delay = 0

restart_backoff = 60

restart_delay_reset_interval = 60

cluster_file = /etc/foundationdb/fdb.cluster

delete_envvars =

kill_on_configuration_change = true

Default parameters for individual fdbserver processes

[fdbserver]
command = /usr/sbin/fdbserver
public_address = auto:$ID
listen_address = public
logdir = /var/log/foundationdb

An individual fdbserver process with id 4500

Parameters set here override defaults from the [fdbserver] section

[fdbserver.4500]
datadir = /var/lib/foundationdb/data/$ID
storage_memory = 10GiB

[fdbserver.4501]
datadir = /var/lib/foundationdb/data/$ID
storage_memory = 10GiB
`

fdbcli --exec “status json” output is :

`{
“client” : {
“cluster_file” : {
“path” : “/etc/foundationdb/fdb.cluster”,
“up_to_date” : true
},
“coordinators” : {
“coordinators” : [
{
“address” : “172.16.9.186:4500”,
“reachable” : true
},
{
“address” : “172.16.57.12:4500”,
“reachable” : false
},
{
“address” : “178.104.163.99:4500”,
“reachable” : true
}
],
“quorum_reachable” : true
},
“database_status” : {
“available” : false,
“healthy” : false
},
“messages” : [
],
“timestamp” : 1594021260
},
“cluster” : {
“clients” : {
“count” : 1,
“supported_versions” : [
{
“client_version” : “5.2.5”,
“connected_clients” : [
{
“address” : “178.104.163.99:40356”,
“log_group” : “default”
}
],
“count” : 1,
“protocol_version” : “fdb00a552000001”,
“source_version” : “4e48018437df4506aa5ed0c7f5976b9412b0145f”
}
]
},
“cluster_controller_timestamp” : 1594021266,
“generation” : 463,
“incompatible_connections” : [
],
“layers” : {
“_error” : “configurationMissing”,
“_valid” : false
},
“machines” : {
“ae5132ed0959f5fbd16edad8584e03ae” : {
“address” : “178.104.163.99”,
“contributing_workers” : 7,
“cpu” : {
“logical_core_utilization” : 0.027834043319133618
},
“locality” : {
“machineid” : “ae5132ed0959f5fbd16edad8584e03ae”,
“processid” : “077beda5c638bb83effae78e30222afc”,
“zoneid” : “ae5132ed0959f5fbd16edad8584e03ae”
},
“machine_id” : “ae5132ed0959f5fbd16edad8584e03ae”,
“memory” : {
“committed_bytes” : 1259991040,
“free_bytes” : 2874003456,
“total_bytes” : 4133994496
},
“network” : {
“megabits_received” : {
“hz” : 0.51068500000000006
},
“megabits_sent” : {
“hz” : 0.437585
},
“tcp_segments_retransmitted” : {
“hz” : 8.999820003599929
}
}
},
“d3d7f561de2bf8943ab08504b3655a0b” : {
“address” : “172.16.9.186”,
“contributing_workers” : 14,
“cpu” : {
“logical_core_utilization” : 0.094262257377426245
},
“locality” : {
“machineid” : “d3d7f561de2bf8943ab08504b3655a0b”,
“processid” : “9ba9f030e1e76f850524a7dc84046526”,
“zoneid” : “d3d7f561de2bf8943ab08504b3655a0b”
},
“machine_id” : “d3d7f561de2bf8943ab08504b3655a0b”,
“memory” : {
“committed_bytes” : 1036357632,
“free_bytes” : 891998208,
“total_bytes” : 1928355840
},
“network” : {
“megabits_received” : {
“hz” : 0.45399900000000004
},
“megabits_sent” : {
“hz” : 0.53055800000000009
},
“tcp_segments_retransmitted” : {
“hz” : 9.9999000009999897
}
}
}
},
“messages” : [
{
“description” : “Unable to read database configuration.”,
“name” : “unreadable_configuration”
}
],
“processes” : {
“077beda5c638bb83effae78e30222afc” : {
“address” : “178.104.163.99:4500”,
“class_source” : “command_line”,
“class_type” : “storage”,
“command_line” : “/usr/sbin/fdbserver --class=storage --cluster_file=/etc/foundationdb/fdb.cluster --datadir=/var/lib/foundationdb/data/4500 --listen_address=public --logdir=/var/log/foundationdb --logsize=100MiB --maxlogssize=1000MiB --public_address=auto:4500 --storage_memory=2GiB”,
“cpu” : {
“usage_cores” : 0.012318353632927345
},
“disk” : {
“busy” : 0,
“free_bytes” : 57510076416,
“reads” : {
“counter” : 20732,
“hz” : 0,
“sectors” : 0
},
“total_bytes” : 94463066112,
“writes” : {
“counter” : 1404875,
“hz” : 0,
“sectors” : 0
}
},
“fault_domain” : “ae5132ed0959f5fbd16edad8584e03ae”,
“locality” : {
“machineid” : “ae5132ed0959f5fbd16edad8584e03ae”,
“processid” : “077beda5c638bb83effae78e30222afc”,
“zoneid” : “ae5132ed0959f5fbd16edad8584e03ae”
},
“machine_id” : “ae5132ed0959f5fbd16edad8584e03ae”,
“memory” : {
“available_bytes” : 781585261,
“limit_bytes” : 8589934592,
“used_bytes” : 594219008
},
“messages” : [
],
“network” : {
“connection_errors” : {
“hz” : 0
},
“connections_closed” : {
“hz” : 0
},
“connections_established” : {
“hz” : 0
},
“current_connections” : 24,
“megabits_received” : {
“hz” : 0.031529000000000001
},
“megabits_sent” : {
“hz” : 0.024975500000000001
}
},
“roles” : [
],
“uptime_seconds” : 607.29499999999996,
“version” : “5.2.5”
},
“16cc4712fdbb548394c2eed61d122b3f” : {
“address” : “178.104.163.99:4506”,
“class_source” : “command_line”,
“class_type” : “stateless”,
“command_line” : “/usr/sbin/fdbserver --class=stateless --cluster_file=/etc/foundationdb/fdb.cluster --datadir=/var/lib/foundationdb/data/4506 --listen_address=public --logdir=/var/log/foundationdb --logsize=100MiB --maxlogssize=1000MiB --public_address=auto:4506 --storage_memory=2GiB”,
“cpu” : {
“usage_cores” : 0.0087347126528734709
},
“disk” : {
“busy” : 0,
“free_bytes” : 57510076416,
“reads” : {
“counter” : 20732,
“hz” : 0,
“sectors” : 0
},
“total_bytes” : 94463066112,
“writes” : {
“counter” : 1404875,
“hz” : 0,
“sectors” : 0
}
},
“fault_domain” : “ae5132ed0959f5fbd16edad8584e03ae”,
“locality” : {
“machineid” : “ae5132ed0959f5fbd16edad8584e03ae”,
“processid” : “16cc4712fdbb548394c2eed61d122b3f”,
“zoneid” : “ae5132ed0959f5fbd16edad8584e03ae”
},
“machine_id” : “ae5132ed0959f5fbd16edad8584e03ae”,
“memory” : {
“available_bytes” : 781611593,
“limit_bytes” : 8589934592,
“used_bytes” : 195780608
},
“messages” : [
],
“network” : {
“connection_errors” : {
“hz” : 0
},
“connections_closed” : {
“hz” : 0
},
“connections_established” : {
“hz” : 0
},
“current_connections” : 17,
“megabits_received” : {
“hz” : 0.013567900000000001
},
“megabits_sent” : {
“hz” : 0.014398300000000001
}
},
“roles” : [
],
“uptime_seconds” : 605.00800000000004,
“version” : “5.2.5”
},
…
…
…
“e90d5b0682731a5243691dfd80572e86” : {
“address” : “178.104.163.99:4501”,
“class_source” : “command_line”,
“class_type” : “storage”,
“command_line” : “/usr/sbin/fdbserver --class=storage --cluster_file=/etc/foundationdb/fdb.cluster --datadir=/var/lib/foundationdb/data/4501 --listen_address=public --logdir=/var/log/foundationdb --logsize=100MiB --maxlogssize=1000MiB --public_address=auto:4501 --storage_memory=2GiB”,
“cpu” : {
“usage_cores” : 0.0096214460568630911
},
“disk” : {
“busy” : 0,
“free_bytes” : 57510076416,
“reads” : {
“counter” : 20732,
“hz” : 0,
“sectors” : 0
},
“total_bytes” : 94463066112,
“writes” : {
“counter” : 1404875,
“hz” : 0,
“sectors” : 0
}
},
“fault_domain” : “ae5132ed0959f5fbd16edad8584e03ae”,
“locality” : {
“machineid” : “ae5132ed0959f5fbd16edad8584e03ae”,
“processid” : “e90d5b0682731a5243691dfd80572e86”,
“zoneid” : “ae5132ed0959f5fbd16edad8584e03ae”
},
“machine_id” : “ae5132ed0959f5fbd16edad8584e03ae”,
“memory” : {
“available_bytes” : 781611593,
“limit_bytes” : 8589934592,
“used_bytes” : 453988352
},
“messages” : [
],
“network” : {
“connection_errors” : {
“hz” : 0
},
“connections_closed” : {
“hz” : 0
},
“connections_established” : {
“hz” : 0
},
“current_connections” : 17,
“megabits_received” : {
“hz” : 0.0136078
},
“megabits_sent” : {
“hz” : 0.014567800000000001
}
},
“roles” : [
],
“uptime_seconds” : 605.01499999999999,
“version” : “5.2.5”
}
},
“protocol_version” : “fdb00a552000001”,
“recovery_state” : {
“description” : “Recruiting new transaction servers.”,
“name” : “recruiting_transaction_servers”,
“required_logs” : 3,
“required_proxies” : 1,
“required_resolvers” : 1
}
}
}

`

Running in triple replication requires at least 3 separate machines for the cluster to be available (see https://apple.github.io/foundationdb/configuration.html#single-datacenter-modes). When you power off one of your 3 machines, the cluster is no longer able to recruit the needed roles and will not be available.

In order to be able to tolerate failures, you could run in a reduced redundancy mode (e.g. double) or add more machines (we recommend at least 5 for two fault tolerance in triple mode).

2 Likes

Thank you very much and sorry for missing configuration page.