Fdb cluster is unavailable after delete a disk

Hi, i am testing a double memory fdb cluster on 2 nodes(172.16.9.186, 178.104.163.99), node 178.104.163.99 have 1 hdd disk, node 172.16.9.186 have 2 hdd disks.

the foundationdb version is 5.2.5.

the cluster database status change to be unavailable after deleting one disk from 172.16.9.186.
(delete disk by cmd: echo 1 > /sys/block/sdb/device/delete, disk scsiid is 6000c29b458680dc333dad5910de37e7)

after remove the fdbserver processes related to the deleted disk from foundationdb.conf, the cluster change to be available.

what happend when the cluster is unavailible?

foundationdb.conf in 172.16.9.186 is :

[fdbserver.4500]
class = storage
datadir = /var/lib/foundationdb/6000c29b458680dc333dad5910de37e7/data/$ID

[fdbserver.4501]
class = storage
datadir = /var/lib/foundationdb/6000c29b458680dc333dad5910de37e7/data/$ID

[fdbserver.4502]
class = storage
datadir = /var/lib/foundationdb/6000c29b458680dc333dad5910de37e7/data/$ID

[fdbserver.4503]
class = storage
datadir = /var/lib/foundationdb/6000c29b458680dc333dad5910de37e7/data/$ID

[fdbserver.4505]
class = transaction
datadir = /var/lib/foundationdb/6000c29b458680dc333dad5910de37e7/journal/$ID

[fdbserver.4506]
class = stateless
datadir = /var/lib/foundationdb/6000c29b458680dc333dad5910de37e7/data/$ID

[fdbserver.4507]
class = stateless
datadir = /var/lib/foundationdb/6000c29b458680dc333dad5910de37e7/data/$ID

[fdbserver.4510]
class = storage
datadir = /var/lib/foundationdb/6000c29221528cb42980551360cb127b/data/$ID

[fdbserver.4511]
class = storage
datadir = /var/lib/foundationdb/6000c29221528cb42980551360cb127b/data/$ID

[fdbserver.4512]
class = storage
datadir = /var/lib/foundationdb/6000c29221528cb42980551360cb127b/data/$ID

[fdbserver.4513]
class = storage
datadir = /var/lib/foundationdb/6000c29221528cb42980551360cb127b/data/$ID

[fdbserver.4515]
class = transaction
datadir = /var/lib/foundationdb/6000c29221528cb42980551360cb127b/journal/$ID

[fdbserver.4516]
class = stateless
datadir = /var/lib/foundationdb/6000c29221528cb42980551360cb127b/data/$ID

[fdbserver.4517]
class = stateless
datadir = /var/lib/foundationdb/6000c29221528cb42980551360cb127b/data/$ID

foundationdb.conf in 178.104.163.99 is :

[fdbserver.4500]
class = storage
datadir = /var/lib/foundationdb/data/$ID

[fdbserver.4501]
class = storage
datadir = /var/lib/foundationdb/data/$ID

[fdbserver.4502]
class = storage
datadir = /var/lib/foundationdb/data/$ID

[fdbserver.4503]
class = storage
datadir = /var/lib/foundationdb/data/$ID

[fdbserver.4505]
class = transaction
datadir = /var/lib/foundationdb/journal/$ID

[fdbserver.4506]
class = stateless
datadir = /var/lib/foundationdb/data/$ID

[fdbserver.4507]
class = stateless
datadir = /var/lib/foundationdb/data/$ID

[backup_agent]
command = /usr/lib/foundationdb/backup_agent/backup_agent
logdir = /var/log/foundationdb

after delete one disk on 172.16.9.186, the fdb cluster status is :

[root@ssb-workspace ~]# fdbcli --exec “status details”
Using cluster file `/etc/foundationdb/fdb.cluster’.

Initializing new transaction servers and recovering transaction logs.

{
“client” : {
“cluster_file” : {
“path” : “/etc/foundationdb/fdb.cluster”,
“up_to_date” : true
},
“coordinators” : {
“coordinators” : [
{
“address” : “178.104.163.99:4500”,
“reachable” : true
}
],
“quorum_reachable” : true
},
“database_status” : {
“available” : false,
“healthy” : false
},
“messages” : [
],
“timestamp” : 1594112767
},
“cluster” : {
“clients” : {
“count” : 0
},
“cluster_controller_timestamp” : 1594112772,
“generation” : 4975,
“incompatible_connections” : [
],
“layers” : {
“_error” : “configurationMissing”,
“_valid” : false
},
“machines” : {
“ae5132ed0959f5fbd16edad8584e03ae” : {
“address” : “178.104.163.99”,
“contributing_workers” : 7,
“cpu” : {
“logical_core_utilization” : 0.034062059379406213
},
“locality” : {
“machineid” : “ae5132ed0959f5fbd16edad8584e03ae”,
“processid” : “077beda5c638bb83effae78e30222afc”,
“zoneid” : “ae5132ed0959f5fbd16edad8584e03ae”
},
“machine_id” : “ae5132ed0959f5fbd16edad8584e03ae”,
“memory” : {
“committed_bytes” : 1444225024,
“free_bytes” : 2689769472,
“total_bytes” : 4133994496
},
“network” : {
“megabits_received” : {
“hz” : 0.58364800000000006
},
“megabits_sent” : {
“hz” : 1.30023
},
“tcp_segments_retransmitted” : {
“hz” : 3.1999680003199971
}
}
},
“e91cc7a7e3749709c3d606dc1c28774b” : {
“address” : “172.16.9.186”,
“contributing_workers” : 14,
“cpu” : {
“logical_core_utilization” : 1
},
“locality” : {
“machineid” : “e91cc7a7e3749709c3d606dc1c28774b”,
“processid” : “9ba9f030e1e76f850524a7dc84046526”,
“zoneid” : “e91cc7a7e3749709c3d606dc1c28774b”
},
“machine_id” : “e91cc7a7e3749709c3d606dc1c28774b”,
“memory” : {
“committed_bytes” : 788647936,
“free_bytes” : 1139707904,
“total_bytes” : 1928355840
},
“network” : {
“megabits_received” : {
“hz” : 1.4523699999999999
},
“megabits_sent” : {
“hz” : 0.58311400000000002
},
“tcp_segments_retransmitted” : {
“hz” : 0
}
}
}
},
“messages” : [
{
“description” : “Unable to read database configuration.”,
“name” : “unreadable_configuration”
}
],
“processes” : {
“077beda5c638bb83effae78e30222afc” : {
“address” : “178.104.163.99:4500”,
“class_source” : “command_line”,
“class_type” : “storage”,
“command_line” : “/usr/sbin/fdbserver --class=storage --cluster_file=/etc/foundationdb/fdb.cluster --datadir=/var/lib/foundationdb/data/4500 --listen_address=public --logdir=/var/log/foundationdb --logsize=100MiB --maxlogssize=1000MiB --public_address=auto:4500 --storage_memory=2GiB”,
“cpu” : {
“usage_cores” : 0.015684443155568446
},
“disk” : {
“busy” : 0.011399886001139888,
“free_bytes” : 56330371072,
“reads” : {
“counter” : 23163,
“hz” : 0,
“sectors” : 0
},
“total_bytes” : 94463066112,
“writes” : {
“counter” : 1778556,
“hz” : 5.9999400005999943,
“sectors” : 1880
}
},
“fault_domain” : “ae5132ed0959f5fbd16edad8584e03ae”,
“locality” : {
“machineid” : “ae5132ed0959f5fbd16edad8584e03ae”,
“processid” : “077beda5c638bb83effae78e30222afc”,
“zoneid” : “ae5132ed0959f5fbd16edad8584e03ae”
},
“machine_id” : “ae5132ed0959f5fbd16edad8584e03ae”,
“memory” : {
“available_bytes” : 786547273,
“limit_bytes” : 8589934592,
“used_bytes” : 621899776
},
“messages” : [
],
“network” : {
“connection_errors” : {
“hz” : 0
},
“connections_closed” : {
“hz” : 0
},
“connections_established” : {
“hz” : 0
},
“current_connections” : 22,
“megabits_received” : {
“hz” : 0.060538600000000005
},
“megabits_sent” : {
“hz” : 0.0421932
}
},
“roles” : [
],
“uptime_seconds” : 92113.600000000006,
“version” : “5.2.5”
},



“9ce32d5522c4689d511a80848c91c427” : {
“address” : “172.16.9.186:4502”,
“class_source” : “command_line”,
“class_type” : “storage”,
“command_line” : “/usr/sbin/fdbserver --class=storage --cluster_file=/etc/foundationdb/fdb.cluster --datadir=/var/lib/foundationdb/6000c29b458680dc333dad5910de37e7/data/4502 --listen_address=public --logdir=/var/log/foundationdb --logsize=100MiB --maxlogssize=1000MiB --public_address=172.16.9.186:4502 --storage_memory=2GiB”,
“cpu” : {
“usage_cores” : 0.013561985966497964
},
“disk” : {
“busy” : 0,
“free_bytes” : 22780743680,
“reads” : {
“counter” : 3779,
“hz” : 0,
“sectors” : 0
},
“total_bytes” : 24173232128,
“writes” : {
“counter” : 9197,
“hz” : 0,
“sectors” : 0
}
},
“fault_domain” : “e91cc7a7e3749709c3d606dc1c28774b”,
“locality” : {
“machineid” : “e91cc7a7e3749709c3d606dc1c28774b”,
“processid” : “9ce32d5522c4689d511a80848c91c427”,
“zoneid” : “e91cc7a7e3749709c3d606dc1c28774b”
},
“machine_id” : “e91cc7a7e3749709c3d606dc1c28774b”,
“memory” : {
“available_bytes” : 365362614,
“limit_bytes” : 8589934592,
“used_bytes” : 301342720
},
“messages” : [
{
“description” : “StorageServerFailed: io_error at Tue Jul 7 17:05:25 2020”,
“name” : “io_error”,
“raw_log_message” : “<Event Severity=“40” Time=“1594112725.845479” Type=“StorageServerFailed” Machine=“172.16.9.186:4502” ID=“261ebd9150b89020” Reason=“Error” Error=“io_error” ErrorDescription=“Disk i/o operation failed” ErrorCode=“1510” logGroup=“default” Backtrace=“addr2line -e fdbserver.debug -p -C -f -i 0x12abcc4 0x12aad32 0x9be387 0x9beb71 0x9bed34 0x9b17cb 0x9b1af7 0x476d89 0x9bc62c 0x9bc9c8 0x446628 0x446967 0x476d89 0x9b15b3 0x9b761d 0x446628 0x446967 0x476d89 0x92e9c8 0x92ef27 0x476d89 0x9123d3 0x476d89 0x1257e2c 0x125681b 0x1256d07 0x1256155 0x476d89 0x90e202 0x476d89 0x5d0a43 0x5d0de4 0x476d89 0x56e97d 0x476d89 0x813928 0x11bf7b9 0x11bfa05 0x47a2a0 0x1281d25 0x43178d 0x7f957d215505”/>\r\n latestError=“1”/>\r\n”,
“time” : 1594112725.845479,
“type” : “StorageServerFailed”
}
],
“network” : {
“connection_errors” : {
“hz” : 0
},
“connections_closed” : {
“hz” : 0
},
“connections_established” : {
“hz” : 0
},
“current_connections” : 12,
“megabits_received” : {
“hz” : 0.031778300000000002
},
“megabits_sent” : {
“hz” : 0.0125194
}
},
“roles” : [
],
“uptime_seconds” : 591.19500000000005,
“version” : “5.2.5”
},
“a322d46ece5bc1327a39a226a98ebb38” : {
“address” : “172.16.9.186:4512”,
“class_source” : “command_line”,
“class_type” : “storage”,
“command_line” : “/usr/sbin/fdbserver --class=storage --cluster_file=/etc/foundationdb/fdb.cluster --datadir=/var/lib/foundationdb/6000c29221528cb42980551360cb127b/data/4512 --listen_address=public --logdir=/var/log/foundationdb --logsize=100MiB --maxlogssize=1000MiB --public_address=172.16.9.186:4512 --storage_memory=2GiB”,
“cpu” : {
“usage_cores” : 0.0057225138743062847
},
“disk” : {
“busy” : 0,
“free_bytes” : 17619632128,
“reads” : {
“counter” : 6250,
“hz” : 0,
“sectors” : 0
},
“total_bytes” : 18888736768,
“writes” : {
“counter” : 9268,
“hz” : 0,
“sectors” : 0
}
},
“fault_domain” : “e91cc7a7e3749709c3d606dc1c28774b”,
“locality” : {
“machineid” : “e91cc7a7e3749709c3d606dc1c28774b”,
“processid” : “a322d46ece5bc1327a39a226a98ebb38”,
“zoneid” : “e91cc7a7e3749709c3d606dc1c28774b”
},
“machine_id” : “e91cc7a7e3749709c3d606dc1c28774b”,
“memory” : {
“available_bytes” : 365359981,
“limit_bytes” : 8589934592,
“used_bytes” : 337129472
},
“messages” : [
],
“network” : {
“connection_errors” : {
“hz” : 0
},
“connections_closed” : {
“hz” : 0
},
“connections_established” : {
“hz” : 0
},
“current_connections” : 12,
“megabits_received” : {
“hz” : 0.032009599999999999
},
“megabits_sent” : {
“hz” : 0.0126938
}
},
“roles” : [
],
“uptime_seconds” : 594.12699999999995,
“version” : “5.2.5”
},
“b75f49ec6b59c09986da9aea260f6b8b” : {
“address” : “172.16.9.186:4505”,
“class_source” : “command_line”,
“class_type” : “transaction”,
“command_line” : “/usr/sbin/fdbserver --class=transaction --cluster_file=/etc/foundationdb/fdb.cluster --datadir=/var/lib/foundationdb/6000c29b458680dc333dad5910de37e7/journal/4505 --listen_address=public --logdir=/var/log/foundationdb --logsize=100MiB --maxlogssize=1000MiB --public_address=172.16.9.186:4505 --storage_memory=2GiB”,
“cpu” : {
“usage_cores” : 0.01111080009517088
},
“disk” : {
“busy” : 0,
“free_bytes” : 1944702976,
“reads” : {
“counter” : 655,
“hz” : 0,
“sectors” : 0
},
“total_bytes” : 2079145984,
“writes” : {
“counter” : 4069,
“hz” : 0,
“sectors” : 0
}
},
“fault_domain” : “e91cc7a7e3749709c3d606dc1c28774b”,
“locality” : {
“machineid” : “e91cc7a7e3749709c3d606dc1c28774b”,
“processid” : “b75f49ec6b59c09986da9aea260f6b8b”,
“zoneid” : “e91cc7a7e3749709c3d606dc1c28774b”
},
“machine_id” : “e91cc7a7e3749709c3d606dc1c28774b”,
“memory” : {
“available_bytes” : 365292982,
“limit_bytes” : 8589934592,
“used_bytes” : 350474240
},
“messages” : [
{
“description” : “SharedTLogFailed: io_error at Tue Jul 7 17:06:06 2020”,
“name” : “io_error”,
“raw_log_message” : “<Event Severity=“40” Time=“1594112766.945861” Type=“SharedTLogFailed” Machine=“172.16.9.186:4505” ID=“cdda6e7738d4529e” Reason=“Error” Error=“io_error” ErrorDescription=“Disk i/o operation failed” ErrorCode=“1510” logGroup=“default” Backtrace=“addr2line -e fdbserver.debug -p -C -f -i 0x12abcc4 0x12aad32 0x9be387 0x9beb71 0x9bed34 0x9b17cb 0x9b695e 0x9cb158 0x9cb8f6 0x4d825c 0x11906e8 0x1190939 0x47a2a0 0x1281d25 0x43178d 0x7fe249633505”/>\r\n latestError=“1”/>\r\n”,
“time” : 1594112766.9458611,
“type” : “SharedTLogFailed”
}
],
“network” : {
“connection_errors” : {
“hz” : 0
},
“connections_closed” : {
“hz” : 0
},
“connections_established” : {
“hz” : 0
},
“current_connections” : 20,
“megabits_received” : {
“hz” : 0.047868500000000001
},
“megabits_sent” : {
“hz” : 0.017444300000000003
}
},
“roles” : [
],
“uptime_seconds” : 591.25199999999995,
“version” : “5.2.5”
},
“b9d1c4d99e567c534776e5be00601994” : {
“address” : “172.16.9.186:4503”,
“class_source” : “command_line”,
“class_type” : “storage”,
“command_line” : “/usr/sbin/fdbserver --class=storage --cluster_file=/etc/foundationdb/fdb.cluster --datadir=/var/lib/foundationdb/6000c29b458680dc333dad5910de37e7/data/4503 --listen_address=public --logdir=/var/log/foundationdb --logsize=100MiB --maxlogssize=1000MiB --public_address=172.16.9.186:4503 --storage_memory=2GiB”,
“cpu” : {
“usage_cores” : 0.0060274996054647612
},
“disk” : {
“busy” : 0,
“free_bytes” : 22780743680,
“reads” : {
“counter” : 3779,
“hz” : 0,
“sectors” : 0
},
“total_bytes” : 24173232128,
“writes” : {
“counter” : 9197,
“hz” : 0,
“sectors” : 0
}
},
“fault_domain” : “e91cc7a7e3749709c3d606dc1c28774b”,
“locality” : {
“machineid” : “e91cc7a7e3749709c3d606dc1c28774b”,
“processid” : “b9d1c4d99e567c534776e5be00601994”,
“zoneid” : “e91cc7a7e3749709c3d606dc1c28774b”
},
“machine_id” : “e91cc7a7e3749709c3d606dc1c28774b”,
“memory” : {
“available_bytes” : 365363200,
“limit_bytes” : 8589934592,
“used_bytes” : 304640000
},
“messages” : [
{
“description” : “StorageServerFailed: io_error at Tue Jul 7 17:05:25 2020”,
“name” : “io_error”,
“raw_log_message” : “<Event Severity=“40” Time=“1594112725.854921” Type=“StorageServerFailed” Machine=“172.16.9.186:4503” ID=“a72091b48a5b73c8” Reason=“Error” Error=“io_error” ErrorDescription=“Disk i/o operation failed” ErrorCode=“1510” logGroup=“default” Backtrace=“addr2line -e fdbserver.debug -p -C -f -i 0x12abcc4 0x12aad32 0x9be387 0x9beb71 0x9bed34 0x9b17cb 0x9b1af7 0x476d89 0x9bc62c 0x9bc9c8 0x446628 0x446967 0x476d89 0x9b15b3 0x9b761d 0x446628 0x446967 0x476d89 0x92e9c8 0x92ef27 0x476d89 0x9123d3 0x476d89 0x1257e2c 0x125681b 0x1256d07 0x1256155 0x476d89 0x90e202 0x476d89 0x5d0a43 0x5d0de4 0x476d89 0x56e97d 0x476d89 0x813928 0x11bf7b9 0x11bfa05 0x47a2a0 0x1281d25 0x43178d 0x7f7702f8b505”/>\r\n latestError=“1”/>\r\n”,
“time” : 1594112725.8549211,
“type” : “StorageServerFailed”
}
],
“network” : {
“connection_errors” : {
“hz” : 0
},
“connections_closed” : {
“hz” : 0
},
“connections_established” : {
“hz” : 0
},
“current_connections” : 12,
“megabits_received” : {
“hz” : 0.031666699999999999
},
“megabits_sent” : {
“hz” : 0.0125197
}
},
“roles” : [
],
“uptime_seconds” : 591.36599999999999,
“version” : “5.2.5”
},
“e0a1ee3c5011edf3cf48036bc9632783” : {
“address” : “172.16.9.186:4510”,
“class_source” : “command_line”,
“class_type” : “storage”,
“command_line” : “/usr/sbin/fdbserver --class=storage --cluster_file=/etc/foundationdb/fdb.cluster --datadir=/var/lib/foundationdb/6000c29221528cb42980551360cb127b/data/4510 --listen_address=public --logdir=/var/log/foundationdb --logsize=100MiB --maxlogssize=1000MiB --public_address=172.16.9.186:4510 --storage_memory=2GiB”,
“cpu” : {
“usage_cores” : 0.0056918241508649306
},
“disk” : {
“busy” : 0,
“free_bytes” : 17619632128,
“reads” : {
“counter” : 6250,
“hz” : 0,
“sectors” : 0
},
“total_bytes” : 18888736768,
“writes” : {
“counter” : 9268,
“hz” : 0,
“sectors” : 0
}
},
“fault_domain” : “e91cc7a7e3749709c3d606dc1c28774b”,
“locality” : {
“machineid” : “e91cc7a7e3749709c3d606dc1c28774b”,
“processid” : “e0a1ee3c5011edf3cf48036bc9632783”,
“zoneid” : “e91cc7a7e3749709c3d606dc1c28774b”
},
“machine_id” : “e91cc7a7e3749709c3d606dc1c28774b”,
“memory” : {
“available_bytes” : 365367881,
“limit_bytes” : 8589934592,
“used_bytes” : 332488704
},
“messages” : [
],
“network” : {
“connection_errors” : {
“hz” : 0
},
“connections_closed” : {
“hz” : 0
},
“connections_established” : {
“hz” : 0
},
“current_connections” : 12,
“megabits_received” : {
“hz” : 0.031887700000000005
},
“megabits_sent” : {
“hz” : 0.0125209
}
},
“roles” : [
],
“uptime_seconds” : 593.65899999999999,
“version” : “5.2.5”
},
“e90d5b0682731a5243691dfd80572e86” : {
“address” : “178.104.163.99:4501”,
“class_source” : “command_line”,
“class_type” : “storage”,
“command_line” : “/usr/sbin/fdbserver --class=storage --cluster_file=/etc/foundationdb/fdb.cluster --datadir=/var/lib/foundationdb/data/4501 --listen_address=public --logdir=/var/log/foundationdb --logsize=100MiB --maxlogssize=1000MiB --public_address=auto:4501 --storage_memory=2GiB”,
“cpu” : {
“usage_cores” : 0.012006031915553182
},
“disk” : {
“busy” : 0.01119984320219518,
“free_bytes” : 56330371072,
“reads” : {
“counter” : 23163,
“hz” : 0,
“sectors” : 0
},
“total_bytes” : 94463066112,
“writes” : {
“counter” : 1778544,
“hz” : 6.3999104012543828,
“sectors” : 1880
}
},
“fault_domain” : “ae5132ed0959f5fbd16edad8584e03ae”,
“locality” : {
“machineid” : “ae5132ed0959f5fbd16edad8584e03ae”,
“processid” : “e90d5b0682731a5243691dfd80572e86”,
“zoneid” : “ae5132ed0959f5fbd16edad8584e03ae”
},
“machine_id” : “ae5132ed0959f5fbd16edad8584e03ae”,
“memory” : {
“available_bytes” : 786578870,
“limit_bytes” : 8589934592,
“used_bytes” : 480972800
},
“messages” : [
],
“network” : {
“connection_errors” : {
“hz” : 0
},
“connections_closed” : {
“hz” : 0
},
“connections_established” : {
“hz” : 0
},
“current_connections” : 21,
“megabits_received” : {
“hz” : 0.037066700000000001
},
“megabits_sent” : {
“hz” : 0.018055700000000001
}
},
“roles” : [
],
“uptime_seconds” : 92111.300000000003,
“version” : “5.2.5”
}
},
“protocol_version” : “fdb00a552000001”,
“recovery_state” : {
“description” : “Initializing new transaction servers and recovering transaction logs.”,
“name” : “initializing_transaction_servers”
}
}
}

See this thread for effects of disk running out on any of SS

Hello!

When the cluster is unavailable, master role print the follow logs in circle:

Event Severity=“10” Time=“1594191671.815474” Type=“EndpointNotFound” Machine=“172.16.9.186:4516” ID=“0000000000000000” Address=“172.16.9.186:4506” Token=“6c1737ef6dd69fd9” SuppressedEventCount=“0” logGroup=“default”/>
Event Severity=“10” Time=“1594191671.815474” Type=“CCWDB” Machine=“172.16.9.186:4516” ID=“a7185cdb49a4f1bd” logGroup=“default”/>
Event Severity=“10” Time=“1594191671.815474” Type=“CCWDB” Machine=“172.16.9.186:4516” ID=“a7185cdb49a4f1bd” Recruiting=“Master” logGroup=“default”/>
Event Severity=“10” Time=“1594191671.815885” Type=“CCWDB” Machine=“172.16.9.186:4516” ID=“a7185cdb49a4f1bd” Recruited=“9a9c53aa431aafe1” logGroup=“default”/>
Event Severity=“10” Time=“1594191671.815885” Type=“RecruitedMasterWorker” Machine=“172.16.9.186:4516” ID=“a7185cdb49a4f1bd” Address=“172.16.9.186:4506” logGroup=“default” TrackLatestType=“Original”/>
Event Severity=“10” Time=“1594191671.815885” Type=“CCWDB” Machine=“172.16.9.186:4516” ID=“a7185cdb49a4f1bd” Lifetime=“a7185cdb49a4f1bd#3” ChangeID=“719075b9e13f29b6” logGroup=“default”/>
Event Severity=“10” Time=“1594191671.815885” Type=“GotServerDBInfoChange” Machine=“172.16.9.186:4516” ID=“0000000000000000” ChangeID=“719075b9e13f29b6” MasterID=“9a9c53aa431aafe1” logGroup=“default”/>
Event Severity=“10” Time=“1594191672.012351” Type=“MasterRegistrationReceived” Machine=“172.16.9.186:4516” ID=“a7185cdb49a4f1bd” dbName=“DB” MasterId=“9a9c53aa431aafe1” Master=“zoneid=948936f4795067051c61fe19a5105367 processid=54b460781b6f513d6e3b7665c9135b06 machineid=948936f4795067051c61fe19a5105367” Tlogs=“eeb740bfd8692df182e8d71949785ddc,71d797746a39bbc3f5a88fa1fd866208,19b2db9e2fc2ceb9e00369513ea68011” Resolvers=“0” RecoveryState=“3” RegistrationCount=“1” Proxies=“0” RecoveryCount=“6516” logGroup=“default”/>
Event Severity=“10” Time=“1594191672.012351” Type=“GotServerDBInfoChange” Machine=“172.16.9.186:4516” ID=“0000000000000000” ChangeID=“83637f7243df9578” MasterID=“9a9c53aa431aafe1” logGroup=“default”/>
Event Severity=“10” Time=“1594191672.018168” Type=“GetTLogTeamDone” Machine=“172.16.9.186:4516” ID=“4952dff3ccb92c5c” Completed=“1” Policy=“zoneid^2 x 1” Results=“3” Processes=“3” Workers=“21” Replication=“2” Desired=“3” RatingTests=“200” PolicyGenerations=“100” InterfaceId=“a7185cdb49a4f1bd” logGroup=“default”/>
Event Severity=“10” Time=“1594191671.815885” Type=“GotServerDBInfoChange” Machine=“172.16.9.186:4516” ID=“0000000000000000” ChangeID=“719075b9e13f29b6” MasterID=“9a9c53aa431aafe1” logGroup=“default”/>
Event Severity=“10” Time=“1594191672.012351” Type=“MasterRegistrationReceived” Machine=“172.16.9.186:4516” ID=“a7185cdb49a4f1bd” dbName=“DB” MasterId=“9a9c53aa431aafe1” Master=“zoneid=948936f4795067051c61fe19a5105367 processid=54b460781b6f513d6e3b7665c9135b06 machineid=948936f4795067051c61fe19a5105367” Tlogs=“eeb740bfd8692df182e8d71949785ddc,71d797746a39bbc3f5a88fa1fd866208,19b2db9e2fc2ceb9e00369513ea68011” Resolvers=“0” RecoveryState=“3” RegistrationCount=“1” Proxies=“0” RecoveryCount=“6516” logGroup=“default”/>
Event Severity=“10” Time=“1594191672.012351” Type=“GotServerDBInfoChange” Machine=“172.16.9.186:4516” ID=“0000000000000000” ChangeID=“83637f7243df9578” MasterID=“9a9c53aa431aafe1” logGroup=“default”/>
Event Severity=“10” Time=“1594191672.018168” Type=“GetTLogTeamDone” Machine=“172.16.9.186:4516” ID=“4952dff3ccb92c5c” Completed=“1” Policy=“zoneid^2 x 1” Results=“3” Processes=“3” Workers=“21” Replication=“2” Desired=“3” RatingTests=“200” PolicyGenerations=“100” InterfaceId=“a7185cdb49a4f1bd” logGroup=“default”/>
Event Severity=“10” Time=“1594191672.018168” Type=“GetTLogTeamWorker” Machine=“172.16.9.186:4516” ID=“4952dff3ccb92c5c” Class=“transaction” Address=“178.104.163.99:4505” Zone=“ae5132ed0959f5fbd16edad8584e03ae” DataHall="[not set]" isExcludedServer=“0” isAvailable=“1” logGroup=“default”/>
Event Severity=“10” Time=“1594191672.018168” Type=“GetTLogTeamWorker” Machine=“172.16.9.186:4516” ID=“4952dff3ccb92c5c” Class=“transaction” Address=“172.16.9.186:4515” Zone=“948936f4795067051c61fe19a5105367” DataHall="[not set]" isExcludedServer=“0” isAvailable=“1” logGroup=“default”/>
Event Severity=“10” Time=“1594191672.018168” Type=“GetTLogTeamWorker” Machine=“172.16.9.186:4516” ID=“4952dff3ccb92c5c” Class=“transaction” Address=“172.16.9.186:4505” Zone=“948936f4795067051c61fe19a5105367” DataHall="[not set]" isExcludedServer=“0” isAvailable=“1” logGroup=“default”/>
Event Severity=“10” Time=“1594191672.018168” Type=“findWorkersForConfig” Machine=“172.16.9.186:4516” ID=“0000000000000000” replication=“2” desiredLogs=“3” actualLogs=“3” desiredProxies=“3” actualProxies=“3” desiredResolvers=“1” actualResolvers=“1” logGroup=“default”/>
Event Severity=“10” Time=“1594191672.819949” Type=“CCWDB” Machine=“172.16.9.186:4516” ID=“a7185cdb49a4f1bd” Watching=“9a9c53aa431aafe1” logGroup=“default”/>
Event Severity=“10” Time=“1594191672.820138” Type=“EndpointNotFound” Machine=“172.16.9.186:4516” ID=“0000000000000000” Address=“172.16.9.186:4506” Token=“0d972419b4b545fd” SuppressedEventCount=“0” logGroup=“default”/>

From the log, we know 172.16.9.186:4506 endpoint is not found(it run on the deleted disk), but why master role select an failed endpoint every time?

UPD1: If there are 3 machines, and test the same case, the cluster is available. According to the thread in your reply, the cluster should be unavailable too.

UPD2: I test the same case on foundationdb of version 6.2.19, the cluster is available.

The root caused may related to the tlog binding on the deleted disk.
In futher testing, remove the transaction server 4505 binding on the deleted disk from foundationdb.conf (

#[fdbserver.4505]
#class = transaction
#datadir = /var/lib/foundationdb/6000c29b458680dc333dad5910de37e7/journal/$ID

), and the fdb cluster is available after deleting the disk.

The cluster is double memory, and tlog configuration is :

              |--------- tlog1(disk1)

172.16.9.186 ------ |

              |--------- tlog2(disk2)

178.104.163.99 — | --------------- tlog3(disk3)

In the case disk1 is deleted, tlog1 will be failed, and a double replication cluster can work with 2 tlogs, but why 1 tlogs failure will down the cluster?

Could someone please explain why this happen? Is’t a bug or designment for data protection?
What is the solution for fdbserver processes failure and where can i find the solution implementation in the source code?

Thanks very much!!