Once upon a time, I frequently used the ability to set multiple machine_id=XXX
values in the same foundationdb.conf
to fake having multiple physical host on a single host: this is very useful when testing ‘double’ or ‘triple’ redundancy on a single dev box (with multiple hard disks), and it also saved me multiple times in production.
It does not seem to work anymore with 6.x? I’m using 6.0.15, see the conf/logs at the end.
I saw that machined_id
has been renamed to locality_machineid
in the .conf file, and updated accordingly, but this still does not work: when doing a “configure double”, the cluster does not come back online and is tring to recruit more log workers unsuccessfully
Looking at the code that throws no_more_servers
errors, I see that it uses the dcid()
or zoneid()
, but not the machineid ? It also uses the IP address of the node, so I tried using 3 different IPs for all the fake nodes (127.0.0.1, and two other IP addresses that my dev box is using) and the results are the same…
Example code that tries to find storage workers:
for( auto& it : id_worker )
if( workerAvailable( it.second, false ) &&
!excludedMachines.count(it.second.interf.locality.zoneId()) &&
( includeDCs.size() == 0 || includeDCs.count(it.second.interf.locality.dcId()) ) &&
!addressExcluded(excludedAddresses, it.second.interf.address()) &&
it.second.processClass.machineClassFitness( ProcessClass::Storage ) <= ProcessClass::UnsetFit ) {
return std::make_pair(it.second.interf, it.second.processClass);
}
Is this “feature” still a thing? This is soooo helpful that I’m a bit sad that it seems gone from 6.x
Is there another way to fake this? I could try changing also the zone or dc id, but I’m afraid that this would create other issues. My goal at the moment is only to have at least double redundancy, so that I can test some things…
–
Here is the foundationdb.conf on my single machine that attempts to simulates 3 machines with 4 processes each:
[fdbmonitor]
restart_delay = 20
[general]
cluster_file=C:\ProgramData\foundationdb\fdb.cluster
## Default parameters for individual fdbserver processes
[fdbserver]
public_address = auto:$ID
listen_address = public
parentpid = $PID
command=C:\Program Files\foundationdb\bin\fdbserver.exe
datadir=C:\ProgramData\foundationdb\data\$ID
logdir=C:\ProgramData\foundationdb\logs
# Fake NODE01
[fdbserver.4500]
locality_machineid = NODE01
[fdbserver.4501]
locality_machineid = NODE01
[fdbserver.4502]
locality_machineid = NODE01
[fdbserver.4503]
locality_machineid = NODE01
# Fake NODE02
[fdbserver.4510]
locality_machineid = NODE02
[fdbserver.4511]
locality_machineid = NODE02
[fdbserver.4512]
locality_machineid = NODE02
[fdbserver.4513]
locality_machineid = NODE02
# Fake NODE03
[fdbserver.4520]
locality_machineid = NODE03
[fdbserver.4521]
locality_machineid = NODE03
[fdbserver.4522]
locality_machineid = NODE03
[fdbserver.4523]
locality_machineid = NODE03
Here is what I get in the fdbcli console, after wiping everything and starting over!
C:\WINDOWS\system32>net start fdbmonitor
The FoundationDB Server Monitor (fdbmonitor) service is starting..
The FoundationDB Server Monitor (fdbmonitor) service was started successfully.
C:\WINDOWS\system32>fdbcli
Using cluster file `C:\ProgramData\foundationdb\fdb.cluster'.
The database is unavailable; type `status' for more information.
Welcome to the fdbcli. For help, type `help'.
fdb> configure single new ssd
Database created
fdb> status details
Using cluster file `C:\ProgramData\foundationdb\fdb.cluster'.
Configuration:
Redundancy mode - single
Storage engine - ssd-2
Coordinators - 1
Cluster:
FoundationDB processes - 12
Machines - 3
Memory availability - 3.3 GB per process on machine with least available
>>>>> (WARNING: 4.0 GB recommended) <<<<<
Fault Tolerance - 0 machines
Server time - 12/21/18 19:20:01
Data:
Replication health - (Re)initializing automatic data distribution
Moving data - unknown (initializing)
Sum of key-value sizes - unknown
Disk space used - 0 MB
Operating space:
Storage server - 45.2 GB free on most full server
Log server - 45.2 GB free on most full server
Workload:
Read rate - 0 Hz
Write rate - 0 Hz
Transactions started - 0 Hz
Transactions committed - 0 Hz
Conflict rate - 0 Hz
Backup and DR:
Running backups - 0
Running DRs - 0
Process performance details:
10.10.0.173:4500 ( 1% cpu; 6% machine; 0.002 Gbps; 4% disk IO; 0.1 GB / 3.3 GB RAM )
10.10.0.173:4501 ( 1% cpu; 6% machine; 0.002 Gbps; 4% disk IO; 0.0 GB / 3.3 GB RAM )
10.10.0.173:4502 ( 0% cpu; 6% machine; 0.002 Gbps; 4% disk IO; 0.0 GB / 3.3 GB RAM )
10.10.0.173:4503 ( 0% cpu; 6% machine; 0.002 Gbps; 4% disk IO; 0.0 GB / 3.3 GB RAM )
10.10.0.173:4510 ( 0% cpu; 6% machine; 0.002 Gbps; 4% disk IO; 0.0 GB / 3.3 GB RAM )
10.10.0.173:4511 ( 0% cpu; 6% machine; 0.002 Gbps; 4% disk IO; 0.0 GB / 3.3 GB RAM )
10.10.0.173:4512 ( 0% cpu; 6% machine; 0.002 Gbps; 4% disk IO; 0.0 GB / 3.3 GB RAM )
10.10.0.173:4513 ( 1% cpu; 6% machine; 0.002 Gbps; 4% disk IO; 0.0 GB / 3.3 GB RAM )
10.10.0.173:4520 ( 0% cpu; 6% machine; 0.002 Gbps; 4% disk IO; 0.0 GB / 3.4 GB RAM )
10.10.0.173:4521 ( 2% cpu; 6% machine; 0.002 Gbps; 4% disk IO; 0.1 GB / 3.3 GB RAM )
10.10.0.173:4522 ( 0% cpu; 6% machine; 0.002 Gbps; 4% disk IO; 0.0 GB / 3.3 GB RAM )
10.10.0.173:4523 ( 1% cpu; 6% machine; 0.002 Gbps; 4% disk IO; 0.1 GB / 3.3 GB RAM )
Coordination servers:
10.10.0.173:4500 (reachable)
Client time: 12/21/18 19:20:01
fdb> configure double
ERROR: The database is unavailable
Type `configure FORCE <TOKEN>*' to configure without this check
fdb> status details
WARNING: Long delay (Ctrl-C to interrupt)
Using cluster file `C:\ProgramData\foundationdb\fdb.cluster'.
Recruiting new transaction servers.
Need at least 2 log servers, 1 proxies and 1 resolvers.
Have 12 processes on 3 machines.
Timed out trying to retrieve storage servers.
fdb>
Here are some things that I see in the CC’s log:
Line 559: <Event Severity="20" Time="1545416397.826630" Type="RecruitStorageNotAvailable" ID="4bcd542f38ab421f" Error="no_more_servers" ErrorDescription="Not enough physical servers available" ErrorCode="1008" Machine="10.10.0.173:4500" LogGroup="default" Roles="CC,TL" />
Line 801: <Event Severity="20" Time="1545416398.426781" Type="RecruitStorageNotAvailable" ID="4bcd542f38ab421f" Error="no_more_servers" ErrorDescription="Not enough physical servers available" ErrorCode="1008" Machine="10.10.0.173:4500" LogGroup="default" Roles="CC,SS,TL" />
Line 1639: <Event Severity="20" Time="1545416480.117065" Type="RecruitFromConfigurationNotAvailable" ID="4bcd542f38ab421f" Error="no_more_servers" ErrorDescription="Not enough physical servers available" ErrorCode="1008" Machine="10.10.0.173:4500" LogGroup="default" Roles="CC,SS,TL" />
Line 1651: <Event Severity="20" Time="1545416480.617592" Type="RecruitTLogMatchingSetNotAvailable" ID="4bcd542f38ab421f" Error="no_more_servers" ErrorDescription="Not enough physical servers available" ErrorCode="1008" Machine="10.10.0.173:4500" LogGroup="default" Roles="CC,SS,TL" />
Line 1653: <Event Severity="20" Time="1545416480.617592" Type="RecruitStorageNotAvailable" ID="4bcd542f38ab421f" Error="no_more_servers" ErrorDescription="Not enough physical servers available" ErrorCode="1008" Machine="10.10.0.173:4500" LogGroup="default" Roles="CC,SS,TL" />
Line 1671: <Event Severity="20" Time="1545416481.619681" Type="RecruitTLogMatchingSetNotAvailable" ID="4bcd542f38ab421f" Error="no_more_servers" ErrorDescription="Not enough physical servers available" ErrorCode="1008" Machine="10.10.0.173:4500" LogGroup="default" Roles="CC,SS,TL" />
Line 1673: <Event Severity="20" Time="1545416481.619681" Type="RecruitStorageNotAvailable" ID="4bcd542f38ab421f" Error="no_more_servers" ErrorDescription="Not enough physical servers available" ErrorCode="1008" Machine="10.10.0.173:4500" LogGroup="default" Roles="CC,SS,TL" />