Simulation testing of FoundationDb

Hello!

I’ve just build fdb 6.3.9 from sources and I’m trying to run simulation tests on it.

First of all, I started the correctness-6.3.9.tar tarball with joshua (GitHub - FoundationDB/fdb-joshua: FoundationDB Correctness service) and received some errors

[oleg@oleg2 FdbJoshua]$ python3 -m joshua.joshua -C ../devops/clusters/joshua/fdb.cluster tail 20210414-154515-oleg-9a72672d246b2754
Results for test ensemble: 20210414-154515-oleg-9a72672d246b2754
0x1000000435db6fba10000 103 11c1fb3ab7a8 5315918842548694494 '<Test TestUID="e3149994-ac82-4703-be22-65b7f00ab230" RandomSeed="732940594" SourceVersion="2a5fb92f7f481ce2c7cfcaa300e952512bc38c56" Time="1618415118" BuggifyEnabled="1" DeterminismCheck="0" OldBinary="fdbserver-6.2.24" TestFile="/var/joshua/ensembles/20210414-154515-oleg-9a72672d246b2754/tests/restarting/from_6.2.0/SnapCycleRestart-1.txt" SimElapsedTime="139.481" RealElapsedTime="11.2693" RandomUnseed="53567" Passed="1" Failed="1" PeakMemory="199954432" OK="true"><BuggifySection File="flow/Knobs.cpp" Line="39" /><BuggifySection File="flow/Knobs.cpp" Line="56" /><BuggifySection File="flow/Knobs.cpp" Line="57" /><BuggifySection File="fdbclient/Knobs.cpp" Line="73" /><BuggifySection File="fdbclient/Knobs.cpp" Line="75" /><BuggifySection File="fdbserver/Knobs.cpp" Line="35" /><BuggifySection File="fdbserver/Knobs.cpp" Line="36" /><BuggifySection File="fdbserver/Knobs.cpp" Line="42" /><BuggifySection File="fdbserver/Knobs.cpp" Line="45" /><BuggifySection File="fdbserver/Knobs.cpp" Line="73" /><BuggifySection File="fdbserver/Knobs.cpp" Line="76" /><BuggifySection File="fdbserver/Knobs.cpp" Line="90" /><BuggifySection File="fdbserver/Knobs.cpp" Line="135" /><BuggifySection File="fdbserver/Knobs.cpp" Line="141" /><BuggifySection File="fdbserver/Knobs.cpp" Line="194" /><BuggifySection File="fdbserver/Knobs.cpp" Line="205" /><BuggifySection File="fdbserver/Knobs.cpp" Line="206" /><BuggifySection File="fdbserver/Knobs.cpp" Line="222" /><BuggifySection File="fdbserver/Knobs.cpp" Line="275" /><BuggifySection File="fdbserver/Knobs.cpp" Line="277" /><BuggifySection File="fdbserver/Knobs.cpp" Line="279" /><BuggifySection File="fdbserver/Knobs.cpp" Line="311" /><BuggifySection File="fdbserver/Knobs.cpp" Line="318" /><BuggifySection File="fdbserver/Knobs.cpp" Line="324" /><BuggifySection File="fdbserver/Knobs.cpp" Line="330" /><BuggifySection File="fdbserver/Knobs.cpp" Line="351" /><BuggifySection File="fdbserver/Knobs.cpp" Line="378" /><BuggifySection File="fdbserver/Knobs.cpp" Line="395" /><BuggifySection File="fdbserver/Knobs.cpp" Line="397" /><BuggifySection File="fdbserver/Knobs.cpp" Line="406" /><BuggifySection File="fdbserver/Knobs.cpp" Line="424" /><BuggifySection File="fdbserver/Knobs.cpp" Line="439" /><BuggifySection File="fdbserver/Knobs.cpp" Line="501" /><BuggifySection File="fdbserver/Knobs.cpp" Line="508" /><BuggifySection File="fdbserver/Knobs.cpp" Line="538" /><BuggifySection File="fdbrpc/AsyncFileCached.actor.cpp" Line="61" /><BuggifySection File="fdbrpc/sim2.actor.cpp" Line="804" /><BuggifySection File="fdbclient/NativeAPI.actor.cpp" Line="394" /><BuggifySection File="fdbrpc/sim2.actor.cpp" Line="241" /><BuggifySection File="fdbserver/CoordinatedState.actor.cpp" Line="38" /><BuggifySection File="fdbserver/CoordinatedState.actor.cpp" Line="41" /><BuggifySection File="fdbrpc/sim2.actor.cpp" Line="930" /><BuggifySection File="fdbserver/masterserver.actor.cpp" Line="725" /><BuggifySection File="fdbserver/masterserver.actor.cpp" Line="925" /><BuggifySection File="fdbserver/tester.actor.cpp" Line="831" /></Test>\n<Test TestFile="/var/joshua/ensembles/20210414-154515-oleg-9a72672d246b2754/tests/restarting/from_6.2.0/SnapCycleRestart-2.txt" RandomSeed="732940595" BuggifyEnabled="1" DeterminismCheck="0" OldBinary="fdbserver-6.2.24"><NoTraceFileGenerated Severity="30" Plugin="" MachineName="11c1fb3ab7a8" /><NoTraceFileGenerated Severity="30" Plugin="" MachineName="11c1fb3ab7a8" /></Test>\n'
0x1000000435db93f190000 103 11c1fb3ab7a8 6774011931387750987 '<Test TestFile="/var/joshua/ensembles/20210414-154515-oleg-9a72672d246b2754/tests/fast/BackupCorrectnessClean.txt" RandomSeed="291958968" BuggifyEnabled="1" DeterminismCheck="0" OldBinary=""><NoTraceFileGenerated Severity="30" Plugin="" MachineName="11c1fb3ab7a8" /><NoTraceFileGenerated Severity="30" Plugin="" MachineName="11c1fb3ab7a8" /></Test>\n'
0x1000000435dbb71630000 103 11c1fb3ab7a8 3061461588318534719 '<Test TestFile="/var/joshua/ensembles/20210414-154515-oleg-9a72672d246b2754/tests/slow/ParallelRestoreOldBackupCorrectnessCycle.txt" RandomSeed="75432848" BuggifyEnabled="1" DeterminismCheck="0" OldBinary=""><NoTraceFileGenerated Severity="30" Plugin="" MachineName="11c1fb3ab7a8" /><NoTraceFileGenerated Severity="30" Plugin="" MachineName="11c1fb3ab7a8" /></Test>\n'
0x1000000435dbd9f660000 103 11c1fb3ab7a8 3103890438703106204 '<Test TestFile="/var/joshua/ensembles/20210414-154515-oleg-9a72672d246b2754/tests/fast/KillRegionCycle.txt" RandomSeed="901454152" BuggifyEnabled="1" DeterminismCheck="0" OldBinary=""><NoTraceFileGenerated Severity="30" Plugin="" MachineName="11c1fb3ab7a8" /><NoTraceFileGenerated Severity="30" Plugin="" MachineName="11c1fb3ab7a8" /></Test>\n'
0x1000000435dbfc5240000 103 11c1fb3ab7a8 3102398388099496483 '<Test TestFile="/var/joshua/ensembles/20210414-154515-oleg-9a72672d246b2754/tests/slow/DifferentClustersSameRV.txt" RandomSeed="257028686" BuggifyEnabled="1" DeterminismCheck="0" OldBinary=""><NoTraceFileGenerated Severity="30" Plugin="" MachineName="11c1fb3ab7a8" /><NoTraceFileGenerated Severity="30" Plugin="" MachineName="11c1fb3ab7a8" /></Test>\n'
0x1000000435dc21b250000 103 11c1fb3ab7a8 4988921788399198735 '<Test TestFile="/var/joshua/ensembles/20210414-154515-oleg-9a72672d246b2754/tests/fast/TaskBucketCorrectness.txt" RandomSeed="65957481" BuggifyEnabled="1" DeterminismCheck="0" OldBinary=""><NoTraceFileGenerated Severity="30" Plugin="" MachineName="11c1fb3ab7a8" /><NoTraceFileGenerated Severity="30" Plugin="" MachineName="11c1fb3ab7a8" /></Test>\n'
0x1000000435dc44dd10000 103 11c1fb3ab7a8 8196442410007927692 '<Test TestFile="/var/joshua/ensembles/20210414-154515-oleg-9a72672d246b2754/tests/fast/TimeKeeperCorrectness.txt" RandomSeed="1584826" BuggifyEnabled="1" DeterminismCheck="0" OldBinary=""><NoTraceFileGenerated Severity="30" Plugin="" MachineName="11c1fb3ab7a8" /><NoTraceFileGenerated Severity="30" Plugin="" MachineName="11c1fb3ab7a8" /></Test>\n'
0x1000000435dc681c00000 103 11c1fb3ab7a8 4673601395945998285 '<Test TestFile="/var/joshua/ensembles/20210414-154515-oleg-9a72672d246b2754/tests/fast/VersionStamp.txt" RandomSeed="37680121" BuggifyEnabled="1" DeterminismCheck="0" OldBinary=""><NoTraceFileGenerated Severity="30" Plugin="" MachineName="11c1fb3ab7a8" /><NoTraceFileGenerated Severity="30" Plugin="" MachineName="11c1fb3ab7a8" /></Test>\n'
0x1000000435dc8b55c0000 103 11c1fb3ab7a8 2993740394339652498 '<Test TestFile="/var/joshua/ensembles/20210414-154515-oleg-9a72672d246b2754/tests/fast/BackupCorrectness.txt" RandomSeed="34274817" BuggifyEnabled="0" DeterminismCheck="0" OldBinary=""><NoTraceFileGenerated Severity="30" Plugin="" MachineName="11c1fb3ab7a8" /><NoTraceFileGenerated Severity="30" Plugin="" MachineName="11c1fb3ab7a8" /></Test>\n'
0x1000000435dcb48db0000 103 11c1fb3ab7a8 2462806307335811719 '<Test TestFile="/var/joshua/ensembles/20210414-154515-oleg-9a72672d246b2754/tests/slow/ParallelRestoreNewBackupCorrectnessAtomicOp.txt" RandomSeed="814560244" BuggifyEnabled="1" DeterminismCheck="0" OldBinary=""><NoTraceFileGenerated Severity="30" Plugin="" MachineName="11c1fb3ab7a8" /><NoTraceFileGenerated Severity="30" Plugin="" MachineName="11c1fb3ab7a8" /></Test>\n'
Ensemble stopped

I have some questions:

  1. Is it normal behaviour, ore something went wrong?
  2. I see that 10 tests failed. How can I know how many tests passed?
  3. I’d like to rerun a test
bin/fdbserver -r simulation -f tests/fast/VersionStamp.txt

How can I force the same RandomSeed and BuggifyEnabled to the test?

  1. I tried to run ./joshua_test from correctness-6.3.9.tar manually (without joshua machine). But receive an error:
[oleg@oleg2 correctness-6.3.9]$ ./joshua_test 
<Test TestUID="f74d39fc-8e10-4e1f-967c-ca2b085373a7" Severity="40" Passed="0" Failed="-1" PeakMemory="619892736" OK="false"><SummarizationError Severity="40" ErrorMessage="Failed to parse &lt;Event Severity=&quot;10&quot; Time=&quot;0.000000&quot; Type=&quot;BuggifySection&quot; Activated=&quot;0&quot; File=&quot;/home/oleg/work/fdb/FoundationDb/flow/Knobs.cpp&quot; Line=&quot;43&quot; Machine=&quot;0.0.0.0:0&quot; LogGroup=&quot;default&quot; /&gt;" /><TestUnexpectedlyNotFinished /></Test>

what am I missing?

  1. There are lots of differnent tests in test subdirectory of the fdb source tree. Why only few of them are run with joshua?

I think something is wrong here. Usually, we don’t run joshua_test manually. It is intended to be ran by joshua_agent, which is part of the GitHub - FoundationDB/fdb-joshua: FoundationDB Correctness service repo.

The typical setup is:

  • Have a FoundationDB cluster running.
  • Build and run the docker image from this script, make sure the docker image uses the same FDB version as the cluster in the first step (current master branch uses the latest FDB version, which can be changed with this patch). Go inside the docker image and make sure it can talk to the cluster.
  • Submit correctness package to the cluster, and check its output.

For your questions:

  1. Something is wrong.
  2. Use --xml --errors argument in your joshua tail command.
  3. Simulation needs two additional arguments, i.e., -s 123 -b on (123 is an integer seed you can change, “on” can be replaced with “off” to turn on/off buggification).
  4. Probably the environment is not the same as Joshua agent’s docker image.
  5. Many are unit tests not included in correctness package, and some tests are outdated.

Note it’s important to use the Docker image, because it contains necessary old FDB binaries (and old TLS libraries) for upgrade tests. Without them, upgrade tests will fail (like the first one in your test needs 6.2.24 binary).

I think most people outside of Apple/Snowflake use ctest to run tests.

It does appear to be a huge gap in documentation that there’s no “how to run tests” guide in the readme, nor contributing guide, nor anywhere in the wiki. :confused:

1 Like

Note it’s important to use the Docker image, because it contains necessary old FDB binaries (and old TLS libraries) for upgrade tests. Without them, upgrade tests will fail (like the first one in your test needs 6.2.24 binary).

I use the docker image for running tests with joshua. But after joshua shows that tests failed, I need a step-by step investigation process for the failed tests. First of all I have to run failed tests manually, without joshua.

  1. Use --xml --errors argument in your joshua tail command.

I can not see any passed tests with python3 -m joshua.joshua -C ../devops/clusters/joshua/fdb.cluster tail 20210414-154515-oleg-9a72672d246b2754 --xml --errors, only the errors

I’d like to receive a list like this:

  • Test xxx - passed
  • Test xxy - passed
  • Test xyx - failed
  • Test xyy - passed

Is it possible with joshua client?

Many are unit tests not included in correctness package, and some tests are outdated.

But why did joshua run only 10 tests from thats are in the corectness package?

Is there any step-by-step guide, how to run the corectness tests with ctest?

1 Like

Hello. I managed to return to fdb tests with joshua.

When I start tests with joshua all ten tests failed with the same error 'NoTraceFileGenerated

Results for test ensemble: 20210712-143806-oleg-7b73538a29623d86
<Trace><Test TestFile="/var/joshua/ensembles/20210712-143806-oleg-7b73538a29623d86/tests/rare/CloggedCycleWithKills.txt" RandomSeed="792566021" BuggifyEnabled="1" DeterminismCheck="0" OldBinary=""><NoTraceFileGenerated Severity="30" Plugin="" MachineName="e724e28ab0af" /><NoTraceFileGenerated Severity="30" Plugin="" MachineName="e724e28ab0af" /></Test>
<Test TestFile="/var/joshua/ensembles/20210712-143806-oleg-7b73538a29623d86/tests/fast/RandomUnitTests.txt" RandomSeed="478368530" BuggifyEnabled="0" DeterminismCheck="0" OldBinary=""><NoTraceFileGenerated Severity="30" Plugin="" MachineName="e724e28ab0af" /><NoTraceFileGenerated Severity="30" Plugin="" MachineName="e724e28ab0af" /></Test>
<Test TestFile="/var/joshua/ensembles/20210712-143806-oleg-7b73538a29623d86/tests/fast/IncrementTest.txt" RandomSeed="886997489" BuggifyEnabled="0" DeterminismCheck="0" OldBinary=""><NoTraceFileGenerated Severity="30" Plugin="" MachineName="e724e28ab0af" /><NoTraceFileGenerated Severity="30" Plugin="" MachineName="e724e28ab0af" /></Test>
<Test TestFile="/var/joshua/ensembles/20210712-143806-oleg-7b73538a29623d86/tests/fast/CloggedSideband.txt" RandomSeed="35027166" BuggifyEnabled="1" DeterminismCheck="0" OldBinary=""><NoTraceFileGenerated Severity="30" Plugin="" MachineName="e724e28ab0af" /><NoTraceFileGenerated Severity="30" Plugin="" MachineName="e724e28ab0af" /></Test>
<Test TestFile="/var/joshua/ensembles/20210712-143806-oleg-7b73538a29623d86/tests/rare/CloggedCycleWithKills.txt" RandomSeed="308590453" BuggifyEnabled="1" DeterminismCheck="0" OldBinary=""><NoTraceFileGenerated Severity="30" Plugin="" MachineName="e724e28ab0af" /><NoTraceFileGenerated Severity="30" Plugin="" MachineName="e724e28ab0af" /></Test>
<Test TestFile="/var/joshua/ensembles/20210712-143806-oleg-7b73538a29623d86/tests/fast/SidebandWithStatus.txt" RandomSeed="75714368" BuggifyEnabled="1" DeterminismCheck="0" OldBinary=""><NoTraceFileGenerated Severity="30" Plugin="" MachineName="e724e28ab0af" /><NoTraceFileGenerated Severity="30" Plugin="" MachineName="e724e28ab0af" /></Test>
<Test TestFile="/var/joshua/ensembles/20210712-143806-oleg-7b73538a29623d86/tests/slow/ParallelRestoreOldBackupCorrectnessCycle.txt" RandomSeed="714193089" BuggifyEnabled="1" DeterminismCheck="0" OldBinary=""><NoTraceFileGenerated Severity="30" Plugin="" MachineName="e724e28ab0af" /><NoTraceFileGenerated Severity="30" Plugin="" MachineName="e724e28ab0af" /></Test>
<Test TestFile="/var/joshua/ensembles/20210712-143806-oleg-7b73538a29623d86/tests/fast/ReportConflictingKeys.txt" RandomSeed="932252579" BuggifyEnabled="1" DeterminismCheck="0" OldBinary=""><NoTraceFileGenerated Severity="30" Plugin="" MachineName="e724e28ab0af" /><NoTraceFileGenerated Severity="30" Plugin="" MachineName="e724e28ab0af" /></Test>
<Test TestFile="/var/joshua/ensembles/20210712-143806-oleg-7b73538a29623d86/tests/slow/ParallelRestoreOldBackupCorrectnessMultiCycles.txt" RandomSeed="697384618" BuggifyEnabled="1" DeterminismCheck="0" OldBinary=""><NoTraceFileGenerated Severity="30" Plugin="" MachineName="e724e28ab0af" /><NoTraceFileGenerated Severity="30" Plugin="" MachineName="e724e28ab0af" /></Test>
<Test TestFile="/var/joshua/ensembles/20210712-143806-oleg-7b73538a29623d86/tests/fast/VersionStamp.txt" RandomSeed="579330401" BuggifyEnabled="1" DeterminismCheck="0" OldBinary=""><NoTraceFileGenerated Severity="30" Plugin="" MachineName="e724e28ab0af" /><NoTraceFileGenerated Severity="30" Plugin="" MachineName="e724e28ab0af" /></Test>
Ensemble stopped

When I run any of these tests manually, it passes:

[oleg@oleg2 bld]$ bin/fdbserver -r simulation -s 478368530  -f correctness/tests/fast/RandomUnitTests.txt
Random seed is 478368530...
Datacenter 0: 4/12 machines, 1/1 coordinators
Datacenter 1: 4/12 machines, 0/1 coordinators
Datacenter 2: 4/12 machines, 0/1 coordinators
FDBD joined cluster.
FDBD joined cluster.
FDBD joined cluster.
FDBD joined cluster.
FDBD joined cluster.
FDBD joined cluster.
FDBD joined cluster.
FDBD joined cluster.
FDBD joined cluster.
FDBD joined cluster.
FDBD joined cluster.
FDBD joined cluster.
FDBD joined cluster.
FDBD joined cluster.
FDBD joined cluster.
FDBD joined cluster.
FDBD joined cluster.
startingConfiguration:new backup_worker_enabled:=1 log_routers:=4 log_spill:=2 log_version:=3 proxies:=2 single regions=[{"datacenters":[{"id":"0","priority":2},{"id":"1","priority":1,"satellite":1,"satellite_logs":3},{"id":"2","priority":1,"satellite":1,"satellite_logs":5}]}] memory-2 usable_regions:=1 start
Run test:UnitTests start
setting up test (UnitTests)...
Test received trigger for setup...
Test received trigger for setup...
Test received trigger for setup...
Test received trigger for setup...
Test received trigger for setup...
running test (UnitTests)...
Found 117 tests
Testing /fdbserver/tlogserver/VersionMessagesOverheadFactor
UnitTests complete
UnitTests complete
UnitTests complete
UnitTests complete
UnitTests complete
UnitTests complete
checking test (UnitTests)...
fetching metrics (UnitTests)...
Metric (0, 0): Test Cases Available, 117.000000, 117
Metric (0, 1): Test Cases Executed, 1.000000, 1
Metric (0, 2): Test Cases Failed, 0.000000, 0
Metric (0, 3): Total wall clock time (s), 8.585997, 8.59
Metric (0, 4): Total flow time (s), 0.000000, 0
Metric (1, 0): Test Cases Available, 0.000000, 0
Metric (1, 1): Test Cases Executed, 0.000000, 0
Metric (1, 2): Test Cases Failed, 0.000000, 0
Metric (1, 3): Total wall clock time (s), 0.000000, 0
Metric (1, 4): Total flow time (s), 0.000000, 0
Metric (2, 0): Test Cases Available, 0.000000, 0
Metric (2, 1): Test Cases Executed, 0.000000, 0
Metric (2, 2): Test Cases Failed, 0.000000, 0
Metric (2, 3): Total wall clock time (s), 0.000000, 0
Metric (2, 4): Total flow time (s), 0.000000, 0
Metric (3, 0): Test Cases Available, 0.000000, 0
Metric (3, 1): Test Cases Executed, 0.000000, 0
Metric (3, 2): Test Cases Failed, 0.000000, 0
Metric (3, 3): Total wall clock time (s), 0.000000, 0
Metric (3, 4): Total flow time (s), 0.000000, 0
Metric (4, 0): Test Cases Available, 0.000000, 0
Metric (4, 1): Test Cases Executed, 0.000000, 0
Metric (4, 2): Test Cases Failed, 0.000000, 0
Metric (4, 3): Total wall clock time (s), 0.000000, 0
Metric (4, 4): Total flow time (s), 0.000000, 0
5 test clients passed; 0 test clients failed
Run test:UnitTests Done.

1 tests passed; 0 tests failed.

Unseed: 53310
Elapsed: 6.387475 simsec, 8.826840 real seconds
Server 2.0.1.0:1: 0.000 Mclocks
Server 3.4.3.5:1: 0.000 Mclocks
Server 2.2.1.1:1: 0.000 Mclocks
Server 2.0.1.3:1: 0.000 Mclocks
Server 2.0.1.2:1: 0.000 Mclocks
Server 2.1.1.0:1: 0.000 Mclocks
TestSystem 1.1.1.1:1: 0.000 Mclocks
Server 3.4.3.1:1: 0.000 Mclocks
Server 3.4.3.4:1: 0.000 Mclocks
Server 2.2.1.2:1: 0.000 Mclocks
Server 2.1.1.1:1: 0.000 Mclocks
Server 2.2.1.0:1: 0.000 Mclocks
Server 2.1.1.3:1: 0.000 Mclocks
Server 2.1.1.2:1: 0.000 Mclocks
Server 2.2.1.3:1: 0.000 Mclocks
Server 3.4.3.2:1: 0.000 Mclocks
Server 3.4.3.3:1: 0.000 Mclocks
Server 2.0.1.1:1: 0.000 Mclocks

How to fix 'NoTraceFileGenerated in joshua?

I found the root cause of NoTraceFileGenerated:

  1. fdbserver could not start in the joshua container due the glibc version incompatibility
  2. TestHarness.exe does not log the root error in this case and raises NoTraceFileGenerated

I created PRs on TestHarness.exe:
- Fixed 'NoTraceFileGenerated' error in simulation tests when the fdbserver process could not start by oleg68 · Pull Request #5177 · apple/foundationdb · GitHub
- Fixed 'NoTraceFileGenerated' error in simulation tests when fdbserver process could not start by oleg68 · Pull Request #5178 · apple/foundationdb · GitHub
- Fixed 'NoTraceFileGenerated' error in simulation tests when fdbserver process could not start by oleg68 · Pull Request #5179 · apple/foundationdb · GitHub