Binding tester heisenbug with API version 710 and tenant

I’ve Tokio/Rust bindings CI setup to continuously run binding tester.

Recently I encountered an unusual failure with FDB version 7.1.3 and API version 710.

The failing seed according to CI is 1191235632 and it fails with the following error message

Incorrect result: 
  rust   - ('tester_output', 'stack', 630, 24440) = b'\x01GOT_RANGE_SPLIT_POINTS\x00'
  python - ('tester_output', 'stack', 630, 24440) = b'\x01\x01ERROR\x00\xff\x012131\x00\xff\x00'


Test with seed 1191235632 and concurrency 1 had 1 incorrect result(s) and 0 error(s) at API version 710

From the failure, it looks like Python bindings detected a 2131 (tenant_not_found) error, whereas the Rust bindings was able to complete the get_range_split_points instruction successfully.

When I tried to replicate failure locally, using the failing seed, the error disappeared, and the seed passed successfully.

What would be the recommended approach for dealing with such heisenbugs?

Some of these types of issues can be a bit tricky. Sometimes I’ll set up a bindingtester job to run the same seed repeatedly to see how common the failure reproduces. If it’s frequent enough, then that kind of reproduction can be enough to debug what’s happening and validate a fix.

Another strategy is to look at the sequence of instructions leading up to the failed instruction, often with some extra debugging output added. For example, in this case you have a tenant that isn’t found, and so you could determine what tenant is being used and what kind of operations have been performed on that tenant. If you see that it was recently created or deleted for example, then there may be some sort of race between the creation/deletion and the use of the tenant. It can help to manually run through the instructions to determine what you think should happen, which can give clues to where things might go wrong.

If what you are seeing is that typically both bindings report a successful call to get the split points but only Python failed the one time, then it’s likely the problem isn’t the rust bindings. It could be something in the Python bindings or tester, something non-deterministic about the generated test, or even an issue in the core FoundationDB client. In any of these cases, I would expect the error to appear occasionally for us as well, so perhaps we’ll come across it soon.

Thanks @ajbeamon for the reply!

I just had another similar failure, so I decided to investigate further.

For the seed 1191235632, the sequence of instructions is as follows.

24433. 'RESET'
24435. 'TENANT_DELETE'
24436. 'WAIT_FUTURE'
24440. 'GET_RANGE_SPLIT_POINTS'
24443. 'READ_CONFLICT_RANGE'

For the seed 3275318271, the sequence of instructions is very similar and is as follows.

26949. 'RESET'
26951. 'TENANT_DELETE'
26952. 'WAIT_FUTURE'
26956. 'GET_RANGE_SPLIT_POINTS'
26957. 'RESET'

The code that implements TENANT_DELETE instruction in Rust is very similar to the Java implementation. Rust implementation is here. Java implementation is here. I was wondering if you might be able to spot any issue in the Rust implementation?

I could be wrong, but It looks to me that in the Rust implementation, for some reason the C binding layer is returning that tenant deletion was successful, and in some cases not returning an error for the subsequent GET_RANGE_SPLIT_POINTS instruction.

Also when I was investigating this issue, I tried to figure out where _TENANT suffix instructions were getting added in api.py since I didn’t find any _TENANT suffix instructions.

Shouldn’t api.py's generate method contain something along the lines?

tenant_reads = [x + '_TENANT' for x in reads]
tenant_mutations = [x + '_TENANT' for x in mutations]

Good catch, yes that should be added.

I was wondering if you might be able to spot any issue in the Rust implementation?

I don’t spot anything in a first look. You said re-running the test doesn’t fail every time, do you know if in that case both bindings are reporting that the split points operation failed or if both cases had the operation succeed?

AFAICT, its is only the Rust bindings that occasionally reports GOT_RANGE_SPLIT_POINTS when it should be reporting an error.

The Tokio/Rust implementation of get_range_split_points is here and here. I am not really doing anything other than just calling the C API.

Funny story, I’m also experiencing 2131 errors during bindingTester :upside_down_face:
Digging :eyes:

@rajivr, did you disable some tests on your side?

I would be really interested if you are able to fix this issue. So far, I’ve had no luck.

The only test that I’ve disabled is versionstamp v1 format as that is not supported in my bindings. It is documented here.

Okay, after some digging, here’s what I found:

I’m experiencing many tenant-related error on the Rust-bindings. I decided to dig on the seed 3181802154 on FoundationDB (7.1.23). BindingTester is the latest commit on the release-7.1. The seed is passing on my laptop running Linux, and failing on Github Actions:

 Incorrect result: 
   rust   - ('tester_output', 'stack', 1174, 19426) = b'\x01\x01ERROR\x00\xff\x011025\x00\xff\x00'
   python - ('tester_output', 'stack', 1174, 19426) = b'\x01\x01ERROR\x00\xff\x012131\x00\xff\x00'
 
 
 Test with seed 3181802154 and concurrency 1 had 1 incorrect result(s) and 0 error(s) at API version 710
 Completed api test with random seed 3181802154 and 1000 operations

We have error 1025(transaction_cancelled) on Rust whereas we have 2131(tenant_not_found) on operation 19426. I forked my branch, and start hacking.

I managed to reduce the scope of the problem to two jobs in the same job run:

They are both trying to run 50 iterations of the same seed. The only difference is that I’ve added an ugly sleep of 4ms between each op near 19426 to make it work.

I know tenants are an experimental features, but I have the feeling that the Rust code is executing faster than Python, causing a race-condition somewhere :thinking:

Following this debug, I have some questions:

  • On the official CI, how many cores are used during bindingTester tests? Github Actions runners have only 2-core CPU.
  • are tenant-operations async internally somehwere?
  • I cannot see any tenant op in the Flow bindings, is it planned?

I could use some help or advices to fix that issue :eyes:

2 Likes

Thanks @PierreZ for the awesome work in helping to reproduce this issue. The CI for my bindings is on a previous version (7.1.12), and I too regularly see this error. Other than this one error, the binding tester is very stable.

My usual questions for starting an investigation into an issue like this are:

  1. Does the test always fail? If so, then that’s usually the easier case and we’re mainly need to find out where the two errors are coming from.
  2. If not, is one of the binding testers producing a stable result? For example, do we most of the time get error 2131 in both bindings but occasionally 1025 in rust (or the opposite)? Or maybe both bindings occasionally give 1025 while usually giving 2131? If only rust gives the 1025 error and then only some of the time, then it could be a subtle behavior difference in the bindings. That doesn’t necessarily mean that they binding implementation is to blame, but it could be revealing a non-determinism in the client behavior. This is of course harder to debug because of the difficulty in reproduction, and I’ll sometimes turn on more verbose logging and as some extra instrumentation before sending it off to try a bunch of runs.
  3. What are the operations being run when the failure happens? The specific operation is useful, as well as any others in the same or concurrent transactions.

Some of the times I’ve seen cases where the same operation has generated 2 different errors, both errors are valid in the context and which you get depends on which one you hit first. Still there can often be managed, and the first step is understanding why the operation would generate the errors.

I’ll add that I don’t think I’ve seen this problem (though I haven’t run the binding tester on 7.1 in a while, maybe something has been fixed since). If I had a reproduction I could play with using the standard bindings, I would try to lend a hand tracking it down, but otherwise I can try to offer suggestions based on any findings you have.

Thanks a lot @ajbeamon for your suggestions :smile: I will try to answer them the best way I can:

The test is always failing on Github Actions. I’ve just tested 200 iterations of the same seed.

The lastest run is:

  • failing without sleep at the very first iteration,
  • succeedding with a sleep of 4ms on 200 iterations.

It’s an hardware-based stable result :see_no_evil: For the 3181802154 seed on operation 19426:

  • in Python, I’m always experiencing 2131
  • in Rust, I’m experiencing either:
    • 2131 on a local, more beefy machine,
    • 2131 on Github Actions with a 4ms sleep after each op between 18900 and 1950,
    • 1025 on Github Actions without a sleep.

I’m seeing also this error on previous runs on other seeds. The similarity are quite high:

  • Same low-level rust librairies,
  • CI both based on Github Actions,
  • python bindings experiencing 2131,
  • rust binding experiencing another behavior, from errors to a succeeding transaction.

I’m not sure what you mean by operations being run. As far as I know:

  • I’m not running operations other than the previous operations,
  • I have no parallelism enabled in ./bindings/bindingtester/bindingtester.py --num-ops 1000 --api-version $fdb_api_version --test-name api --compare python rust --seed 3181802154

I meant to say the sequence of instructions run by the binding tester. You can use the --print option to print the sequence of instructions for a seed (and add --all I believe to include even the less interesting instructions).

In this output:

rust   - ('tester_output', 'stack', 1174, 19426)

The 19426 refers to the instruction number, so you can track it back down to the particular instruction that failed and look at what ran leading up to it.

This doesn’t always point you to the most relevant place, so it’s also helpful if you have a stable result to use --bisect. This will run the test repeatedly with different numbers of instructions until it finds the minimum number that produces your error. Then you pass that number into --num-ops and rerun the smaller reproduction and/or print the test instructions and see how it ends.

Thanks for the tips, it will help me display relevant operations.

Running –bisect is confirming that something strange is going on, here’s an extract of the logs:

Test with seed 3181802154 and concurrency 1 had 0 incorrect result(s) and 0 error(s) at API version 710
Completed api test with random seed 3181802154 and 500 operations

Test with seed 3181802154 and concurrency 1 had 0 incorrect result(s) and 0 error(s) at API version 710
Completed api test with random seed 3181802154 and 750 operations

Test with seed 3181802154 and concurrency 1 had 1 incorrect result(s) and 0 error(s) at API version 710
Completed api test with random seed 3181802154 and 875 operations

Test with seed 3181802154 and concurrency 1 had 0 incorrect result(s) and 0 error(s) at API version 710
Completed api test with random seed 3181802154 and 813 operations

# Same logs with 0 errors and operations {844, 860, 852, 848, 846, 845}

Error finding minimal failing test for seed 3181802154. The failure may not be deterministic

We also started running integrations tests (which are starting an fdb testcontainer) on our layers with tenants enabled, and we are also hitting the same tenant_not_found error, even if it has been created. I’m not sure why, but sleeping before running the tests seems to fix the issue. I will keep digging, much faster now that I can reproduce locally through our integration tests.

To add more details, on our integrations tests, we are spinning up an fdb container running 7.1.23. From several tests, I discovered that tests are always failing if I’m starting my tests right after configure new single memory tenant_mode=optional_experimental;createtenant test with a tenant_not_found error. Waiting for database to be healthy before starting the test fixed the issue :see_no_evil:

I tried adding some sleep on the bindingTester, with no success.

I didn’t have time to look at the issue, but is there some tenants-caching involved internally in fdbclient/fdb?

EDIT 17 of november 2022

I finally had the time to dig a bit more:

  • my assumption that I can reproduce it locally is wrong, it was another issue on our side :see_no_evil:
  • we have not encountered the bug in our layers, which are using the tenant branch,
  • It is still failling on CI. I added more pipelines to try to retrieve more patterns:
    • ./bindings/bindingtester/bindingtester.py --num-ops 1000 --api-version $fdb_api_version --test-name api --compare python rust --seed 3181802154 is failing without a sleep and succeeds with a sleep(Exhibit A and B, code difference is here)
    • brute-forcing --bisect on multiple fdb versions is showing that:
      • it fails randomly during bisect when sleep is disabled, regardless of fdb’s version(Error finding minimal failing test for seed 3181802154. The failure may not be deterministic)
      • it always succeeds when sleep is enabled, regardless of fdb’s version(No failing test found for seed 3181802154 with 1000 ops. Try specifying a larger --num-ops parameter.)

I’m not sure on how I could debug more this issue, any ideas @ajbeamon ?

My question above still remains, is there some tenant-caching in libfdb/fdbserver?

I spent a little bit of time digging into the get range split points operation and what might go wrong with it. I don’t have an exact answer yet, but there are a couple things that seemed interesting. First, are you running your test against a cluster that has more than one process in it? I think that all of our binding tester runs involve only one process in the cluster, and if you are doing something different it could represent something worth exploring.

The other thing is that the getRangeSplitPoints function does not actually get a read version on the client to send to the cluster for this request. Instead, it gets answered based on the most recent data on the server process it talks to.

That means that if you delete a tenant and then succeed in sending a getRangeSplitPoints request to a storage server for that tenant, that server may get the request before it learns of the deletion. It would therefore not actually fail. Running more than one process in your cluster could make this scenario more likely, as there would be a longer delay between data being committed and getting sent to the storage server.

The last relevant bit involves what the client does when it wants to use a tenant as part of an operation. It needs to look up some details about the tenant, which it either learns from the commit proxy or gets locally from cache. Ordinarily reading the local cache would have been fine because the server process should reject the request if the cache was stale, but as mentioned above that isn’t happening here with an unversioned request.

That means it is also possible that there is some race involving this entry showing up in your client cache. This seems less likely (I think it would require that no successful operation on this tenant has been completed yet and that one is outstanding), but it may also be possible.

As a side note, some of what I described above (in particular, the caching behavior on the client) is likely going to change soon.

Thanks a lot for spending time on it :smile:

I’m not doing something different.

That could explains the error when combining getRangeSplitPoints and tenants, thanks for digging that out.

If I put aside those types of errors, I’m oftenly seeing a transaction cancelled in Rust where it should be tenant_not_found:

 Incorrect result: 
  rust   - ('tester_output', 'stack', 1308, 26689) = b'\x01\x01ERROR\x00\xff\x011025\x00\xff\x00'
  python - ('tester_output', 'stack', 1308, 26689) = b'\x01\x01ERROR\x00\xff\x012131\x00\xff\x00'


Test with seed 2834048320 and concurrency 1 had 1 incorrect result(s) and 0 error(s) at API version 710
Completed api test with random seed 2834048320 and 1000 operations

Those errors are only appearing on the CI, and I can’t reproduce the error locally. I will try to add more a more beefy CI server, because I have no idea how reliable those Github Actions VMs are :see_no_evil:

Also, could you share how the bindingTester is runned on the official CI? I’m mostly looking for information like:

  • how many ops and which scenarios are run,
  • how many core and ram are available,
  • how the loop is handled.
  • Is there a retry strategy on a seed?
  • do you fail on each errors?

My goal is to be more closer to what you are using :smile:

I’m not really sure what kind of hardware we use, but I can point you to the scripts that we use to run the binding tester. I’m not super familiar with what the details of what these scripts do, but I can give a summary from my understanding. The basic idea is that for a given run, we start with bindingTest.sh, which does a little setup and then runs bindingTestScript.sh. This creates a local cluster (using localClusterStart.sh) and then starts run_binding_tester.sh to actually run the binding tester. This last script has some logic to choose a particular set of arguments for the test and runs it.

We basically package all of this up and run it across a test framework that executes this whole process a bunch of times (with random seeds, etc., each time). I believe each run only does one run of the binding tester, and if the binding tester reports that the run failed (I think by exit code), then we will consider the test failed. We don’t really retry the same set of arguments in any automated way, but when we get reports of failures we would re-run the test manually.