Binding tester heisenbug with API version 710 and tenant

I’ve Tokio/Rust bindings CI setup to continuously run binding tester.

Recently I encountered an unusual failure with FDB version 7.1.3 and API version 710.

The failing seed according to CI is 1191235632 and it fails with the following error message

Incorrect result: 
  rust   - ('tester_output', 'stack', 630, 24440) = b'\x01GOT_RANGE_SPLIT_POINTS\x00'
  python - ('tester_output', 'stack', 630, 24440) = b'\x01\x01ERROR\x00\xff\x012131\x00\xff\x00'


Test with seed 1191235632 and concurrency 1 had 1 incorrect result(s) and 0 error(s) at API version 710

From the failure, it looks like Python bindings detected a 2131 (tenant_not_found) error, whereas the Rust bindings was able to complete the get_range_split_points instruction successfully.

When I tried to replicate failure locally, using the failing seed, the error disappeared, and the seed passed successfully.

What would be the recommended approach for dealing with such heisenbugs?

Some of these types of issues can be a bit tricky. Sometimes I’ll set up a bindingtester job to run the same seed repeatedly to see how common the failure reproduces. If it’s frequent enough, then that kind of reproduction can be enough to debug what’s happening and validate a fix.

Another strategy is to look at the sequence of instructions leading up to the failed instruction, often with some extra debugging output added. For example, in this case you have a tenant that isn’t found, and so you could determine what tenant is being used and what kind of operations have been performed on that tenant. If you see that it was recently created or deleted for example, then there may be some sort of race between the creation/deletion and the use of the tenant. It can help to manually run through the instructions to determine what you think should happen, which can give clues to where things might go wrong.

If what you are seeing is that typically both bindings report a successful call to get the split points but only Python failed the one time, then it’s likely the problem isn’t the rust bindings. It could be something in the Python bindings or tester, something non-deterministic about the generated test, or even an issue in the core FoundationDB client. In any of these cases, I would expect the error to appear occasionally for us as well, so perhaps we’ll come across it soon.

Thanks @ajbeamon for the reply!

I just had another similar failure, so I decided to investigate further.

For the seed 1191235632, the sequence of instructions is as follows.

24433. 'RESET'
24435. 'TENANT_DELETE'
24436. 'WAIT_FUTURE'
24440. 'GET_RANGE_SPLIT_POINTS'
24443. 'READ_CONFLICT_RANGE'

For the seed 3275318271, the sequence of instructions is very similar and is as follows.

26949. 'RESET'
26951. 'TENANT_DELETE'
26952. 'WAIT_FUTURE'
26956. 'GET_RANGE_SPLIT_POINTS'
26957. 'RESET'

The code that implements TENANT_DELETE instruction in Rust is very similar to the Java implementation. Rust implementation is here. Java implementation is here. I was wondering if you might be able to spot any issue in the Rust implementation?

I could be wrong, but It looks to me that in the Rust implementation, for some reason the C binding layer is returning that tenant deletion was successful, and in some cases not returning an error for the subsequent GET_RANGE_SPLIT_POINTS instruction.

Also when I was investigating this issue, I tried to figure out where _TENANT suffix instructions were getting added in api.py since I didn’t find any _TENANT suffix instructions.

Shouldn’t api.py's generate method contain something along the lines?

tenant_reads = [x + '_TENANT' for x in reads]
tenant_mutations = [x + '_TENANT' for x in mutations]

Good catch, yes that should be added.

I was wondering if you might be able to spot any issue in the Rust implementation?

I don’t spot anything in a first look. You said re-running the test doesn’t fail every time, do you know if in that case both bindings are reporting that the split points operation failed or if both cases had the operation succeed?

AFAICT, its is only the Rust bindings that occasionally reports GOT_RANGE_SPLIT_POINTS when it should be reporting an error.

The Tokio/Rust implementation of get_range_split_points is here and here. I am not really doing anything other than just calling the C API.