Binding tester SET_READ_VERSION instruction

For the past week, I’ve been trying to debug a rather difficult binding tester failure in my Tokio Rust bindings.

@PierreZ has also encountered a similar error in his Rust bindings.

At a high level, what seems to be happening is that in Rust binding tester I am seeing a transaction succeed, whereas in Python the transaction is failing. Binding tester generates the following error.

Incorrect result: 
  rust   - ('tester_output', 'workspace', None, 7.77163468821456e-132, True, (SingleFloat(2.781151311691776e+16),)) = b'\x01ERROR\x00\x014436\x00'
  python - None

Incorrect result: 
  rust   - ('tester_output', 'stack', 596, 34252) = b'\x01RESULT_NOT_PRESENT\x00'
  python - ('tester_output', 'stack', 596, 34252) = b'\x01\x01ERROR\x00\xff\x011020\x00\xff\x00'

The sequence of instructions that is causing this issue is as follows.

  34224. 'WAIT_FUTURE'
  34225. 'NEW_TRANSACTION'
  34226. 'PUSH' 4436
  34227. 'ON_ERROR'
  34228. 'WAIT_FUTURE'
  34229. 'NEW_TRANSACTION'
  34230. 'SET_READ_VERSION'
  34231. 'PUSH' ...
  34232. 'SET'
  34233. 'PUSH' ...
  34234. 'PUSH' ...
  34235. 'WRITE_CONFLICT_RANGE'

  [...]

  34250. 'GET_KEY'
  34251. 'WAIT_FUTURE'
  34252. 'COMMIT'
  34253. 'WAIT_FUTURE'
  34254. 'RESET'
  34255. 'NEW_TRANSACTION'

At instruction 34229 a new transaction is setup whose read version gets modified (34230) to the version stored in the state machine. This transaction then commits in instruction 34252 causing an error in Python but succeeds in Rust. I’ve verified the inputs to intermediary SET, WRITE_CONFLICT_RANGE and GET_KEY, and they are similar.

I also tried to setup tracing at the C library level. C level trace logs for the transaction at instruction 34229 is here: trace-py.json and trace-rs.json

As you can see from the trace logs, the main difference seems to be that in case of Rust, I am seeing that self-conflicting transactions (prefix \xFF/SC/..) gets inserted and transaction succeeds. However in case of Python, these self-conflicting transactions are missing.

Now, if I were to surgically disable instruction 34230. 'SET_READ_VERSION' by modifying Rust binding tester and also Python tester.py, then transaction 34229 succeeds in both Rust and Python.

The C level trace logs are here: trace-py-34230-disabled.json and trace-rs-34230-disabled.json.

Another observation from the above trace logs is that even when the transaction 34229 succeeds, Python binding tester is not generating self-conflicting transaction, while Rust binding tester does. I am not sure why that is the case.

The seed that causing this issue is 4238495939 and it is happening at num-op 430. I use a slightly modified version of binding tester (since I don’t have support for versionstamp v1). You can use the binding tester that is here if you would like to generate the full sequence of instructions.

At num-op 1000, SET_READ_VERSION instruction is generated 4 times. Its only the second SET_READ_VERSION at 34230 that is causing the issue.

I would really appreciate help in figuring out what might be wrong and how I can fix it. :slight_smile:

At a high level what I’m seeing from the logs you posted is that the rust transaction is missing a read conflict range. This is the read conflict range the python transaction has that the rust transaction seems to be missing:

{
  "Severity": "10",
  "Time": "1647325464.518947",
  "DateTime": "2022-03-15T06:24:24Z",
  "Type": "TransactionTrace_CommitError_ReadConflictRange",
  "ID": "0000000000000000",
  "TransactionID": "34229",
  "Begin": "\\x02tester_output\\x00\\x02workspace\\x00\\x1d\\x0an*\\xf2h\\xa3M\\x01\\xfdG\\x16&\\x1d\\xc9vv\"\\xc9#\\xcc\\xbcU\\xdc\\x87\\x03\\xa2\\x86\\xcc\\xd8\\xc9\\xc0\\xdb/\\x1b\\xc8\\xbax\\xa0R\\xbbBi\\xa3\\xb9+\\xc8\\xd9\\x98\\x18Rd_\\x94\\xddr:\\xba\\x82c\\xc0[\\x80\\x99\\xa5\\x8b\\x97;\\xd9~\\x85\\xd1\\x08\\xb5\\x97tn4\\x00\\xf0q4\\xff\\xa9\\xfb\\x82P\\x9b\\x1f\\x9c\\xe9z6\\x9d\\xe4\\x83\\xd5\\xda\\xe6\\x06dv\\xe9\\xf6\\xf9\\xb0#\\xafY\\x19E%mW\\x8dW<Ap\\xed\\x82\\xa9\\x12\\x93\\x1e\\xe7\\x97\\xd8\\x11X3Zp!.\\xe4\\xc4f\\x94\\x0d3\\xcd\\x82D\\xcc\\xcc6X\\xbb\\xd4\\x8f\\xfd\\x85\\xd4\\x8ei\\x88\\xf4\\x06\\xab5q\\xb8p\\x8d\\x1b\\x91H\\xae\\x07\\xf4>\\x17\\x12\\xa2\\xe0\\x9d\\xd6\\xb8\\xca\\xf4\\x7fq\\xa5\\\\x0a\\x0f\\xa0\\x83\\xb5\\xef\\xe3\\xa3\\x8f\\xa6,W\\x8ae\\xedC:E\\x02\\xa1\\xa4\\xac\\x0b\\xf8\\xf0\\xf0\\x7fS\\x9fi\"\\x01Zy;?\\xd7\\xdffM\\xa9\\xcf\"\\xcb\\x98\\xaa\\xe0\\xef\\xc1\\x19o\\x03\\xcd\\xde\\xb1\\xc5Dj\\xc1\\x93\\x8dk\\xcb\\xae9R\\xcbXx\\x00&",
  "End": "\\x02tester_output\\x00\\x02workspace\\x000\\xaa\\x0a\\xb6\\x99\\xafCMR\\x91f\\xcd\\x00\\xbf\\xbc\\xdf\\xaf\\x00",
  "Machine": "127.0.0.1:28829",
  "LogGroup": "default"
}

The reason you don’t see the \xFF/SC/ conflict ranges in the python is that the transaction is already self-conflicting.

Thanks @andrew.noyes for the reply. Since your pointer regarding the missing read conflict range, I went back to study the C logs again.

Is there a way to request detailed transaction log at the C API level?

Basically I want to be able to see all the mutations performed by a transaction along with the gets and range reads.

I am currently using the following settings.

# Enable tracing
fdb.options.set_trace_enable("trace-py")
fdb.options.set_trace_format("json")

def new_transaction_with_tracing(self, idx):
    with Tester.tr_map_lock:
        tr = self.db.create_transaction()

        tr.options.set_debug_transaction_identifier(str(idx))
        tr.options.set_transaction_logging_max_field_length(-1)
        tr.options.set_log_transaction()

        Tester.tr_map[self.tr_name] = tr

Is there some way that I can increase the verbosity even further?

The mutations should be in the Commit (and CommitError) event: foundationdb/ClientLogEvents.h at ecccfd0868cea448e3c6e457dbf9e7155cacc3cb · apple/foundationdb · GitHub

Thanks @andrew.noyes :slight_smile: I did look at this file before. I am missing some mutations in the logs. Hence I asked the previous question to see if there was some other way to get more logging information.

In stacktester, I was wondering it there was some other way a key might be updated outside of the SET and ATOMIC_OP instructions?

I am encountering a scenario where GET_KEY is returning a value, and I am unable to identify which SET instruction was setting that value both from Python and C logs.

I don’t know of any other way to get a key to have a value besides sets and atomic ops

My first guess as to why you see your get key request do a range read without a conflict range is that it is being called with the snapshot parameter enabled. Is it possible that the bindings are errantly passing the wrong snapshot value to fdb_transaction_get_key?

I am encountering a scenario where GET_KEY is returning a value, and I am unable to identify which SET instruction was setting that value both from Python and C logs.

Depending on the key, it’s possible that you’ve left the bounds of your tester’s output space and are seeing keys that were generated externally (e.g. via the binding tester setup logic). I don’t think that should be the case if your key has the \\x02tester_output prefix like above, though.

In stacktester, I was wondering it there was some other way a key might be updated outside of the SET and ATOMIC_OP instructions?

The unit tests can do this, though that shouldn’t apply in a non-scripted test. LOG_STACK will also write keys, but that should only happen at the end of the test. Some directory operations also write keys. Of course, SET_DATABASE and ATOMIC_OP_DATABASE will also write keys just in case you didn’t account for those.

@ajbeamon Thanks a lot for the reply! Your guess was indeed right. :slight_smile: Turns out there was a rather subtle error in the APIs that I had created to support binding tester (link 1, link 2) This API was causing reads to automatically become snapshot reads.

The seed is still failing: that is transaction succeeds in Rust, and fails in Python. But there is some progress.

Here are the updated C level traces for Rust and Python: trace-rs.json and trace-py.json.

As you can see, there is no longer self conflicting transactions in Rust, and get key request produces a GetRange and ReadConflictRange (though the End value is different).

Something I noticed from the trace is that there is GetVersion event for transaction 34229 in Rust, but a similar event is missing in Python. Is there some reason why that might be the case?

Here is the output of the binding tester.

Thanks again for the help. I really appreciate it!

@ajbeamon @andrew.noyes I am happy to share the seed is now passing. :slight_smile:

This has been a rather grueling bind tester issue that has taken almost two weeks to fix. I am really glad to see it pass.

Thanks again for all the help and suggestions.

Thanks @rajivr, @ajbeamon and @andrew.noyes for the multiple tips. I’m also digging a error on the Rust binding:

./bindings/bindingtester/bindingtester.py --num-ops 132 --api-version 630 --test-name directory --concurrency 1 rust --no-directory-snapshot-ops --compare python --seed 4142326254

Incorrect result: 
  rust   - ('tester_output', 'stack', 139, 1234) = b'\x01RESULT_NOT_PRESENT\x00'
  python - ('tester_output', 'stack', 139, 1234) = b'\x01\x01ERROR\x00\xff\x011020\x00\xff\x00'

Comparing results from '('tester_output', 'directory_log', 'directory')'...
Comparing results from '('tester_output', 'directory_log', 'subspace')'...

Test with seed 4142326254 and concurrency 1 had 1 incorrect result(s) and 0 error(s) at API version 630
Completed directory test with random seed 4142326254 and 132 operations

The instructions for the seeds are:

1216. Instruction is RESET
1220. Instruction is DIRECTORY_REMOVE_DATABASE
1224. Instruction is DIRECTORY_REMOVE_IF_EXISTS
1226. Instruction is DIRECTORY_CHANGE
1228. Instruction is DIRECTORY_LIST
1233. Instruction is DIRECTORY_MOVE_DATABASE
1234. Instruction is COMMIT

So we have 3 transactions:

  • The transaction created at 1216 during RESET that will be commited at 1234.
  • The two transactions for the _DATABASE instructions:
    • DIRECTORY_REMOVE_DATABASE at 1220,
    • DIRECTORY_MOVE_DATABASE at 1233.

Yes, something is wrong here, given FDB’s strict serializability :rofl:

I added some traces on both Python and Rust and I can cleary see some CommitError’s events on the Python’s side.

Do you know the difference between:

  • TransactionTrace_CommitError_ReadConflictRange,
  • TransactionTrace_CommitError_WriteConflictRange,
  • TransactionTrace_CommitError_Mutation?

There is a high level overview of conflict ranges available here.

For the purposes of this test, the read conflict ranges should correspond with the keys that your directory operations were required to read (either through point reads or range reads) or any read conflict ranges that they added. The write conflict ranges would correspond with any keys that were written during the directory operations (or likewise any write conflict ranges explicitly added). In the Python output, we can see that the one write conflict range corresponds with the single key that was set.

The mutations event will tell you about any particular write operations you perform, such as a set, clear, or atomic operation. These will usually generate write conflict ranges, though that can be disabled through a transaction option.

For a transaction that fails with a conflict, the most relevant detail will be the read conflict ranges of that transaction. A conflict means some other transaction wrote a key in one of those ranges.

If you observe a different set of read conflict ranges in your Rust transaction, that either means that you read different data or you set snapshot=true on your reads, which disables conflict ranges. You should be able to observe the reads performed using other TransactionTrace events.

If the read conflict ranges are the same, then that means one of your other transactions must be writing different keys. If that’s the case, you would need to look at the Mutation or WriteConflictRange sets for the other two transactions.

Strangely, on the Rust-side, I’m not seeing any TransactionTrace_Commit events for 1216 :upside_down_face:

/tmp/fdb/rust 
❯ grep 1216 trace.127.0.0.1.27174.1648028127.9WxKI6.0.1.json | jq | grep '"Type"' 
  "Type": "TransactionTrace_GetVersion",
  "Type": "TransactionTrace_Get",
  "Type": "TransactionTrace_Get",
  "Type": "TransactionTrace_Get",
  "Type": "TransactionTrace_Get",
  "Type": "TransactionTrace_Get",
  "Type": "TransactionTrace_GetRange",

/tmp/fdb/rust 
❯ cd ../python

/tmp/fdb/python 
❯ grep 1216 trace.127.0.0.1.27179.1648028127.k5UsF1.0.1.json | jq | grep '"Type"'
  "Type": "TransactionTrace_GetVersion",
  "Type": "TransactionTrace_Get",
  "Type": "TransactionTrace_Get",
  "Type": "TransactionTrace_Get",
  "Type": "TransactionTrace_Get",
  "Type": "TransactionTrace_Get",
  "Type": "TransactionTrace_GetRange",
  "Type": "TransactionTrace_CommitError_ReadConflictRange",
  "Type": "TransactionTrace_CommitError_ReadConflictRange",
  "Type": "TransactionTrace_CommitError_ReadConflictRange",
  "Type": "TransactionTrace_CommitError_ReadConflictRange",
  "Type": "TransactionTrace_CommitError_ReadConflictRange",
  "Type": "TransactionTrace_CommitError_ReadConflictRange",
  "Type": "TransactionTrace_CommitError_WriteConflictRange",
  "Type": "TransactionTrace_CommitError_Mutation",
  "Type": "TransactionTrace_CommitError",

The transaction seems to be commited from my point-of-view, but without events, which is weird, because I’m expecting events like TransactionTrace_Commit_ReadConflictRange or TransactionTrace_CommitError_ReadConflictRange.

Digging :upside_down_face:

EDIT

I digged this afternoon. Here’s what I learned:

  • It seems like I’m doing the Commit part on the right transaction,
  • It seems like I’m not calling some resets,
  • Creating a dummy transaction and calling commit directly without set will not generate commit traces. I think this explains why I’m not seeing TransactionTrace_Commit events, as I have no Set.

I will check that tomorrow :slight_smile:

The seed(and others) is now fixed, thanks a lot @ajbeamon for your help :slight_smile:

As the debug of this might be useful to others, I will continue my previous post.

This is the key-point that allowed me to debug this. On the Python-bindings, I was seeing a mutation:

{
    "Severity": "10",
    "Time": "1647961245.701304",
    "DateTime": "2022-03-22T15:00:45Z",
    "Type": "TransactionTrace_CommitError_Mutation",
    "ID": "0000000000000000",
    "TransactionID": "1216",
    "Mutation": "code: SetValue param1: \\x07\\x06\\x03\\x05\\x03\\x0b\\x05\\x0c\\x04\\x0f\\x04\\x12\\x0c\\x13\\x12\\x0d\\x0f\\x0c\\x10\\x08\\x0c\\x01\\x07\\x06\\x03\\x05\\x03\\x0b\\x05\\x0c\\x04\\x0f\\x04\\x12\\x0c\\x13\\x12\\x0d\\x0f\\x0c\\x10\\x08\\x0c\\x00\\x01version\\x00 param2: \\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00",
    "ThreadID": "12023878897746492853",
    "Machine": "127.0.0.1:41025",
    "LogGroup": "default"
}

This mutation was not present in the Rust bindings. As they were no Set, the traces did not contain Commit-related events.

This key contains at the end version, which is some metadata used by the directory. Thanks to the ValueSizeBytes fields in the traces, I discovered that the key was wrongly set somewhere in my Rust bindings.

By grepping the key, I found the the transactionID, then the guilty operation. The key was set during a failed CREATE_OR_OPEN. Let’s look at the code!

Rust binding:

self.check_version(trx, allow_create).await?;

Python binding:

self._check_version(tr, write_access=False)

In my case, allow_create was true, triggering the creation of the key.

On a final note, I would like to thank the FDB community for developing the BindingTester. It allows binding maintainers (official or not) to be certain of the correctness of their library.

This silly bug would have been difficult to be catched with classic tests. With the BindingTester, the Rust-bindings is taking advantages of Github Action’s free-tier to run thousands of seeds every hour. This is a great way to find problems that are reproducible.

1 Like