How to achieve (1000 tx 1000 reads each)/sec on single client

we re-use the same transaction

We are doing something like this, but just copying read version. In our tests performance is similar for both variants (reuse rw or tx).

This is very different, and has orders of magnitude more overhead (because 1000x more requests)

One of the selling points is intelligent batching, but it seems that it is not implemented for reads? If every key is on the same SS then workload is about the same as range reads. Overhead is only on network that could be avoided or batched. This is just very surprising that it is not true.