Given the fact that you are seeing 3x performance difference simply by changing from 10-bytes (base85) to 16-bytes (“utf-8”) or 8-bytes (tuple), I would think that the main difference is in the parsing of keys by the binding itself, less than the querying of the keys at the cluster level.
I’ve done a lot of profiling of the tuple encoding implementation of one of the binding, and it can become a cpu bottleneck especially in synthetic benchmark where your thread spend most of its time encoding/decoding millions of keys (the rest is IO waiting for the data to come back). In garbage collected language, the most significant waste is the allocation/garbage collection of the temporary buffers, and also a lot of memory copying.
If in real life, you will spend a lot of time reading and parsing millions of records at a time, then you should definitely try to optimize the handling of the keys. If on the other hand you will read/write a few keys at a time per request, the overhead of the rest of the code (http request overhead, acl checks, business logic, etc…) will dwarf the cpu required to encode/decode keys.
If you are only storing a single type of keys which is a 16-byte integer, then you should probably try to roll your own encoding: It can be as simple as storing your keys as a 16-byte big-endian byte blob (to preserve order). This will take 16 bytes per key but will be dirt cheap to encode/decode.
If your 16-byte identifier is a GUID then there’s probably no better encoding than that.
If you 16-byte identifier is a sequential number starting a 1, then you may look into some sort of varint encoding that preserve lexicographical order. The encoding used by the tuple layer can be a good candidate, and you could probably may extract just the code that deal with integers, and discard all the overhead in the tuple layer that has to deal with variable size tuples and dynamic typing, to get the best performance as possible.
If you need to store other data alongside, or do some sort of multi-tenant storage, then you can either simply use the tuple layer and hope that the performance will be sufficient, or just add a custom prefix to the 16-byte keys?