Issue while querying special key for status json

Thanks, will you please change the shouldEscape function too as per Christophe’s comment below?

Saw 4 such cases in my crontab files. In all 4, the json truncated at "\xcd’
So till now, I haven’t seen truncation at regular text. Will keep checking my crontab output file to find more cases.

Thanks, will you please change the shouldEscape function too as per Christophe’s comment below?

Yes, I haven’t looked carefully at this aspect of the issue yet, but I will try to get in a fix for this problem too.

This and the others look like the start of multi-byte utf-8 sequences, so… the theory that there is something bad regarding utf-8 encoding or decoding is still alive :slight_smile:

I don’t think this will fix the underlying issue, but could probably ensure that even if there is bad data somewhere, it won’t cause a cascade failure when reading the status json key (assuming this is the cause of the bug). Though, it may now propagate the bad data to consumers of the API that may not properly handle this?

But anyway, it is usually recommended to use a whitelist of what NOT to encode in JSON, and only pass-through basic printable ASCII characters, and defensively escape everything else.

In my own implementation, I only let through “printable” characters from ASCII, latin-1, chinese, japanese, arabic and cyrilic alphabets to help a human read JSON exports, and escape all the rest. I think in fdb’s use case you could only whitelist alphanum plus the usual gang of ponctuation that you would produce normally, but encode everything else.

One unit test that could be added after that, is to verify that all strings that are expected to contain ASCII text do not contain code points outside of that range (especially > 0x80) which are more likely due to some corruption. Maybe this assert could be added inside the various JsonObject/Array builders with some knob?

Hi, Any update on this?

The PR mentioned above has gone in and will come in the next patch release, but we haven’t yet tracked down the source of the original problem.

There’s been some evidence to suggest that the invalid status reports I’d previously heard of were the result of some sort of tooling issue and not the same as what’s happening here. So far I’ve been unable to reproduce the problem myself or spot in code what is going wrong.

I think I now understand how the document can become truncated in response to bad values. If there is a null byte (\x00) present in the version string, then the status document that you get when reading the key from \xff\xff/status/json will be truncated at that point. Interestingly, this isn’t a problem when querying status from fdbcli it seems, as it converts the JSON object to a string using a pretty printer that escapes all unprintable characters.

We could potentially do similar formatting for the JSON when read from our special key, or alternatively we could keep using the raw bytes but not truncate it.

I was able to discover this by intentionally having the client send bad version strings. I did this using the hidden option SUPPORTED_CLIENT_VERSIONS, which has option code 1000. I’m still not sure how it’s happening that you would get invalid version strings from the client without doing something like this, so the root problem is still a bit of a mystery.

I’ve created an issue now for this escaping problem:

First time for me reproducing this in real life, with v6.2.15 deployed on a linux cluster:

Here is an extract of the “status json” that is returned by fdbcli. You can clearly see that is is (properly) encoding garbage for both protocol_version and source_version. My implementation of a tiny json parser in the .NET client also breaks when parsing the get obtained by reading the system key, complaining of an improperly closed string literal.

            "supported_versions" : [
                {
                    "client_version" : "6.2.15",
                    "connected_clients" : [
                        {
                            "address" : "10.10.0.22:5900",
                            "log_group" : "default"
                        },
                        {
                            "address" : "10.10.0.22:5913",
                            "log_group" : "default"
                        }
                    ],
                    "count" : 2,
                    "max_protocol_clients" : [
                        {
                            "address" : "10.10.0.22:5900",
                            "log_group" : "default"
                        },
                        {
                            "address" : "10.10.0.22:5913",
                            "log_group" : "default"
                        }
                    ],
                    "max_protocol_count" : 2,
                    "protocol_version" : "\u0000\u0000\u0002\u0000\u0000\u0000T\u0000\u0000\u0000\u0010\u0000\u0000\u0000\u0002",
                    "source_version" : "ÿ\u0002\u0000\u0000\u0000\u0010\u0000\u0000\u0000ÿÿÿÿ\u0002\u0000\u0000\u0000\u001E\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0002\u0000\u0000\u00005\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0012\u0000\u0000\u0000:\u0000\u0000\u0000\u0005\u0000\u0000\u0000\u0000\u0000\u0000\u0000M\u0000\u0000\u0000"
                },

The IP in question belongs to one of our TeamCity Build agent and … it was executing unit tests against this same cluster while this was failing. A few minutes after it completed its run, and disconnected from the cluster, then all of the sudden, “status json” does not return any bad values.

This agent is running the Windows version of the client, talking to a Linux version of the server. I don’t recall having seen this issue with Windows => Windows. Though in the past, the CI build agent would run against a local instance of fdb, not all talking to the same cluster used for testing the complete app.

What is weird is that at the same time, I had an instance of my tool “FdbShell” connected to the same cluster, which uses the same version of the .NET Client, also from a windows client to the same linux fdb cluster… so it does appear to be random?