Issue while querying special key for status json

Hi Team,

I am noticing that querying the special key “\xff\xff/status/json” sometimes doesn’t provide the complete json object.
I tried querying this key in fdbcli shell (get “\xff\xff/status/json”). Sometimes I get back the complete status json and sometimes a truncated one. The truncated json abruptly ends like {“client_version”:"\x95’

Could you please advise?

Thanks,
Pragya

It sounds like a bug to be getting a truncated document. Is the example you gave a real output that you got? That’s an unusual value for client_version, so if it is that might be some sort of clue.

Some other questions –

  1. Does the output always terminate in roughly the same place when it’s truncated?
  2. How often does this happen?
  3. Have you noticed any patterns for when it occurs?

Thank you for responding.

Yes, the example I gave is a real output. Pasting the whole output below. (Removing ip addresses):

Regarding the other questions:

  1. Does the output always terminate in roughly the same place when it’s truncated? - Yes
  2. How often does this happen? - Happens kind of randomly. I start seeing correct json after a few tries.
  3. Have you noticed any patterns for when it occurs? - Not much but status json command on fdbcli works fine.

Output
fdb> get “\xff\xff/status/json”
\xff\xff/status/json' is {“client”:{“cluster_file”:{“path”:"/xxx/ssst.cluster",“up_to_date”:true},“coordinators”:{“coordinators”:[{“address”:"",“reachable”:true},{“address”:"",“reachable”:true},{“address”:"",“reachable”:true}],“quorum_reachable”:true},“database_status”:{“available”:true,“healthy”:true},“messages”:,“timestamp”:1582707378},“cluster”:{“clients”:{“count”:12,“supported_versions”:[{“client_version”:“Unknown”,“connected_clients”:[{“address”:"",“log_group”:“default”}],“count”:1,“protocol_version”:“Unknown”,“source_version”:“Unknown”},{“client_version”:“6.2.10”,“connected_clients”:[{“address”:"",“log_group”:“default”},{“address”:"",“log_group”:“default”},{“address”:“10.xx.xxx.:50124”,“log_group”:“default”},{“address”:"",“log_group”:“default”},{“address”:"",“log_group”:“default”},{“address”:"",“log_group”:“default”},{“address”:"",“log_group”:“default”}],“count”:7,“max_protocol_clients”:[{“address”:"",“log_group”:“default”},{“address”:"",“log_group”:“default”},{“address”:"",“log_group”:“default”},{“address”:"",“log_group”:“default”},{“address”:"",“log_group”:“default”},{“address”:"",“log_group”:“default”},{“address”:"",“log_group”:“default”},{“address”:"",“log_group”:“default”},{“address”:"",“log_group”:“default”},{“address”:"",“log_group”:“default”},{“address”:"",“log_group”:“default”},{“address”:"",“log_group”:“default”}],“max_protocol_count”:12,“protocol_version”:“fdb00b062010001”,“source_version”:“Current version id not currently supported within Windows.”},{“client_version”:"\x95

@KrzysFR, just to double check, do you pull status at all via \xFF\xFF/status/json, or is the door open that this could possibly be a Windows-only problem that no one has run into before?

For what it’s worth, now, the json looks like below. As you can see, it doesn’t have the message regarding windows.

fdb> get “\xff\xff/status/json”
\xff\xff/status/json' is {“client”:{“cluster_file”:{“path”:"/xx/xx.cluster",“up_to_date”:true},“coordinators”:{“coordinators”:[{“address”:"",“reachable”:true},{“address”:",“reachable”:true},{“address”:"",“reachable”:true}],“quorum_reachable”:true},“database_status”:{“available”:true,“healthy”:true},“messages”:,“timestamp”:1582797952},“cluster”:{“clients”:{“count”:1,“supported_versions”:[{“client_version”:“Unknown”,“connected_clients”:[{“address”:"",“log_group”:“default”}],“count”:1,“protocol_version”:“Unknown”,“source_version”:“Unknown”},{“client_version”:"\xd4’

also, even in the complete json (when I get it, I have sometimes seen “Current version id not currently supported within Windows.” in the json.

I have exposed a GetStatusAsync(…) API on the .NET binding, but the implementation reads the \xff\xff/status/json key.

I have not encountered issues with truncation, but again our clusters are rather small (compared to others) so if the issue is due to some maximum size, I may not have seen it yet.

I just tested doing a lot of get “\xff\xff/status/json”, from a Windows client to a Windows server with 1 node, and the result is always complete (always ends with ...}}}}})

Same result when testing from Windows to a bigger Linux cluster with 8 servers and 64 processes: the returned value is complete everytime (and payload is much larger).

If you’re talking about the warning that appears when you run fdbcli --version … it’s been like this since forever (or at least 5.0+?) and only appears when querying the version. Don’t think that it’s used for anything else.

From windows:

C:\WINDOWS\system32>fdbcli --version
FoundationDB CLI 6.2 (v6.2.10)
source version Current version id not currently supported within Windows.
protocol fdb00b062010001

From linux:

doxense@fdb01:~$ fdbcli --version
FoundationDB CLI 6.2 (v6.2.7)
source version d3c4bd9c5ac29ef98c024317353e3c1a24a9d6b3
protocol fdb00b062010001

I think that the windows build script does not pass the current git commit hash to msvc or it’s not patched into a source file before building?

Interesting, our cluster is pretty small too - It is a test cluster - one machine, 5 processes, 800MB disk space used.

No, I see this written in the json sometimes when I do a get “\xff\xff/status/json”) - Just like it’s witten in my incomplete json below. I was saying that I sometimes see this in complete json as well.

I am actually running fdbcli on a linux host.

I think that it is same field that is used by fdbcli --version, and is right beside the protocol version which is also present on the next line.

Though, I’m not sure why in your samples above, they are always ending by \x95 or \xd4 which don’t look like valid UTF-8 encoding? Maybe this is an issue with a buffer overflow of some kind, and it is exposing what was before in memory? JSON is supposed to be utf-8 and maybe if there is some corrupted binary data somewhere, the generator would stop or choke on the first random byte that would not match any valid utf-8 code point?

In one of your sample you also have a weird square in the middle of ...“messages”:[],“timestamp”... though it may be an artefact of the forum comment syntax? Maybe if you add triple backquotes ``` on lines before and after the json output, we could see the exact text?

Anyway, I think this is the code that generates the status json is here: https://github.com/apple/foundationdb/blob/release-6.2/fdbserver/Status.actor.cpp#L923

What I can see is that it does write “count”, “client_version”, “protocol_version” and “source_version” (in this order) while in your output it is a bit different, we see “max_protocol_count” (instead of “count”), “protocol_version”, “source_version” and then “client_version” that is corrupted. Looking at the file history, this section of the code has not been touched in 3 years, except for the “count” property 7 months ago…

For reference, here is what I get at the same spot: ...,"count":2,"protocol_version":"Unknown","source_version":"Unknown"}]},"cluster_controller_timestamp":1...

It matches a bit more the code, except that I don’t have the ‘client_version’ field (maybe it is skipped because it is null?). I do have “Unknown” instead of “Current version id not currently supported within Windows.”

I’m also running 6.2.10 on this machine and my json output is slightly different, which is a bit worrying :slight_smile:

FYI, just updated the windows machine from 6.2.10 to 6.2.15 and I get the same result.

{
   "client":{
      "cluster_file":{
         "path":"C:\\ProgramData\\foundationdb\\fdb.cluster",
         "up_to_date":true
      },
      "coordinators":{
         "coordinators":[
            {
               "address":"127.0.0.1:4500",
               "reachable":true
            }
         ],
         "quorum_reachable":true
      },
      "database_status":{
         "available":true,
         "healthy":true
      },
      "messages":[

      ],
      "timestamp":1582807398
   },
   "cluster":{
      "clients":{
         "count":0,
         "supported_versions":[
            {
               "client_version":"Unknown",
               "connected_clients":[
                  {
                     "address":"127.0.0.1:58871",
                     "log_group":"default"
                  },
                  {
                     "address":"127.0.0.1:58918",
                     "log_group":"default"
                  }
               ],
               "count":2,
               "protocol_version":"Unknown",
               "source_version":"Unknown"
            }
         ]
      },
      "cluster_controller_timestamp":1582807398,
      ...
      // rest of the data
      ...
   }
}

I tried following the implementation of the Json Builder (json_spirit), looking for issues with handling utf-8 data, but it seems to be doing its job:

My idea was that for some reason, you have one connected client that is sending a corrupted client_version field, and this value would not be properly escaped into JSON, and when parsed back into the client, would be truncated (instead of throwing a syntax error), but I’m don’t see any clues in the code.

The only “weird” thing I can see is that the parser is only escaping a few characters into strings, and not escaping the rest: It is best practice to always escape characters with a code points >= 128:

The shouldEscape(...) helper should probably also return true for characters < 32 and >= 128, this would maybe prevent the json from being truncated, and would expose the garbage client_version?

Edit

I followed the code the produces the various client/protocol/source fields, and found this:

In my case, it seems that initUnknown() is called, but in your case, either the ClientVersionRef(std::string versionString) ctor is used to parse the string into 3 parts, using “,” as a separator, or the ClientVersionRef(Arena &arena, ClientVersionRef const& cv) is used.

Since the second ctor is looking for ‘,’ and is grabbing client_version first, it seems a bit unlucky that you’d get random binary data, followed by ‘,’ and then textual data. So probably the first ctor is used to copy already bad strings? My C+±foo is not strong enough to follow this ctor up the call tree, so someone with better knowledge of the code base is required!

I’ve heard of a similar issue happening in a Linux environment, though in that case I just know that the result was not a valid JSON document. The actual contents of the result were not preserved.

It looks like various of our binaries will produce an entry in the clients list with “Unknown” version, which could explain what you are seeing if you are using them. These would include backup and DR binaries as well as fdbcli.

This particular problem should be pretty easy to fix in a patch.

@pragya are you loading any external clients using the multi-version API?

No

Yes, even in my case, the environment is linux. I am running fdbcli on a linux host.
Also, did you never observe the issue again? Anything you did to fix it?

Yeah, it happens randomly for me. I works fine for long and then suddenly it stops working. I just now set up a cron which calls this key every 5 minutes and prints the output to a file. Let me see if I see something and how frequently do I see it

Yeah, that came only when I pasted on the forum. Have lost that particular output now. Will take care from next time.

Um, don’t know about this: see the below substring from a complete json output:

‘’’“cluster”:{“clients”:{“count”:1,“supported_versions”:[{“client_version”:“Unknown”,“connected_clients”:[{“address”:"",“log_group”:“default”}],“count”:1,“protocol_version”:“Unknown”,“source_version”:“Unknown”}’’’

The order is not followed here.

Could you please check this in a couple more cases and if everywhere you see it

I see, I am running 6.2.7

I am sorry, this is output when you run “status json” right - not get . Even in my case, when get gives truncated json, “status json” works completely fine.

I always used “get” for the samples above, I just formatted/indented the json to make it more legible (the result of the get is giant blob of json that fills the whole screen).

It seems that the client that uses “Unknown” is probably the backup_agent (cf @ajbeamon’s answer above).

I tried with an application connected to that cluster, and I now indeed see the log:

    "cluster" : {
        "clients" : {
            "count" : 1,
            "supported_versions" : [
                {
                    "client_version" : "Unknown",
                    "connected_clients" : [
                        {
                            "address" : "127.0.0.1:58871",
                            "log_group" : "default"
                        },
                        {
                            "address" : "127.0.0.1:63236",
                            "log_group" : "default"
                        }
                    ],
                    "count" : 2,
                    "protocol_version" : "Unknown", 
                    "source_version" : "Unknown"
                },
                {
                    "client_version" : "6.2.15",
                    "connected_clients" : [
                        {
                            "address" : "127.0.0.1:63027",
                            "log_group" : "default"
                        }
                    ],
                    "count" : 1,
                    "max_protocol_clients" : [
                        {
                            "address" : "127.0.0.1:63027",
                            "log_group" : "default"
                        }
                    ],
                    "max_protocol_count" : 1,
                    "protocol_version" : "fdb00b062010001",
                    "source_version" : "Current version id not currently supported within Windows."
                }
            ]
        },

First entry is probably the backup agent (“Unknown”), while the second one is a Windows application running locally, and it has the “Current version id not currently supported within Windows.” message as well.

Do you know if it is possible that there is another client, application or script that could connect from time to time, and that would inject an invalid value? Maybe when it works is when this client is not connected.

Maybe one way to find it is if, when the problem occurs, you can netstat the server to look for clients connected to the ports 4500, 4501, … and look for one that does not appear normally?

Ah, okay, got it!

Sure, let me check this whenever it happens again. But just so I understand, what’s the theory here - how would this client be able to change the output of this special key which calls an internal api to retrieve the status json.?

Right, so here in the second entry, order is not followed as expected, right?

My idea, and again I don’t have any proof of it, is that this client is sending a string containing its version, protocol version and so on, which is somehow corrupted (as in random bytes and not only text). Looking at the code, it’s supposed to be a string like “version,version,version” where version is probably "6.2.17", etc… The code that parses this string puts the corrupted data into string fields in memory. The code that produces the status JSON uses the content of these fields to fill a JSON object, which is sent back to the client as a string. But these random binary bytes would be seen as utf-8 text, except that they are probably not a legal utf-8 sequence, and borks the JSON parser (either server-side or client-side).

In your samples above, you can see \x95 and \xd4 which are above 128 and don’t look like a valid utf-8 multi-byte sequence. If the utf-8 reader in the json parser stops when it sees these bytes, but somehow the error is not caught, it would explain why you see a truncated output.

Maybe the status json command uses a different way to parse the json (since it needs to pretty print / indent the result) and does not stop on that byte?

One experience to invalidate this theory: if every time you see a truncated output, you always see the tail as a double quote followed by \xHH where HH is a number >= 0x80 and nothing after that, then I would bet on utf-8 decoding issues caused by improperly escaped “bad data”. If sometimes you don’t see ‘\xXX’ but regular text without anything after that, then it’s something else.

Now, why is there bad data in the client_version field, it’s another issue altogether.

I’m not sure. Maybe the JSON object builder class is using an hashtable to store the fields, that does not preserve the order when being serialized, which would explain the fact that the json fields are not in the same order as in the code.

I’ve started a PR to address this which has some fixes for things I’ve found so far:

  1. Our binaries like fdbcli will now report a version
  2. The version list could have been wrong when certain events occur while using the multi-version client

I don’t think these fixes solve the problem with there being a corrupted version, though I haven’t yet been able to identify why that is happening.

I also haven’t yet done any work towards fixing the problem with status truncating in response to characters it doesn’t like (assuming that is indeed the problem).

I didn’t observe this event myself, but it sounded like it happened multiple times and was resolved by trying to read status again later.

Thank you for this. I will definitely keep this in mind when I observe the issue next.