Add health metrics feature to special key space

Hello! We’d like to expose the functionality introduced in PR https://github.com/apple/foundationdb/pull/1114 via the special key space. A quick summary of this functionality:

Status requests are expensive for the cluster controller, so certain cluster health metrics can be sent from the ratekeeper to proxies and accessed by clients, avoiding the cluster controller altogether.

Aggregate health metrics:

int64_t worstStorageQueue;
int64_t worstStorageDurabilityLag;
int64_t worstTLogQueue;
double tpsLimit;
bool batchLimited;

Per-process health metrics:

int64_t storageQueue;
int64_t storageDurabilityLag;
double diskUsage;
double cpuUsage;
int64_t tLogQueue

Question 1:

Is it too late to add this to api version 630? I haven’t seen an official release of 6.3 yet so maybe it’s not too late. It looks like the conclusion of Versioning of special key space is that we should not change the behavior of a read without users opting-in by changing the api version. If it is too late, maybe we could add a default-off transaction option to control this.

Question 2:

How should we encode this as key-value pairs?

Some proposals:

  1. A single key with a json blob as a value. Something like
\xff\xff/metrics/health -> {
"worstStorageQueue": 12345,
"worstStorageDurabilityLag": 5012345,
"worstTLogQueue": 12345,
"tpsLimit": 123.45,
"batchLimited": false,
$process_id: {
  "storageQueue": 12345,
  "storageDurabilityLag": 5012345,
  "diskUsage": 0.12345,
  "cpuUsage": 0.12345,
  "tLogQueue": 12345
},
...
}

where tLogQueue will be absent for storage processes and storage* will be absent for tlogProcesses

  1. A key for each process, with a json blob as a value
\xff\xff/metrics/health/aggregates -> {
"worstStorageQueue": 12345,
"worstStorageDurabilityLag": 5012345,
"worstTLogQueue": 12345,
"tpsLimit": 123.45,
"batchLimited": false
}
\xff\xff/metrics/health/process/$process_id -> {
  "storageQueue": 12345,
  "storageDurabilityLag": 5012345,
  "diskUsage": 0.12345,
  "cpuUsage": 0.12345,
  "tLogQueue": 12345
}
...
  1. A key for each field
\xff\xff/metrics/health/batchLimited -> false
\xff\xff/metrics/health/process/$process_id/cpuUsage -> 0.12345
\xff\xff/metrics/health/process/$process_id/diskUsage -> 0.12345
\xff\xff/metrics/health/process/$process_id/storageDurabilityLag -> 5012345
\xff\xff/metrics/health/process/$process_id/storageQueue -> 12345
\xff\xff/metrics/health/process/$process_id/tlogQueue -> 12345
\xff\xff/metrics/health/tpsLimit -> 123.45
\xff\xff/metrics/health/worstStorageDurabilityLag -> 5012345
\xff\xff/metrics/health/worstStorageQueue -> 12345
\xff\xff/metrics/health/worstTLogQueue -> 12345
2 Likes

The proposal 3 for question 2 is most flexible and low-overhead. One concern is about upgrading:
Suppose we release this feature in 6.3 and clients (or layers) use this feature to get a cluster’s health metrics. Clients (or layers) can use getRange to get a set of health metrics.

In the future FDB 7.0, we introduce new fields in the health metrics. Now clients (layers) that do not explicitly change the api version will see new fields. Will this break clients (layers)?

@ajbeamon @alloc Since you have experience (I meant suffering from upgrading issues :wink: ), what do you think?

(BTW, about question 1, I don’t think it is too late to introduce the feature in 6.3 since “no one” has deployed 6.3 yet.)

This is often a bit of a grey area. We disallow changes to the protocol version at this point even though no one has deployed, and in theory the API version might be viewed similarly. On the other hand, it may not be that unreasonable to be a little bit flexible with something that’s lower risk like this while we’re still in the release candidate phase. I’d probably try bringing it up with the rest of the team, and if no one there or here on the forums objects you could go forward with it.

It’s possible if you do get objections that we could consider changing the API version within a patch to something like 631. I don’t think there’s any precedent for it yet or whether there’s some reason it would be a bad idea, but it could be worth a thought.

1 Like

I think the idea is that for libfdb_c 7.0, new fields would not show up unless the client changes the api version. Clients using api version 630 should not see a difference

1 Like

I feel the option 3 “A key for each field” is better because it gives more flexibility to clients.

I like option 2, because it makes it easy to explain how the value is encoded (valid json). I’m not opposed to 3, but I’m not sure how to encode the values. For example for a double, should we just use the same spec json uses? Do popular languages all support parsing such a thing in their standard libraries? Anyone mind doing option 3 “A key for each field” but still have values be json?

Good point, I didn’t realize this. Encoding a single value as json seems not ideal. I agree option 2 seems better.