Hello! We’d like to expose the functionality introduced in PR https://github.com/apple/foundationdb/pull/1114 via the special key space. A quick summary of this functionality:
Status requests are expensive for the cluster controller, so certain cluster health metrics can be sent from the ratekeeper to proxies and accessed by clients, avoiding the cluster controller altogether.
Aggregate health metrics:
int64_t worstStorageQueue;
int64_t worstStorageDurabilityLag;
int64_t worstTLogQueue;
double tpsLimit;
bool batchLimited;
Per-process health metrics:
int64_t storageQueue;
int64_t storageDurabilityLag;
double diskUsage;
double cpuUsage;
int64_t tLogQueue
Question 1:
Is it too late to add this to api version 630? I haven’t seen an official release of 6.3 yet so maybe it’s not too late. It looks like the conclusion of Versioning of special key space is that we should not change the behavior of a read without users opting-in by changing the api version. If it is too late, maybe we could add a default-off transaction option to control this.
Question 2:
How should we encode this as key-value pairs?
Some proposals:
- A single key with a json blob as a value. Something like
\xff\xff/metrics/health -> {
"worstStorageQueue": 12345,
"worstStorageDurabilityLag": 5012345,
"worstTLogQueue": 12345,
"tpsLimit": 123.45,
"batchLimited": false,
$process_id: {
"storageQueue": 12345,
"storageDurabilityLag": 5012345,
"diskUsage": 0.12345,
"cpuUsage": 0.12345,
"tLogQueue": 12345
},
...
}
where tLogQueue will be absent for storage processes and storage* will be absent for tlogProcesses
- A key for each process, with a json blob as a value
\xff\xff/metrics/health/aggregates -> {
"worstStorageQueue": 12345,
"worstStorageDurabilityLag": 5012345,
"worstTLogQueue": 12345,
"tpsLimit": 123.45,
"batchLimited": false
}
\xff\xff/metrics/health/process/$process_id -> {
"storageQueue": 12345,
"storageDurabilityLag": 5012345,
"diskUsage": 0.12345,
"cpuUsage": 0.12345,
"tLogQueue": 12345
}
...
- A key for each field
\xff\xff/metrics/health/batchLimited -> false
\xff\xff/metrics/health/process/$process_id/cpuUsage -> 0.12345
\xff\xff/metrics/health/process/$process_id/diskUsage -> 0.12345
\xff\xff/metrics/health/process/$process_id/storageDurabilityLag -> 5012345
\xff\xff/metrics/health/process/$process_id/storageQueue -> 12345
\xff\xff/metrics/health/process/$process_id/tlogQueue -> 12345
\xff\xff/metrics/health/tpsLimit -> 123.45
\xff\xff/metrics/health/worstStorageDurabilityLag -> 5012345
\xff\xff/metrics/health/worstStorageQueue -> 12345
\xff\xff/metrics/health/worstTLogQueue -> 12345