Now we are using foundationdb as a database for our application. We need to monitor the performance behavior of foundationdb. I wonder if there is any performance metrics of foundationdb to monitor? And is there any tools/command/API/GUI to check and monitor these metrics? Thanks!
We don’t have a good documentation on this topic. Most metrics are captured in log files from fdbserver processes, e.g., ProcessMetrics, MachineMetrics, NetworkMetrics, ProxyMetrics, TLogMetrics, StorageMetrics. Each of the metric above has many fields to expose information. There are also many histograms in the log files. Typically, you want to import these log files into a timeseries database, and then query from them, e.g., using splunk.
Another way is to use the status json from fdbcli, which summarizes many information from the database. Because it’s costly for the server to generating the json, this probably should be run once every 5s or so.
The information to monitor typically include: CPU by roles, memory, disk usage/space and dataset size, network connection/retransmit, client and server side latency (commit, GRV, read), throughput, availability (by probes), backup rate/status, TLog queue, storage queue, storage cache hit ratio, ratekeeper throttling reason and rate, data distribution moving rate, reachable coordinators, etc.
Logs generally have more information. The status call is aggregated information from cluster controller, which collects all processes’ data and reports back to the client.
There’s a telegraf I modified a few years ago to pull all of the status json info, and turn it into metrics in a useful fashion (it understands the structure of the JSON, and FDB in general).
I haven’t looked at it since I left that position, they’ve got a branch labelled for fdb 6.3. Perhaps it can all serve as a starting point for you.