We want to monitor FDB node and alert when a node fails (either node failure or FDB process failure). Will node failure be reflected in status()? If so, where to find it?
We (Wavefront) do this through a couple different alerts & metrics that we consume from either “status json
” or through the XML trace logs (using this).
If you need a monitoring/observability platform (and one that uses Fdb as it’s telemetry store… at scale), I might know a guy.
System checks
We have the obvious instance checks (either from AWS/CloudWatch or lack of a reporting Telegraf “system.uptime
”).
Node/process failures
You have two, maybe more ways of doing this.
- alert on a change in machine count
- alert on a change in process count
Could probably skip the first and just alert on a change in process counts over some time period (since a dead host is a drop in processes).
In production we alert on process restarts:
(ts(loghead.fdb.*.memory, (source='*db*' or source='*ha*'))) - lag(5m, (ts(loghead.fdb.*.memory, (source='*db*' or source='*ha*')))) < -2G
Note: “loghead
” is the internal name for “wavefront-fdb-tailer
”
The Wavefront Way
On a node or process failure, the cluster will go into healing which can expose performance-related metrics. We obsess over metrics and alerting/monitoring so here’s a sampling of what we use:
Storage/Log queues
We alert when this is > 1G (and 1.5G for memory engine) - YMMV.
A Wavefront query for this would look like:
(max(ts("telegraf.exec.fdbcli-memory.cluster.qos.worst.queue.bytes.log.server", tag="production"), hosttags)) >= 1.5G or (max(ts("telegraf.exec.fdbcli-memory.cluster.qos.worst.queue.bytes.storage.server", tag="production"), hosttags)) >= 1.5G
Rapid change in disk space
Healing consumes “Operating space” at a quicker pace than normal usage. Again YMMV.
((-1 * (deriv((max(ts(telegraf.exec.fdbcli.cluster.data.least.operating.space.bytes.storage.server, (tag="production") and status=production), hosttags)))) ) > 10M)
Moving a lot of data
Healing moves data and for some defintion of “a lot of data” we alert. In production, we define “a lot of data” as 10% of the KV size (which excludes clusters where constant data rebalancing is the norm).
(aliasSource(max(ts(telegraf.exec.fdbcli*.cluster.data.moving.data.in.*.bytes, (role=db or role=ha)), cluster, role, mirror), tagk, cluster, 0)) / (aliasSource(max(ts(telegraf.exec.fdbcli*.cluster.data.total.kv.size.bytes, (role=db or role=ha)), cluster, role, mirror), tagk, cluster, 0)) * 100 > 10
I missed one.
ts(telegraf.exec.fdbcli.cluster.processes.*.messages, tag=production) = 1
We also alert on any messages a process has which may (or often) indicate a problem with a process.
Matthew, thanks for the information.