How to detect node failure?

We want to monitor FDB node and alert when a node fails (either node failure or FDB process failure). Will node failure be reflected in status()? If so, where to find it?

We (Wavefront) do this through a couple different alerts & metrics that we consume from either “status json” or through the XML trace logs (using this).

If you need a monitoring/observability platform (and one that uses Fdb as it’s telemetry store… at scale), I might know a guy.

System checks
We have the obvious instance checks (either from AWS/CloudWatch or lack of a reporting Telegraf “system.uptime”).

Node/process failures
You have two, maybe more ways of doing this.

  1. alert on a change in machine count
  2. alert on a change in process count

Could probably skip the first and just alert on a change in process counts over some time period (since a dead host is a drop in processes).

In production we alert on process restarts:

(ts(loghead.fdb.*.memory, (source='*db*' or source='*ha*'))) - lag(5m, (ts(loghead.fdb.*.memory, (source='*db*' or source='*ha*')))) < -2G

Note:loghead” is the internal name for “wavefront-fdb-tailer

The Wavefront Way
On a node or process failure, the cluster will go into healing which can expose performance-related metrics. We obsess over metrics and alerting/monitoring so here’s a sampling of what we use:

Storage/Log queues

We alert when this is > 1G (and 1.5G for memory engine) - YMMV.

A Wavefront query for this would look like:

(max(ts("telegraf.exec.fdbcli-memory.cluster.qos.worst.queue.bytes.log.server", tag="production"), hosttags)) >= 1.5G or (max(ts("telegraf.exec.fdbcli-memory.cluster.qos.worst.queue.bytes.storage.server", tag="production"), hosttags)) >= 1.5G

Rapid change in disk space
Healing consumes “Operating space” at a quicker pace than normal usage. Again YMMV.

((-1 * (deriv((max(ts(telegraf.exec.fdbcli.cluster.data.least.operating.space.bytes.storage.server, (tag="production") and status=production), hosttags)))) ) > 10M)

Moving a lot of data
Healing moves data and for some defintion of “a lot of data” we alert. In production, we define “a lot of data” as 10% of the KV size (which excludes clusters where constant data rebalancing is the norm).

(aliasSource(max(ts(telegraf.exec.fdbcli*.cluster.data.moving.data.in.*.bytes, (role=db or role=ha)), cluster, role, mirror), tagk, cluster, 0)) / (aliasSource(max(ts(telegraf.exec.fdbcli*.cluster.data.total.kv.size.bytes, (role=db or role=ha)), cluster, role, mirror), tagk, cluster, 0)) * 100 > 10

2 Likes

I missed one.

ts(telegraf.exec.fdbcli.cluster.processes.*.messages, tag=production) = 1

We also alert on any messages a process has which may (or often) indicate a problem with a process.

1 Like

Matthew, thanks for the information.