What do you monitor?

dave · April 21, 2018, 5:30pm

For the teams deploying FoundationDB at scale, what are the most important metrics that you monitor, where do you get them (from the JSON status? the logs? the operating system?), and what alerting thresholds or heuristics do you use?

panghy · April 21, 2018, 11:49pm

We at Wavefront parses the fdbcli json status and parses fdb trace logs and generate metrics from them. We look at storage queues, role changes, memory usage, could (when its near 5s), op space and moving data mostly.

Telegraf provides the infra metrics.

mrz · April 22, 2018, 9:25pm

We literally just pipe the json straight into telegraf:

[[inputs.exec]]
commands = ["/usr/bin/fdbcli --exec 'status json'"]
data_format = "json"

## measurement name suffix (for separating different commands)
name_suffix = ".fdbcli"

mrz · April 22, 2018, 9:44pm

And since this runs on a computer (in AWS no less), we have telemetry for everything on Linux but importantly disk and memory (and fdb’s usage if it per fdb process).

And because AWS, we and grab EBS telemetry (including things like burst balance).

(There’s probably a whole new discourse category on running this database.)

killertypo · April 23, 2018, 1:25am

we do additional translations on the JSON to translate non friendly values into friendly values as well as turn machine ids and process IDs into point tags!

killertypo · April 23, 2018, 1:27am

Another important factor is to monitor the total storage_memory available for a transaction process in relation to the files held durable on disk (for memory). This may be only an issue present in the 3x releases, but when the files are too large to be read back into memory processes limits need to be increased. We take the entire systems available memory divided by the number of procs N to determine available headroom for immediate triage purposes.

killertypo · April 23, 2018, 1:32am

this presents itself as dedicated transaction processes essentially not taking on any workload and laying dormant (in terms of CPU seconds)

KrzysFR · April 23, 2018, 8:44am

I wrote a couple of tool (fdbtop, fdbshell) to dog food the .NET binding at the time, to showcase what could be done with the existing API:

fdbtop

Inspired by all the xxtop out there. It uses everything contained in the JSON status details (there’s probably more stuff in 5.1.x nowadays)

KV Metrics:

Transactions Metrics:

Latency Probes:

Cluster Topology:

you can see what is probably a bug on Windows, with the 100% HDD busy time forthe second host that is clearly an error. It has been stuck there since the start

fdbshell

Used to explore the content of the dataset, and computes some metrics (counting keys, k/v size distribution, etc…)

One command samples a subset of the shards in the database to get some insight into the compositions of keys and values:

the top and bottom are shards that store documents (almost always the same key size, but variable value size). The middle contains some indexes (empty alues), and so on.

ASCII Charts For The Win!

mrz · April 23, 2018, 4:42pm

Got a Linux friendly version? Cuz that looks like something I want!

killertypo · April 24, 2018, 3:36pm

.NET is mostly fully portable Linux these days

KrzysFR · April 27, 2018, 3:06pm

For transaction-level tracing, for tests or in production, I also implemented a “transaction logger” filter that will wrap a transaction, and for each method called, trace and measure the arguments, results and duration. When the transaction completes (success or no), it then has the ability to produce a complete log for each operations in the transaction. This log can then be dumped to disk, serialized to some database, and of course displayed as nice ASCII tables in your debug output:

gist.github.com

https://gist.github.com/KrzysFR/7bdc3e3c598955dcd2c564f7c5b3924e

Transaction Logs

=== FoundationDB.Layers.Directories.DirectoryFacts.Test_Concurrent_Directory_Creation_With_Custom_Prefix() === 16:56:30.6413691
Transaction #113 (read/write, 20 operations, '#' = 0.5 ms, started 14:56:30.6132900Z, ended 14:56:30.6413691Z)
┌  oper. ┬─────────────────────────────────────────────────────────┬──── start ──── end ── duration ──┬─ sent  recv ┐
│ 0   rv°│ ######X                                                 │ T+  5.825 ~   8.473 (  2,648 µs) │             │ GetReadVersion => 62110623213066
│ 1   G  │ ___________________:.                                   │ T+ 13.654 ~  13.755 (    101 µs) │    24    12 │ Get T<02>DL<00><FE><01>T<02>DL<00><FF><FE><00><01>version<00> => <01><00><00><00><00><00><00><00><00><00><00><00>
│ 2   // │ ____________________`                                   │ T+ 13.760                        │     -     - │ // Looking for child Second under node T<02>DL<00><FE><01>T<02>DL<00><FF><FE><00>...
│:2   G  │ ____________________&$                                  │ T+ 13.763 ~  14.443 (    680 µs) │    24     0 │ Get T<02>DL<00><FE><01>T<02>DL<00><FF><FE><00><14><02>Second<00> => not_found
│ 3   G  │ _____________________.                                  │ T+ 14.448 ~  14.470 (     22 µs) │    24    12 │ Get T<02>DL<00><FE><01>T<02>DL<00><FF><FE><00><01>version<00> => <01><00><00><00><00><00><00><00><00><00><00><00>
│ 4   R  │ _____________________.;                                 │ T+ 14.474 ~  14.614 (    139 µs) │    44    30 │ Snapshot.GetRange fGE{T<02>DL<00><FE><01>T<02>DL<00><FF><FE><00><01>hca<00><14><00>} <= k < fGE{T<02>DL<00><FE><01>T<02>DL<00><FF><FE><00><01>hca<00><14><FF>}, limit(1), reverse, Exact => 1 result(s), has_more
│ 5   a  │ ______________________`                                 │ T+ 14.618 ~  14.622 (      4 µs) │    30       │ Atomic_Add T<02>DL<00><FE><01>T<02>DL<00><FF><FE><00><01>hca<00><14><14>, <01 00 00 00 00 00 00 00>

This file has been truncated. show original

Transactions also have an .Annotate(string message) extension method that can insert comments into the log itself, very useful to debug complex algorithms. The log above is one produced by a test of the Directory Layer.

It can be a bit difficult when trying to troubleshoot concurrent transactions, because you still need to put side by side the logs of both transactions, and the timings do not align well. This could be the job of some tool that would ingest the logs and merge them somehow.

The implementation itself could probably be replicated to other languages:

ajbeamon · April 27, 2018, 5:29pm

There is now also a transaction option that can produce some of this data called transaction_logging_enable. It logs some details about the transaction and its operations to the client trace log. It doesn’t have much tooling at the moment for analysis though. In Java, for example, see TransactionOptions (FoundationDB Java Client API)

KrzysFR · April 27, 2018, 5:30pm

Is the byte format of this log documented somewhere? I could maybe try to implement at least to be able to view them in the ASCII format shown above

ajbeamon · April 27, 2018, 6:00pm

These get written to the XML trace files our clients produce, along with a variety of other events. I’m not sure we have much documentation on them, but the format is pretty straightforward. There are a bunch of “Event” elements, each with a set of attributes that contain the details of the event. Some of the attributes are always present, like Type, Severity, ID, Machine, and Time, while others are event specific. Here’s an example event:

<Event Severity="10" Time="1522251262.482962" Type="ProcessTimeOffset" Machine="0.0.0.0:0" ID="0000000000000000" ProcessTime="1522251262.486520" SystemTime="1522251262.486623" OffsetFromSystemTime="-0.000103" logGroup=""/>

ajbeamon · April 27, 2018, 6:02pm

We don’t have the details of the specific events that get logged documented either, though each event will include the identifier you provide to the option.

gaurav · August 26, 2018, 2:31pm

Update: Found these in the .Net client repo.

Hi Christophe, Are these tools available (open-sourced) for others to use? They look very useful!

markus.pilman · August 28, 2018, 11:39pm

Very late to the party (missed this thread somehow). We (Snowflake) are probably one of the larger FDB users and FDB is operationally very critical for us (here’s a blog post that talks a bit how we use FDB: https://www.snowflake.com/how-foundationdb-powers-snowflake-metadata-forward/).

We collect many metrics. The default wavefront dashboard has close to 70 metrics that are updated in real time. We get these from the logs and from fdbcli. The most important ones are (all per process):

Storage Queue
Non-Durable Versions
TLog queue
CPU utilization
Disk utilization
Number of IOPS per disk (each process has its own disk)
Moving data
The role(s) that each process is executing
Ratekeeper
FDB reads, writes, transactions started, and transactions committed

We also have many alerts. The most important ones are:

we alert whenever a process logs a Sev 40. Usually this only fires when a disk fails
High Storage queue
High value for Non-Durable version (this might be a bit surprising, but we have learned that high NDV without high SQ usually means that there is a read hotspot that we need to fix).
we have a canary - basically several processes that check that they can communicate with the cluster. If they can’t for some amount of time, they fire an alert.
We fire if the CPU on any process is saturated for a long time (a few minutes IIRC).

Some of these metrics are populated to the clients. So if they detected a degradation, they will back off with their load. This proved to be the most effective way of keeping FDB happy. This is also something we constantly try to improve and it often saved us. Basically, we never want ratekeeper to go down, instead we try to identify processes that are unhappy and try to change the load on the fly.

Additionally we load all fdb log files into a private Snowflake account so that we can query it with SQL. We also load a lot of client side data into this account. This allows us to make an analysis of how the system is used. This is crucial fro hotspot detection. Having all this data in a central database proved to be very useful for root cause detection and to analyze overall trends over time (we try to be as data driven as possible). Additionally to the normal logs we also populate a mapping from partition to process. This allows us to correlate workload with problems observed on a single processes.

panghy · September 13, 2018, 5:14am

We (Wavefront) will definitely be submitting an abstract for how we manage and monitor 70+ FDB clusters in the December summit. Hoping to open-source all the tools we have as well!

panghy · October 5, 2018, 7:38am

Also, we actually use http://www4.ncsu.edu/~rhee/export/bitcp/cubic-paper.pdf when writing to FDB and monitor the time it takes for us to get a read version (good is <1s) while checking fdb status JSON (\xff\xff/status/json) every 5 seconds and only consider “uncongested” state when storage queue and log queue both have 250M of headroom (750M worst storage queue or less and 1.750G worst log queue or less).

gaurav · October 5, 2018, 1:02pm

I have come across “storage queue” and “log queue” in multiple threads. Could someone explain what these are precisely and how to monitor these? If there is an overview doc that explains these terms in context, it will be very helpful !