For the teams deploying FoundationDB at scale, what are the most important metrics that you monitor, where do you get them (from the JSON status? the logs? the operating system?), and what alerting thresholds or heuristics do you use?
We at Wavefront parses the fdbcli json status and parses fdb trace logs and generate metrics from them. We look at storage queues, role changes, memory usage, could (when its near 5s), op space and moving data mostly.
Telegraf provides the infra metrics.
We literally just pipe the json straight into telegraf:
[[inputs.exec]] commands = ["/usr/bin/fdbcli --exec 'status json'"] data_format = "json" ## measurement name suffix (for separating different commands) name_suffix = ".fdbcli"
And since this runs on a computer (in AWS no less), we have telemetry for everything on Linux but importantly disk and memory (and fdb’s usage if it per fdb process).
And because AWS, we and grab EBS telemetry (including things like burst balance).
(There’s probably a whole new discourse category on running this database.)
we do additional translations on the JSON to translate non friendly values into friendly values as well as turn machine ids and process IDs into point tags!
Another important factor is to monitor the total storage_memory available for a transaction process in relation to the files held durable on disk (for memory). This may be only an issue present in the 3x releases, but when the files are too large to be read back into memory processes limits need to be increased. We take the entire systems available memory divided by the number of procs N to determine available headroom for immediate triage purposes.
this presents itself as dedicated transaction processes essentially not taking on any workload and laying dormant (in terms of CPU seconds)
I wrote a couple of tool (
fdbshell) to dog food the .NET binding at the time, to showcase what could be done with the existing API:
Inspired by all the
xxtop out there. It uses everything contained in the JSON status details (there’s probably more stuff in 5.1.x nowadays)
you can see what is probably a bug on Windows, with the 100% HDD busy time forthe second host that is clearly an error. It has been stuck there since the start
Used to explore the content of the dataset, and computes some metrics (counting keys, k/v size distribution, etc…)
One command samples a subset of the shards in the database to get some insight into the compositions of keys and values:
the top and bottom are shards that store documents (almost always the same key size, but variable value size). The middle contains some indexes (empty alues), and so on.
ASCII Charts For The Win!
Got a Linux friendly version? Cuz that looks like something I want!
.NET is mostly fully portable Linux these days
For transaction-level tracing, for tests or in production, I also implemented a “transaction logger” filter that will wrap a transaction, and for each method called, trace and measure the arguments, results and duration. When the transaction completes (success or no), it then has the ability to produce a complete log for each operations in the transaction. This log can then be dumped to disk, serialized to some database, and of course displayed as nice ASCII tables in your debug output:
Transactions also have an
.Annotate(string message) extension method that can insert comments into the log itself, very useful to debug complex algorithms. The log above is one produced by a test of the Directory Layer.
It can be a bit difficult when trying to troubleshoot concurrent transactions, because you still need to put side by side the logs of both transactions, and the timings do not align well. This could be the job of some tool that would ingest the logs and merge them somehow.
The implementation itself could probably be replicated to other languages:
There is now also a transaction option that can produce some of this data called transaction_logging_enable. It logs some details about the transaction and its operations to the client trace log. It doesn’t have much tooling at the moment for analysis though. In Java, for example, see https://apple.github.io/foundationdb/javadoc/com/apple/foundationdb/TransactionOptions.html#setTransactionLoggingEnable-java.lang.String-
Is the byte format of this log documented somewhere? I could maybe try to implement at least to be able to view them in the ASCII format shown above
These get written to the XML trace files our clients produce, along with a variety of other events. I’m not sure we have much documentation on them, but the format is pretty straightforward. There are a bunch of “Event” elements, each with a set of attributes that contain the details of the event. Some of the attributes are always present, like Type, Severity, ID, Machine, and Time, while others are event specific. Here’s an example event:
<Event Severity="10" Time="1522251262.482962" Type="ProcessTimeOffset" Machine="0.0.0.0:0" ID="0000000000000000" ProcessTime="1522251262.486520" SystemTime="1522251262.486623" OffsetFromSystemTime="-0.000103" logGroup=""/>
We don’t have the details of the specific events that get logged documented either, though each event will include the identifier you provide to the option.
Update: Found these in the .Net client repo.
Hi Christophe, Are these tools available (open-sourced) for others to use? They look very useful!
Very late to the party (missed this thread somehow). We (Snowflake) are probably one of the larger FDB users and FDB is operationally very critical for us (here’s a blog post that talks a bit how we use FDB: https://www.snowflake.com/how-foundationdb-powers-snowflake-metadata-forward/).
We collect many metrics. The default wavefront dashboard has close to 70 metrics that are updated in real time. We get these from the logs and from fdbcli. The most important ones are (all per process):
- Storage Queue
- Non-Durable Versions
- TLog queue
- CPU utilization
- Disk utilization
- Number of IOPS per disk (each process has its own disk)
- Moving data
- The role(s) that each process is executing
- FDB reads, writes, transactions started, and transactions committed
We also have many alerts. The most important ones are:
- we alert whenever a process logs a Sev 40. Usually this only fires when a disk fails
- High Storage queue
- High value for Non-Durable version (this might be a bit surprising, but we have learned that high NDV without high SQ usually means that there is a read hotspot that we need to fix).
- we have a canary - basically several processes that check that they can communicate with the cluster. If they can’t for some amount of time, they fire an alert.
- We fire if the CPU on any process is saturated for a long time (a few minutes IIRC).
Some of these metrics are populated to the clients. So if they detected a degradation, they will back off with their load. This proved to be the most effective way of keeping FDB happy. This is also something we constantly try to improve and it often saved us. Basically, we never want ratekeeper to go down, instead we try to identify processes that are unhappy and try to change the load on the fly.
Additionally we load all fdb log files into a private Snowflake account so that we can query it with SQL. We also load a lot of client side data into this account. This allows us to make an analysis of how the system is used. This is crucial fro hotspot detection. Having all this data in a central database proved to be very useful for root cause detection and to analyze overall trends over time (we try to be as data driven as possible). Additionally to the normal logs we also populate a mapping from partition to process. This allows us to correlate workload with problems observed on a single processes.
We (Wavefront) will definitely be submitting an abstract for how we manage and monitor 70+ FDB clusters in the December summit. Hoping to open-source all the tools we have as well!
Also, we actually use http://www4.ncsu.edu/~rhee/export/bitcp/cubic-paper.pdf when writing to FDB and monitor the time it takes for us to get a read version (good is <1s) while checking fdb status JSON (\xff\xff/status/json) every 5 seconds and only consider “uncongested” state when storage queue and log queue both have 250M of headroom (750M worst storage queue or less and 1.750G worst log queue or less).
I have come across “storage queue” and “log queue” in multiple threads. Could someone explain what these are precisely and how to monitor these? If there is an overview doc that explains these terms in context, it will be very helpful !