How can I reduce FoundationDB's trace log spam?

danm · November 14, 2022, 3:13pm

I’m talking about the output that it sends to files named trace-<IP, port, timestamp, random chars>.json (or presumably .xml if you’ve not got trace-format = json configured).

We initially had an issue that our clients were turning on the server request tracing flag, which was causing a huge amount of log spam (and had some sort of memory leak on the client side…).

We’ve now disabled that, but the logs from FDB are still around a 40x increase over what we had across all our environments previously. Occasionally there’s a warning or above that we’d like to keep, but most of it is metric information from events like MemoryMetrics, ProcessMetrics, MachineMetrics, FastAllocMemoryUsage, Role/Refresh etc. We already graph the relevant ones of these metrics from other sources, we don’t want them in the log output.

I am aware you can limit the size of the logfiles on disk, but we’re in the cloud so the hosts are ephemeral, and grab a data volume and associated IP on startup before running FoundationDB. They can also run into issues where the host is unreachable. As such, we export all our logs to an external system (DataDog, in this case), so that we can refer to them if the host is gone.

FoundationDB’s log output has caused a pretty massive spike in our costs for log ingestion, which I’m under pressure to reduce. But I don’t want to ditch these files being exported completely, as that will severely limit our ability to troubleshoot if something goes wrong with a host.

I can’t find anything in the docs about even something as wide-ranging as setting a minimum log-level for output to these files. How do I reduce the log spam here?

markus.pilman · November 14, 2022, 6:13pm

Can you create a top-k list of event types you’re seeing? Metrics are being listed once every 5 seconds and the role assignment is traced once per recruitment plus once when the file rolls over.

One of the imho funniest FDB problems is that while you can set a higher log-level, doing so will break FDB. This is imho a bug (and we talked about fixing this, but I am not sure whether we did).

danm · November 14, 2022, 8:35pm

Top-10 over a 1 day period, across all our envs and FDB clusters:

Event type                | Count
----------------------------------
Role                      | 2.92M
PingLatency               | 2.75M
TransactionMetrics        | 2.60M (is this something else I can reduce by turning off a client option?)
MachineMetrics            | 1.75M
ProcessMetrics            | 1.75M
NetworkMetrics            | 1.75M
FastAllocMemoryUsage      | 1.74M
MemoryMetrics             | 1.74M
MachineLoadDetail         | 1.74M
LocalConfigurationMetrics | 1.45M

danm · November 14, 2022, 8:46pm

When you say role is traced when the file rolls, plus once per recruitment, does that mean the level of them I’m seeing could indicate a problem? We’re not seeing any issues with use of the FDB clusters. We’re not explicitly configuring the max size of those trace log files, so it’ll just be the default 10MB per file and roll when the combined size of all files is 100MB.

Though having said that I’ve just checked one of the storage nodes and it has 15 files at 15MB apiece currently, so maybe the docs are wrong about the defaults?

The files look to be being created at about 70 minute intervals, give or take, on storage nodes. Controllers they look to be about every 150 mins. We’re running in three-data-hall mode, so we’ve got 9 controllers across 3 AWS AZs and each one is very lightly loaded

markus.pilman · November 14, 2022, 11:13pm

Possibly. I would try to track down where these are coming from. Either your trace files are tiny and they fill up very quickly, or there are some roles that constantly die and get resurrected.

This being said: 15MB per 70 minutes per process doesn’t sound like a lot of logging to me…

danm · November 15, 2022, 9:52am

Is there any documentation more than Understanding FoundationDB Log Files · apple/foundationdb Wiki · GitHub which explains what this actually means, and potential causes/solutions? DataDog is saying that, across a cluster, we’re getting multiple Role/Refresh events per second. If I log onto a specific node and look at the logs directly, I see stuff like this:

{  "Severity": "10", "Time": "1668505429.019050", "DateTime": "2022-11-15T09:43:49Z", "Type": "Role", "ID": "e9206ca654644f89", "Transition": "Refresh", "As": "Worker", "ThreadID": "12609016624796202891", "Machine": "10.1.244.170:4500", "LogGroup": "main-dev", "Roles": "SS" }
{  "Severity": "10", "Time": "1668505429.061061", "DateTime": "2022-11-15T09:43:49Z", "Type": "Role", "ID": "d1e4dd72dd7fc6d4", "Transition": "Refresh", "As": "StorageServer", "ThreadID": "12609016624796202891", "Machine": "10.1.244.170:4500", "LogGroup": "main-dev", "Roles": "SS" }
{  "Severity": "10", "Time": "1668505434.019109", "DateTime": "2022-11-15T09:43:54Z", "Type": "Role", "ID": "e9206ca654644f89", "Transition": "Refresh", "As": "Worker", "ThreadID": "12609016624796202891", "Machine": "10.1.244.170:4500", "LogGroup": "main-dev", "Roles": "SS" }
{  "Severity": "10", "Time": "1668505434.061108", "DateTime": "2022-11-15T09:43:54Z", "Type": "Role", "ID": "d1e4dd72dd7fc6d4", "Transition": "Refresh", "As": "StorageServer", "ThreadID": "12609016624796202891", "Machine": "10.1.244.170:4500", "LogGroup": "main-dev", "Roles": "SS" }
{  "Severity": "10", "Time": "1668505439.019120", "DateTime": "2022-11-15T09:43:59Z", "Type": "Role", "ID": "e9206ca654644f89", "Transition": "Refresh", "As": "Worker", "ThreadID": "12609016624796202891", "Machine": "10.1.244.170:4500", "LogGroup": "main-dev", "Roles": "SS" }
{  "Severity": "10", "Time": "1668505439.061167", "DateTime": "2022-11-15T09:43:59Z", "Type": "Role", "ID": "d1e4dd72dd7fc6d4", "Transition": "Refresh", "As": "StorageServer", "ThreadID": "12609016624796202891", "Machine": "10.1.244.170:4500", "LogGroup": "main-dev", "Roles": "SS" }

So I can see from that that we appear to be flip-flopping in the logs between two different IDs, one of which is As StorageServer, and the other As Worker. But… the Roles stay consistent (SS) for every entry. And the ThreadID is consistent. And, according to Cluster Coordination · apple/foundationdb Wiki · GitHub, it’s normal for a server process to run a Worker role alongside any ‘assigned’ roles. So at the moment to me this just seems like excessive logging for normal operation.

Notably, these are every 5 seconds, so it seems like this is traced with the same frequency as the metrics. And like them, I don’t feel like it’s useful to us.

danm · November 15, 2022, 10:00am

I would say 15MB per 70 mins per process was fine if it indicated a problem, but that level of “Everything is fine” information is currently costing us around $3.5k/month to ingest and process across all our FoundationDB clusters…

markus.pilman · November 15, 2022, 4:48pm

Ha I forgot about role refresh (even though I was the one who implemented it)… It was introduced since a lot of time series tools have trouble keeping state around, so refreshing this is making life easier for them.

In my experience most tools produce way more logging than this. It’s a fine line of figuring out how much information you have during a production incident and how expensive the logs are. I think what you want to do here is preprocess the logs and only send data to datadog that you actually need. The rest you could ingest into something cheaper (like S3 – or some other database with cheap storage and analytical capabilities). In our setup we filter the logs to only send a small subset to our observability tool but we then load everything into a Snowflake database. This data has proven very valuable many times when investigating issues or when doing cost analysis and looking at SLOs.

Now if you want to reduce the logging interval for these things you can set the knob WORKER_LOGGING_INTERVAL to something larger than 5.0. For example you can add this to your foundationdb.conf:

knob_worker_logging_interval = 30.0

We usually don’t recommend to blindly change knobs (they’re not exposed as options since we don’t really test them properly). So you probably should do this first in a QA deployment or staging cluster.

danm · November 17, 2022, 11:02am

Thank you for your help! I’ve dug more into the DataDog agent’s filtering and it looks like we can use that to limit what’s sent up rather than telling FDB to limit what’s going on, so we’re going with that option as the FDB config options are… not so heavily tested

jzhou · November 18, 2022, 5:01am

To add to the discussion, there were/are bugs in FDB that log excessively. For instance FKReenableLB was very spammy and was fixed recently in 7.1.20. In the FDB logs, you can search for TraceEventThrottle_* events, which are events that are throttled for logging too much in a short time. In 7.1, we have TraceEventThrottle_BusiestWriteTag occurs a lot, meaning BusiestWriteTag is very spammy. Unfortunately, the fix for it is protocol incompatible, and couldn’t be applied to 7.1.

SteavedHams · November 18, 2022, 8:35am

@markus.pilman Do we still need Role Refresh events even though every TraceEvent contains the current Roles as a detail? Did Role Refresh predate that feature?

markus.pilman · November 18, 2022, 3:58pm

We (as in Snowflake) don’t need it, but there are tools that need this. I think we should make this behavior optional though since it seems most people don’t need this.

rajivr · November 19, 2022, 4:13am

@danm Have you considered using QuickWit? This is something I’ve on my roadmap as we move towards production. I would also like to replace QuickWit’s metastore with my Rust RecordLayer-ish crate.

fulmicoton · February 28, 2023, 3:03am

@rajivr Hello, I’m the CEO of Quickwit Quickwit 0.5 should be released in a few weeks and will offer native support for tracing if you are still interested in it. You will be able to send OTEL traces to quickwit and connect jaeger to quickwit, without extra configuration

fmassot · May 3, 2023, 4:26pm

The feature is released in case you miss it: Distributed Tracing with Quickwit | Quickwit

Topic		Replies	Views
Trace Logs - decrease verbosity Using FoundationDB	1	274	September 2, 2022
Trace log files accumulated and total size exceeded maxlogssize Using FoundationDB	4	320	September 8, 2022
Commercial support Using FoundationDB	6	1534	December 11, 2018
Fdb log files accumulating for old process and wasn't cleaned when there is a new process Running FoundationDB operator	1	467	September 3, 2022
Is there any performance metrics for foundationdb to monitor? Using FoundationDB performance	6	908	February 14, 2022

How can I reduce FoundationDB's trace log spam?

Related topics