How can I reduce FoundationDB's trace log spam?

I’m talking about the output that it sends to files named trace-<IP, port, timestamp, random chars>.json (or presumably .xml if you’ve not got trace-format = json configured).

We initially had an issue that our clients were turning on the server request tracing flag, which was causing a huge amount of log spam (and had some sort of memory leak on the client side…).

We’ve now disabled that, but the logs from FDB are still around a 40x increase over what we had across all our environments previously. Occasionally there’s a warning or above that we’d like to keep, but most of it is metric information from events like MemoryMetrics, ProcessMetrics, MachineMetrics, FastAllocMemoryUsage, Role/Refresh etc. We already graph the relevant ones of these metrics from other sources, we don’t want them in the log output.

I am aware you can limit the size of the logfiles on disk, but we’re in the cloud so the hosts are ephemeral, and grab a data volume and associated IP on startup before running FoundationDB. They can also run into issues where the host is unreachable. As such, we export all our logs to an external system (DataDog, in this case), so that we can refer to them if the host is gone.

FoundationDB’s log output has caused a pretty massive spike in our costs for log ingestion, which I’m under pressure to reduce. But I don’t want to ditch these files being exported completely, as that will severely limit our ability to troubleshoot if something goes wrong with a host.

I can’t find anything in the docs about even something as wide-ranging as setting a minimum log-level for output to these files. How do I reduce the log spam here?

Can you create a top-k list of event types you’re seeing? Metrics are being listed once every 5 seconds and the role assignment is traced once per recruitment plus once when the file rolls over.

One of the imho funniest FDB problems is that while you can set a higher log-level, doing so will break FDB. This is imho a bug (and we talked about fixing this, but I am not sure whether we did).

Top-10 over a 1 day period, across all our envs and FDB clusters:

Event type                | Count
----------------------------------
Role                      | 2.92M
PingLatency               | 2.75M
TransactionMetrics        | 2.60M (is this something else I can reduce by turning off a client option?)
MachineMetrics            | 1.75M
ProcessMetrics            | 1.75M
NetworkMetrics            | 1.75M
FastAllocMemoryUsage      | 1.74M
MemoryMetrics             | 1.74M
MachineLoadDetail         | 1.74M
LocalConfigurationMetrics | 1.45M

When you say role is traced when the file rolls, plus once per recruitment, does that mean the level of them I’m seeing could indicate a problem? We’re not seeing any issues with use of the FDB clusters. We’re not explicitly configuring the max size of those trace log files, so it’ll just be the default 10MB per file and roll when the combined size of all files is 100MB.

Though having said that I’ve just checked one of the storage nodes and it has 15 files at 15MB apiece currently, so maybe the docs are wrong about the defaults?

The files look to be being created at about 70 minute intervals, give or take, on storage nodes. Controllers they look to be about every 150 mins. We’re running in three-data-hall mode, so we’ve got 9 controllers across 3 AWS AZs and each one is very lightly loaded :wink:

Possibly. I would try to track down where these are coming from. Either your trace files are tiny and they fill up very quickly, or there are some roles that constantly die and get resurrected.

This being said: 15MB per 70 minutes per process doesn’t sound like a lot of logging to me…

Is there any documentation more than Understanding FoundationDB Log Files · apple/foundationdb Wiki · GitHub which explains what this actually means, and potential causes/solutions? DataDog is saying that, across a cluster, we’re getting multiple Role/Refresh events per second. If I log onto a specific node and look at the logs directly, I see stuff like this:

{  "Severity": "10", "Time": "1668505429.019050", "DateTime": "2022-11-15T09:43:49Z", "Type": "Role", "ID": "e9206ca654644f89", "Transition": "Refresh", "As": "Worker", "ThreadID": "12609016624796202891", "Machine": "10.1.244.170:4500", "LogGroup": "main-dev", "Roles": "SS" }
{  "Severity": "10", "Time": "1668505429.061061", "DateTime": "2022-11-15T09:43:49Z", "Type": "Role", "ID": "d1e4dd72dd7fc6d4", "Transition": "Refresh", "As": "StorageServer", "ThreadID": "12609016624796202891", "Machine": "10.1.244.170:4500", "LogGroup": "main-dev", "Roles": "SS" }
{  "Severity": "10", "Time": "1668505434.019109", "DateTime": "2022-11-15T09:43:54Z", "Type": "Role", "ID": "e9206ca654644f89", "Transition": "Refresh", "As": "Worker", "ThreadID": "12609016624796202891", "Machine": "10.1.244.170:4500", "LogGroup": "main-dev", "Roles": "SS" }
{  "Severity": "10", "Time": "1668505434.061108", "DateTime": "2022-11-15T09:43:54Z", "Type": "Role", "ID": "d1e4dd72dd7fc6d4", "Transition": "Refresh", "As": "StorageServer", "ThreadID": "12609016624796202891", "Machine": "10.1.244.170:4500", "LogGroup": "main-dev", "Roles": "SS" }
{  "Severity": "10", "Time": "1668505439.019120", "DateTime": "2022-11-15T09:43:59Z", "Type": "Role", "ID": "e9206ca654644f89", "Transition": "Refresh", "As": "Worker", "ThreadID": "12609016624796202891", "Machine": "10.1.244.170:4500", "LogGroup": "main-dev", "Roles": "SS" }
{  "Severity": "10", "Time": "1668505439.061167", "DateTime": "2022-11-15T09:43:59Z", "Type": "Role", "ID": "d1e4dd72dd7fc6d4", "Transition": "Refresh", "As": "StorageServer", "ThreadID": "12609016624796202891", "Machine": "10.1.244.170:4500", "LogGroup": "main-dev", "Roles": "SS" }

So I can see from that that we appear to be flip-flopping in the logs between two different IDs, one of which is As StorageServer, and the other As Worker. But… the Roles stay consistent (SS) for every entry. And the ThreadID is consistent. And, according to Cluster Coordination · apple/foundationdb Wiki · GitHub, it’s normal for a server process to run a Worker role alongside any ‘assigned’ roles. So at the moment to me this just seems like excessive logging for normal operation.

Notably, these are every 5 seconds, so it seems like this is traced with the same frequency as the metrics. And like them, I don’t feel like it’s useful to us.

I would say 15MB per 70 mins per process was fine if it indicated a problem, but that level of “Everything is fine” information is currently costing us around $3.5k/month to ingest and process across all our FoundationDB clusters…

Ha I forgot about role refresh (even though I was the one who implemented it)… It was introduced since a lot of time series tools have trouble keeping state around, so refreshing this is making life easier for them.

In my experience most tools produce way more logging than this. It’s a fine line of figuring out how much information you have during a production incident and how expensive the logs are. I think what you want to do here is preprocess the logs and only send data to datadog that you actually need. The rest you could ingest into something cheaper (like S3 – or some other database with cheap storage and analytical capabilities). In our setup we filter the logs to only send a small subset to our observability tool but we then load everything into a Snowflake database. This data has proven very valuable many times when investigating issues or when doing cost analysis and looking at SLOs.

Now if you want to reduce the logging interval for these things you can set the knob WORKER_LOGGING_INTERVAL to something larger than 5.0. For example you can add this to your foundationdb.conf:

knob_worker_logging_interval = 30.0

We usually don’t recommend to blindly change knobs (they’re not exposed as options since we don’t really test them properly). So you probably should do this first in a QA deployment or staging cluster.

Thank you for your help! I’ve dug more into the DataDog agent’s filtering and it looks like we can use that to limit what’s sent up rather than telling FDB to limit what’s going on, so we’re going with that option as the FDB config options are… not so heavily tested :wink:

To add to the discussion, there were/are bugs in FDB that log excessively. For instance FKReenableLB was very spammy and was fixed recently in 7.1.20. In the FDB logs, you can search for TraceEventThrottle_* events, which are events that are throttled for logging too much in a short time. In 7.1, we have TraceEventThrottle_BusiestWriteTag occurs a lot, meaning BusiestWriteTag is very spammy. Unfortunately, the fix for it is protocol incompatible, and couldn’t be applied to 7.1.

@markus.pilman Do we still need Role Refresh events even though every TraceEvent contains the current Roles as a detail? Did Role Refresh predate that feature?

We (as in Snowflake) don’t need it, but there are tools that need this. I think we should make this behavior optional though since it seems most people don’t need this.

@danm Have you considered using QuickWit? This is something I’ve on my roadmap as we move towards production. I would also like to replace QuickWit’s metastore with my Rust RecordLayer-ish crate.

@rajivr Hello, I’m the CEO of Quickwit :slight_smile: Quickwit 0.5 should be released in a few weeks and will offer native support for tracing if you are still interested in it. You will be able to send OTEL traces to quickwit and connect jaeger to quickwit, without extra configuration

3 Likes

The feature is released in case you miss it: Distributed Tracing with Quickwit | Quickwit