Show the forums: Ingest FDB logs into ClickHouse

Hello *,

Recently for my own uses, I’ve been experimenting with (and somewhat standardizing on) the usage of ClickHouse for a lot of my time series/OLAP/logging information – it’s a screaming fast column store for OLAP workloads. Recently I found a question on the forums about ingesting/reading trace files, which made it apparent a lot of people are using one-off tools for logging/analytics. I thought it might be a good use of ClickHouse to try ingesting and querying the FDB trace logs: they’re “wide” event logs with potentially many columns for each event which is a decent fit for columnar designs.

Here’s the result of this experiment: a tool, contained in a Docker container, that you can launch to watch for log files being rotated, and ingest them whenever fdbserver syncs/flushes them and creates a new file. It depends on the (relatively new) --trace_format json directive in fdbserver.

Additions to the various column types, CODEC choices, etc would be appreciated.

In the future, ClickHouse will also (ideally) support reading/scanning/ingesting data from S3 data sources directly without tools like Kafka, which would open up another mechanism for doing log ingestion (instead this tool would only have to write cleaned-up logs to S3, and schema management, query construction, etc could be done elsewhere.) This would have the obvious value-add of re-using any existing S3 endpoints in your infrastructure, which is already necessary for robust and scalable FoundationDB backups.

There are several more details about how things work inside the README. I’d appreciate any particular feedback, but probably the most important thing is:

  • Is the general idea of waiting for close(2) events on trace files through inotify a valid and legitimate way of tracking log file rotations? The whole assumption here is that a single fdbserver writes to a single trace file and once it’s closed, it’s rotated and never touched again. Providing this guarantee is met, I think everything else is relatively valid, right?

Very fast TL;DR: you can run these two docker images to pull up a demo ClickHouse server and a copy of this tool, which will ingest any .json trace logs in /var/log/foundationdb into ClickHouse

$ docker run -d --rm \
    --ulimit nofile=262144:262144 \
    -p 8123:8123 \
    --name clickhouse-server \
    yandex/clickhouse-server

$ docker run -d --rm \
    --link clickhouse-server \
    -e CLICKHOUSE_ADDR=http://clickhouse-server:8123 \
    -e CLICKHOUSE_DB=testing \
    -e CLICKHOUSE_TABLE=cluster01 \
    -v /var/log/foundationdb:/data/logs \
    thoughtpolice/fdblog2clickhouse:latest

After a while, you can run a clickhouse-client and check out the data:

$ docker run -it --rm \
    --link clickhouse-server \
    yandex/clickhouse-client \
    --host clickhouse-server

ClickHouse client version 19.5.2.6 (official build).
Connecting to clickhouse-server:9000 as user default.
Connected to ClickHouse server version 19.5.2 revision 54417.

e328593055b3 :) describe table testing.cluster01;

DESCRIBE TABLE testing.cluster01

┌─name───────┬─type─────────────┬─default_type─┬─default_expression─┬─comment─────────┬─codec_expression─┐
│ As         │ Nullable(String) │              │                    │ Lorem ipsum     │ NONE             │
│ ID         │ String           │              │                    │ Lorem ipsum     │ NONE             │
│ Locality   │ Nullable(String) │              │                    │ Lorem ipsum     │ NONE             │
│ Machine    │ String           │              │                    │ Lorem ipsum     │ NONE             │
│ Severity   │ UInt32           │              │                    │ Event severity  │ NONE             │
│ Transition │ Nullable(String) │              │                    │ Lorem ipsum     │ NONE             │
│ Time       │ DateTime         │              │                    │ Event timestamp │ NONE             │
│ Type       │ String           │              │                    │ Event type      │ NONE             │
└────────────┴──────────────────┴──────────────┴────────────────────┴─────────────────┴──────────────────┘

8 rows in set. Elapsed: 0.001 sec.

e328593055b3 :) select count(*) from testing.cluster01;

SELECT count(*)
FROM testing.cluster01

┌─count()─┐
│   25831 │
└─────────┘

1 rows in set. Elapsed: 0.002 sec. Processed 25.83 thousand rows, 103.32 KB (10.34 million rows/s., 41.38 MB/s.)

e328593055b3 :)
3 Likes

This is great! Does ClickHouse support querying the json data directly?

We (Snowflake) injest all of our FDB traces to snowflake through our injestion service. The Table we use has the following schema:

| name     | type              | kind   |
|----------+-------------------+--------|
| TIME     | TIMESTAMP_NTZ(9)  | COLUMN |
| SEVERITY | NUMBER(38,0)      | COLUMN |
| MACHINE  | VARCHAR(16777216) | COLUMN |
| TYPE     | VARCHAR(16777216) | COLUMN |
| FILE     | VARCHAR(16777216) | COLUMN |
| V        | VARIANT           | COLUMN |

The table has currently around 20 billion rows. Us injesting into Snowflake is actually the reason --trace_format json exists :wink:

The gust of it is that I strongly agree that having the traces in a data warehouse is a great thing to have. This helped us quite often to analyze production issues with FDB that we had.

Additionally we also store tracefiles from all simulation tests in a database (separate table but the same schema).

When you get this data injested and your SQL dialect supports json, you can also create views for the most common types that you query to bring them into a relational form. We never bothered to do that as nobody seems to mind querying the json directly, but if you would connect the database to some third party application that doesn’t understand json this would be useful.

Clickhouse does have some JSON functions but they aren’t as sophisticated as the Snowflake variant type. It also doesn’t have a more efficient encoding for JSON than just compressing the string.

This is great! Does ClickHouse support querying the json data directly?

You mean doing things like raw scans over the actual documents? No, I’m afraid. Though maybe you could add it…

ClickHouse is an OLAP database but it is firmly a SQL one, so while you don’t necessarily want a totally normalized schema (just a wide one with everything), you still need a schema. This is one of the reasons the Python tool actually snips out the columns it knows about from the underlying JSON, another being that these columns can be enriched with metadata/column type information for ClickHouse (like the compression type). When you want to write rows into the database, you can use JSON (in a very particular format), but the JSON “schema” must match the table schema directly

ClickHouse errors when non-present columns are present in the submitted JSON, so it must all match exactly. It also errors when non-Nullable columns have NULL inserted into them or is not present (Nullables require more overhead in its design and degrade performance), hence why some are optional in my current schema.

ClickHouse does support a limited subset of functionality on JSON strings. So what you could do is actually keep the relevant data in actual columns after cleaning up the DataFrame, then have a ‘raw’ column that contains the original JSON row in its entirety as a String. CODEC(ZSTD) would substantially reduce the space usage in this column, and it would let you do some limited JSON stuff in the mean time. ClickHouse will probably expand some of its JSON support in the future, but I don’t know how much.


One thing I experimented with was just slurping up all my logs into a single DataFrame and then dumping out the schema based on every trace that was seen. This resulted in a 500+ column-wide schema. Which ClickHouse could handle, but, the problem is it’s incomplete: you don’t know what columns are “possible” until you see them in a trace log, meaning you can get stuck on invalid log files in the future if they include columns you didn’t previously account for. I imagine I am missing quite a few possible columns in my logs, making it difficult to be sure it’s correct. Otherwise ClickHouse could injest every column of every trace file with little fuss.

FoundationDB, as of current, doesn’t support any kind of exportable schema for all possible trace events – events are simply added to the source code as freeform strings. There would need to be a big refactoring so that e.g. all trace events were put into a separate file with metadata about the format of the trace event, what columns/types are present, and then used/compiled into the source code somehow (e.g. an X-Macro or preprocessor hack). This would be a good change and make it possible to make reliable versions of this tool that work for any trace, but it would be a significant amount of work, I think.

I see… IMHO it is a bad idea to force a schema on traces. We want to be able to add more stuff on the fly and having to change a schema whenever you add tracing will result in people not tracing useful stuff anymore (as some people - like myself - are lazy).

I think any tool that we want to use to analyze traces has to support semi-structured data. Snowflake optimizes JSON really well (internally we use a PAX-layout and automatically create columns for common json keys) and the query language is easy to use. The main drawback is that it is not an open source tool… But it is one option for people who are willing to pay.

Did you take a look at Presto for this? Presto should be able to directly read data from S3 and I think it has some JSON-support (not sure how good it is).

Semi-structured querying is probably quite useful and makes the pipeline simpler; ClickHouse might get some better support for it, but I think in practice trace structure for many events is probably already pretty stable (at least for many common fields). After all, even if you can store and query semi-structured data, doing vast changes to the trace format will probably break tons of stuff (reporting dashboards, etc) that you have to account for anyway. You’ll do canary rollouts, tests, schema updates etc. It gives you a bit more agility to have a flexible format like the one in use now (which is good), but it’s not like the trace format will (or should) change all the time, or you run major version upgrades every day, either.

In return for having a schema, though, ClickHouse has absolutely unrivaled performance and space usage compared to almost anything else that is readily available. (I don’t know how much space 20 billion records uses for you, but it’d be interesting to compare: I’d ballpark ClickHouse would get 80%+ space savings, and easily do multi-hundred-million rows/s on a single high-end node.) I don’t know of any open source semi-structured tool that can come close to ClickHouse’s performance, or even more equivalent commercial OLAP SQL offerings – it’s nearly as fast (or faster) as BigQuery in many cases, but quirkier and not as polished.

I have also looked at Presto quite a bit (I recently wanted to use it on the multi-PB Common Crawl Index, available on S3, actually), and it could almost certainly fit this use case, but it’s more of a “data warehouse frontend” than an actual storage system. As a result there are some up-front differences:

  • S3 support requires running a Hive metadata service to keep schema metadata, and That’s An Annoying Amount Of Work™. Presto itself is easy to run but Hive is… surprisingly annoying to run. You’d think this very common data-warehouse usecase would have a native storage connector, making it easy, but the Hive adapter is the only supported way of doing this for now.
  • Even if the data is on S3, you will lose substantial performance/space savings vs ClickHouse unless you process records into a disk-based columnar format, such as Parquet. Therefore you still need some kind of batch processing pipeline. Again, this works with Presto (Parquet on S3 is probably the closest columnar-OLAP storage system you can get), but still requires Hive.
  • Presto still models all foreign connectors with “ahead-of-time” schemas, so you would still need to write out the schema for your Parquet data on S3, etc. Though, it has more traditional/expansive JSON support that isn’t as limited as the ClickHouse variants.

These drawbacks (especially the first two, which bring a much more complex set of failure/operation modes) mean that ClickHouse is ultimately a much better choice for most of my uses, since it’s much more effective as far as ROI/commodity hardware usage goes and very easy to run.

Ultimately, it would be better to build something like a ClickHouse connector for Presto. Then you could load trace files into ClickHouse and query ClickHouse schemas from Presto easily, without the need for Hive or anything else, while retaining the other benefits.

EDIT: I should say though, that one of my inevitable goals is to pipe full trace logs somewhere into ‘cold storage’ on my own S3 endpoint, and given that – if you can figure out the 30 magical commands needed to get Hive working – Presto would still be a good choice to consider, given you need a little data loader for ClickHouse (resp. a Parquet converter) anyway.

This table uses ~1.5TB of space (or around $30/month in money - which is the important number for us). A simple full-table scan on one row takes on a large warehouse 30 seconds. But I don’t think you can make an easy apples to apples comparison between Snowflake and ClickHouse.

But I don’t think this should be a Snowflake vs ClickHouse thread and I am sorry if I directed the course towards this :wink: If Snowflake is slower than ClickHouse I would be very much interested in some numbers.

But please don’t get me wrong: I didn’t say that you shouldn’t use ClickHouse, I just said that without support for semi-structured data in your OLAP system you will probably run into problems and it might be worth to explore other options.

You are right that existing traces typically don’t change that often (though they do sometimes change quite drastically). But you will probably see new traces and/or new keys in existing traces whenever you change the FDB version. And this will probably be true even for patch-releases. Changing your database schema whenever you switch to a different version of FDB won’t be fun. Furthermore your tables will be very sparse which will mean that you will need to use quite a lot of projection to make results readable (you can make this better by creating many views - or maybe even a view per trace-type but it is still not optimal).

If you can make this still work well with ClickHouse I think there would be a lot of value in this work. Did you take a look at the Wavefront tailer? Depending on how cheap inserts are for ClickHouse you could adapt from there? This would allow you to ingest data quicker than FDB rotates files.

Also if you don’t mind me asking: what is your ultimate goal here? Do you simply want to have OLAP over your own files or do you want to have a general solution for FoundationDB? If it is the first Snowflake will probably be cheaper for you (as the storage pricing is the same as S3 and you only pay for compute if you actually run queries).

No worries, and that’s an interesting ballpark number for the size/cost. It’d probably be very small with ClickHouse but comparing CH to a system like Snowflake with disaggregated storage, as you say, makes cost comparisons very wonky here (even if ClickHouse compresses 1/10th of that size, the cost of a included cloud SSD for that storage, and the CPUs, is probably a bit more variable/costly than the relative warehousing cost, etc)

Right, the important thing is trying to isolate what are probably the most common, stable trace columns, and using those for the main table. After all, even being able to aggregate error severity by machine across some time window is still very useful, even if many other columns are difficult.

It would still be excellent if there was some schema describing all possible trace events in the source code. At least then there would be some hope of comparing/generating schemas and writing migrations, etc.

The sparsity thing is true and common unfortunately. :frowning: There ultra-wide “has everything” table is the easy part. The best thing is to use a (MATERIALIZED) VIEW into other storage engines for these cases with more refined schemas, which requires a bit of thought as you say.

I didn’t look at the Wavefront trailer, but I know of it. ClickHouse doesn’t really like small writes at all, unfortunately, and it’s an easy way to annihilate your performance – it wants very large batched writes (tens or hundreds of thousands of entries per insert). It can do many large batched writes concurrently at very high rates, but they must be large. Therefore, I don’t think the tailing approach is very appropriate.

The default FDB policy of rotating logs at 10MB is actually a pretty good default setting for this case, and why I use the “wait for a log rotation to occur, then slurp that up” strategy, since it works nicely and is extremely easy to implement.

A more generalized tool, one that tailed logs and did continuous/streaming inserts, or could wait for log rotation to occur and do big “batch” inserts of a log, would be very useful.

Basically: I need a better way of aggregating/searching logs for performance tuning, debugging, etc, and I already use ClickHouse for other use cases like this (logging) etc already. So it was an easy fit, and I get to to re-use what I already have available (my hardware, my server, my SQL skills).

Generally, this tool already should be usable for deployment elsewhere if you already have ClickHouse on hand. There’s nothing specific to my deployment. One goal is to eventually also use this for my simulation tests of my NixOS packages (as I mentioned elsewhere) but it should already be usable for other people, too.

Also, regarding pricing: I mostly have decent hardware available already running ClickHouse (my workstation, an alright remove VPS) for my low-volume uses, so there’s a bit of a sunk cost here where I might as well already use the resources I have at hand rather than pay more S3/compute costs. Since ClickHouse also prioritizes performance above all else, this means I get maximum return on my hardware investment too – although I probably won’t get this far, I could easily go to multi-billion datapoints on either machine without using much space/compute, and it’s already available.