精读预计 7 分钟

RawTree

摘要

针对传统分析数据库需预设 Schema 和索引的痛点，RawTree 实现了无模式（Schemaless）的高速数据摄取。其核心特性包括：系统能根据实际查询模式自动生成主键和投影（Projections）以优化性能；具备支持动态类型的 SQL 引擎，可自动处理混合类型数据的排序与聚合；提供简洁的 API 供 AI Agent 调用。该项目由前 ClickHouse 研究员开发，旨在兼顾半结构化数据的灵活性与高性能分析。

荐读理由

用 RawMergeTree 引擎建表，直接 ingest JSON 日志或 webhook 数据，自动生成主键与 projections，SQL 查询动态列无需提前建 schema 或 index；这套可直接抄到 AI 工程项目里，替代 ClickHouse 的 schema-first 模式

原文

Introducing RawTree

Jun 9, 2026 by Maksim Kita

Most analytics databases today assume you know your table schemas upfront. You define tables, set up indexes, create materialized views, and only then can you start querying your data.

Without indexes or materialized views the database is slow. With them, optimizing queries requires a lot of work, usually involving iterating the tables and materialized view's schemas, and they are not effective at dealing with unstructured/semi-structured data. All of this makes their APIs and interaction complex, which is not ideal for coding agents, because they have to make decisions about data structure/schema/indexes before they have any data. The core issue is "schema before data".

Imagine a team building an AI support agent. Every run produces different data: prompts, tool calls, retrieval results, latencies, error payloads, user metadata, feedback, sometimes extra fields that did not exist the day before. At the beginning they do not know which of those fields will matter most. They just want to keep everything and start asking questions immediately.

Ideally, the agent should be able to send requests like:

curl -X https://api.rawtree.com/v1/tables/agent_runs \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '[
    {"run_id":"r1","model":"gpt-5","latency_ms":812,"tool":"search","status":"ok"},
    {"run_id":"r2","model":"gpt-5","latency_ms":"1540","tool":"browser","status":"error","error_code":"timeout"}
  ]'

So that after a few runs, it could start running queries like these:

curl -X "https://api.rawtree.com/v1/query" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"sql":"SELECT tool, status, count(), avg(latency_ms) \
    FROM agent_runs \
    GROUP BY tool, status \
    ORDER BY count() DESC;"}'

We built RawTree to solve this problem. Our main motivation and goal was to create an ideal analytical database for agents, and thus for humans too. The main feature is its simplicity: you just ingest any data, query with SQL, fetch results via API. No schema upfront, and no need to design indexes, primary keys, or materialized views to start working. RawTree is designed to be useful on raw workloads immediately, then improve recurring workloads over time by adapting to actual query access patterns.

Before jumping into details of each feature and history of RawTree I wanted to highlight some of the main features that make RawTree a unique database:

Schemaless fast and predictable ingestion. You can connect RawTree to log exporter/webhook/custom built API/ingestion pipeline in minutes.
Reliable type system and SQL for dynamic columns. Custom and fast (almost zero overhead) operators for GROUP BY/JOINs/ORDER BY/DISTINCT/Window functions.
Fast query execution on raw untyped data. For even faster execution client can add types directly in SQL without necessity to change schema.
Database optimizes itself based on client query access patterns. There are no indexes/primary keys/materialized views to define; the primary key indexes/JIT projections are created automatically under the hood based on query access patterns without having to define them manually.
Clean API for agents. A coding agent should be able to learn to use it successfully with a few lines of text.

History of RawTree

RawTree grew out of my previous work on Ursa - ClickHouse research fork. Ursa started as a performance project: I wanted to see how fast I could make an analytical engine if I did not have to preserve backward compatibility. That work gave me a serious engine foundation, but it also made another problem more obvious.

Performance is only part of the difficulty. A lot of analytical systems still assume that before you can do useful work, someone has to decide the schema, the sort order, the indexes, the projections, and how the data model will evolve. That can work when the workload is stable and the team already knows what matters. It doesn't works as well when the data is messy, the shape keeps evolving, and queries appear before the model is settled.

I wanted a different operating model: ingest first, query immediately, and let the physical optimization follow real usage instead of forcing those decisions upfront.

That is not just a product or API problem. It creates a few hard database problems.

Type system for dynamic data. Data cannot stay as raw JSON internally if you want serious analytical performance. It has to be stored and processed in typed form (String, Integer, etc.). But once a column can contain different types, you need clear and predictable SQL semantics for filtering, grouping, ordering, joins, aggregates, and functions.
Performance on raw workloads. Accepting semi-structured data is not useful if it makes the engine much slower. The goal is not convenience at any cost. The goal is to keep the engine fast while still allowing flexible ingest and query patterns.
Optimization without upfront modeling. If the user does not define primary keys, indexes, or projections in advance, the engine needs to infer useful physical structures from actual workload patterns. That means collecting the right statistics and making good decisions automatically, without turning the system into a black box.

RawTree is the result of trying to solve those problems together in one database.

The rest of this post shows some of the main pieces: schemaless ingestion, SQL over dynamic data, and automatic physical optimization based on real query access patterns.

Schemaless ingestion and APIs

RawTree is built around RawMergeTree table engine. This engine allows you to ingest any data and primary key/runtime indexes/projections will be created automatically.

 test_table ENGINE=RawMergeTree;
 test_table
    ('{"id": 0, "value": "Value_0"}'),
    ('{"id": 1, "value": 1}'),
    ('{"id": 2, "value": 2.0}');

 id, value, dynamicType(id)  id_type, dynamicType(value)  value_type
 test_table;

┌─id─┬─value───┬─id_type─┬─value_type─┐
│ 0  │ Value_0 │ Int64   │     │
│ 1  │ 1       │ Int64   │ Int64      │
│ 2  │ 2.0     │ Int64   │ Float64    │
└────┴─────────┴─────────┴────────────┘

This makes it possible to build clean and simple APIs for agents. You just insert your data, and tables and columns are created automatically.

curl -X https://api.rawtree.com/v1/tables/events \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '[
    {"action":"click","user":"alice","value":42},
    {"action":"view","user":"bob","value":10,"info":"some additional info"}
   ]'

Returns: {"inserted":2}

Autogenerated projections

Autogenerated projections are created automatically based on actual query access patterns. The system will automatically identify repeatedly executed slow queries and will create the necessary projections. For example:

    region,
    sum(amount)  sum
 test_auto_projection
 region
 region

5 rows in set. Elapsed: 2.125 sec. Processed 100.00 million rows, 3.50 GB (47.06 million rows/s., 1.65/s.)
Peak memory usage: 25.60

If you run this relatively slow query multiple times, then projection advisor will decide to automatically create projection and then:

    region,
    sum(amount)  sum
 test_auto_projection
 region
 region

┌─region─┬─────────sum─┐
│ 0      │  9945969605 │
│ 1      │  9966469035 │
│ 2      │  9997709681 │
│ 3      │ 10009853640 │
│ 4      │ 10028792257 │
└────────┴─────────────┘

5 rows in set. Elapsed: 0.005 sec.

The projection advisor will favour lightweight projections for low-cardinality aggregation scenarios, where the number of groups after aggregation is small. This lets the engine maintain many lightweight projections with low storage overhead to accommodate client access patterns.

Autogenerated PRIMARY KEY

In RawMergeTree, data is stored in parts similar to ordinary MergeTree. But each part has its own primary key. The engine automatically creates a primary key when data is inserted, based on column types and cardinality. Later during actual query access pattern statistics are collected and primary key is rebuilt during merges based on those statistics. For example:

 test_table= RawMergeTree;

 test_table
 concat('{"id":', toString(number), ',"value":"Value_', toString(number), '"}')
 numbers(10_000_000);

 id, value test_table  id = 5;

┌─id─┬─value───┐
│ 5  │ Value_5 │
└────┴─────────┘

1 row in set. Elapsed: 0.005 sec. Processed 13.95 thousand rows, 237.33 KB (2.74 million rows/s., 46.64/s.)
Peak memory usage: 187.55

If we run EXPLAIN we can see amount of data that was read and that index was actually used:

= 1 * test_table  id = 5;

┌─explain────────────────────────────────────────────────────────────┐
│ Expression ((Project names + Projection))                          │
│   Expression (( +to column identifiers)) │
│     ReadFromMergeTree (default.test_table)                         │
│     Indexes:                                                       │
│       PrimaryKey                                                   │
│         Condition: true                                            │
│         Parts: 1/1                                                 │
│         Granules: 2/1221                                           │
│       Ranges: 2                                                    │
└────────────────────────────────────────────────────────────────────┘

I will write more about autogenerated PRIMARY KEY features in next blog posts and cover more details like how engine handles complex scenarios like dynamic primary key columns, primary key analysis for dynamic columns, actual filter statistics collection and why there are 2 granules in EXPLAIN output instead of 1.

Type System

All SQL operators DISTINCT, GROUP BY, JOINS, ORDER BY and both ordinary and aggregate functions need to support working natively with dynamic types and provide good and predictable default behavior.

To show why this matters, consider a few places where ordinary SQL semantics become tricky for dynamic columns. In systems without dynamic-type-aware operators, even basic operations such as ORDER BY, GROUP BY, and aggregates either fail or require the user to opt into behavior that may not match what they expect.

ORDER BY operator in ClickHouse with a Dynamic type:

 test_dynamic (value);
 test_dynamic  (1), (2.0), (3);
 * test_dynamic  value;

from server (version 26.5.1):
Code: 44. DB::Exception: Received from localhost:9000. DB::Exception:  types Variant/ are not allowed in keys,
because it can lead to unexpected results. Consider using a subcolumn with a specific data type instead
(for example 'column.Int64' or 'json.some.path.:Int64' if its a  path subcolumn) or casting this column to a specific data type.
 setting allow_suspicious_types_in_order_by = 1 in order to allow it. (ILLEGAL_COLUMN)

Even when allowing this operation explicitly, the behavior may not be what a client expects:

 allow_suspicious_types_in_order_by = 1

 value test_dynamic  value;

┌─value─┐
│ 2     │
│ 1     │
│ 3     │
└───────┘

This sorting behavior also needs to be consistent with all functions that compare values internally like arraySort:

 arraySort([1, 2.0, 3]::())

┌─arraySort(CA⋯)'))─┐
│ [2,1,3]                  │
└──────────────────────────┘

In RawTree ORDER BY is supported for dynamic columns and all functions that compare values by default, and it is possible to set legacy behavior if needed with dynamic_columns_compare_by_value setting:

 test_dynamic (value);
 test_dynamic  (1), (2.0), (3);

 * test_dynamic  value;

┌─value─┐
│ 1     │
│ 2     │
│ 3     │
└───────┘

 arraySort([1, 2.0, 3]::())

┌─arraySort(CA⋯)'))─┐
│ [1,2,3]                  │
└──────────────────────────┘

Similar issues show up with GROUP BY. Without dynamic-type-aware grouping semantics, compatible values with different concrete integer types can be treated as separate groups:

 test_dynamic (value);
 test_dynamic  toUInt8(1);
 test_dynamic  toUInt64(1);

 value, ()  test_dynamic  value;

from server (version 26.5.1):
Code: 44. DB::Exception: Received from localhost:9000. DB::Exception:  types Variant/ are not allowed in keys,
because it can lead to unexpected results. Consider using a subcolumn with a specific data type instead (for example 'column.Int64'
or 'json.some.path.:Int64' if its a  path subcolumn) or casting this column to a specific data type.
 setting allow_suspicious_types_in_group_by = 1 in order to allow it. (ILLEGAL_COLUMN)

For example:

 value, ()  test_dynamic  value;

┌─value─┬─()─┐
│ 1     │       1 │
│ 1     │       1 │
└───────┴─────────┘

In RawTree for GROUP BY there is special operator implementations where all integer types are treated as the same virtual Integer type:

 test_dynamic (value);
 test_dynamic  toUInt8(1);
 test_dynamic  toUInt64(1);

 value, ()  test_dynamic  value;

┌─value─┬─()─┐
│ 1     │       2 │
└───────┴─────────┘

Aggregates have the same problem. A function like sum needs to understand which concrete dynamic types it can process:

 test_dynamic (value);
 test_dynamic  toUInt8(1);

 sum(value)  test_dynamic;

from server (version 26.3.1):
Code: 43. DB::Exception: Received from localhost:9000. DB::Exception: Illegal type of argument
for aggregate function sum. (ILLEGAL_TYPE_OF_ARGUMENT)

In RawTree it works without any issues:

 test_dynamic (value);
 test_dynamic  toUInt8(1);

 sum(value)  test_dynamic;

┌─sum(value)─┐
│ 1          │
└────────────┘

A curious reader could ask what will happen if sum function is executed on column that contains String and Integer types? It is a very interesting system design question. In RawTree we ignore values that function cannot process, but we also provide a hint to the client for visibility and awareness:

 test_dynamic (value);

 test_dynamic  (1), (2), ('');

 sum(value)  test_dynamic;

┌─sum(value)─┐
│ 3          │
└────────────┘

Hint:  function 'sum' skipped rows where column arguments have concrete types (String).
with the function's signature. For function 'sum(value)'
in scope ' sum(value)  test_dynamic'

In RawTree similar hint logic is also implemented for ordinary functions:

 value test_dynamic  value >= 0;

┌─value─┐
│ 1     │
│ 2     │
└───────┘

2 rows in set. Elapsed: 0.002 sec.

Hint:  'greaterOrEquals' skipped rows with column argument at index 0 for column 'value' with concrete type
is incompatible with other function arguments, function was called with types (String, UInt8).
 function 'value >= 0' in scope 'SELECT value FROM test_dynamic WHERE value >= 0'

In future blog posts I will deep dive into RawTree type system and explain more details about it.

Performance

As part of the RawTree launch we are also working on RawBench, a benchmark for schemaless JSON analytics. The benchmark ingests and queries raw GitHub JSON data from https://www.gharchive.org/ and is similar to other benchmarks like ClickBench and JSONBench in methodology, but focused on semi-structured data.

The early results look promising. This is a preview of the combined score, where lower is better and bars use a compressed scale so the slower systems remain readable:

RawTree       ×2.42   █
StarRocks     ×4.50   ██
Apache Doris  ×5.85   ██▌
ClickHouse    ×6.38   ██▊
Elasticsearch ×44.78  ████████
DuckDB        ×85.46  ███████████
PostgreSQL    ×101.01 ████████████
MongoDB       ×392.21 ████████████████████

We will publish RawBench separately with the repository, exact scope, methodology, setup, reproduction steps, and final results soon.

Conclusion

Too much analytical work still begins with modeling decisions that are hard to make well before real data and real queries exist.

We think that assumption is wrong for a growing class of workloads, especially semi-structured and agent-driven ones. But removing schema upfront is only useful if the database still behaves like a serious analytical system: types have to make sense, SQL has to stay predictable, and performance has to remain strong.

That is the problem RawTree is trying to solve. If this sounds interesting, you should join RawTree's private beta

Hacker News · 1 赞 · 0 评讨论 → 阅读原文 →

这条对你有帮助吗？