RawTree
摘要
针对传统分析数据库需预设Schema和索引的痛点,RawTree实现了无模式(Schemaless)的高速数据摄取。其核心特性包括:系统能根据实际查询模式自动生成主键和投影(Projections)以优化性能;具备支持动态类型的SQL引擎,可自动处理混合类型数据的排序与聚合;提供简洁的API供AI Agent调用。该项目由前ClickHouse研究员开发,旨在兼顾半结构化数据的灵活性与高性能分析。
荐读理由
针对 AI Agent 产生的不确定性、高频变动数据,该数据库方案通过自动推断主键与投影(Projections)实现了“先入库后建模”,能让你在无需手动维护 Schema 和索引的情况下,直接对原始 JSON 执行高性能 SQL 分析。
原文
Introducing RawTree
Jun 9, 2026 by Maksim Kita
Most analytics databases today assume you know your table schemas upfront. You define tables, set up indexes, create materialized views, and only then can you start querying your data.
Without indexes or materialized views the database is slow. With them, optimizing queries requires a lot of work, usually involving iterating the tables and materialized view's schemas, and they are not effective at dealing with unstructured/semi-structured data. All of this makes their APIs and interaction complex, which is not ideal for coding agents, because they have to make decisions about data structure/schema/indexes before they have any data. The core issue is "schema before data".
Imagine a team building an AI support agent. Every run produces different data: prompts, tool calls, retrieval results, latencies, error payloads, user metadata, feedback, sometimes extra fields that did not exist the day before. At the beginning they do not know which of those fields will matter most. They just want to keep everything and start asking questions immediately.
Ideally, the agent should be able to send requests like:
curl -X https://api.rawtree.com/v1/tables/agent_runs \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '[
{"run_id":"r1","model":"gpt-5","latency_ms":812,"tool":"search","status":"ok"},
{"run_id":"r2","model":"gpt-5","latency_ms":"1540","tool":"browser","status":"error","error_code":"timeout"}
]'
So that after a few runs, it could start running queries like these:
curl -X "https://api.rawtree.com/v1/query" \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{"sql":"SELECT tool, status, count(), avg(latency_ms) \
FROM agent_runs \
GROUP BY tool, status \
ORDER BY count() DESC;"}'
We built RawTree to solve this problem. Our main motivation and goal was to create an ideal analytical database for agents, and thus for humans too. The main feature is its simplicity: you just ingest any data, query with SQL, fetch results via API. No schema upfront, and no need to design indexes, primary keys, or materialized views to start working. RawTree is designed to be useful on raw workloads immediately, then improve recurring workloads over time by adapting to actual query access patterns.
Before jumping into details of each feature and history of RawTree I wanted to highlight some of the main features that make RawTree a unique database:
Schemaless fast and predictable ingestion. You can connect RawTree to log exporter/webhook/custom built API/ingestion pipeline in minutes.
Reliable type system and SQL for dynamic columns. Custom and fast (almost zero overhead) operators for GROUP BY/JOINs/ORDER BY/DISTINCT/Window functions.
Fast query execution on raw untyped data. For even faster execution client can add types directly in SQL without necessity to change schema.
Database optimizes itself based on client query access patterns. There are no indexes/primary keys/materialized views to define; the primary key indexes/JIT projections are created automatically under the hood based on query access patterns without having to define them manually.
Clean API for agents. A coding agent should be able to learn to use it successfully with a few lines of text.
History of RawTree
RawTree grew out of my previous work on Ursa - ClickHouse research fork. Ursa started as a performance project: I wanted to see how fast I could make an analytical engine if I did not have to preserve backward compatibility. That work gave me a serious engine foundation, but it also made another problem more obvious.
Performance is only part of the difficulty. A lot of analytical systems still assume that before you can do useful work, someone has to decide the schema, the sort order, the indexes, the projections, and how the data model will evolve. That can work when the workload is stable and the team already knows what matters. It doesn't works as well when the data is messy, the shape keeps evolving, and queries appear before the model is settled.
I wanted a different operating model: ingest first, query immediately, and let the physical optimization follow real usage instead of forcing those decisions upfront.
That is not just a product or API problem. It creates a few hard database problems.
Type system for dynamic data. Data cannot stay as raw JSON internally if you want serious analytical performance. It has to be stored and processed in typed form (String, Integer, etc.). But once a column can contain different types, you need clear and predictable SQL semantics for filtering, grouping, ordering, joins, aggregates, and functions.
Performance on raw workloads. Accepting semi-structured data is not useful if it makes the engine much slower. The goal is not convenience at any cost. The goal is to keep the engine fast while still allowing flexible ingest and query patterns.
Optimization without upfront modeling. If the user does not define primary keys, indexes, or projections in advance, the engine needs to infer useful physical structures from actual workload patterns. That means collecting the right statistics and making good decisions automatically, without turning the system into a black box.
RawTree is the result of trying to solve those problems together in one database.
The rest of this post shows some of the main pieces: schemaless ingestion, SQL over dynamic data, and automatic physical optimization based on real query access patterns.
Schemaless ingestion and APIs
RawTree is built around RawMergeTree table engine. This engine allows you to ingest any data and primary key/runtime indexes/projections will be created automatically.
test_table ENGINE=RawMergeTree;
test_table
('{"id": 0, "value": "Value_0"}'),
('{"id": 1, "value": 1}'),
('{"id": 2, "value": 2.0}');
id, value, dynamicType(id) id_type, dynamicType(value) value_type
test_table;
┌─id─┬─value───┬─id_type─┬─value_type─┐
│ 0 │ Value_0 │ Int64 │ │
│ 1 │ 1 │ Int64 │ Int64 │
│ 2 │ 2.0 │ Int64 │ Float64 │
└────┴─────────┴─────────┴────────────┘
This makes it possible to build clean and simple APIs for agents. You just insert your data, and tables and columns are created automatically.
curl -X https://api.rawtree.com/v1/tables/events \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '[
{"action":"click","user":"alice","value":42},
{"action":"view","user":"bob","value":10,"info":"some additional info"}
]'
Returns: {"inserted":2}
Autogenerated projections
Autogenerated projections are created automatically based on actual query access patterns. The system will automatically identify repeatedly executed slow queries and will create the necessary projections. For example:
region,
sum(amount) sum
test_auto_projection
region
region
5 rows in set. Elapsed: 2.125 sec. Processed 100.00 million rows, 3.50 GB (47.06 million rows/s., 1.65/s.)
Peak memory usage: 25.60
If you run this relatively slow query multiple times, then projection advisor will decide to automatically create projection and then:
region,
sum(amount) sum
test_auto_projection
region
region
┌─region─┬─────────sum─┐
│ 0 │ 9945969605 │
│ 1 │ 9966469035 │
│ 2 │ 9997709681 │
│ 3 │ 10009853640 │
│ 4 │ 10028792257 │
└────────┴─────────────┘
5 rows in set. Elapsed: 0.005 sec.
The projection advisor will favour lightweight projections for low-cardinality aggregation scenarios, where the number of groups after aggregation is small. This lets the engine maintain many lightweight projections with low storage overhead to accommodate client access patterns.
Autogenerated PRIMARY KEY
In RawMergeTree, data is stored in parts similar to ordinary MergeTree. But each part has its own primary key. The engine automatically creates a primary key when data is inserted, based on column types and cardinality. Later during actual query access pattern statistics are collected and primary key is rebuilt during merges based on those statistics. For example:
test_table= RawMergeTree;
test_table
concat('{"id":', toString(number), ',"value":"Value_', toString(number), '"}')
numbers(10_000_000);
id, value test_table id = 5;
┌─id─┬─value───┐
│ 5 │ Value_5 │
└────┴─────────┘
1 row in set. Elapsed: 0.005 sec. Processed 13.95 thousand rows, 237.33 KB (2.74 million rows/s., 46.64/s.)
Peak memory usage: 187.55
If we run EXPLAIN we can see amount of data that was read and that index was actually used:
= 1 * test_table id = 5;
┌─explain────────────────────────────────────────────────────────────┐
│ Expression ((Project names + Projection)) │
│ Expression (( +to column identifiers)) │
│ ReadFromMergeTree (default.test_table) │
│ Indexes: │
│ PrimaryKey │
│ Condition: true │
│ Parts: 1/1 │
│ Granules: 2/1221 │
│ Ranges: 2 │
└────────────────────────────────────────────────────────────────────┘
I will write more about autogenerated PRIMARY KEY features in next blog posts and cover more details like how engine handles complex scenarios like dynamic primary key columns, primary key analysis for dynamic columns, actual filter statistics collection and why there are 2 granules in EXPLAIN output instead of 1.
Type System
All SQL operators DISTINCT, GROUP BY, JOINS, ORDER BY and both ordinary and aggregate functions need to support working natively with dynamic types and provide good and predictable default behavior.
To show why this matters, consider a few places where ordinary SQL semantics become tricky for dynamic columns. In systems without dynamic-type-aware operators, even basic operations such as ORDER BY, GROUP BY, and aggregates either fail or require the user to opt into behavior that may not match what they expect.
ORDER BY operator in ClickHouse with a Dynamic type:
test_dynamic (value);
test_dynamic (1), (2.0), (3);
* test_dynamic value;
from server (version 26.5.1):
Code: 44. DB::Exception: Received from localhost:9000. DB::Exception: types Variant/ are not allowed in keys,
because it can lead to unexpected results. Consider using a subcolumn with a specific data type instead
(for example 'column.Int64' or 'json.some.path.:Int64' if its a path subcolumn) or casting this column to a specific data type.
setting allow_suspicious_types_in_order_by = 1 in order to allow it. (ILLEGAL_COLUMN)
Even when allowing this operation explicitly, the behavior may not be what a client expects:
allow_suspicious_types_in_order_by = 1
value test_dynamic value;
┌─value─┐
│ 2 │
│ 1 │
│ 3 │
└───────┘
This sorting behavior also needs to be consistent with all functions that compare values internally like arraySort:
arraySort([1, 2.0, 3]::())
┌─arraySort(CA⋯)'))─┐
│ [2,1,3] │
└──────────────────────────┘
In RawTree ORDER BY is supported for dynamic columns and all functions that compare values by default, and it is possible to set legacy behavior if needed with dynamic_columns_compare_by_value setting:
test_dynamic (value);
test_dynamic (1), (2.0), (3);
* test_dynamic value;
┌─value─┐
│ 1 │
│ 2 │
│ 3 │
└───────┘
arraySort([1, 2.0, 3]::())
┌─arraySort(CA⋯)'))─┐
│ [1,2,3] │
└──────────────────────────┘
Similar issues show up with GROUP BY. Without dynamic-type-aware grouping semantics, compatible values with different concrete integer types can be treated as separate groups:
test_dynamic (value);
test_dynamic toUInt8(1);
test_dynamic toUInt64(1);
value, () test_dynamic value;
from server (version 26.5.1):
Code: 44. DB::Exception: Received from localhost:9000. DB::Exception: types Variant/ are not allowed in keys,
because it can lead to unexpected results. Consider using a subcolumn with a specific data type instead (for example 'column.Int64'
or 'json.some.path.:Int64' if its a path subcolumn) or casting this column to a specific data type.
setting allow_suspicious_types_in_group_by = 1 in order to allow it. (ILLEGAL_COLUMN)
For example:
value, () test_dynamic value;
┌─value─┬─()─┐
│ 1 │ 1 │
│ 1 │ 1 │
└───────┴─────────┘
In RawTree for GROUP BY there is special operator implementations where all integer types are treated as the same virtual Integer type:
test_dynamic (value);
test_dynamic toUInt8(1);
test_dynamic toUInt64(1);
value, () test_dynamic value;
┌─value─┬─()─┐
│ 1 │ 2 │
└───────┴─────────┘
Aggregates have the same problem. A function like sum needs to understand which concrete dynamic types it can process:
test_dynamic (value);
test_dynamic toUInt8(1);
sum(value) test_dynamic;
from server (version 26.3.1):
Code: 43. DB::Exception: Received from localhost:9000. DB::Exception: Illegal type of argument
for aggregate function sum. (ILLEGAL_TYPE_OF_ARGUMENT)
In RawTree it works without any issues:
test_dynamic (value);
test_dynamic toUInt8(1);
sum(value) test_dynamic;
┌─sum(value)─┐
│ 1 │
└────────────┘
A curious reader could ask what will happen if sum function is executed on column that contains String and Integer types? It is a very interesting system design question. In RawTree we ignore values that function cannot process, but we also provide a hint to the client for visibility and awareness:
test_dynamic (value);
test_dynamic (1), (2), ('');
sum(value) test_dynamic;
┌─sum(value)─┐
│ 3 │
└────────────┘
Hint: function 'sum' skipped rows where column arguments have concrete types (String).
with the function's signature. For function 'sum(value)'
in scope ' sum(value) test_dynamic'
In RawTree similar hint logic is also implemented for ordinary functions:
value test_dynamic value >= 0;
┌─value─┐
│ 1 │
│ 2 │
└───────┘
2 rows in set. Elapsed: 0.002 sec.
Hint: 'greaterOrEquals' skipped rows with column argument at index 0 for column 'value' with concrete type
is incompatible with other function arguments, function was called with types (String, UInt8).
function 'value >= 0' in scope 'SELECT value FROM test_dynamic WHERE value >= 0'
In future blog posts I will deep dive into RawTree type system and explain more details about it.
Performance
As part of the RawTree launch we are also working on RawBench, a benchmark for schemaless JSON analytics. The benchmark ingests and queries raw GitHub JSON data from https://www.gharchive.org/ and is similar to other benchmarks like ClickBench and JSONBench in methodology, but focused on semi-structured data.
The early results look promising. This is a preview of the combined score, where lower is better and bars use a compressed scale so the slower systems remain readable:
RawTree ×2.42 █
StarRocks ×4.50 ██
Apache Doris ×5.85 ██▌
ClickHouse ×6.38 ██▊
Elasticsearch ×44.78 ████████
DuckDB ×85.46 ███████████
PostgreSQL ×101.01 ████████████
MongoDB ×392.21 ████████████████████
We will publish RawBench separately with the repository, exact scope, methodology, setup, reproduction steps, and final results soon.
Conclusion
Too much analytical work still begins with modeling decisions that are hard to make well before real data and real queries exist.
We think that assumption is wrong for a growing class of workloads, especially semi-structured and agent-driven ones. But removing schema upfront is only useful if the database still behaves like a serious analytical system: types have to make sense, SQL has to stay predictable, and performance has to remain strong.
That is the problem RawTree is trying to solve. If this sounds interesting, you should join RawTree's private beta
这条对你有帮助吗?