ClickHouse and PromQL
Querying metrics stored in ClickHouse with PromQL.
See the end-result here
Background
ClickHouse makes an excellent metrics backend at scale: great compression, fast queries over large data sets, and a single store for metrics, logs, and traces.
Querying requires SQL, so with Grafana you’d typically use the ClickHouse data source plugin, or skip Grafana entirely with ClickStack.
But what if you already have many Grafana dashboards built for a Prometheus data source and still want to consolidate all observability signals in ClickHouse?
ClickHouse TimeSeries table engine
TimeSeries is an experimental feature UNAVAILABLE in ClickHouse Cloud.
ClickHouse has an experimental TimeSeries table engine that speaks Prometheus remote write and remote read natively.
Creating one is straightforward:
CREATE TABLE timeseries_table ENGINE=TimeSeries
The table itself stores nothing directly — all data lives in its target tables. With the table in place, you can expose Prometheus remote read/write endpoints on your ClickHouse server:
<clickhouse>
<prometheus>
<port>9363</port>
<handlers>
<prometheus_write_rule>
<url>/remote-write</url>
<handler>
<type>remote_write</type>
<database>timeseries_db</database>
<table>timeseries_table</table>
</handler>
</prometheus_write_rule>
<prometheus_read_rule>
<url>/remote-read</url>
<handler>
<type>remote_read</type>
<database>timeseries_db</database>
<table>timeseries_table</table>
</handler>
</prometheus_read_rule>
</handlers>
</prometheus>
</clickhouse>
Now <clickhouse ip>:9363/remote-write and <clickhouse ip>:9363/remote-read are valid Prometheus endpoints, targetable from Prometheus, Thanos, or an OpenTelemetry exporter.
For example:
architecture-beta
group cluster(cloud)[Cluster]
service ch(database)[ClickHouse] in cluster
service chr(internet)['ClickHouse /remote-read'] in cluster
service chw(internet)['ClickHouse /remote-write'] in cluster
service prom(server)[Prometheus] in cluster
service otel(server)[Otel Metrics Gateway] in cluster
service grafana(server)[Grafana] in cluster
prom:T --> B:chr
grafana:T <--> B:prom
otel:T --> B:chw
chr:R <-- L:ch
chw:L --> R:ch
- ClickHouse exposes
/remote-readand/remote-write - An OpenTelemetry collector uses the
prometheusremotewriteexporter to push metrics to/remote-write - A lightweight, zero-retention Prometheus instance reads from
/remote-read - Grafana queries that Prometheus instance as its data source
Problems
This works for smaller data sets, but PromQL queries computing rates or averages over high-cardinality metrics across 24+ hour windows will quickly start timing out.
Schema tuning helps at the margins, but the core issue is that the TimeSeries engine does no downsampling. Unlike Thanos, every query returns raw, full-resolution samples regardless of the time range — too many for Grafana to handle.
The fix: cascade incoming data into three downsampled bucket tables on write:
- 1-minute resolution (high)
- 5-minute resolution (medium)
- 1-hour resolution (low)
Then serve data from the appropriate table based on the query’s time range:
| 0–24 h | 24–48 h | 48–120 h | 120 h+ |
|---|---|---|---|
| raw data table | high res table | medium res table | low res table |
Prometheus remote read configuration has no routing — multiple endpoints just merge results. So we need a proxy that inspects the time range and forwards each remote read request to the right ClickHouse endpoint:
architecture-beta
group cluster(cloud)[Cluster]
service ch(database)[ClickHouse] in cluster
service chr(internet)['ClickHouse /rm-read/raw'] in cluster
service chr1m(internet)['ClickHouse /rm-read/1m'] in cluster
service chr5m(internet)['ClickHouse /rm-read/5m'] in cluster
service chr1h(internet)['ClickHouse /rm-read/1h'] in cluster
service chw(internet)['ClickHouse /rm-write'] in cluster
service prom(server)[Prometheus] in cluster
service otel(server)[Otel Metrics Gateway] in cluster
service grafana(server)[Grafana] in cluster
service proxy(server)[Routing proxy] in cluster
junction junctionChCenter in cluster
junction junctionChLeft in cluster
junction junctionChLeftMost in cluster
junction junctionChRight in cluster
junction junctionChRightMost in cluster
junction junctionProxyCenter in cluster
junction junctionProxyLeft in cluster
junction junctionProxyLeftMost in cluster
junction junctionProxyRight in cluster
junction junctionProxyRightMost in cluster
ch:B <-- T:junctionChCenter
junctionChCenter:L -- R:junctionChLeft
junctionChLeftMost:R -- L:junctionChLeft
junctionChCenter:R -- L:junctionChRight
junctionChRightMost:L -- R:junctionChRight
junctionChCenter:B --> T:chr1m
junctionChLeftMost:B --> T:chw
junctionChLeft:B --> T:chr
junctionChRight:B --> T:chr5m
junctionChRightMost:B --> T:chr1h
proxy:T -- B:junctionProxyCenter
junctionProxyLeftMost:R -- L:junctionProxyLeft
junctionProxyLeft:R -- L:junctionProxyCenter
junctionProxyRight:L -- R:junctionProxyCenter
junctionProxyRightMost:L -- R:junctionProxyRight
junctionProxyLeftMost:T --> B:chr
junctionProxyLeft:T --> B:chr1m
junctionProxyRight:T --> B:chr5m
junctionProxyRightMost:T --> B:chr1h
prom:T --> B:proxy
grafana:T --> B:prom
otel:T --> B:chw
Implementation
Database schema
The schema defines four TimeSeries engine tables:
timeseries_raw— raw metric datatimeseries_1m— downsampled to 1 mintimeseries_5m— downsampled to 5 mintimeseries_1h— downsampled to 1 hour
TimeSeries tables don’t store data themselves — samples and labels live in target tables. These are created automatically by default, but here we define them explicitly:
timeseries_data_table— target data table fortimeseries_rawtimeseries_1m_table— target data table fortimeseries_1mtimeseries_5m_table— target data table fortimeseries_5mtimeseries_1h_table— target data table fortimeseries_1h
When samples arrive at timeseries_raw via /remote-write, materialized views cascade the data through progressively coarser buckets:
timeseries_data_table→timeseries_1m_table(1 min aggregation)timeseries_1m_table→timeseries_5m_table(5 min aggregation)timeseries_5m_table→timeseries_1h_table(1 hour aggregation)
-- ============================================================================
-- ClickHouse TimeSeries Schema for Prometheus
-- ============================================================================
CREATE DATABASE IF NOT EXISTS timeseries_db;
-- ============================================================================
-- RAW DATA TABLE
-- ============================================================================
CREATE TABLE IF NOT EXISTS timeseries_db.timeseries_data_table
(
`id` UUID,
`timestamp` DateTime64(3) CODEC(Delta, ZSTD(1)),
`value` Float64 CODEC(Gorilla, ZSTD(1))
)
ENGINE = ReplacingMergeTree(timestamp)
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (id, timestamp)
TTL timestamp + INTERVAL 14 DAY DELETE
SETTINGS
index_granularity = 8192,
compress_marks = true,
compress_primary_key = true,
-- Optimized compression for time series data
min_compress_block_size = 65536,
max_compress_block_size = 1048576;
-- ============================================================================
-- 1-MINUTE AGGREGATION TABLE
-- ============================================================================
CREATE TABLE IF NOT EXISTS timeseries_db.timeseries_1m_table
(
`id` UUID,
`timestamp` DateTime64(3) CODEC(Delta, ZSTD(1)),
`sum_val` Float64 CODEC(Gorilla, ZSTD(1)),
`cnt` UInt64 CODEC(T64, ZSTD(1)),
`min_val` Float64 CODEC(Gorilla, ZSTD(1)),
`max_val` Float64 CODEC(Gorilla, ZSTD(1)),
`value` Float64 ALIAS (sum_val / NULLIF(cnt, 0))
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(timestamp)
ORDER BY (id, timestamp)
TTL timestamp + INTERVAL 90 DAY DELETE
SETTINGS
index_granularity = 8192,
compress_marks = true,
compress_primary_key = true,
min_compress_block_size = 65536,
max_compress_block_size = 1048576;
-- Materialized view: Downsample from raw data to 1-minute buckets
CREATE MATERIALIZED VIEW IF NOT EXISTS timeseries_db.mv_downsample_1m
TO timeseries_db.timeseries_1m_table
AS
SELECT
id,
toStartOfMinute(timestamp) AS timestamp,
sum(value) AS sum_val,
count() AS cnt,
min(value) AS min_val,
max(value) AS max_val
FROM timeseries_db.timeseries_data_table
GROUP BY id, timestamp;
-- ============================================================================
-- 5-MINUTE AGGREGATION TABLE
-- ============================================================================
CREATE TABLE IF NOT EXISTS timeseries_db.timeseries_5m_table
(
`id` UUID,
`timestamp` DateTime64(3) CODEC(Delta, ZSTD(1)),
`sum_val` Float64 CODEC(Gorilla, ZSTD(1)),
`cnt` UInt64 CODEC(T64, ZSTD(1)),
`min_val` Float64 CODEC(Gorilla, ZSTD(1)),
`max_val` Float64 CODEC(Gorilla, ZSTD(1)),
`value` Float64 ALIAS (sum_val / NULLIF(cnt, 0))
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(timestamp)
ORDER BY (id, timestamp)
TTL timestamp + INTERVAL 1 YEAR DELETE
SETTINGS
index_granularity = 8192,
compress_marks = true,
compress_primary_key = true,
min_compress_block_size = 65536,
max_compress_block_size = 1048576;
-- Materialized view: CASCADE from 1m table
CREATE MATERIALIZED VIEW IF NOT EXISTS timeseries_db.mv_downsample_5m
TO timeseries_db.timeseries_5m_table
AS
SELECT
id,
toStartOfFiveMinutes(timestamp) AS timestamp,
sum(sum_val) AS sum_val,
sum(cnt) AS cnt,
min(min_val) AS min_val,
max(max_val) AS max_val
FROM timeseries_db.timeseries_1m_table
GROUP BY id, timestamp;
-- ============================================================================
-- 1-HOUR AGGREGATION TABLE
-- ============================================================================
CREATE TABLE IF NOT EXISTS timeseries_db.timeseries_1h_table
(
`id` UUID,
`timestamp` DateTime64(3) CODEC(Delta, ZSTD(1)),
`sum_val` Float64 CODEC(Gorilla, ZSTD(1)),
`cnt` UInt64 CODEC(T64, ZSTD(1)),
`min_val` Float64 CODEC(Gorilla, ZSTD(1)),
`max_val` Float64 CODEC(Gorilla, ZSTD(1)),
`value` Float64 ALIAS (sum_val / NULLIF(cnt, 0))
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(timestamp)
ORDER BY (id, timestamp)
TTL timestamp + INTERVAL 5 YEAR DELETE
SETTINGS
index_granularity = 8192,
compress_marks = true,
compress_primary_key = true,
min_compress_block_size = 65536,
max_compress_block_size = 1048576;
-- Materialized view: CASCADE from 5m table
CREATE MATERIALIZED VIEW IF NOT EXISTS timeseries_db.mv_downsample_1h
TO timeseries_db.timeseries_1h_table
AS
SELECT
id,
toStartOfHour(timestamp) AS timestamp,
sum(sum_val) AS sum_val,
sum(cnt) AS cnt,
min(min_val) AS min_val,
max(max_val) AS max_val
FROM timeseries_db.timeseries_5m_table
GROUP BY id, timestamp;
-- ============================================================================
-- TAGS/LABELS TABLE
-- ============================================================================
-- Engine: AggregatingMergeTree is required because:
-- 1. min_time and max_time use SimpleAggregateFunction type
-- 2. This is required by aggregate_min_time_and_max_time = true (default)
-- 3. TimeSeries engine will write aggregating inserts to update these columns
CREATE TABLE IF NOT EXISTS timeseries_db.timeseries_tags_table
(
`id` UUID DEFAULT reinterpretAsUUID(sipHash128(metric_name, all_tags)),
`metric_name` LowCardinality(String),
-- tags Map
`tags` Map(LowCardinality(String), String),
-- all_tags Map WILL contain all labels (EPHEMERAL - only for id calculation)
`all_tags` Map(String, String) EPHEMERAL,
`labels_hash` UInt64 MATERIALIZED cityHash64(
arrayStringConcat(
arraySort(arrayMap((k,v) -> concat(k, '=', v), mapKeys(tags), mapValues(tags))),
','
)
),
-- Time range columns - automatically maintained by TimeSeries engine
-- Using SimpleAggregateFunction as required by aggregate_min_time_and_max_time setting
`min_time` SimpleAggregateFunction(min, Nullable(DateTime64(3))),
`max_time` SimpleAggregateFunction(max, Nullable(DateTime64(3)))
)
ENGINE = AggregatingMergeTree
ORDER BY (metric_name, id)
SETTINGS
index_granularity = 8192,
allow_nullable_key = 1;
-- Indexes
ALTER TABLE timeseries_db.timeseries_tags_table
ADD INDEX IF NOT EXISTS idx_id id TYPE bloom_filter(0.01) GRANULARITY 4;
ALTER TABLE timeseries_db.timeseries_tags_table
ADD INDEX IF NOT EXISTS idx_labels_hash labels_hash TYPE set(1000000) GRANULARITY 4;
ALTER TABLE timeseries_db.timeseries_tags_table
ADD INDEX IF NOT EXISTS idx_metric_name metric_name TYPE bloom_filter(0.001) GRANULARITY 1;
ALTER TABLE timeseries_db.timeseries_tags_table
ADD INDEX IF NOT EXISTS idx_tags_keys mapKeys(tags) TYPE bloom_filter(0.01) GRANULARITY 1;
ALTER TABLE timeseries_db.timeseries_tags_table
ADD INDEX IF NOT EXISTS idx_tags_values mapValues(tags) TYPE bloom_filter(0.01) GRANULARITY 1;
-- ============================================================================
-- Create addtional indexes on on frequently queried labels if needed, like:
-- ALTER TABLE timeseries_db.timeseries_tags_table
-- ADD INDEX IF NOT EXISTS idx_job tags['job'] TYPE bloom_filter(0.01) GRANULARITY 1;
-- ============================================================================
-- ============================================================================
-- METRICS METADATA TABLE
-- ============================================================================
CREATE TABLE IF NOT EXISTS timeseries_db.timeseries_metrics_table
(
`metric_family_name` String,
`type` String,
`unit` String,
`help` String
)
ENGINE = ReplacingMergeTree
ORDER BY metric_family_name;
-- ============================================================================
-- TIMESERIES ENGINE TABLES
-- ============================================================================
-- Using built-in min_time/max_time tracking with aggregation
CREATE TABLE IF NOT EXISTS timeseries_db.timeseries_raw
ENGINE = TimeSeries
SETTINGS
store_min_time_and_max_time = true, -- Store min/max time per series
filter_by_min_time_and_max_time = true, -- Use for query optimization
aggregate_min_time_and_max_time = true, -- Use SimpleAggregateFunction (default)
tags_to_columns = {}
DATA timeseries_db.timeseries_data_table
TAGS timeseries_db.timeseries_tags_table
METRICS timeseries_db.timeseries_metrics_table;
CREATE TABLE IF NOT EXISTS timeseries_db.timeseries_1m
ENGINE = TimeSeries
SETTINGS
store_min_time_and_max_time = false,
filter_by_min_time_and_max_time = false,
aggregate_min_time_and_max_time = true, -- Use SimpleAggregateFunction (default)
tags_to_columns = {}
DATA timeseries_db.timeseries_1m_table
TAGS timeseries_db.timeseries_tags_table
METRICS timeseries_db.timeseries_metrics_table;
CREATE TABLE IF NOT EXISTS timeseries_db.timeseries_5m
ENGINE = TimeSeries
SETTINGS
store_min_time_and_max_time = false,
filter_by_min_time_and_max_time = false,
aggregate_min_time_and_max_time = true, -- Use SimpleAggregateFunction (default)
tags_to_columns = {}
DATA timeseries_db.timeseries_5m_table
TAGS timeseries_db.timeseries_tags_table
METRICS timeseries_db.timeseries_metrics_table;
CREATE TABLE IF NOT EXISTS timeseries_db.timeseries_1h
ENGINE = TimeSeries
SETTINGS
store_min_time_and_max_time = false,
filter_by_min_time_and_max_time = false,
aggregate_min_time_and_max_time = true, -- Use SimpleAggregateFunction (default)
tags_to_columns = {}
DATA timeseries_db.timeseries_1h_table
TAGS timeseries_db.timeseries_tags_table
METRICS timeseries_db.timeseries_metrics_table;
-- ============================================================================
-- CARDINALITY TRACKING
-- ============================================================================
CREATE MATERIALIZED VIEW IF NOT EXISTS timeseries_db.mv_cardinality_stats
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (metric_name, hour)
AS
SELECT
t.metric_name,
toStartOfHour(d.timestamp) AS hour,
uniqExact(d.id) AS unique_series,
count() AS data_points
FROM timeseries_db.timeseries_data_table d
INNER JOIN timeseries_db.timeseries_tags_table t ON d.id = t.id
GROUP BY t.metric_name, hour;
-- ============================================================================
-- HELPER VIEWS
-- ============================================================================
CREATE VIEW IF NOT EXISTS timeseries_db.v_series_list AS
SELECT
id,
metric_name,
tags
FROM timeseries_db.timeseries_tags_table;
ClickHouse protocol config
Next, add Prometheus protocol handlers for remote write, raw remote read, and each downsampled table:
<clickhouse>
<logger>
<level>debug</level>
<console>true</console>
</logger>
<compression>
<case>
<method>zstd</method>
</case>
</compression>
<disable_internal_dns_cache>true</disable_internal_dns_cache>
<prometheus>
<port>9363</port>
<handlers>
<prometheus_write_rule>
<url>/remote-write</url>
<handler>
<type>remote_write</type>
<database>timeseries_db</database>
<table>timeseries_raw</table>
</handler>
</prometheus_write_rule>
<prometheus_read_rule_raw>
<url>/remote-read/raw</url>
<handler>
<type>remote_read</type>
<database>timeseries_db</database>
<table>timeseries_raw</table>
</handler>
</prometheus_read_rule_raw>
<prometheus_read_rule_1m>
<url>/prom-read/1m</url>
<handler>
<type>remote_read</type>
<database>timeseries_db</database>
<table>timeseries_1m</table>
</handler>
</prometheus_read_rule_1m>
<prometheus_read_rule_5m>
<url>/prom-read/5m</url>
<handler>
<type>remote_read</type>
<database>timeseries_db</database>
<table>timeseries_5m</table>
</handler>
</prometheus_read_rule_5m>
<prometheus_read_rule_1h>
<url>/prom-read/1h</url>
<handler>
<type>remote_read</type>
<database>timeseries_db</database>
<table>timeseries_1h</table>
</handler>
</prometheus_read_rule_1h>
</handlers>
</prometheus>
</clickhouse>
Proxy implementation
The complete proxy lives at https://github.com/maratoid/tsdb-aggregate-proxy.
make build
bin/tsdb-aggregate-proxy --help
By default the proxy connects to ClickHouse via the native protocol and queries target data tables directly, translating remote read requests into SQL. Pass --no-query-bypass to use the ClickHouse Prometheus protocol endpoints instead.
With query bypass (default), using default address, table names, and thresholds:
bin/tsdb-aggregate-proxy --chi-password=p@ssw0rd
Without query bypass:
bin/tsdb-aggregate-proxy --no-query-bypass
Prometheus configuration
Configure Prometheus with zero local retention, remote read pointed at the proxy, and remote write pointed at ClickHouse:
global:
scrape_interval: 15s
# query_log_file: /prometheus/query.log
# scrape_failure_log_file: /prometheus/scrape-failures.log
scrape_configs:
- job_name: prom-self-monitor
static_configs:
- labels:
job: prom-self-monitor
targets:
- localhost:9090
remote_read:
- url: http://proxy:9091/api/v1/read
name: clickhouse-remote-read
remote_timeout: 10m
read_recent: true
filter_external_labels: false
basic_auth:
username: ingest
password: p@ssw0rd
enable_http2: true
remote_write:
- url: http://clickhouse:9363/remote-write
name: clickhouse-remote-write
remote_timeout: 10m
basic_auth:
username: ingest
password: p@ssw0rd
enable_http2: true
Example OpenTelemetry collector configuration
If you’re using an OpenTelemetry collector, the prometheusremotewrite exporter can push scraped metrics to the ClickHouse remote write endpoint:
...
exporters:
debug: {}
prometheusremotewrite/clickhouse:
endpoint: http://clickhouse:9363/remote-write
auth:
authenticator: basicauth/clickhouse
resource_to_telemetry_conversion:
enabled: true
send_metadata: true
timeout: 60s
tls:
insecure_skip_verify: true
...
extensions:
...
basicauth/clickhouse:
client_auth:
username: ingest
password: p@ssw0rd
...
service:
extensions:
...
- basicauth/clickhouse
pipelines:
metrics:
exporters:
- debug
- prometheusremotewrite/clickhouse
...
Test docker compose setup
The repository includes a docker compose stack that brings up:
- ClickHouse with the TimeSeries schema and Prometheus protocol handlers
- The routing proxy in
--query-bypassmode - Prometheus reading from the proxy
- Grafana provisioned with Prometheus as the default data source
- Avalanche generating continuous test metrics
- An OpenTelemetry collector scraping Avalanche, the proxy, and Prometheus, then writing to ClickHouse