ClickHouse and PromQL

TLDR

See the end-result here

Background

ClickHouse makes an excellent metrics backend at scale: great compression, fast queries over large data sets, and a single store for metrics, logs, and traces.

Querying requires SQL, so with Grafana you’d typically use the ClickHouse data source plugin, or skip Grafana entirely with ClickStack.

But what if you already have many Grafana dashboards built for a Prometheus data source and still want to consolidate all observability signals in ClickHouse?

ClickHouse TimeSeries table engine

ClickHouse experimental feature

TimeSeries is an experimental feature UNAVAILABLE in ClickHouse Cloud.

ClickHouse has an experimental TimeSeries table engine that speaks Prometheus remote write and remote read natively.

Creating one is straightforward:

CREATE TABLE timeseries_table ENGINE=TimeSeries

The table itself stores nothing directly — all data lives in its target tables. With the table in place, you can expose Prometheus remote read/write endpoints on your ClickHouse server:

<clickhouse>   
  <prometheus>
    <port>9363</port>
    <handlers>
      <prometheus_write_rule>
        <url>/remote-write</url>
          <handler>
              <type>remote_write</type>
              <database>timeseries_db</database>
              <table>timeseries_table</table>
          </handler>
      </prometheus_write_rule>
      <prometheus_read_rule>
        <url>/remote-read</url>
        <handler>
          <type>remote_read</type>
          <database>timeseries_db</database>
          <table>timeseries_table</table>
        </handler>
      </prometheus_read_rule>
    </handlers>
  </prometheus>
</clickhouse>

Now <clickhouse ip>:9363/remote-write and <clickhouse ip>:9363/remote-read are valid Prometheus endpoints, targetable from Prometheus, Thanos, or an OpenTelemetry exporter.

For example:

    architecture-beta
    group cluster(cloud)[Cluster]

    service ch(database)[ClickHouse] in cluster
    service chr(internet)['ClickHouse /remote-read'] in cluster
    service chw(internet)['ClickHouse /remote-write'] in cluster
    service prom(server)[Prometheus] in cluster
    service otel(server)[Otel Metrics Gateway] in cluster
    service grafana(server)[Grafana] in cluster

    prom:T --> B:chr
    grafana:T <--> B:prom
    otel:T --> B:chw
    chr:R <-- L:ch 
    chw:L --> R:ch
  • ClickHouse exposes /remote-read and /remote-write
  • An OpenTelemetry collector uses the prometheusremotewrite exporter to push metrics to /remote-write
  • A lightweight, zero-retention Prometheus instance reads from /remote-read
  • Grafana queries that Prometheus instance as its data source

Problems

This works for smaller data sets, but PromQL queries computing rates or averages over high-cardinality metrics across 24+ hour windows will quickly start timing out.

Schema tuning helps at the margins, but the core issue is that the TimeSeries engine does no downsampling. Unlike Thanos, every query returns raw, full-resolution samples regardless of the time range — too many for Grafana to handle.

The fix: cascade incoming data into three downsampled bucket tables on write:

  • 1-minute resolution (high)
  • 5-minute resolution (medium)
  • 1-hour resolution (low)

Then serve data from the appropriate table based on the query’s time range:

0–24 h24–48 h48–120 h120 h+
raw data tablehigh res tablemedium res tablelow res table

Prometheus remote read configuration has no routing — multiple endpoints just merge results. So we need a proxy that inspects the time range and forwards each remote read request to the right ClickHouse endpoint:

    architecture-beta
    group cluster(cloud)[Cluster]

    service ch(database)[ClickHouse] in cluster
    service chr(internet)['ClickHouse /rm-read/raw'] in cluster
    service chr1m(internet)['ClickHouse /rm-read/1m'] in cluster
    service chr5m(internet)['ClickHouse /rm-read/5m'] in cluster
    service chr1h(internet)['ClickHouse /rm-read/1h'] in cluster
    service chw(internet)['ClickHouse /rm-write'] in cluster
    service prom(server)[Prometheus] in cluster
    service otel(server)[Otel Metrics Gateway] in cluster
    service grafana(server)[Grafana] in cluster
    service proxy(server)[Routing proxy] in cluster
    junction junctionChCenter in cluster
    junction junctionChLeft in cluster
    junction junctionChLeftMost in cluster
    junction junctionChRight in cluster
    junction junctionChRightMost in cluster
    junction junctionProxyCenter in cluster
    junction junctionProxyLeft in cluster
    junction junctionProxyLeftMost in cluster
    junction junctionProxyRight in cluster
    junction junctionProxyRightMost in cluster
    ch:B <-- T:junctionChCenter
    junctionChCenter:L -- R:junctionChLeft
    junctionChLeftMost:R -- L:junctionChLeft
    junctionChCenter:R -- L:junctionChRight
    junctionChRightMost:L -- R:junctionChRight
    junctionChCenter:B --> T:chr1m
    junctionChLeftMost:B --> T:chw
    junctionChLeft:B --> T:chr
    junctionChRight:B --> T:chr5m
    junctionChRightMost:B --> T:chr1h
    proxy:T -- B:junctionProxyCenter
    junctionProxyLeftMost:R -- L:junctionProxyLeft
    junctionProxyLeft:R -- L:junctionProxyCenter
    junctionProxyRight:L -- R:junctionProxyCenter
    junctionProxyRightMost:L -- R:junctionProxyRight
    junctionProxyLeftMost:T --> B:chr
    junctionProxyLeft:T --> B:chr1m
    junctionProxyRight:T --> B:chr5m
    junctionProxyRightMost:T --> B:chr1h
    prom:T --> B:proxy
    grafana:T --> B:prom
    otel:T --> B:chw

Implementation

Database schema

The schema defines four TimeSeries engine tables:

  • timeseries_raw — raw metric data
  • timeseries_1m — downsampled to 1 min
  • timeseries_5m — downsampled to 5 min
  • timeseries_1h — downsampled to 1 hour

TimeSeries tables don’t store data themselves — samples and labels live in target tables. These are created automatically by default, but here we define them explicitly:

  • timeseries_data_table — target data table for timeseries_raw
  • timeseries_1m_table — target data table for timeseries_1m
  • timeseries_5m_table — target data table for timeseries_5m
  • timeseries_1h_table — target data table for timeseries_1h

When samples arrive at timeseries_raw via /remote-write, materialized views cascade the data through progressively coarser buckets:

  1. timeseries_data_tabletimeseries_1m_table (1 min aggregation)
  2. timeseries_1m_tabletimeseries_5m_table (5 min aggregation)
  3. timeseries_5m_tabletimeseries_1h_table (1 hour aggregation)
-- ============================================================================
-- ClickHouse TimeSeries Schema for Prometheus
-- ============================================================================
CREATE DATABASE IF NOT EXISTS timeseries_db;

-- ============================================================================
-- RAW DATA TABLE
-- ============================================================================
CREATE TABLE IF NOT EXISTS timeseries_db.timeseries_data_table
(
    `id` UUID,
    `timestamp` DateTime64(3) CODEC(Delta, ZSTD(1)),
    `value` Float64 CODEC(Gorilla, ZSTD(1))
)
ENGINE = ReplacingMergeTree(timestamp)
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (id, timestamp)
TTL timestamp + INTERVAL 14 DAY DELETE
SETTINGS 
    index_granularity = 8192,
    compress_marks = true,
    compress_primary_key = true,
    -- Optimized compression for time series data
    min_compress_block_size = 65536,
    max_compress_block_size = 1048576;

-- ============================================================================
-- 1-MINUTE AGGREGATION TABLE
-- ============================================================================
CREATE TABLE IF NOT EXISTS timeseries_db.timeseries_1m_table
(
    `id` UUID,
    `timestamp` DateTime64(3) CODEC(Delta, ZSTD(1)),
    `sum_val` Float64 CODEC(Gorilla, ZSTD(1)),
    `cnt` UInt64 CODEC(T64, ZSTD(1)),
    `min_val` Float64 CODEC(Gorilla, ZSTD(1)),
    `max_val` Float64 CODEC(Gorilla, ZSTD(1)),
    `value` Float64 ALIAS (sum_val / NULLIF(cnt, 0))
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(timestamp)
ORDER BY (id, timestamp)
TTL timestamp + INTERVAL 90 DAY DELETE
SETTINGS 
    index_granularity = 8192,
    compress_marks = true,
    compress_primary_key = true,
    min_compress_block_size = 65536,
    max_compress_block_size = 1048576;

-- Materialized view: Downsample from raw data to 1-minute buckets
CREATE MATERIALIZED VIEW IF NOT EXISTS timeseries_db.mv_downsample_1m
TO timeseries_db.timeseries_1m_table
AS
SELECT
    id,
    toStartOfMinute(timestamp) AS timestamp,
    sum(value) AS sum_val,
    count() AS cnt,
    min(value) AS min_val,
    max(value) AS max_val
FROM timeseries_db.timeseries_data_table
GROUP BY id, timestamp;

-- ============================================================================
-- 5-MINUTE AGGREGATION TABLE
-- ============================================================================
CREATE TABLE IF NOT EXISTS timeseries_db.timeseries_5m_table
(
    `id` UUID,
    `timestamp` DateTime64(3) CODEC(Delta, ZSTD(1)),
    `sum_val` Float64 CODEC(Gorilla, ZSTD(1)),
    `cnt` UInt64 CODEC(T64, ZSTD(1)),
    `min_val` Float64 CODEC(Gorilla, ZSTD(1)),
    `max_val` Float64 CODEC(Gorilla, ZSTD(1)),
    `value` Float64 ALIAS (sum_val / NULLIF(cnt, 0))
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(timestamp)
ORDER BY (id, timestamp)
TTL timestamp + INTERVAL 1 YEAR DELETE
SETTINGS 
    index_granularity = 8192,
    compress_marks = true,
    compress_primary_key = true,
    min_compress_block_size = 65536,
    max_compress_block_size = 1048576;

-- Materialized view: CASCADE from 1m table
CREATE MATERIALIZED VIEW IF NOT EXISTS timeseries_db.mv_downsample_5m
TO timeseries_db.timeseries_5m_table
AS
SELECT
    id,
    toStartOfFiveMinutes(timestamp) AS timestamp,
    sum(sum_val) AS sum_val,
    sum(cnt) AS cnt,
    min(min_val) AS min_val,
    max(max_val) AS max_val
FROM timeseries_db.timeseries_1m_table
GROUP BY id, timestamp;

-- ============================================================================
-- 1-HOUR AGGREGATION TABLE
-- ============================================================================
CREATE TABLE IF NOT EXISTS timeseries_db.timeseries_1h_table
(
    `id` UUID,
    `timestamp` DateTime64(3) CODEC(Delta, ZSTD(1)),
    `sum_val` Float64 CODEC(Gorilla, ZSTD(1)),
    `cnt` UInt64 CODEC(T64, ZSTD(1)),
    `min_val` Float64 CODEC(Gorilla, ZSTD(1)),
    `max_val` Float64 CODEC(Gorilla, ZSTD(1)),
    `value` Float64 ALIAS (sum_val / NULLIF(cnt, 0))
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(timestamp)
ORDER BY (id, timestamp)
TTL timestamp + INTERVAL 5 YEAR DELETE
SETTINGS 
    index_granularity = 8192,
    compress_marks = true,
    compress_primary_key = true,
    min_compress_block_size = 65536,
    max_compress_block_size = 1048576;

-- Materialized view: CASCADE from 5m table
CREATE MATERIALIZED VIEW IF NOT EXISTS timeseries_db.mv_downsample_1h
TO timeseries_db.timeseries_1h_table
AS
SELECT
    id,
    toStartOfHour(timestamp) AS timestamp,
    sum(sum_val) AS sum_val,
    sum(cnt) AS cnt,
    min(min_val) AS min_val,
    max(max_val) AS max_val
FROM timeseries_db.timeseries_5m_table
GROUP BY id, timestamp;

-- ============================================================================
-- TAGS/LABELS TABLE
-- ============================================================================
-- Engine: AggregatingMergeTree is required because:
-- 1. min_time and max_time use SimpleAggregateFunction type
-- 2. This is required by aggregate_min_time_and_max_time = true (default)
-- 3. TimeSeries engine will write aggregating inserts to update these columns
CREATE TABLE IF NOT EXISTS timeseries_db.timeseries_tags_table
(
    `id` UUID DEFAULT reinterpretAsUUID(sipHash128(metric_name, all_tags)),
    `metric_name` LowCardinality(String),
    -- tags Map
    `tags` Map(LowCardinality(String), String),
    -- all_tags Map WILL contain all labels (EPHEMERAL - only for id calculation)
    `all_tags` Map(String, String) EPHEMERAL,
    `labels_hash` UInt64 MATERIALIZED cityHash64(
        arrayStringConcat(
            arraySort(arrayMap((k,v) -> concat(k, '=', v), mapKeys(tags), mapValues(tags))),
            ','
        )
    ),
    
    -- Time range columns - automatically maintained by TimeSeries engine
    -- Using SimpleAggregateFunction as required by aggregate_min_time_and_max_time setting
    `min_time` SimpleAggregateFunction(min, Nullable(DateTime64(3))),
    `max_time` SimpleAggregateFunction(max, Nullable(DateTime64(3)))
)
ENGINE = AggregatingMergeTree
ORDER BY (metric_name, id)
SETTINGS 
    index_granularity = 8192,
    allow_nullable_key = 1;

-- Indexes
ALTER TABLE timeseries_db.timeseries_tags_table
    ADD INDEX IF NOT EXISTS idx_id id TYPE bloom_filter(0.01) GRANULARITY 4;

ALTER TABLE timeseries_db.timeseries_tags_table
    ADD INDEX IF NOT EXISTS idx_labels_hash labels_hash TYPE set(1000000) GRANULARITY 4;

ALTER TABLE timeseries_db.timeseries_tags_table
    ADD INDEX IF NOT EXISTS idx_metric_name metric_name TYPE bloom_filter(0.001) GRANULARITY 1;

ALTER TABLE timeseries_db.timeseries_tags_table
    ADD INDEX IF NOT EXISTS idx_tags_keys mapKeys(tags) TYPE bloom_filter(0.01) GRANULARITY 1;

ALTER TABLE timeseries_db.timeseries_tags_table
    ADD INDEX IF NOT EXISTS idx_tags_values mapValues(tags) TYPE bloom_filter(0.01) GRANULARITY 1;

-- ============================================================================
-- Create addtional indexes on on frequently queried labels if needed, like:
-- ALTER TABLE timeseries_db.timeseries_tags_table
--    ADD INDEX IF NOT EXISTS idx_job tags['job'] TYPE bloom_filter(0.01) GRANULARITY 1;
-- ============================================================================

-- ============================================================================
-- METRICS METADATA TABLE
-- ============================================================================
CREATE TABLE IF NOT EXISTS timeseries_db.timeseries_metrics_table
(
    `metric_family_name` String,
    `type` String,
    `unit` String,
    `help` String
)
ENGINE = ReplacingMergeTree
ORDER BY metric_family_name;

-- ============================================================================
-- TIMESERIES ENGINE TABLES
-- ============================================================================
-- Using built-in min_time/max_time tracking with aggregation
CREATE TABLE IF NOT EXISTS timeseries_db.timeseries_raw
ENGINE = TimeSeries
SETTINGS 
    store_min_time_and_max_time = true,           -- Store min/max time per series
    filter_by_min_time_and_max_time = true,       -- Use for query optimization
    aggregate_min_time_and_max_time = true,       -- Use SimpleAggregateFunction (default)
    tags_to_columns = {}
DATA timeseries_db.timeseries_data_table
TAGS timeseries_db.timeseries_tags_table
METRICS timeseries_db.timeseries_metrics_table;

CREATE TABLE IF NOT EXISTS timeseries_db.timeseries_1m
ENGINE = TimeSeries
SETTINGS 
    store_min_time_and_max_time = false,
    filter_by_min_time_and_max_time = false,
    aggregate_min_time_and_max_time = true,       -- Use SimpleAggregateFunction (default)
    tags_to_columns = {}
DATA timeseries_db.timeseries_1m_table
TAGS timeseries_db.timeseries_tags_table
METRICS timeseries_db.timeseries_metrics_table;

CREATE TABLE IF NOT EXISTS timeseries_db.timeseries_5m
ENGINE = TimeSeries
SETTINGS 
    store_min_time_and_max_time = false,
    filter_by_min_time_and_max_time = false,
    aggregate_min_time_and_max_time = true,       -- Use SimpleAggregateFunction (default)
    tags_to_columns = {}
DATA timeseries_db.timeseries_5m_table
TAGS timeseries_db.timeseries_tags_table
METRICS timeseries_db.timeseries_metrics_table;

CREATE TABLE IF NOT EXISTS timeseries_db.timeseries_1h
ENGINE = TimeSeries
SETTINGS 
    store_min_time_and_max_time = false,
    filter_by_min_time_and_max_time = false,
    aggregate_min_time_and_max_time = true,       -- Use SimpleAggregateFunction (default)
    tags_to_columns = {}
DATA timeseries_db.timeseries_1h_table
TAGS timeseries_db.timeseries_tags_table
METRICS timeseries_db.timeseries_metrics_table;

-- ============================================================================
-- CARDINALITY TRACKING
-- ============================================================================
CREATE MATERIALIZED VIEW IF NOT EXISTS timeseries_db.mv_cardinality_stats
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (metric_name, hour)
AS
SELECT
    t.metric_name,
    toStartOfHour(d.timestamp) AS hour,
    uniqExact(d.id) AS unique_series,
    count() AS data_points
FROM timeseries_db.timeseries_data_table d
INNER JOIN timeseries_db.timeseries_tags_table t ON d.id = t.id
GROUP BY t.metric_name, hour;

-- ============================================================================
-- HELPER VIEWS
-- ============================================================================
CREATE VIEW IF NOT EXISTS timeseries_db.v_series_list AS
SELECT
    id, 
    metric_name, 
    tags
FROM timeseries_db.timeseries_tags_table;

ClickHouse protocol config

Next, add Prometheus protocol handlers for remote write, raw remote read, and each downsampled table:

<clickhouse>
    <logger>
        <level>debug</level>
        <console>true</console>
    </logger>
    <compression>
        <case>
            <method>zstd</method>
        </case>
    </compression>
    <disable_internal_dns_cache>true</disable_internal_dns_cache>
    
    <prometheus>
        <port>9363</port>
        <handlers>
            <prometheus_write_rule>
                <url>/remote-write</url>
                <handler>
                    <type>remote_write</type>
                    <database>timeseries_db</database>
                    <table>timeseries_raw</table>
                </handler>
            </prometheus_write_rule>
            <prometheus_read_rule_raw>
                <url>/remote-read/raw</url>
                <handler>
                    <type>remote_read</type>
                    <database>timeseries_db</database>
                    <table>timeseries_raw</table>
                </handler>
            </prometheus_read_rule_raw>
            <prometheus_read_rule_1m>
                <url>/prom-read/1m</url>
                <handler>
                    <type>remote_read</type>
                    <database>timeseries_db</database>
                    <table>timeseries_1m</table>
                </handler>
            </prometheus_read_rule_1m>
            <prometheus_read_rule_5m>
                <url>/prom-read/5m</url>
                <handler>
                    <type>remote_read</type>
                    <database>timeseries_db</database>
                    <table>timeseries_5m</table>
                </handler>
            </prometheus_read_rule_5m>
            <prometheus_read_rule_1h>
                <url>/prom-read/1h</url>
                <handler>
                    <type>remote_read</type>
                    <database>timeseries_db</database>
                    <table>timeseries_1h</table>
                </handler>
            </prometheus_read_rule_1h>
        </handlers>
    </prometheus>
</clickhouse>

Proxy implementation

The complete proxy lives at https://github.com/maratoid/tsdb-aggregate-proxy.

make build
bin/tsdb-aggregate-proxy --help

By default the proxy connects to ClickHouse via the native protocol and queries target data tables directly, translating remote read requests into SQL. Pass --no-query-bypass to use the ClickHouse Prometheus protocol endpoints instead.

With query bypass (default), using default address, table names, and thresholds:

bin/tsdb-aggregate-proxy --chi-password=p@ssw0rd

Without query bypass:

bin/tsdb-aggregate-proxy --no-query-bypass

Prometheus configuration

Configure Prometheus with zero local retention, remote read pointed at the proxy, and remote write pointed at ClickHouse:

global:
  scrape_interval: 15s
  # query_log_file: /prometheus/query.log
  # scrape_failure_log_file: /prometheus/scrape-failures.log
scrape_configs:
  - job_name: prom-self-monitor
    static_configs:
      - labels: 
          job: prom-self-monitor
        targets:
          - localhost:9090
remote_read:
  - url: http://proxy:9091/api/v1/read
    name: clickhouse-remote-read
    remote_timeout: 10m
    read_recent: true
    filter_external_labels: false
    basic_auth:
      username: ingest
      password: p@ssw0rd
    enable_http2: true
remote_write:
  - url: http://clickhouse:9363/remote-write
    name: clickhouse-remote-write
    remote_timeout: 10m
    basic_auth:
      username: ingest
      password: p@ssw0rd
    enable_http2: true

Example OpenTelemetry collector configuration

If you’re using an OpenTelemetry collector, the prometheusremotewrite exporter can push scraped metrics to the ClickHouse remote write endpoint:

...
exporters:
  debug: {}
  prometheusremotewrite/clickhouse:
    endpoint: http://clickhouse:9363/remote-write
    auth:
      authenticator: basicauth/clickhouse
    resource_to_telemetry_conversion:
      enabled: true
    send_metadata: true
    timeout: 60s
    tls:
      insecure_skip_verify: true
...
extensions:
  ...
  basicauth/clickhouse:
    client_auth:
      username: ingest
      password: p@ssw0rd
...
service:
  extensions:
    ...
    - basicauth/clickhouse
  pipelines:
    metrics:
      exporters:
        - debug
        - prometheusremotewrite/clickhouse
...

Test docker compose setup

The repository includes a docker compose stack that brings up:

  • ClickHouse with the TimeSeries schema and Prometheus protocol handlers
  • The routing proxy in --query-bypass mode
  • Prometheus reading from the proxy
  • Grafana provisioned with Prometheus as the default data source
  • Avalanche generating continuous test metrics
  • An OpenTelemetry collector scraping Avalanche, the proxy, and Prometheus, then writing to ClickHouse