Developing parsing data storage system (TimescaleDB, ClickHouse)
Raw data from blockchain or exchanges accumulates fast — tens of gigabytes per day for actively parsed sources. Storing this in regular PostgreSQL in one table means query degradation within months. Choice between TimescaleDB and ClickHouse is choice between two fundamentally different storage models and query patterns. Both technologies are correct in their context.
TimescaleDB vs ClickHouse: when to use what
TimescaleDB — PostgreSQL extension. Adds hypertables (automatic time-based partitioning), continuous aggregates (incremental materialized views), compression. You remain in PostgreSQL ecosystem: standard SQL, ACID transactions, JOINs with regular tables, familiar tooling.
ClickHouse — columnar OLAP database. Data stored by columns, gives huge advantage in aggregations across column subsets. Speed of GROUP BY and SUM on billions of rows — order of magnitude faster than PostgreSQL. Weakness: no transactions, UPDATE/DELETE expensive, JOINs work differently.
| Criteria | TimescaleDB | ClickHouse |
|---|---|---|
| Query pattern | Complex JOINs, OLTP+OLAP mix | Analytics, aggregations over large ranges |
| Writing | INSERT in transactions, UPSERT | Batch insert, eventual deduplication |
| Point reads | Fast (B-tree indexes) | Slower (no efficient point indexes) |
| Analytics | Good | Much faster |
| Updates | Standard UPDATE | Expensive (ReplacingMergeTree) |
| Operational complexity | Moderate | Higher |
| Data volume | Up to ~1TB efficiently | Efficient from 100GB+ |
Recommendation for on-chain data parsing:
- TimescaleDB — if data needed for product logic (balances, positions, accounts), has JOINs with relational data, needs ACID guarantees
- ClickHouse — if this is analytics pipeline (trading signals, aggregated stats, historical analysis), queries work with large date ranges
Production often combines both: TimescaleDB for hot/operational data + ClickHouse for analytics warehouse.
TimescaleDB architecture
Hypertables
Basic concept: regular PostgreSQL table becomes hypertable — under the hood chunks (partitions) created by time dimension. Each chunk — separate file, old chunks can be compressed or archived.
-- Create table
CREATE TABLE trades (
time TIMESTAMPTZ NOT NULL,
exchange TEXT NOT NULL,
symbol TEXT NOT NULL,
price NUMERIC(20, 8) NOT NULL,
volume NUMERIC(20, 8) NOT NULL,
side CHAR(4) NOT NULL -- 'buy' | 'sell'
);
-- Convert to hypertable with 1-day chunks
SELECT create_hypertable('trades', 'time', chunk_time_interval => INTERVAL '1 day');
-- Indexes per chunk (TimescaleDB creates automatically on time)
CREATE INDEX ON trades (symbol, time DESC);
Chunk size is key parameter. Rule: each chunk should fit in memory for bulk insert (usually 1-7 days for exchange data).
Continuous Aggregates
Replace expensive realtime GROUP BY with incremental materialized views:
-- OHLCV aggregate by minutes
CREATE MATERIALIZED VIEW trades_1m
WITH (timescaledb.continuous) AS
SELECT
time_bucket('1 minute', time) AS bucket,
symbol,
exchange,
first(price, time) AS open,
max(price) AS high,
min(price) AS low,
last(price, time) AS close,
sum(volume) AS volume,
count(*) AS trade_count
FROM trades
GROUP BY bucket, symbol, exchange
WITH NO DATA;
-- Update policy: refresh aggregate every 1 minute, starting 2 minutes ago
SELECT add_continuous_aggregate_policy('trades_1m',
start_offset => INTERVAL '2 minutes',
end_offset => INTERVAL '1 minute',
schedule_interval => INTERVAL '1 minute'
);
Now query SELECT * FROM trades_1m WHERE bucket > NOW() - INTERVAL '1 day' reads from materialized view, not aggregates raw data.
Compression
Old data compresses with minimal functionality loss (except UPDATE/DELETE):
-- Enable compression sorted by symbol+time (optimal for our queries)
ALTER TABLE trades SET (
timescaledb.compress,
timescaledb.compress_orderby = 'time DESC',
timescaledb.compress_segmentby = 'symbol'
);
-- Auto-compress chunks older than 7 days
SELECT add_compression_policy('trades', INTERVAL '7 days');
Typical compression ratio for exchange data: 10-20x. 100GB raw → 5-10GB compressed.
Retention policy
-- Auto-delete data older than 2 years
SELECT add_retention_policy('trades', INTERVAL '2 years');
-- Or: move old chunks to cheap storage (tablespace on HDD)
SELECT add_tiering_policy('trades', INTERVAL '30 days');
ClickHouse architecture
MergeTree engines
Engine choice is critical. For parsing data usually:
MergeTree — basic, no special properties:
CREATE TABLE trades
(
time DateTime64(3),
exchange LowCardinality(String),
symbol LowCardinality(String),
price Decimal(20, 8),
volume Decimal(20, 8),
side Enum8('buy' = 1, 'sell' = 2)
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(time)
ORDER BY (symbol, exchange, time);
ORDER BY in ClickHouse is simultaneously primary key (sparse index) and physical storage order. Choose by query patterns: if most filter by (symbol, time) — exactly that ORDER BY.
ReplacingMergeTree — for deduplication on insert (needed for reliable re-processing):
CREATE TABLE balances
(
time DateTime64(3),
address String,
token LowCardinality(String),
balance Decimal(38, 18),
block_number UInt64,
_version UInt64 -- usually block_number or timestamp for dedup
)
ENGINE = ReplacingMergeTree(_version)
PARTITION BY toYYYYMM(time)
ORDER BY (address, token, time);
On duplicate inserts (same primary key) — record with max _version preserved. Deduplication happens on merge, not immediately — for exact queries need FINAL modifier.
SummingMergeTree — for aggregates (volumes, sums):
CREATE TABLE volume_by_hour
(
hour DateTime,
symbol LowCardinality(String),
volume Decimal(20, 8),
count UInt64
)
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (symbol, hour);
Batch insert
ClickHouse optimized for batch insert. Frequent small inserts — anti-pattern:
import clickhouse_connect
client = clickhouse_connect.get_client(host='localhost')
# Correct: batch minimum 1000-10000 rows
def flush_buffer(rows: list[dict]):
if len(rows) < 1000:
return # accumulate further
client.insert(
'trades',
[[r['time'], r['exchange'], r['symbol'], r['price'], r['volume']] for r in rows],
column_names=['time', 'exchange', 'symbol', 'price', 'volume']
)
rows.clear()
Minimum recommended batch: 1000 rows. Optimal: 10,000-100,000 rows. Frequency: no more than 1-2 times per second per table.
Materialized Views in ClickHouse
ClickHouse materialized views — trigger-based, update on insert (not scheduled like TimescaleDB):
-- Materialized view for OHLCV aggregation
CREATE MATERIALIZED VIEW trades_1h_mv
TO trades_1h -- target table
AS
SELECT
toStartOfHour(time) AS hour,
symbol,
exchange,
argMin(price, time) AS open,
max(price) AS high,
min(price) AS low,
argMax(price, time) AS close,
sum(volume) AS volume
FROM trades
GROUP BY hour, symbol, exchange;
ASOF JOIN for temporal data
Unique ClickHouse feature for temporal data — join by nearest value:
-- Attach asset price to each liquidation (nearest price before event)
SELECT
l.time,
l.symbol,
l.quantity_usd,
p.price AS price_at_liquidation
FROM liquidations l
ASOF LEFT JOIN prices p
ON l.symbol = p.symbol AND l.time >= p.time;
Common practices
Data types. Use LowCardinality(String) for low-cardinality fields (exchange, symbol, side) — 2-10x size savings and faster filtering. Decimal instead of Float for financial values — no precision issues.
Partitioning. By months (toYYYYMM) — standard for most financial data. Allows dropping old partitions without DELETE.
Monitoring. Key metrics:
- Table and partition sizes
- Number of parts (in ClickHouse: many small parts = problem)
- Slow query execution time
- Insert latency at peak load
Stack
| Component | For TimescaleDB | For ClickHouse |
|---|---|---|
| Version | TimescaleDB 2.x + PostgreSQL 16 | ClickHouse 24+ |
| Client | psycopg3, asyncpg, sqlalchemy | clickhouse-connect, clickhouse-driver |
| Migrations | Flyway / Liquibase | custom SQL scripts |
| Monitoring | pg_stat_statements + Grafana | system.query_log + Grafana |
Schema design + TimescaleDB or ClickHouse setup for specific volume and query patterns: 1-2 weeks. If includes existing data migration — plus one week.







