Scraping Data Storage System Development (TimescaleDB, ClickHouse)

We design and develop full-cycle blockchain solutions: from smart contract architecture to launching DeFi protocols, NFT marketplaces and crypto exchanges. Security audits, tokenomics, integration with existing infrastructure.
Showing 1 of 1 servicesAll 1306 services
Scraping Data Storage System Development (TimescaleDB, ClickHouse)
Medium
~3-5 business days
FAQ
Blockchain Development Services
Blockchain Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1217
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1046
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    823

Developing parsing data storage system (TimescaleDB, ClickHouse)

Raw data from blockchain or exchanges accumulates fast — tens of gigabytes per day for actively parsed sources. Storing this in regular PostgreSQL in one table means query degradation within months. Choice between TimescaleDB and ClickHouse is choice between two fundamentally different storage models and query patterns. Both technologies are correct in their context.

TimescaleDB vs ClickHouse: when to use what

TimescaleDB — PostgreSQL extension. Adds hypertables (automatic time-based partitioning), continuous aggregates (incremental materialized views), compression. You remain in PostgreSQL ecosystem: standard SQL, ACID transactions, JOINs with regular tables, familiar tooling.

ClickHouse — columnar OLAP database. Data stored by columns, gives huge advantage in aggregations across column subsets. Speed of GROUP BY and SUM on billions of rows — order of magnitude faster than PostgreSQL. Weakness: no transactions, UPDATE/DELETE expensive, JOINs work differently.

Criteria TimescaleDB ClickHouse
Query pattern Complex JOINs, OLTP+OLAP mix Analytics, aggregations over large ranges
Writing INSERT in transactions, UPSERT Batch insert, eventual deduplication
Point reads Fast (B-tree indexes) Slower (no efficient point indexes)
Analytics Good Much faster
Updates Standard UPDATE Expensive (ReplacingMergeTree)
Operational complexity Moderate Higher
Data volume Up to ~1TB efficiently Efficient from 100GB+

Recommendation for on-chain data parsing:

  • TimescaleDB — if data needed for product logic (balances, positions, accounts), has JOINs with relational data, needs ACID guarantees
  • ClickHouse — if this is analytics pipeline (trading signals, aggregated stats, historical analysis), queries work with large date ranges

Production often combines both: TimescaleDB for hot/operational data + ClickHouse for analytics warehouse.

TimescaleDB architecture

Hypertables

Basic concept: regular PostgreSQL table becomes hypertable — under the hood chunks (partitions) created by time dimension. Each chunk — separate file, old chunks can be compressed or archived.

-- Create table
CREATE TABLE trades (
  time        TIMESTAMPTZ NOT NULL,
  exchange    TEXT NOT NULL,
  symbol      TEXT NOT NULL,
  price       NUMERIC(20, 8) NOT NULL,
  volume      NUMERIC(20, 8) NOT NULL,
  side        CHAR(4) NOT NULL  -- 'buy' | 'sell'
);

-- Convert to hypertable with 1-day chunks
SELECT create_hypertable('trades', 'time', chunk_time_interval => INTERVAL '1 day');

-- Indexes per chunk (TimescaleDB creates automatically on time)
CREATE INDEX ON trades (symbol, time DESC);

Chunk size is key parameter. Rule: each chunk should fit in memory for bulk insert (usually 1-7 days for exchange data).

Continuous Aggregates

Replace expensive realtime GROUP BY with incremental materialized views:

-- OHLCV aggregate by minutes
CREATE MATERIALIZED VIEW trades_1m
WITH (timescaledb.continuous) AS
SELECT
  time_bucket('1 minute', time) AS bucket,
  symbol,
  exchange,
  first(price, time)  AS open,
  max(price)          AS high,
  min(price)          AS low,
  last(price, time)   AS close,
  sum(volume)         AS volume,
  count(*)            AS trade_count
FROM trades
GROUP BY bucket, symbol, exchange
WITH NO DATA;

-- Update policy: refresh aggregate every 1 minute, starting 2 minutes ago
SELECT add_continuous_aggregate_policy('trades_1m',
  start_offset => INTERVAL '2 minutes',
  end_offset   => INTERVAL '1 minute',
  schedule_interval => INTERVAL '1 minute'
);

Now query SELECT * FROM trades_1m WHERE bucket > NOW() - INTERVAL '1 day' reads from materialized view, not aggregates raw data.

Compression

Old data compresses with minimal functionality loss (except UPDATE/DELETE):

-- Enable compression sorted by symbol+time (optimal for our queries)
ALTER TABLE trades SET (
  timescaledb.compress,
  timescaledb.compress_orderby = 'time DESC',
  timescaledb.compress_segmentby = 'symbol'
);

-- Auto-compress chunks older than 7 days
SELECT add_compression_policy('trades', INTERVAL '7 days');

Typical compression ratio for exchange data: 10-20x. 100GB raw → 5-10GB compressed.

Retention policy

-- Auto-delete data older than 2 years
SELECT add_retention_policy('trades', INTERVAL '2 years');

-- Or: move old chunks to cheap storage (tablespace on HDD)
SELECT add_tiering_policy('trades', INTERVAL '30 days');

ClickHouse architecture

MergeTree engines

Engine choice is critical. For parsing data usually:

MergeTree — basic, no special properties:

CREATE TABLE trades
(
    time      DateTime64(3),
    exchange  LowCardinality(String),
    symbol    LowCardinality(String),
    price     Decimal(20, 8),
    volume    Decimal(20, 8),
    side      Enum8('buy' = 1, 'sell' = 2)
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(time)
ORDER BY (symbol, exchange, time);

ORDER BY in ClickHouse is simultaneously primary key (sparse index) and physical storage order. Choose by query patterns: if most filter by (symbol, time) — exactly that ORDER BY.

ReplacingMergeTree — for deduplication on insert (needed for reliable re-processing):

CREATE TABLE balances
(
    time      DateTime64(3),
    address   String,
    token     LowCardinality(String),
    balance   Decimal(38, 18),
    block_number UInt64,
    _version  UInt64  -- usually block_number or timestamp for dedup
)
ENGINE = ReplacingMergeTree(_version)
PARTITION BY toYYYYMM(time)
ORDER BY (address, token, time);

On duplicate inserts (same primary key) — record with max _version preserved. Deduplication happens on merge, not immediately — for exact queries need FINAL modifier.

SummingMergeTree — for aggregates (volumes, sums):

CREATE TABLE volume_by_hour
(
    hour     DateTime,
    symbol   LowCardinality(String),
    volume   Decimal(20, 8),
    count    UInt64
)
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (symbol, hour);

Batch insert

ClickHouse optimized for batch insert. Frequent small inserts — anti-pattern:

import clickhouse_connect

client = clickhouse_connect.get_client(host='localhost')

# Correct: batch minimum 1000-10000 rows
def flush_buffer(rows: list[dict]):
    if len(rows) < 1000:
        return  # accumulate further
    
    client.insert(
        'trades',
        [[r['time'], r['exchange'], r['symbol'], r['price'], r['volume']] for r in rows],
        column_names=['time', 'exchange', 'symbol', 'price', 'volume']
    )
    rows.clear()

Minimum recommended batch: 1000 rows. Optimal: 10,000-100,000 rows. Frequency: no more than 1-2 times per second per table.

Materialized Views in ClickHouse

ClickHouse materialized views — trigger-based, update on insert (not scheduled like TimescaleDB):

-- Materialized view for OHLCV aggregation
CREATE MATERIALIZED VIEW trades_1h_mv
TO trades_1h  -- target table
AS
SELECT
    toStartOfHour(time) AS hour,
    symbol,
    exchange,
    argMin(price, time)  AS open,
    max(price)           AS high,
    min(price)           AS low,
    argMax(price, time)  AS close,
    sum(volume)          AS volume
FROM trades
GROUP BY hour, symbol, exchange;

ASOF JOIN for temporal data

Unique ClickHouse feature for temporal data — join by nearest value:

-- Attach asset price to each liquidation (nearest price before event)
SELECT 
    l.time,
    l.symbol,
    l.quantity_usd,
    p.price AS price_at_liquidation
FROM liquidations l
ASOF LEFT JOIN prices p
    ON l.symbol = p.symbol AND l.time >= p.time;

Common practices

Data types. Use LowCardinality(String) for low-cardinality fields (exchange, symbol, side) — 2-10x size savings and faster filtering. Decimal instead of Float for financial values — no precision issues.

Partitioning. By months (toYYYYMM) — standard for most financial data. Allows dropping old partitions without DELETE.

Monitoring. Key metrics:

  • Table and partition sizes
  • Number of parts (in ClickHouse: many small parts = problem)
  • Slow query execution time
  • Insert latency at peak load

Stack

Component For TimescaleDB For ClickHouse
Version TimescaleDB 2.x + PostgreSQL 16 ClickHouse 24+
Client psycopg3, asyncpg, sqlalchemy clickhouse-connect, clickhouse-driver
Migrations Flyway / Liquibase custom SQL scripts
Monitoring pg_stat_statements + Grafana system.query_log + Grafana

Schema design + TimescaleDB or ClickHouse setup for specific volume and query patterns: 1-2 weeks. If includes existing data migration — plus one week.