OHLCV Historical Candle Data Scraping
Historical OHLCV data is needed for backtesting strategies, building charts, calculating technical indicators, and training ML models. The problem: each exchange has different depth limits, rate limits, formats, and different data for the same instrument. Manual parsing via browser is for one-off tasks. Systematic collection requires infrastructure.
CEX vs DEX: different sources, different logic
Centralized exchanges (CEX): Binance, OKX, Bybit, Kraken — provide REST APIs for historical OHLCV. Data is clean, aggregated, with minimal gaps. Limits: rate limits (usually 1200 req/min per IP), max history depth (Binance gives 1000 candles per request, history from 2017).
Decentralized exchanges (DEX): Uniswap, Curve, dYdX — no native OHLCV API. Data must be computed from swap events on-chain or obtained via The Graph / DeFiLlama.
Binance parsing: proper implementation
Binance is the standard for historical data depth and API quality. Method /api/v3/klines:
interface BinanceKline {
openTime: number // ms
open: string
high: string
low: string
close: string
volume: string
closeTime: number
quoteAssetVolume: string
numberOfTrades: number
takerBuyBaseVolume: string
takerBuyQuoteVolume: string
}
async function fetchBinanceHistory(
symbol: string,
interval: string, // '1m', '5m', '1h', '1d'
startTime: number, // ms
endTime: number
): Promise<BinanceKline[]> {
const allKlines: BinanceKline[] = []
let currentStart = startTime
while (currentStart < endTime) {
const params = new URLSearchParams({
symbol,
interval,
startTime: currentStart.toString(),
endTime: endTime.toString(),
limit: '1000', // max per request
})
const response = await fetch(
`https://api.binance.com/api/v3/klines?${params}`,
{ headers: { 'X-MBX-APIKEY': process.env.BINANCE_API_KEY! } }
)
if (response.status === 429) {
// Rate limit: wait per Retry-After header
const retryAfter = parseInt(response.headers.get('Retry-After') || '60')
await sleep(retryAfter * 1000)
continue
}
const klines: any[][] = await response.json()
if (klines.length === 0) break
const parsed = klines.map(k => ({
openTime: k[0],
open: k[1], high: k[2], low: k[3], close: k[4],
volume: k[5],
closeTime: k[6],
quoteAssetVolume: k[7],
numberOfTrades: k[8],
takerBuyBaseVolume: k[9],
takerBuyQuoteVolume: k[10],
}))
allKlines.push(...parsed)
currentStart = klines[klines.length - 1][6] + 1 // last candle closeTime + 1ms
// Sleep between requests
await sleep(100)
}
return allKlines
}
Rate limiting: proper implementation
Naive sleep(100) is poor. Need token bucket or sliding window rate limiter, especially collecting multiple instruments in parallel:
class RateLimiter {
private tokens: number
private lastRefill: number
constructor(
private maxTokens: number,
private refillRate: number, // tokens per second
) {
this.tokens = maxTokens
this.lastRefill = Date.now()
}
async acquire(count = 1): Promise<void> {
this.refill()
if (this.tokens >= count) {
this.tokens -= count
return
}
// Wait until tokens accumulate
const waitMs = ((count - this.tokens) / this.refillRate) * 1000
await sleep(waitMs)
this.tokens = 0
}
private refill(): void {
const now = Date.now()
const elapsed = (now - this.lastRefill) / 1000
this.tokens = Math.min(this.maxTokens, this.tokens + elapsed * this.refillRate)
this.lastRefill = now
}
}
// Binance: 1200 requests/minute = 20 requests/second
const binanceLimiter = new RateLimiter(1200, 20)
Multi-exchange collector
For strategies needing data from multiple exchanges simultaneously, ccxt library standardizes work with 100+ exchanges via unified interface:
import ccxt from 'ccxt'
class MultiExchangeOHLCVCollector {
private exchanges: Map<string, ccxt.Exchange> = new Map()
constructor(configs: ExchangeConfig[]) {
for (const config of configs) {
const ExchangeClass = ccxt[config.id as keyof typeof ccxt] as any
this.exchanges.set(config.id, new ExchangeClass({
apiKey: config.apiKey,
secret: config.secret,
rateLimit: true, // ccxt built-in rate limiter
}))
}
}
async fetchOHLCV(
exchangeId: string,
symbol: string,
timeframe: string,
since?: number,
limit = 500
): Promise<OHLCV[]> {
const exchange = this.exchanges.get(exchangeId)!
// ccxt returns: [timestamp, open, high, low, close, volume]
const rawData = await exchange.fetchOHLCV(symbol, timeframe, since, limit)
return rawData.map(([ts, o, h, l, c, v]) => ({
exchange: exchangeId,
symbol,
timeframe,
timestamp: new Date(ts),
open: o, high: h, low: l, close: c, volume: v,
}))
}
async fetchFullHistory(
exchangeId: string,
symbol: string,
timeframe: string,
fromDate: Date
): Promise<OHLCV[]> {
const exchange = this.exchanges.get(exchangeId)!
const all: OHLCV[] = []
let since = fromDate.getTime()
while (true) {
const batch = await this.fetchOHLCV(exchangeId, symbol, timeframe, since)
if (batch.length === 0) break
all.push(...batch)
since = batch[batch.length - 1].timestamp.getTime() + 1
await exchange.sleep(exchange.rateLimit)
}
return all
}
}
Data storage and processing
TimescaleDB is standard for OHLCV data. Hypertable by timestamp provides automatic partitioning and efficient range queries:
CREATE TABLE ohlcv (
time TIMESTAMPTZ NOT NULL,
exchange VARCHAR(50) NOT NULL,
symbol VARCHAR(20) NOT NULL,
timeframe VARCHAR(10) NOT NULL,
open NUMERIC(20, 8) NOT NULL,
high NUMERIC(20, 8) NOT NULL,
low NUMERIC(20, 8) NOT NULL,
close NUMERIC(20, 8) NOT NULL,
volume NUMERIC(30, 8) NOT NULL,
PRIMARY KEY (time, exchange, symbol, timeframe)
);
SELECT create_hypertable('ohlcv', 'time');
-- Compress data older than 30 days
ALTER TABLE ohlcv SET (
timescaledb.compress,
timescaledb.compress_segmentby = 'exchange, symbol, timeframe'
);
SELECT add_compression_policy('ohlcv', INTERVAL '30 days');
For timeframe conversion without re-parsing — Continuous Aggregates:
-- Auto aggregate 1m → 1h
CREATE MATERIALIZED VIEW ohlcv_1h
WITH (timescaledb.continuous) AS
SELECT
time_bucket('1 hour', time) AS time,
exchange, symbol,
'1h' AS timeframe,
first(open, time) AS open,
max(high) AS high,
min(low) AS low,
last(close, time) AS close,
sum(volume) AS volume
FROM ohlcv
WHERE timeframe = '1m'
GROUP BY 1, 2, 3;
DEX data via on-chain events
For DEX OHLCV (Uniswap V3 example) — compute from Swap events:
-- From indexed Uniswap V3 Swap events
-- sqrtPriceX96 → price: price = (sqrtPriceX96 / 2^96)^2 * 10^(decimal1 - decimal0)
SELECT
time_bucket('1 hour', block_timestamp) AS time,
'uniswap-v3' AS exchange,
pair_symbol AS symbol,
first(price, block_timestamp) AS open,
max(price) AS high,
min(price) AS low,
last(price, block_timestamp) AS close,
sum(volume_usd) AS volume
FROM uniswap_swaps
GROUP BY 1, 2, 3
Full collector with multi-exchange support, TimescaleDB storage, incremental updates, and gap alerts — 1-2 weeks development.







