Scraping Cryptocurrency Exchange Data
Exchanges don't publicly publish historical data in convenient format — each provides REST API with different limits and formats, WebSocket for real-time streams, and almost all impose rate limits that must be respected to avoid IP or API key bans. The task of scraping data from multiple exchanges comes down to normalizing heterogeneous APIs into a single schema while respecting rate limit policies.
Structure of Exchange Data
Main data types from exchange:
Order book (orderbook): snapshot of current bid/ask orders with volumes. Useful for liquidity analysis and spread. REST — snapshot, WebSocket — incremental updates (diff).
Trades (trades): history of executed trades. Each trade: timestamp, price, amount, side (buy/sell). Basis for building OHLCV candles.
OHLCV (candles): aggregated data for period. Most exchanges provide directly via REST.
Funding rates (for perpetual contracts): financing rate, affects position holding cost. For arbitrage strategies between spot and futures.
CCXT: Unified Access
CCXT (CryptoCurrency eXchange Trading) — Python/JavaScript/PHP library with unified API to 100+ exchanges. Correct solution for most tasks — don't write exchange connectors manually.
import ccxt
import asyncio
async def fetch_ohlcv_all_exchanges(symbol: str, timeframe: str):
exchanges = [
ccxt.binance({'enableRateLimit': True}),
ccxt.coinbase({'enableRateLimit': True}),
ccxt.kraken({'enableRateLimit': True}),
ccxt.bybit({'enableRateLimit': True}),
]
results = {}
for exchange in exchanges:
try:
ohlcv = await exchange.fetch_ohlcv(symbol, timeframe, limit=500)
results[exchange.id] = [
{
'timestamp': candle[0],
'open': candle[1],
'high': candle[2],
'low': candle[3],
'close': candle[4],
'volume': candle[5],
}
for candle in ohlcv
]
except ccxt.RateLimitExceeded:
await asyncio.sleep(exchange.rateLimit / 1000)
except ccxt.NetworkError as e:
logger.error(f"{exchange.id}: network error: {e}")
return results
enableRateLimit: True activates built-in CCXT rate limiter — automatically throttle requests per documented exchange limits.
Historical Data: Bulk Download
Most exchanges return maximum 500–1000 candles per request. For downloading long history — iterative requests with pagination by time:
async def download_full_history(
exchange: ccxt.Exchange,
symbol: str,
timeframe: str,
since_ms: int
) -> list:
all_candles = []
current_since = since_ms
while True:
candles = await exchange.fetch_ohlcv(
symbol, timeframe,
since=current_since,
limit=1000
)
if not candles:
break
all_candles.extend(candles)
last_timestamp = candles[-1][0]
if last_timestamp <= current_since:
break # infinite loop protection
current_since = last_timestamp + exchange.parse_timeframe(timeframe) * 1000
# Respect rate limit
await asyncio.sleep(exchange.rateLimit / 1000)
return all_candles
For Binance historical data is available via Binance Data Vision (data.binance.vision) — bulk CSV files for each day/month without rate limits. Orders of magnitude faster than API. Alternatives: Bybit Public Data, OKX Market Data.
WebSocket for Real-Time Data
import websockets
import json
async def stream_binance_trades(symbol: str, callback):
uri = f"wss://stream.binance.com:9443/ws/{symbol.lower()}@trade"
async with websockets.connect(uri) as ws:
while True:
try:
msg = await asyncio.wait_for(ws.recv(), timeout=30)
data = json.loads(msg)
await callback({
'exchange': 'binance',
'symbol': symbol,
'timestamp': data['T'],
'price': float(data['p']),
'amount': float(data['q']),
'side': 'sell' if data['m'] else 'buy',
'trade_id': data['t'],
})
except asyncio.TimeoutError:
await ws.ping() # keepalive
For production: automatic reconnect on connection failure, sequence number tracking (Binance uses u and U fields for gap detection in order book updates).
Storage and Normalization
For time series — TimescaleDB (PostgreSQL extension) or ClickHouse for analytics queries. TimescaleDB automatically partitions tables by time (hypertables), supports compression and materialized views for aggregations:
CREATE TABLE trades (
time TIMESTAMPTZ NOT NULL,
exchange TEXT NOT NULL,
symbol TEXT NOT NULL,
price DOUBLE PRECISION NOT NULL,
amount DOUBLE PRECISION NOT NULL,
side TEXT NOT NULL,
trade_id TEXT
);
SELECT create_hypertable('trades', 'time');
CREATE INDEX ON trades (exchange, symbol, time DESC);
Symbol normalization between exchanges — typical problem: BTC/USDT, BTCUSDT, BTC-USDT, XBT/USD — same pair on different exchanges. CCXT solves this with its mapping scheme (exchange.load_markets() returns unified market IDs).
Rate Limits and Anti-Ban
Exchanges ban IPs on rate limit excess. Strategies:
- Rotate API keys (multiple keys of one account or multiple accounts)
- Caching — don't request same data multiple times
- Prioritize WebSocket over REST for real-time data
- Exponential backoff on 429 responses
For large-scale data collection from dozens of exchanges simultaneously — residential proxy rotation, but this is gray zone per ToS of most exchanges. Better use official bulk data endpoints where available.
Developing a parser with support for 5–10 exchanges, normalization to single schema, real-time WebSocket streams, and storage in TimescaleDB — 3–4 weeks.







