Parsing Data from Blockchain Explorers
Etherscan, BscScan, Polygonscan — convenient interfaces on top of node, but their API has strict limits: 5 requests per second on free plan, no streaming, pagination limited to 10000 records. If need to download history of 500K transactions for specific contract or track all interactions with address — need know workarounds and alternatives.
Etherscan API: What it Can and Can't
What works well:
- Address transaction history:
?module=account&action=txlist&address=0x... - ERC-20 transfers:
?module=account&action=tokentx&address=0x... - Contract verification and ABI retrieval:
?module=contract&action=getabi - Contract source code:
?module=contract&action=getsourcecode
Hard limits:
- Maximum 10000 records per request (bypass via block pagination)
-
startblock/endblockparameters — only pagination method - Rate limit: 5 req/sec (free), 10-20/sec (paid plans)
- No WebSocket/streaming
- Internal transactions (
action=txlistinternal) not always complete
import httpx
import asyncio
from typing import AsyncGenerator
async def get_all_transactions(
address: str,
api_key: str,
start_block: int = 0
) -> AsyncGenerator[dict, None]:
"""Download ALL address transactions via block pagination"""
base_url = "https://api.etherscan.io/api"
current_block = start_block
while True:
async with httpx.AsyncClient() as client:
resp = await client.get(base_url, params={
"module": "account",
"action": "txlist",
"address": address,
"startblock": current_block,
"endblock": 99999999,
"sort": "asc",
"apikey": api_key,
"offset": 10000,
"page": 1,
})
data = resp.json()
if data["status"] != "1" or not data["result"]:
break
txs = data["result"]
for tx in txs:
yield tx
if len(txs) < 10000:
break # last page
# Next block = last received + 1
current_block = int(txs[-1]["blockNumber"]) + 1
await asyncio.sleep(0.2) # rate limit 5 req/sec
Important nuance: if single block has > 10000 transactions for address (theoretically possible for contracts like USDT) — loop will hang. Solution: additional pagination inside block via page parameter.
Etherscan API Alternatives
Alchemy / Infura / QuickNode Enhanced APIs. Allow requests like "all transactions for address X" without 10000 limit:
import { Alchemy, Network } from 'alchemy-sdk';
const alchemy = new Alchemy({ apiKey: process.env.ALCHEMY_KEY, network: Network.ETH_MAINNET });
// Get all Asset Transfers (ERC-20, ERC-721, ETH)
const transfers = await alchemy.core.getAssetTransfers({
fromAddress: '0x...',
category: ['external', 'erc20', 'erc721', 'erc1155'],
withMetadata: true,
maxCount: 1000,
});
// Continue via pageKey
if (transfers.pageKey) {
const more = await alchemy.core.getAssetTransfers({
pageKey: transfers.pageKey,
// ... same parameters
});
}
Alchemy Asset Transfers API much more powerful than Etherscan: no 10K limit, pagination via pageKey, returns ETH + all tokens in one request.
Moralis Web3 API:
import Moralis from 'moralis';
await Moralis.start({ apiKey: process.env.MORALIS_KEY });
const response = await Moralis.EvmApi.transaction.getWalletTransactions({
chain: '0x1',
address: '0x...',
limit: 100,
cursor: undefined, // for pagination
});
const { result, cursor } = response.toJSON();
Moralis also does getWalletTokenTransfers, getNFTTransfers, cross-chain queries.
Parsing Explorer HTML (When API Isn't Enough)
Sometimes data exists only in web interface, not API: token holders list on Etherscan, verified contract list, some internal calls. Then — scraping HTML.
import httpx
from bs4 import BeautifulSoup
import asyncio
async def get_token_holders(token_address: str, pages: int = 10) -> list[dict]:
"""Parse top token holders from Etherscan"""
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept": "text/html",
}
holders = []
async with httpx.AsyncClient(headers=headers) as client:
for page in range(1, pages + 1):
resp = await client.get(
f"https://etherscan.io/token/{token_address}",
params={"a": "#holders", "p": page}
)
soup = BeautifulSoup(resp.text, 'html.parser')
table = soup.find('table', {'id': 'transfersTable'})
if not table:
break
for row in table.find_all('tr')[1:]: # skip header
cols = row.find_all('td')
if len(cols) >= 3:
holders.append({
'rank': cols[0].text.strip(),
'address': cols[1].find('a')['href'].split('/')[-1],
'quantity': cols[2].text.strip(),
'percentage': cols[3].text.strip() if len(cols) > 3 else None,
})
await asyncio.sleep(2) # respect server
return holders
Etherscan has bot protection (Cloudflare), aggressive scraping can lead to temporary IP ban. Better use residential proxy or official API.
Direct Node Work
For maximum data completeness — own node + Erigon (--tracing for internal transactions) or own Etherscan-like indexer. More expensive but gives:
- Internal transactions without limits
- Data by storage slots
- Trace calls for MEV analysis
# eth_getBlockReceipts — all receipts of block in one request
curl -X POST $ETH_RPC_URL \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"eth_getBlockReceipts","params":["0x1234567"],"id":1}'
eth_getBlockReceipts (EIP-1559 extension, supported by Alchemy/Infura) returns all block receipts in one request — significantly more efficient than N separate eth_getTransactionReceipt.
Storage and Deduplication
Parallel collection from multiple sources creates duplicates. Strategy: INSERT ... ON CONFLICT (tx_hash) DO NOTHING for transactions, (tx_hash, log_index) unique constraint for events.
CREATE TABLE eth_transactions (
tx_hash CHAR(66) PRIMARY KEY,
block_number BIGINT NOT NULL,
from_address CHAR(42) NOT NULL,
to_address CHAR(42),
value NUMERIC(38) DEFAULT 0,
gas_used BIGINT,
status SMALLINT,
ts TIMESTAMPTZ
);
-- Safe upsert without errors on duplicates
INSERT INTO eth_transactions VALUES (...)
ON CONFLICT (tx_hash) DO NOTHING;







