Multi-blockchain Data Aggregation System Development
The task looks simple: "collect data from multiple blockchains". In practice, it's one of the most technically complex tasks in Web3 infrastructure. Each network is a separate data model, its own finalization logic, its own RPC API, its own rate limits, and its own specific errors. Ethereum operates in UTC with ~12-second blocks, Solana delivers ~400ms slots and counts confirmations differently, TON has a sharded architecture where "block" is a conditional concept. Collecting all this into a unified API with consistent data is a nontrivial engineering task.
The problem of heterogeneity: why you can't just "query all RPC"
Different data models
EVM networks (Ethereum, Arbitrum, Polygon, BSC) share a common model: blocks, transactions, receipts with logs. But there are differences even here:
-
Arbitrum adds
l1BlockNumberand specific system transactions (sequencer batch submissions) -
Optimism/Base have
depositedTxtype for L1→L2 transactions, which don't have standardfrom - zkSync Era uses Native AA — no distinction between EOA and contracts, all accounts are contracts
Solana is a completely different paradigm: no "transaction invoked contract method" — instead "instructions in transaction passed to programs". To decode you need an ABI analog — IDL (Interface Definition Language, Anchor format).
UTXO models (Bitcoin, Litecoin) are fundamentally different: no account balances, there are unspent outputs. "Address balance" is the sum of all UTXO where this address is output.
Different finalization semantics
| Network | Mechanism | Finality |
|---|---|---|
| Ethereum | PoS + Casper FFG | ~15 min (finalized checkpoint) |
| Arbitrum One | Optimistic Rollup | ~7 days (fraud proof window) for L1 finality |
| Polygon PoS | Heimdall checkpoints | ~30 min for Ethereum finality |
| Solana | Tower BFT | ~12-32 slots (~6–16 sec) |
| Bitcoin | PoW | 6 confirmations (~60 min) — conventional standard |
If the system doesn't account for this, data will be incorrect: a transaction will appear "final" by confirmation count but get reorganized.
Aggregation system architecture
Collector layer (Chain Collectors)
Each collector is an isolated service responsible for one network. Common interface:
interface ChainCollector {
getLatestBlock(): Promise<UnifiedBlock>;
getBlockRange(from: bigint, to: bigint): Promise<UnifiedBlock[]>;
getTransactionsByAddress(address: string, fromBlock: bigint): Promise<UnifiedTx[]>;
subscribeNewBlocks(callback: (block: UnifiedBlock) => void): Unsubscribe;
}
Unified types normalize each network's specifics:
interface UnifiedTx {
chain: ChainId;
hash: string;
blockNumber: bigint;
timestamp: number; // unix
from: string; // normalized lowercase hex for EVM, base58 for Solana
to: string | null;
value: bigint; // in smallest units of native token
status: 'success' | 'failed' | 'pending';
finality: 'unconfirmed' | 'safe' | 'finalized';
raw: unknown; // original network data
}
Node and provider management
Problem: public RPC is unreliable, rate limits are unpredictable, Alchemy/Infura get expensive at scale.
Strategy: tiered provider pool
Primary: Own nodes (Geth+Lighthouse, Reth for archive)
↓ failover
Secondary: Alchemy / QuickNode (premium tier)
↓ failover
Tertiary: Infura / public RPC (only for non-critical requests)
Circuit breaker on each provider: if error rate > 5% over 60 sec or latency > 2x p99 baseline — remove provider from rotation, health check every 30 sec.
For archive data (historical blocks > 128 blocks ago on Ethereum) you need an archive node — this is a separate story. Erigon takes ~3TB for full Ethereum archive, Reth slightly less. For most projects it's cheaper to use Alchemy Archive or QuickNode Archive than maintain own node.
Normalization and transformation layer
Raw blockchain data is rarely needed as-is. Typical transformations:
Decoding ERC-20 Transfer events
const ERC20_TRANSFER_TOPIC =
"0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef";
function decodeTransfer(log: Log): TokenTransfer | null {
if (log.topics[0] !== ERC20_TRANSFER_TOPIC) return null;
return {
token: log.address,
from: `0x${log.topics[1].slice(26)}`,
to: `0x${log.topics[2].slice(26)}`,
amount: BigInt(log.data),
};
}
Token data enrichment: for each log.address you need to know symbol, decimals, USD price. Cache token metadata in Redis with TTL 24h, update prices every 30 sec from CoinGecko/CoinMarketCap.
Cross-chain aggregation: if you need to show "total address balance across all networks in USD", you need to normalize different decimals, convert through price feeds, handle wrapped versions of same token (USDC on Ethereum ≠ USDC.e on Arbitrum).
Storage layer
For hot data (last 7–30 days): PostgreSQL with partitioning by chain_id + date. Indexes on (chain_id, address, block_number) and (chain_id, tx_hash). TimescaleDB hypertables for large data — automatic compression of old partitions.
For cold data (archive): ClickHouse — columnar database, order of magnitude more efficient than PostgreSQL for analytical queries over large periods. Query "all USDC transactions > $10k during 2023 across all EVM networks" on 100M+ rows — ClickHouse gives result in seconds, PostgreSQL in minutes.
For address/hash search: ElasticSearch or just PostgreSQL with LIKE — hash index sufficient for exact matches.
Reorg handling
This is the trickiest part of the system. Algorithm:
- Save each block with flag
is_canonical = trueandparent_hash - New block with same
block_numberbut differenthash— potential reorg - Walk back via
parent_hashuntil common ancestor is found - Mark all blocks on "old" branch as
is_canonical = false, add blocks from "new" branch - Output API always filters by
is_canonical = true - Webhooks/downstream systems receive
tx.orphanedevents for revoked transactions
For Ethereum reorg depth is rarely > 2 blocks post-Merge. For Polygon PoS — seen reorgs of 30+ blocks. Observation buffer: 128 blocks for EVM networks.
API layer
REST + WebSocket for real-time:
GET /v1/address/{address}/transactions?chains=eth,arb,polygon&limit=50
GET /v1/tx/{chain}/{hash}
GET /v1/address/{address}/token-balances?chains=eth,bsc
WS /v1/subscribe?address={addr}&chains=eth,arb&events=transfer,swap
GraphQL is convenient if clients need flexibility in queries: one request gets transactions + balances + token metadata. But adds complexity on backend — N+1 problems, needs DataLoader.
Rate limiting: per-API-key, sliding window, separate limits for REST and WebSocket (WebSocket connections are more expensive). Redis + Lua script for atomic increments.
Monitoring and operations
Critical metrics:
-
Collector lag — difference between
latest block timestampin network and processing time in our system. Alert when lag > 2 min. - Reorg depth — max reorg depth over last 24h. Alert when depth > 10.
- RPC error rate — per provider and method. Alert when > 1%.
- Queue depth — if processor can't keep up with collector, queue grows. Alert when depth > 10k messages.
Grafana dashboard with per-chain panels: current block, lag, TPS, error rate.
Stack
| Component | Technology |
|---|---|
| Collectors | Node.js (viem/ethers) + Go for high-load networks |
| Queue | Apache Kafka (high throughput) or RabbitMQ (moderate) |
| Hot storage | PostgreSQL 15 + TimescaleDB |
| Cold storage | ClickHouse |
| Cache | Redis Cluster |
| API | Node.js (Fastify) or Go (Fiber) |
| Monitoring | Prometheus + Grafana + PagerDuty |
| Orchestration | Kubernetes with HPA on collectors |
Realistic MVP timeline (3–4 EVM networks, no archive, REST API): 8–12 weeks. Complete system with 10+ networks, ClickHouse, WebSocket, monitoring — 5–7 months.







