Blockchain Infrastructure Scaling
Infrastructure that works fine with 100 users starts crumbling at 10,000. The specificity of blockchain stack is that bottleneck is often not where you expect: not database, not CPU — but RPC node that can't deliver eth_getLogs, or indexer that lags 50 blocks behind, or WebSocket handler that drops connections under load. Scaling blockchain infrastructure is a separate discipline with unfamiliar patterns.
Diagnosis: Where the Real Bottleneck Is
Before scaling anything — measure. Typical bottlenecks:
// Instrumentation of RPC calls
class InstrumentedProvider {
private metrics: Map<string, number[]> = new Map();
async call(method: string, params: any[]): Promise<any> {
const start = performance.now();
try {
const result = await this.provider.send(method, params);
this.record(method, performance.now() - start);
return result;
} catch (err) {
this.recordError(method);
throw err;
}
}
getPercentiles(method: string) {
const samples = (this.metrics.get(method) || []).sort((a, b) => a - b);
return {
p50: samples[Math.floor(samples.length * 0.5)],
p95: samples[Math.floor(samples.length * 0.95)],
p99: samples[Math.floor(samples.length * 0.99)],
count: samples.length,
};
}
}
What to measure: latency per RPC method, queue depth at indexer, lag between head block of node and head in your DB, throughput of WebSocket connections.
Scaling RPC Layer
Node Pool with Load Balancing
Single node — single point of failure and bottleneck. Minimal production configuration:
class NodePool {
private nodes: RpcNode[];
private currentIndex = 0;
private healthStatus: Map<string, boolean> = new Map();
async sendRequest(method: string, params: any[]): Promise<any> {
// Round-robin skipping unhealthy nodes
for (let i = 0; i < this.nodes.length; i++) {
const node = this.nodes[this.currentIndex % this.nodes.length];
this.currentIndex++;
if (!this.healthStatus.get(node.url)) continue;
try {
return await node.send(method, params);
} catch (err) {
// Mark node unhealthy on error
this.healthStatus.set(node.url, false);
setTimeout(() => this.healthStatus.set(node.url, true), 30_000);
}
}
throw new Error('All nodes unhealthy');
}
}
For stateful operations (subscriptions, pending transactions) — sticky routing: one client always goes to one node.
Caching RPC Responses
Many requests are identical and cacheable:
const CACHEABLE_METHODS: Record<string, number> = {
'eth_chainId': 86400, // 24 hours — doesn't change
'eth_getCode': 3600, // 1 hour — contract code stable
'eth_getBlockByNumber': 60, // 1 min for non-latest
'eth_getTransactionReceipt': 300, // 5 min — after finalization doesn't change
};
class CachingRpcProxy {
async send(method: string, params: any[]): Promise<any> {
const ttl = CACHEABLE_METHODS[method];
if (!ttl) return this.upstream.send(method, params);
// Don't cache 'latest' block
if (params.includes('latest') || params.includes('pending')) {
return this.upstream.send(method, params);
}
const cacheKey = `rpc:${method}:${JSON.stringify(params)}`;
const cached = await this.redis.get(cacheKey);
if (cached) return JSON.parse(cached);
const result = await this.upstream.send(method, params);
await this.redis.setex(cacheKey, ttl, JSON.stringify(result));
return result;
}
}
eth_getCode — especially cacheable: code of deployed contract never changes. One request for entire contract lifetime.
Indexing: From Polling to Event-Driven
Problem with Polling
// Bad: polling every 5 seconds
setInterval(async () => {
const balance = await provider.getBalance(address);
if (balance !== lastBalance) notifyUser(balance);
}, 5000);
With 10,000 addresses — 2,000 requests per second just for balance monitoring. Node chokes.
Event-Driven via Logs
EVM events — the right tool:
class EventIndexer {
private lastProcessedBlock: number;
async start() {
// Subscribe to new blocks
this.provider.on('block', async (blockNumber) => {
await this.processRange(this.lastProcessedBlock + 1, blockNumber);
this.lastProcessedBlock = blockNumber;
});
}
private async processRange(from: number, to: number) {
// One request for all interesting events across all blocks in range
const logs = await this.provider.getLogs({
fromBlock: from,
toBlock: to,
topics: [
// OR-logic: any of these event selectors
[
ethers.id('Transfer(address,address,uint256)'),
ethers.id('Approval(address,address,uint256)'),
ethers.id('Deposit(address,uint256)'),
]
],
});
// Group by event type and process in batch
const grouped = Map.groupBy(logs, log => log.topics[0]);
await Promise.all([
this.processTransfers(grouped.get(ethers.id('Transfer(...)'))),
this.processApprovals(grouped.get(ethers.id('Approval(...)'))),
]);
}
}
The Graph for Complex Indexing
For complex queries (aggregation, historical user data) — The Graph subgraph:
# schema.graphql
type User @entity {
id: Bytes!
totalDeposited: BigInt!
transactions: [Transaction!]! @derivedFrom(field: "user")
}
type Transaction @entity {
id: Bytes!
user: User!
amount: BigInt!
blockNumber: BigInt!
timestamp: BigInt!
}
// AssemblyScript handler in subgraph
export function handleDeposit(event: DepositEvent): void {
let user = User.load(event.params.user);
if (!user) {
user = new User(event.params.user);
user.totalDeposited = BigInt.zero();
}
user.totalDeposited = user.totalDeposited.plus(event.params.amount);
user.save();
let tx = new Transaction(event.transaction.hash);
tx.user = event.params.user;
tx.amount = event.params.amount;
tx.blockNumber = event.block.number;
tx.timestamp = event.block.timestamp;
tx.save();
}
Self-hosted Graph Node in production: PostgreSQL 14+ with sufficient I/O, indexing large protocol takes hours. Start with Graph Studio (managed), migrate when necessary.
WebSocket: Scaling Subscriptions
WebSocket connections are stateful — can't just add nginx upstream and balance round-robin. Need pub/sub layer:
// Redis Pub/Sub as backbone for WebSocket events
class WebSocketGateway {
private redisSub: Redis; // subscriber connection
private clients: Map<string, Set<WebSocket>> = new Map();
constructor() {
this.redisSub = new Redis();
// One subscriber per server, events distributed to all clients
this.redisSub.psubscribe('blockchain:*');
this.redisSub.on('pmessage', this.broadcastToClients.bind(this));
}
subscribeClient(ws: WebSocket, topic: string) {
if (!this.clients.has(topic)) this.clients.set(topic, new Set());
this.clients.get(topic)!.add(ws);
}
private broadcastToClients(pattern: string, channel: string, message: string) {
const clients = this.clients.get(channel);
if (!clients) return;
const dead: WebSocket[] = [];
for (const ws of clients) {
if (ws.readyState !== WebSocket.OPEN) { dead.push(ws); continue; }
ws.send(message);
}
dead.forEach(ws => clients.delete(ws));
}
}
Separate service publishes events to Redis on new block or transaction. WS-servers subscribe to Redis and distribute to their clients. Horizontal scaling: add WS servers, all read one Redis channel.
Managing Node Load
Request Coalescing
Multiple simultaneous requests for same resource:
class RequestCoalescer {
private pending: Map<string, Promise<any>> = new Map();
async get(key: string, fetcher: () => Promise<any>): Promise<any> {
// If request already in flight — wait for its result
if (this.pending.has(key)) {
return this.pending.get(key);
}
const promise = fetcher().finally(() => this.pending.delete(key));
this.pending.set(key, promise);
return promise;
}
}
// Usage: 100 concurrent requests for eth_getBalance of one address
// become 1 RPC request
const coalescer = new RequestCoalescer();
const balance = await coalescer.get(
`balance:${address}:latest`,
() => provider.getBalance(address)
);
Multicall for Batch Requests
// Instead of 100 separate balanceOf calls — one multicall
import { Multicall3 } from '@ethcall/core';
const multicall = new Multicall3({ ethersProvider: provider });
const calls = addresses.map(address => ({
target: tokenAddress,
callData: erc20Interface.encodeFunctionData('balanceOf', [address]),
}));
const results = await multicall.aggregate(calls);
const balances = results.map((result, i) =>
erc20Interface.decodeFunctionResult('balanceOf', result)[0]
);
One HTTP request instead of 100. For read-heavy apps with many addresses — mandatory pattern.
Monitoring and Observability
Prometheus metrics for blockchain infrastructure:
import { Counter, Histogram, Gauge } from 'prom-client';
const rpcLatency = new Histogram({
name: 'rpc_request_duration_ms',
help: 'RPC request latency in milliseconds',
labelNames: ['method', 'node', 'status'],
buckets: [10, 50, 100, 250, 500, 1000, 2500, 5000],
});
const indexerLag = new Gauge({
name: 'indexer_block_lag',
help: 'Blocks behind chain head',
labelNames: ['network'],
});
const wsConnections = new Gauge({
name: 'websocket_connections_total',
help: 'Active WebSocket connections',
});
Alerts in Grafana:
-
indexer_block_lag > 10— indexer lagging, something slow -
rpc_request_duration_ms{p95} > 2000— node degrading -
rate(rpc_errors_total[5m]) > 10— node returning errors -
websocket_connections_total > 8000— approaching fd limit
Architectural Solutions: When to Apply What
| Problem | Solution | Complexity |
|---|---|---|
| RPC node — bottleneck | Node pool + load balancing | Low |
| Repeated identical requests | Request coalescing + Redis cache | Low |
| 100+ addresses — monitoring balances | Multicall + event indexing | Medium |
| WS under load drops connections | Redis pub/sub backbone | Medium |
| Historical queries slow | Erigon/Reth archive + query optimization | Medium |
| Complex on-chain data analytics | The Graph subgraph | High |
| Multi-chain, thousands events/sec | Kafka + event streaming | High |
Scaling always starts with measurement. Add complexity only where profiling showed real bottleneck.







