Scraping DeFi Protocol Data (TVL, APY, Pools)
Task looks simple at first: collect TVL and APY across protocols, put in database, serve via API. In practice — each protocol has its own calc logic, some data only on-chain, some via subgraphs with delay, and APY changes every block. Plus many protocols deploy different versions on different chains with incompatible ABI.
Data sources and their specifics
The Graph: primary aggregated data source
Most major protocols have official subgraphs: Uniswap, Curve, Aave, Compound, Balancer, Yearn. The Graph Studio allows querying historical and current data via GraphQL.
Problems we face:
Latency. Subgraph updates with 1-10 minute delay after on-chain events. For real-time monitoring — doesn't fit. For historical data and dashboards — normal.
Outdated subgraphs. Uniswap V2 subgraph long unmaintained by team, data may be incomplete. For Uniswap V3 official subgraph periodically lags at high volume.
Pagination. The Graph returns max 1000 records per query. Getting all Uniswap V3 pools (>50,000) requires pagination via skip or id_gt pattern.
query GetPools($lastId: String) {
pools(first: 1000, where: { id_gt: $lastId }, orderBy: id) {
id
token0 { symbol, decimals }
token1 { symbol, decimals }
totalValueLockedUSD
volumeUSD
feeTier
}
}
TVL note in The Graph: Uniswap V3 subgraph calculates TVL as sum of token values in USD via internal price feed. This price feed sometimes gives wrong values for illiquid tokens — pool with $500k real TVL might show as $50M from manipulated one-token price. Need check via external source.
On-chain queries for accurate data
For data important to get accurately and fresh — direct eth_call to contracts:
Aave v3 TVL: Pool.getReserveData(asset) returns aToken.totalSupply() * liquidityIndex. For each asset in each market.
Curve APY: Minter.minted(gauge, user) for CRV emission, gauge.inflation_rate() for current rate. Real APY = (crv_per_year * crv_price) / gauge_tvl_usd.
Uniswap V3 fee APY: positions.tokensOwed0/1 — accumulated fees. For general pool APY: pool.feeGrowthGlobal0X128 — delta per period / liquidity.
Multicall3 (0xcA11bde05977b3631167028862bE2a173976CA11) — deployed on all major chains, allows batching hundreds eth_call into one transaction. Instead of 100 separate RPC requests — one batch. Critical for scraping performance.
DeFi Llama API
https://api.llama.fi — public API without key for TVL data across most protocols. Data structure:
GET /tvl/{protocol} → current TVL
GET /protocol/{protocol} → historical TVL + breakdown
GET /pools → APY across all pools (~10k records)
/pools — gold mine: APY already calculated for thousands pools across chains. But DeFi Llama updates every few minutes — for realtime tasks need own calculation.
Data collection architecture
Collection layers
Scheduler (cron / event-driven)
├── GraphQL Fetcher (The Graph subgraphs)
├── On-chain Fetcher (Multicall3 + ethers.js)
├── HTTP Fetcher (DeFi Llama, CoinGecko)
└── WebSocket Listener (real-time events)
↓
Normalizer (single format)
↓
TimescaleDB / PostgreSQL
↓
API (REST/GraphQL)
Normalizer — key component. Each protocol returns data in own format. Normalization: { protocolId, chainId, poolAddress, tvlUsd, apy, timestamp }. Single schema enables cross-protocol comparison.
APY calculation
APY = Annual Percentage Yield with compound. For most DeFi protocols data is APR (without compound), needs conversion:
APY = (1 + APR/n)^n - 1, where n — number of compound periods per year.
For lending protocols APR usually already with compound (Aave v3 calculates via liquidityRate). For LP positions — no: fees accrue without reinvest.
Real APY components for Uniswap V3 LP position:
- Trading fees APR (depends on volume and position range)
- Liquidity mining rewards (if any incentives)
- Minus IL (historical estimate)
Honest APY without IL deduction misleads users. Show both numbers.
Error handling and rate limiting
RPC providers have rate limits. Alchemy free tier: 300 CUPS (compute units per second). One eth_call = 10-40 CU, Multicall3 = 20 CU regardless request count inside. Batch maximally.
The Graph: 1000 requests per day free plan. Use cache with TTL — most data doesn't need update more than every 5 minutes.
Retry with exponential backoff on all HTTP requests. Dead letter queue for failed fetches — don't lose data on temporary RPC failures.
Development stack
TypeScript + Node.js for scrapers. PostgreSQL + TimescaleDB for time-series storage. Redis for intermediate data caching. Docker Compose for local development.
ethers.js v6 for on-chain interactions. graphql-request for The Graph queries. p-limit for concurrency control (don't overwhelm RPC).
Timeline estimates
Scraper for 2-3 protocols on one chain with basic API — 2-3 days. Multi-protocol, multi-chain system with historical base and normalization — 1-2 weeks depending on source count and APY accuracy requirements.







