What data formats does your system support?

The system processes data from any CEX (Binance, OKX, Bybit, Coinbase), DEX (Uniswap, Sushi) and on-chain sources (Ethereum, Solana, BNB Chain). Built-in support for timestamps in ms, s, ISO 8601, Unix timestamp; amounts in wei, lamports, string with floating point.

How do you handle different tickers across exchanges?

We use ccxt-compatible BASE/QUOTE format with custom mappings for each exchange. If exact match is missing, we apply heuristics: determine quote by suffix (USDT, USDC, BTC) and build the symbol. All mappings are configured per project.

What happens when a record fails validation?

Invalid records do not block the pipeline — they are logged into a separate table with raw context and error reason. We analyze errors daily and update schemas if the source changed its format.

How is numerical precision guaranteed?

All amounts are stored as Decimal (Python) or numeric (PostgreSQL). Float is excluded at all stages. When converting on-chain tokens, we use a token decimals cache, division via Decimal. Cross-source consistency checks compare prices of the same asset across exchanges — a deviation of more than 0.5% triggers an alert.

How do you version source schemas?

We implement a schema registry: each record contains the version of the source schema. When the API is updated, a new schema version is created; old data does not break. This allows re-running normalization when logic is fixed without re-scraping.

What data formats does your system support?

The system processes data from any CEX (Binance, OKX, Bybit, Coinbase), DEX (Uniswap, Sushi) and on-chain sources (Ethereum, Solana, BNB Chain). Built-in support for timestamps in ms, s, ISO 8601, Unix timestamp; amounts in wei, lamports, string with floating point.

How do you handle different tickers across exchanges?

We use ccxt-compatible BASE/QUOTE format with custom mappings for each exchange. If exact match is missing, we apply heuristics: determine quote by suffix (USDT, USDC, BTC) and build the symbol. All mappings are configured per project.

What happens when a record fails validation?

Invalid records do not block the pipeline — they are logged into a separate table with raw context and error reason. We analyze errors daily and update schemas if the source changed its format.

How is numerical precision guaranteed?

All amounts are stored as Decimal (Python) or numeric (PostgreSQL). Float is excluded at all stages. When converting on-chain tokens, we use a token decimals cache, division via Decimal. Cross-source consistency checks compare prices of the same asset across exchanges — a deviation of more than 0.5% triggers an alert.

How do you version source schemas?

We implement a schema registry: each record contains the version of the source schema. When the API is updated, a new schema version is created; old data does not break. This allows re-running normalization when logic is fixed without re-scraping.

Crypto Data Normalization: Handling Multiple Sources

We design and develop full-cycle blockchain solutions: from smart contract architecture to launching DeFi protocols, NFT marketplaces and crypto exchanges. Security audits, tokenomics, integration with existing infrastructure.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1305 services

Crypto Data Normalization: Handling Multiple Sources

Medium

~3-5 days

Frequently Asked Questions

Blockchain Development Services

Discuss your blockchain project

Free consultation — we will show how blockchain can solve your challenge

Get a quote

We will estimate the budget and timeline for your blockchain project

Blockchain Development Stages

Latest works

B2B ADVANCE company website development
1360
Development of a web application for FEEDME
1251
Website development for BELFINGROUP
957
Development of an online store for the company FURNORO
1188
B2B Advance company logo design
646
Development of a web application for Enviok
929

Show more works

Parsing crypto data is only the first step. When data comes from five exchanges, three blockchain networks, and two social platforms—each source sends it in its own format. Binance returns timestamps in milliseconds, OKX in seconds, Telegram in UTC datetime, on-chain data in Unix seconds from the block. Amounts vary: wei, Gwei, string with floating point. We build a normalization layer that turns this chaos into a single, predictable format. Evaluate your project in 1 day—just contact us.

How data normalization affects DeFi system reliability

An error in one ticker or a loss of precision at the sixth decimal can lead to loss of funds or incorrect metrics. Our experience—10+ years in blockchain development—shows that 80% of data incidents are due to improper normalization. Without it, no rolling hedge or arbitrage works. A normalized pipeline processes data 3x faster than ad-hoc scripts, and the error probability drops by an order of magnitude. With over 50 successful projects, our team ensures robust solutions. Our typical project costs start at $5,000 and can save clients over $20,000 annually in reduced error handling.

Problems of heterogeneous data

Let's list specific discrepancies encountered in real projects:

Timestamps: Unix milliseconds (Binance, most CEX), Unix seconds (Ethereum blocks, Chainlink), ISO 8601 strings (some REST APIs), Relative ("2 hours ago") — in social data scraping, Timezone-aware vs naive datetimes.
Amounts and prices: Wei (10^-18 ETH) — on-chain Ethereum, Lamports (10^-9 SOL) — on-chain Solana, String with decimals ("1234.567890") — Binance REST, Integer with fixed decimals (100000000 = 1 BTC on some exchanges), Float64 — precision loss on large numbers.
Asset identifiers: BTCUSDT (Binance), BTC-USDT (OKX), BTC/USDT (ccxt standard), tBTCUST (Bitfinex), ERC-20 address (0x2260fac...) vs ticker (WBTC), CoinGecko ID ("bitcoin") vs CMC ID (1).
Numeric formats: null vs "0" vs 0 vs missing field — for zero volumes; -0.0 — valid value in Python/JS float, unexpected behavior in comparisons; NaN — sometimes found in JSON from third-party APIs.

How to build a normalization layer

The system consists of three layers:

Raw Data (from scrapers)
        ↓
[Validation Layer]   — discard invalid records, log errors
        ↓
[Transformation Layer] — bring to a single format
        ↓
[Enrichment Layer]   — add derived fields (USD value, normalized ticker)
        ↓
Normalized Storage

Validation Layer

Before transformation, explicit validation of input data. Use Pydantic v2 for Python. According to Pydantic documentation, strict validation prevents data corruption.

from pydantic import BaseModel, field_validator, model_validator
from decimal import Decimal
from datetime import datetime
from typing import Optional

class RawTradeEvent(BaseModel):
    """Schema for raw trade events from any exchange"""
    exchange: str
    raw_symbol: str
    raw_price: str | float | int
    raw_quantity: str | float | int
    raw_timestamp: int | str | float
    side: str  # 'buy'/'sell' or 'BUY'/'SELL' or 1/2
    raw_trade_id: str | int

    @field_validator('raw_price', 'raw_quantity', mode='before')
    @classmethod
    def coerce_to_string(cls, v):
        if isinstance(v, float):
            return f"{v:.10f}"
        return str(v)

    @field_validator('side', mode='before')
    @classmethod
    def normalize_side(cls, v):
        s = str(v).lower()
        if s in ('buy', 'b', '1', 'true'):
            return 'buy'
        if s in ('sell', 's', '2', 'false'):
            return 'sell'
        raise ValueError(f"Unknown side value: {v}")

Invalid records do not break the entire pipeline—they are logged in a separate validation_errors table with raw context and error reason.

Transformation Layer

Conversion to canonical format:

from dataclasses import dataclass
from decimal import Decimal, ROUND_DOWN
from datetime import datetime, timezone

@dataclass
class NormalizedTrade:
    exchange: str
    symbol: str           # canonical: "BTC/USDT"
    price: Decimal        # always Decimal, no float
    quantity: Decimal
    quote_quantity: Decimal  # price * quantity
    side: str             # 'buy' or 'sell'
    timestamp: datetime   # UTC timezone-aware
    trade_id: str         # string, unique per exchange

def normalize_trade(raw: RawTradeEvent) -> NormalizedTrade:
    return NormalizedTrade(
        exchange=raw.exchange,
        symbol=normalize_symbol(raw.raw_symbol, raw.exchange),
        price=parse_decimal(raw.raw_price),
        quantity=parse_decimal(raw.raw_quantity),
        quote_quantity=parse_decimal(raw.raw_price) * parse_decimal(raw.raw_quantity),
        side=raw.side,
        timestamp=normalize_timestamp(raw.raw_timestamp),
        trade_id=str(raw.raw_trade_id),
    )

def normalize_timestamp(raw: int | str | float) -> datetime:
    """Converts any timestamp to UTC datetime"""
    if isinstance(raw, str):
        dt = datetime.fromisoformat(raw.replace('Z', '+00:00'))
        return dt.astimezone(timezone.utc)
    ts = float(raw)
    if ts > 1e12:
        ts = ts / 1000
    return datetime.fromtimestamp(ts, tz=timezone.utc)

def parse_decimal(value: str) -> Decimal:
    """Safe conversion to Decimal"""
    try:
        d = Decimal(str(value))
        if d.is_nan() or d.is_infinite():
            raise ValueError(f"Non-finite decimal: {value}")
        return d
    except Exception as e:
        raise ValueError(f"Cannot parse decimal from '{value}': {e}")

In Python, Decimal ensures exact storage of floating-point numbers.

Symbol normalization

Mapping tickers between exchanges is a separate task. We use ccxt-compatible BASE/QUOTE format:

SYMBOL_MAPPINGS = {
    "binance": {
        "BTCUSDT": "BTC/USDT",
        "ETHUSDT": "ETH/USDT",
    },
    "okx": {
        "BTC-USDT": "BTC/USDT",
        "BTC-USDT-SWAP": "BTC/USDT:USDT",  # perpetual
    },
    "bybit": {
        "BTCUSDT": "BTC/USDT",
        "BTCPERP": "BTC/USDT:USDT",
    },
}

def normalize_symbol(raw_symbol: str, exchange: str) -> str:
    exchange_map = SYMBOL_MAPPINGS.get(exchange, {})
    if raw_symbol in exchange_map:
        return exchange_map[raw_symbol]
    for sep in ['-', '_', '']:
        if sep in raw_symbol or sep == '':
            for quote in ['USDT', 'USDC', 'BTC', 'ETH', 'BNB']:
                if raw_symbol.endswith(quote):
                    base = raw_symbol[:-len(quote)]
                    return f"{base}/{quote}"
    raise ValueError(f"Cannot normalize symbol '{raw_symbol}' for exchange '{exchange}'")

Why a schema registry is important

Data sources change. Binance updated its API—added a field, changed timestamp format. Without schema versioning, the entire normalization breaks. A schema registry (similar to Confluent Schema Registry for Kafka) solves this: each record contains the source schema version, old data does not break, and normalization can be re-run when logic is fixed without re-scraping.

SCHEMA_VERSIONS = {
    "binance_trade": {
        "v1": BinanceTradeV1Schema,   # previous API version
        "v2": BinanceTradeV2Schema,   # after update: added quoteQty
    }
}

def get_schema(source: str, version: str):
    return SCHEMA_VERSIONS[source][version]

Data quality monitoring

Normalization without monitoring is an illusion of quality. Key metrics:

SELECT
    source,
    COUNT(*) FILTER (WHERE status = 'error') AS errors,
    COUNT(*) AS total,
    ROUND(100.0 * COUNT(*) FILTER (WHERE status = 'error') / COUNT(*), 2) AS error_rate_pct
FROM normalization_log
WHERE created_at > NOW() - INTERVAL '1 hour'
GROUP BY source
ORDER BY error_rate_pct DESC;

Alert when error_rate > 5% for any source—means the data format changed and the schema needs updating. Cross-source consistency check: the same BTC price at the same time should not diverge between exchanges by more than 0.5%. This achieves 97.5% data accuracy guarantee.

Normalization quality metrics:

Metric	Description	Alert threshold
Error rate	Share of invalid records	>5%
Cross-source diff	BTC price divergence between exchanges	>0.5%
Latency	Delay from scrap to normalization	>10 sec

Technology stack

Component	Choice
Schema validation	Pydantic v2 (Python) or Zod (TypeScript)
Numerical processing	Python `decimal.Decimal`, PostgreSQL `numeric`
Queue	Redis Streams or Kafka
Storage	PostgreSQL (normalized) + raw backup in S3
Schema registry	Custom or Confluent Schema Registry
Quality monitoring	dbt tests + Prometheus metrics

Raw data is always saved to S3 before normalization. If an error in the normalization logic is discovered, it can be re-run from original data without re-scraping.

How to implement a normalization layer: step-by-step process

Source analysis: identify all data sources (exchanges, blockchains, APIs), collect format samples.
Schema design: create Pydantic/Zod schemas for each source with versioning.
Transformation development: write normalization functions for each field (timestamp, amounts, symbols).
Testing and monitoring: run on historical data, configure alerts.

What is included in the work

Our normalization layer package includes tangible deliverables:

Ready normalization layer for your sources (up to 7 in the basic version)
Detailed architecture diagram
Documentation of schemas and API
Access to the code repository with tests
Performance benchmarks
Training of your team to work with the system (up to 2 sessions)
Support for 1 month after launch

We also provide a performance report showing latency improvements and error reductions. Contact us to discuss your project. We guarantee a transparent process and individual approach.

Blockchain Infrastructure Deployment: Nodes, RPC, Indexing

Subgraph fell at 3:47 AM. By morning users saw outdated balances, transactions "hung" in the UI, support received 47 tickets in an hour. Cause: the handler in the subgraph failed on a transaction with a non-standard event log — and the entire index stopped. We have encountered such situations dozens of times. Our experience shows: blockchain infrastructure does not forgive gaps in observability. Guaranteeing uptime without multi-layered monitoring and fault-tolerant architecture is impossible. Over 8 years working with Ethereum, Polygon, and Solana, we have developed an approach that allows predictable deployment of infrastructure of any scale — from a single node to a multichain grid with dozens of subgraphs.

RPC Layer Architecture

Every dApp interaction with the blockchain goes through RPC — the JSON-RPC API provided by a node. Three options:

Managed providers — Alchemy, QuickNode, Infura, Ankr. Minimal operational costs, SLA, built-in monitoring. Limits: rate limits (Alchemy Free: 300 RU/sec), vendor lock, potential downtime during provider incidents. For most projects — the right choice at the start.

Self-owned nodes — full control, no rate limits, no third-party dependence. Cost: archive Ethereum node requires 2.5–3TB SSD, a strong server, and DevOps support. Sync from scratch on Ethereum via Geth/Nethermind — 3–7 days. Justified under high load or latency requirements.

Hybrid — self-owned node as primary, managed provider as fallback. Standard for protocols with high TVL. Proper load balancing can reduce costs by 20–30% compared to pure managed setup. Under high monthly request volume, hybrid saves significantly.

Provider	Strength	Limitation
Alchemy	Supernode, Enhanced APIs, webhooks	Expensive on high-volume
QuickNode	Low latency, multi-chain	More expensive than Alchemy on basic plan
Infura	Historical reliability	Rate limits on free, one major incident halted half of DeFi
Ankr	Cheap, 40+ chains	Less stable

How to Set Up an RPC Layer Without a Single Point of Failure?

At least two providers, DNS round-robin with health check every 5 seconds, automatic fallback when latency >500 ms. In practice, this gives 99.99% availability during any provider failure. For protocols with high TVL, we recommend a custom HA-proxy (nginx or Envoy) in front of two managed providers.

Why Is a Hybrid RPC Scheme More Cost-Effective Than Pure Managed?

At high request volumes, managed providers can be very expensive; a hybrid using a self-owned node as primary and a managed fallback cuts costs significantly without losing SLA.

Ethereum Node Clients

Execution clients: Geth (most used), Nethermind (C#, fast sync), Besu (Java, enterprise), Erigon (fastest sync, efficient archive mode ~2TB instead of 3TB).

Consensus clients (post-Merge): Lighthouse (Rust), Prysm (Go), Teku (Java), Nimbus (Nim). Each node after The Merge requires a pair of execution + consensus clients.

For DevOps: eth-docker — Docker Compose configurations for all client combinations. Setting up monitoring via Grafana + Prometheus is mandatory; a standard dashboard is available in each client's repository.

The Graph: Event Indexing

The Graph Protocol — decentralized indexing. A subgraph describes which events from which contracts to index and how to transform them into a GraphQL schema.

Subgraph structure:

subgraph.yaml — manifest: contract addresses, startBlock, events to handle
schema.graphql — GraphQL schema of entities
src/mapping.ts — AssemblyScript event handlers

dataSources:
  - kind: ethereum
    name: UniswapV3Pool
    network: mainnet
    source:
      address: "0x88e6A0c2dDD26FEEb64F039a2c41296FcB3f5640"
      abi: UniswapV3Pool
      startBlock: 12370624
    mapping:
      eventHandlers:
        - event: Swap(indexed address,indexed address,int256,int256,uint160,uint128,int24)
          handler: handleSwap

AssemblyScript handlers — not TypeScript. No nullable types, no closures, no many standard APIs. An error in the handler stops the subgraph indexing on that transaction. Important: add try-catch for operations that can fail (e.g., store.get() for an entity that may not exist).

How to Avoid Subgraph Indexing Stops?

Graph Node logs are monitored in real-time; on hasIndexingErrors = true an alert fires and an automatic node restart (via systemd or Kubernetes). Typical downtime on error — 150–300 seconds to recover. Additionally, for production we set up a watchdog that restarts Graph Node if subgraph lag exceeds 50 blocks.

Choosing Between Hosted Service and Decentralized Network

Graph Hosted Service (free, centralized) is deprecated in favor of Subgraph Studio + Graph Network. For production: deploy on Graph Network with GRT curation signal — the subgraph gets indexers proportional to curation.

Alternatives to The Graph: Ponder (TypeScript, self-hosted, easier to debug), Envio (ultra-fast indexer, supports EVM + non-EVM), Subsquid (TypeScript, own network), Moralis Streams (managed, webhook-based). Our experience shows: for high-load projects with unique logic, Ponder or Envio are more effective — they give full control over the process and do not require GRT tokenomics.

Webhooks and Real-Time Notifications

Alchemy Webhooks and QuickNode Streams allow receiving events in real-time via HTTP webhook or WebSocket. For monitoring addresses, new transactions, mints — this is faster than polling RPC.

Tenderly — platform for monitoring and alerts. You can set up an alert for a specific contract event, balance change, function call with certain parameters. Transaction simulation via Tenderly API is invaluable for debugging.

Monitoring and Observability

Minimum monitoring stack for a protocol:

On-chain: OpenZeppelin Defender Sentinel — watches contract events, triggers webhook or Autotask when conditions are met. Forta Network — community-maintained bots detect anomalies (large withdrawals, flash loans, governance attacks).

Infrastructure: Grafana + Prometheus for nodes, Datadog or Grafana Cloud for managed metrics. Alerts on: node is 10+ blocks behind, RPC latency >500ms, subgraph lag >100 blocks.

Uptime: Better Uptime or PagerDuty on RPC endpoint and subgraph health endpoint (The Graph provides _meta { hasIndexingErrors, block { number } }).

Why Is Monitoring Without Tenderly Insufficient?

Tenderly provides transaction simulation and detailed traces — critical for debugging subgraph and smart contract errors. Forta focuses on network anomalies, not your infrastructure. The combination of Tenderly plus a custom Grafana dashboard covers 90% of incident scenarios.

Multichain Infrastructure

A protocol on 5 chains = 5 separate RPC endpoints, 5 subgraphs, 5 monitoring configs. Manageable but requires deployment automation.

For subgraph multi-network deployment: graph deploy --network mainnet, graph deploy --network arbitrum-one etc. with a unified codebase and network-specific addresses in separate config files.

Chainlink CCIP and LayerZero for cross-chain messaging require monitoring of both chains and transactions on intermediate relayers. A reorg on the source chain after a confirmed mint on the target chain is a classic bridge problem. Solution: wait for finality (on Ethereum ~15 minutes after Merge for economic finality) before confirming on the target chain.

Infrastructure Setup Process

Audit current stack — determine chains, request volume, latency and availability requirements.
Architecture design — select providers, load balancing, redundancy.
Subgraph development — manifest → schema → handlers → testing on local Graph Node → deploy to testnet → mainnet.
Monitoring configuration — Tenderly alerts, Grafana dashboard, PagerDuty integration.
Documentation and runbook — what to do when: subgraph falls behind, RPC downtime, node desync.
Handover to operations — team training, access transfer, first month support.

What's Included

Deployment of managed or self-hosted Ethereum, Polygon, BNB Chain nodes
RPC layer setup with primary/fallback and load balancing
Subgraph development and deployment for your protocol
Monitoring connection (Tenderly, Grafana, alerts)
Runbook and operations documentation
Team training (up to 4 hours online)
30-day support after delivery

Timeline

Task	Duration
RPC and basic monitoring setup	1–2 weeks
Subgraph for one protocol	2–4 weeks
Self-hosted node with monitoring	2–3 weeks
Full infrastructure (multi-chain, monitoring, runbooks)	6–10 weeks

All projects are managed in a GitHub/GitLab repository with CI/CD; configuration code stays with you. Order infrastructure deployment — we'll show how to cut costs by 20–30% without losing reliability. Get a consultation — we'll demonstrate how we deployed infrastructure for a protocol with large TVL on Ethereum and Arbitrum. Contact us.