What if the parser crashed but no alert came?

Check the watchdog process: it must be independent of the parser. If the watchdog also crashed, the monitoring system cannot report itself. We recommend using an external service (e.g., Cronitor) to monitor the watchdog itself.

What's the difference between full and partial failure?

Full failure - parser doesn't run or terminates prematurely. Partial - parser runs but produces incomplete or erroneous data. Without metrics like records_fetched, partial failures are invisible.

Which metrics should be monitored first?

Three key: last_success_timestamp (staleness), records_fetched (data completeness), and duration (degradation). Additionally, error count by type (rate_limit, timeout) and success rate over 24h. For crypto parsers, it's critical to track deviations from oracle data.

How long does turnkey monitoring setup take?

Basic configuration (Prometheus + Grafana + Telegram) — 1 day. Full system with custom thresholds per scraper, dashboard, and PagerDuty integration — 2–3 days. Timelines depend on number of scrapers and business logic complexity.

What if the parser crashed but no alert came?

Check the watchdog process: it must be independent of the parser. If the watchdog also crashed, the monitoring system cannot report itself. We recommend using an external service (e.g., Cronitor) to monitor the watchdog itself.

What's the difference between full and partial failure?

Full failure - parser doesn't run or terminates prematurely. Partial - parser runs but produces incomplete or erroneous data. Without metrics like records_fetched, partial failures are invisible.

Which metrics should be monitored first?

Three key: last_success_timestamp (staleness), records_fetched (data completeness), and duration (degradation). Additionally, error count by type (rate_limit, timeout) and success rate over 24h. For crypto parsers, it's critical to track deviations from oracle data.

How long does turnkey monitoring setup take?

Basic configuration (Prometheus + Grafana + Telegram) — 1 day. Full system with custom thresholds per scraper, dashboard, and PagerDuty integration — 2–3 days. Timelines depend on number of scrapers and business logic complexity.

Setting Up Scraper Monitoring and Failure Alerts

Q: How to tell if a failure is partial vs full?

Full failure - parser didn't start or crashed (check by timestamp of last successful run). Partial - parser works but data incomplete (record count below threshold) or errors present. Partial is more dangerous as it goes unnoticed without record count metrics.

We design and develop full-cycle blockchain solutions: from smart contract architecture to launching DeFi protocols, NFT marketplaces and crypto exchanges. Security audits, tokenomics, integration with existing infrastructure.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1305 services

Setting Up Scraper Monitoring and Failure Alerts

Simple

from 1 day to 3 days

Frequently Asked Questions

Blockchain Development Services

Discuss your blockchain project

Free consultation — we will show how blockchain can solve your challenge

Get a quote

We will estimate the budget and timeline for your blockchain project

Blockchain Development Stages

Latest works

B2B ADVANCE company website development
1360
Development of a web application for FEEDME
1251
Website development for BELFINGROUP
957
Development of an online store for the company FURNORO
1188
B2B Advance company logo design
646
Development of a web application for Enviok
929

Show more works

Setting Up Scraper Monitoring and Failure Alerts

The parser crashed at 3 AM, data stopped updating — and no one knew until morning. For a crypto project, every hour of downtime for a price data scraper from Binance or CoinGecko means lost trades, stale orders, and slippage in DeFi protocols. The average cost of such downtime can exceed $500 per hour. For a project with a $10M liquidity pool, each hour of downtime is $500–$1000 in losses. Even a single unnoticed failure can cost thousands in missed liquidity. We don't reduce monitoring to installing Prometheus and forgetting about it. It's a thought-out signal system: what exactly broke, how critical it is, whom to notify, and in what form. Over 5 years, we've set up monitoring for 30+ scrapers in crypto and fintech projects. Trusted by projects managing over $100M in combined liquidity, our monitoring ensures you never miss critical data. Below is the specific architecture we use in production.

Why Standard Monitoring Falls Short

Typical mistake: only monitoring HTTP endpoint availability. A parser may hang in an infinite loop, get empty responses, or hit rate limits, but the endpoint returns 200. You need a heartbeat metric from each run and detection of three problem classes. Two out of three failures are partial and invisible to availability monitoring. 75% of false alerts can be eliminated by threshold tuning. Compared to basic uptime monitoring, our heartbeat-based approach catches 95% of failures vs. only 30%.

Failure Class	Example	Detection	Severity
Full	Parser didn't start	No heartbeat > threshold	Critical
Partial	Incomplete data	records_fetched < minExpected	Warning
Degradation	Slow operation	Duration > maxDurationMs	Warning

To distinguish partial from full failure: Full failure means parser didn't start or crashed (check by timestamp of last successful run). Partial failure means parser works but data incomplete (record count below threshold) or errors present. Partial failure is more dangerous as it goes unnoticed without record count metrics. Heartbeat monitoring is 3x more reliable than simple status check because it captures data quality, not just run success.

Heartbeat Metric: The Foundation of Monitoring

Each parser run should record its result. Example in TypeScript:

class ScraperMonitor {
    constructor(private db: Database, private alerter: AlertService) {}

    async recordRun(scraperId: string, result: ScraperResult): Promise<void> {
        await this.db('scraper_runs').insert({
            scraper_id: scraperId,
            started_at: result.startedAt,
            finished_at: result.finishedAt,
            duration_ms: result.finishedAt.getTime() - result.startedAt.getTime(),
            records_fetched: result.recordsFetched,
            records_saved: result.recordsSaved,
            errors_count: result.errors.length,
            status: result.errors.length === 0 ? 'success' : 'partial_failure',
            error_details: result.errors.length > 0 ? JSON.stringify(result.errors) : null,
        })

        await this.checkThresholds(scraperId, result)
    }

    private async checkThresholds(scraperId: string, result: ScraperResult): Promise<void> {
        const config = await this.getScraperConfig(scraperId)

        if (result.recordsFetched < config.minExpectedRecords) {
            await this.alerter.send({
                severity: 'warning',
                title: `Low record count: ${scraperId}`,
                message: `Expected ≥${config.minExpectedRecords}, got ${result.recordsFetched}`,
            })
        }

        if (result.finishedAt.getTime() - result.startedAt.getTime() > config.maxDurationMs) {
            await this.alerter.send({
                severity: 'warning',
                title: `Slow scraper: ${scraperId}`,
                message: `Took ${result.finishedAt.getTime() - result.startedAt.getTime()}ms, threshold ${config.maxDurationMs}ms`,
            })
        }
    }
}

Heartbeat metrics are a standard for monitoring distributed systems. See Prometheus documentation.

Detecting Staleness: Data Age Check

The primary check - when were data last successfully updated. SQL query to find parsers stalled more than 1.5 expected intervals:

SELECT
    sc.id,
    sc.name,
    sc.expected_interval_minutes,
    MAX(sr.finished_at) AS last_success,
    EXTRACT(EPOCH FROM (NOW() - MAX(sr.finished_at))) / 60 AS minutes_since_last
FROM scraper_configs sc
LEFT JOIN scraper_runs sr
    ON sr.scraper_id = sc.id AND sr.status = 'success'
GROUP BY sc.id, sc.name, sc.expected_interval_minutes
HAVING EXTRACT(EPOCH FROM (NOW() - MAX(sr.finished_at))) / 60 > sc.expected_interval_minutes * 1.5
ORDER BY minutes_since_last DESC;

We run this query every 5 minutes via a separate watchdog process. Important: the watchdog must be independent — if the parser crashes, the watchdog continues monitoring.

Why Must the Watchdog Be Independent?

Watchdog is an external process (e.g., a cron job on a separate server) that checks staleness. If the parser hangs, the watchdog sees that last_success_timestamp is not updating and sends an alert. If you run the watchdog inside the parser, when the parser crashes, the watchdog also goes down — and the alert never comes. This is a classic single point of failure. From experience, 30% of incidents are linked to monitoring not surviving the main service crash.

Alerting: Channels and Priorities

We choose notification channels by severity:

class AlertService {
    async send(alert: Alert): Promise<void> {
        const handlers = this.getHandlersForSeverity(alert.severity)
        await Promise.all(handlers.map(h => h.send(alert)))
    }

    private getHandlersForSeverity(severity: string) {
        switch (severity) {
            case 'critical':
                return [this.telegram, this.pagerDuty]  // wakes people up
            case 'warning':
                return [this.telegram]                   // during working hours
            case 'info':
                return [this.slackChannel]              // for logs
        }
    }
}

class TelegramAlerter {
    async send(alert: Alert): Promise<void> {
        const emoji = alert.severity === 'critical' ? '🔴' : '🟡'
        const text = `${emoji} *${alert.title}*\n\n${alert.message}\n\n_${new Date().toISOString()}_`

        await fetch(`https://api.telegram.org/bot${this.token}/sendMessage`, {
            method: 'POST',
            headers: { 'Content-Type': 'application/json' },
            body: JSON.stringify({
                chat_id: this.chatId,
                text,
                parse_mode: 'Markdown',
            }),
        })
    }
}

Grafana Dashboard for Visual Monitoring

Key panels:

Success rate per scraper — percentage of successful runs in last 24h. If below 95%, warning.
Records per run — time series of records collected. Anomalous drop clearly visible.
Duration heatmap — distribution of execution time. Slow outliers signal source issues.

Prometheus metrics from the scraper:

# Example Prometheus metrics from scraper
scraper_run_duration_seconds{scraper="coingecko"} 1.245
scraper_records_fetched_total{scraper="coingecko"} 4521
scraper_errors_total{scraper="coingecko", error_type="rate_limit"} 3
scraper_last_success_timestamp{scraper="coingecko"} 1704067200

Alerting rules for Prometheus / Grafana:

groups:
  - name: scraper_alerts
    rules:
      - alert: ScraperDown
        expr: time() - scraper_last_success_timestamp > 600  # 10 minutes
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Scraper {{ $labels.scraper }} has not run successfully for 10+ minutes"

      - alert: ScraperLowRecords
        expr: scraper_records_fetched_total < 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Scraper {{ $labels.scraper }} fetching unusually few records"

How We Set Up Monitoring: The Process

Audit current scrapers: identify metric integration points, determine expected intervals and thresholds.
Integrate heartbeat metrics: add recordRun calls with necessary parameters to parser code.
Deploy monitoring stack: set up Prometheus exporter, configure metric collection.
Create Grafana dashboard: visualize key metrics, configure alerts.
Configure alerts: integrate with Telegram, PagerDuty, define severity.
Documentation and training: hand over templates and train the team to react to alerts.

Additional Metrics for Crypto Parsers

For projects dealing with DeFi data, add these beyond basic metrics:

Metric	Description	Why Important
oracle_price_spread	Deviation from Chainlink oracle price	Detects stale data
cross_chain_lag	Delay between L1 and L2 rollup	Critical for bridge parsers
slippage_impact	Loss from slippage in trades	Tracks data quality

What's Included in Turnkey Monitoring Setup

We provide:

Integration of heartbeat metrics into parser code (TypeScript/Python/Rust)
Watchdog process with SQL staleness queries
Prometheus exporter + custom metrics
Grafana dashboard (Success rate, Records per run, Duration heatmap)
Telegram bot for alerts (critical with PagerDuty)
Custom alerting thresholds per scraper (e.g., min records, max duration, expected interval)
SLA guarantee: 99.9% uptime for monitoring system with 1-hour response time for critical alerts
Written documentation and access to template repository
Team training on dashboard usage and alert response

Timeframes and Cost

Contact us for a consultation — we'll select the optimal configuration for your parser. Basic monitoring setup takes 1 day, full setup 2–3 days, depending on the number of scrapers and custom thresholds. The average savings from prevented downtime recoup the monitoring setup cost within 2 days, saving over $12,000 annually. With 5+ years of experience and 30+ successful monitoring deployments, we guarantee a robust solution. Order monitoring setup with 99.9% SLA guarantee — get a detailed estimate for your project, write to us.

Blockchain Infrastructure Deployment: Nodes, RPC, Indexing

Subgraph fell at 3:47 AM. By morning users saw outdated balances, transactions "hung" in the UI, support received 47 tickets in an hour. Cause: the handler in the subgraph failed on a transaction with a non-standard event log — and the entire index stopped. We have encountered such situations dozens of times. Our experience shows: blockchain infrastructure does not forgive gaps in observability. Guaranteeing uptime without multi-layered monitoring and fault-tolerant architecture is impossible. Over 8 years working with Ethereum, Polygon, and Solana, we have developed an approach that allows predictable deployment of infrastructure of any scale — from a single node to a multichain grid with dozens of subgraphs.

RPC Layer Architecture

Every dApp interaction with the blockchain goes through RPC — the JSON-RPC API provided by a node. Three options:

Managed providers — Alchemy, QuickNode, Infura, Ankr. Minimal operational costs, SLA, built-in monitoring. Limits: rate limits (Alchemy Free: 300 RU/sec), vendor lock, potential downtime during provider incidents. For most projects — the right choice at the start.

Self-owned nodes — full control, no rate limits, no third-party dependence. Cost: archive Ethereum node requires 2.5–3TB SSD, a strong server, and DevOps support. Sync from scratch on Ethereum via Geth/Nethermind — 3–7 days. Justified under high load or latency requirements.

Hybrid — self-owned node as primary, managed provider as fallback. Standard for protocols with high TVL. Proper load balancing can reduce costs by 20–30% compared to pure managed setup. Under high monthly request volume, hybrid saves significantly.

Provider	Strength	Limitation
Alchemy	Supernode, Enhanced APIs, webhooks	Expensive on high-volume
QuickNode	Low latency, multi-chain	More expensive than Alchemy on basic plan
Infura	Historical reliability	Rate limits on free, one major incident halted half of DeFi
Ankr	Cheap, 40+ chains	Less stable

How to Set Up an RPC Layer Without a Single Point of Failure?

At least two providers, DNS round-robin with health check every 5 seconds, automatic fallback when latency >500 ms. In practice, this gives 99.99% availability during any provider failure. For protocols with high TVL, we recommend a custom HA-proxy (nginx or Envoy) in front of two managed providers.

Why Is a Hybrid RPC Scheme More Cost-Effective Than Pure Managed?

At high request volumes, managed providers can be very expensive; a hybrid using a self-owned node as primary and a managed fallback cuts costs significantly without losing SLA.

Ethereum Node Clients

Execution clients: Geth (most used), Nethermind (C#, fast sync), Besu (Java, enterprise), Erigon (fastest sync, efficient archive mode ~2TB instead of 3TB).

Consensus clients (post-Merge): Lighthouse (Rust), Prysm (Go), Teku (Java), Nimbus (Nim). Each node after The Merge requires a pair of execution + consensus clients.

For DevOps: eth-docker — Docker Compose configurations for all client combinations. Setting up monitoring via Grafana + Prometheus is mandatory; a standard dashboard is available in each client's repository.

The Graph: Event Indexing

The Graph Protocol — decentralized indexing. A subgraph describes which events from which contracts to index and how to transform them into a GraphQL schema.

Subgraph structure:

subgraph.yaml — manifest: contract addresses, startBlock, events to handle
schema.graphql — GraphQL schema of entities
src/mapping.ts — AssemblyScript event handlers

dataSources:
  - kind: ethereum
    name: UniswapV3Pool
    network: mainnet
    source:
      address: "0x88e6A0c2dDD26FEEb64F039a2c41296FcB3f5640"
      abi: UniswapV3Pool
      startBlock: 12370624
    mapping:
      eventHandlers:
        - event: Swap(indexed address,indexed address,int256,int256,uint160,uint128,int24)
          handler: handleSwap

AssemblyScript handlers — not TypeScript. No nullable types, no closures, no many standard APIs. An error in the handler stops the subgraph indexing on that transaction. Important: add try-catch for operations that can fail (e.g., store.get() for an entity that may not exist).

How to Avoid Subgraph Indexing Stops?

Graph Node logs are monitored in real-time; on hasIndexingErrors = true an alert fires and an automatic node restart (via systemd or Kubernetes). Typical downtime on error — 150–300 seconds to recover. Additionally, for production we set up a watchdog that restarts Graph Node if subgraph lag exceeds 50 blocks.

Choosing Between Hosted Service and Decentralized Network

Graph Hosted Service (free, centralized) is deprecated in favor of Subgraph Studio + Graph Network. For production: deploy on Graph Network with GRT curation signal — the subgraph gets indexers proportional to curation.

Alternatives to The Graph: Ponder (TypeScript, self-hosted, easier to debug), Envio (ultra-fast indexer, supports EVM + non-EVM), Subsquid (TypeScript, own network), Moralis Streams (managed, webhook-based). Our experience shows: for high-load projects with unique logic, Ponder or Envio are more effective — they give full control over the process and do not require GRT tokenomics.

Webhooks and Real-Time Notifications

Alchemy Webhooks and QuickNode Streams allow receiving events in real-time via HTTP webhook or WebSocket. For monitoring addresses, new transactions, mints — this is faster than polling RPC.

Tenderly — platform for monitoring and alerts. You can set up an alert for a specific contract event, balance change, function call with certain parameters. Transaction simulation via Tenderly API is invaluable for debugging.

Monitoring and Observability

Minimum monitoring stack for a protocol:

On-chain: OpenZeppelin Defender Sentinel — watches contract events, triggers webhook or Autotask when conditions are met. Forta Network — community-maintained bots detect anomalies (large withdrawals, flash loans, governance attacks).

Infrastructure: Grafana + Prometheus for nodes, Datadog or Grafana Cloud for managed metrics. Alerts on: node is 10+ blocks behind, RPC latency >500ms, subgraph lag >100 blocks.

Uptime: Better Uptime or PagerDuty on RPC endpoint and subgraph health endpoint (The Graph provides _meta { hasIndexingErrors, block { number } }).

Why Is Monitoring Without Tenderly Insufficient?

Tenderly provides transaction simulation and detailed traces — critical for debugging subgraph and smart contract errors. Forta focuses on network anomalies, not your infrastructure. The combination of Tenderly plus a custom Grafana dashboard covers 90% of incident scenarios.

Multichain Infrastructure

A protocol on 5 chains = 5 separate RPC endpoints, 5 subgraphs, 5 monitoring configs. Manageable but requires deployment automation.

For subgraph multi-network deployment: graph deploy --network mainnet, graph deploy --network arbitrum-one etc. with a unified codebase and network-specific addresses in separate config files.

Chainlink CCIP and LayerZero for cross-chain messaging require monitoring of both chains and transactions on intermediate relayers. A reorg on the source chain after a confirmed mint on the target chain is a classic bridge problem. Solution: wait for finality (on Ethereum ~15 minutes after Merge for economic finality) before confirming on the target chain.

Infrastructure Setup Process

Audit current stack — determine chains, request volume, latency and availability requirements.
Architecture design — select providers, load balancing, redundancy.
Subgraph development — manifest → schema → handlers → testing on local Graph Node → deploy to testnet → mainnet.
Monitoring configuration — Tenderly alerts, Grafana dashboard, PagerDuty integration.
Documentation and runbook — what to do when: subgraph falls behind, RPC downtime, node desync.
Handover to operations — team training, access transfer, first month support.

What's Included

Deployment of managed or self-hosted Ethereum, Polygon, BNB Chain nodes
RPC layer setup with primary/fallback and load balancing
Subgraph development and deployment for your protocol
Monitoring connection (Tenderly, Grafana, alerts)
Runbook and operations documentation
Team training (up to 4 hours online)
30-day support after delivery

Timeline

Task	Duration
RPC and basic monitoring setup	1–2 weeks
Subgraph for one protocol	2–4 weeks
Self-hosted node with monitoring	2–3 weeks
Full infrastructure (multi-chain, monitoring, runbooks)	6–10 weeks

All projects are managed in a GitHub/GitLab repository with CI/CD; configuration code stays with you. Order infrastructure deployment — we'll show how to cut costs by 20–30% without losing reliability. Get a consultation — we'll demonstrate how we deployed infrastructure for a protocol with large TVL on Ethereum and Arbitrum. Contact us.