Scraper Monitoring and Alert Setup
Parser crashed at 3 AM, data stopped updating, and nobody knew until morning — classic story. Scraper monitoring isn't "set up Prometheus and forget"; it's a thoughtful signal system: what exactly broke, how critical, who to notify and how.
What to monitor
Three classes of problems with different criticality:
Complete failure — parser crashed, no data coming in. Detected by last_successful_run timestamp.
Partial failure — parser works but data incomplete or wrong. Harder to detect, most dangerous (silent error worse than obvious).
Degradation — parser runs slower, data with delay, rate limit errors accumulating.
Heartbeat metric: monitoring foundation
Each parser run should record result:
class ScraperMonitor {
constructor(private db: Database, private alerter: AlertService) {}
async recordRun(scraperId: string, result: ScraperResult): Promise<void> {
await this.db('scraper_runs').insert({
scraper_id: scraperId,
started_at: result.startedAt,
finished_at: result.finishedAt,
duration_ms: result.finishedAt.getTime() - result.startedAt.getTime(),
records_fetched: result.recordsFetched,
records_saved: result.recordsSaved,
errors_count: result.errors.length,
status: result.errors.length === 0 ? 'success' : 'partial_failure',
error_details: result.errors.length > 0 ? JSON.stringify(result.errors) : null,
})
await this.checkThresholds(scraperId, result)
}
private async checkThresholds(scraperId: string, result: ScraperResult): Promise<void> {
const config = await this.getScraperConfig(scraperId)
// Too few records — possible empty API response or changed structure
if (result.recordsFetched < config.minExpectedRecords) {
await this.alerter.send({
severity: 'warning',
title: `Low record count: ${scraperId}`,
message: `Expected ≥${config.minExpectedRecords}, got ${result.recordsFetched}`,
})
}
// Too slow
if (result.finishedAt.getTime() - result.startedAt.getTime() > config.maxDurationMs) {
await this.alerter.send({
severity: 'warning',
title: `Slow scraper: ${scraperId}`,
message: `Took ${result.finishedAt.getTime() - result.startedAt.getTime()}ms, threshold ${config.maxDurationMs}ms`,
})
}
}
}
Staleness detection: data is outdated
Primary check — when data last successfully updated:
-- Scrapers not updated beyond expected interval
SELECT
sc.id,
sc.name,
sc.expected_interval_minutes,
MAX(sr.finished_at) AS last_success,
EXTRACT(EPOCH FROM (NOW() - MAX(sr.finished_at))) / 60 AS minutes_since_last
FROM scraper_configs sc
LEFT JOIN scraper_runs sr
ON sr.scraper_id = sc.id AND sr.status = 'success'
GROUP BY sc.id, sc.name, sc.expected_interval_minutes
HAVING EXTRACT(EPOCH FROM (NOW() - MAX(sr.finished_at))) / 60 > sc.expected_interval_minutes * 1.5
ORDER BY minutes_since_last DESC;
This query runs every 5 minutes via separate watchdog process. If worker itself breaks — it can't report its failure, so watchdog must be independent.
Alerting: channels and priorities
class AlertService {
async send(alert: Alert): Promise<void> {
const handlers = this.getHandlersForSeverity(alert.severity)
await Promise.all(handlers.map(h => h.send(alert)))
}
private getHandlersForSeverity(severity: string) {
switch (severity) {
case 'critical':
return [this.telegram, this.pagerDuty] // wakes people
case 'warning':
return [this.telegram] // during work hours
case 'info':
return [this.slackChannel] // for logs
}
}
}
class TelegramAlerter {
async send(alert: Alert): Promise<void> {
const emoji = alert.severity === 'critical' ? 'RED CIRCLE' : 'YELLOW CIRCLE'
const text = `${emoji} *${alert.title}*\n\n${alert.message}\n\n_${new Date().toISOString()}_`
await fetch(`https://api.telegram.org/bot${this.token}/sendMessage`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
chat_id: this.chatId,
text,
parse_mode: 'Markdown',
}),
})
}
}
Grafana dashboard for visual monitoring
Key panels:
Success rate by scrapers — percent successful runs last 24h. Alert if drops below 95%.
Records per run — time series of collected records. Abnormal drop is visible.
Duration heatmap — execution time distribution. Slow outliers signal problems.
# Prometheus metrics example from scraper
scraper_run_duration_seconds{scraper="coingecko"} 1.245
scraper_records_fetched_total{scraper="coingecko"} 4521
scraper_errors_total{scraper="coingecko", error_type="rate_limit"} 3
scraper_last_success_timestamp{scraper="coingecko"} 1704067200
Prometheus/Grafana alert rule:
groups:
- name: scraper_alerts
rules:
- alert: ScraperDown
expr: time() - scraper_last_success_timestamp > 600 # 10 minutes
for: 2m
labels:
severity: critical
annotations:
summary: "Scraper {{ $labels.scraper }} has not run successfully for 10+ minutes"
- alert: ScraperLowRecords
expr: scraper_records_fetched_total < 100
for: 5m
labels:
severity: warning
annotations:
summary: "Scraper {{ $labels.scraper }} fetching unusually few records"
Basic monitoring setup: Prometheus + Grafana + Telegram alerting — 1 day. Full system with custom thresholds per scraper, dashboard, and PagerDuty integration — 2-3 days.







