Scraping Data from Crypto News Aggregators
Crypto market reacts to news faster than tradfi: from article publication about regulatory action to price movement — sometimes seconds. For trading systems, risk monitoring, or sentiment analysis you need structured news flow with minimal latency.
Sources and data retrieval methods
RSS/Atom feeds (most reliable)
CoinDesk, Cointelegraph, The Block, Decrypt — all have RSS. This is the official, stable channel:
import Parser from "rss-parser";
const parser = new Parser({
customFields: {
item: [["media:content", "media", { keepArray: false }]],
},
});
const feeds: Record<string, string> = {
coindesk: "https://www.coindesk.com/arc/outboundfeeds/rss/",
cointelegraph: "https://cointelegraph.com/rss",
theblock: "https://www.theblock.co/rss.xml",
decrypt: "https://decrypt.co/feed",
};
async function fetchFeed(source: string, url: string): Promise<NewsItem[]> {
const feed = await parser.parseURL(url);
return feed.items.map((item) => ({
source,
title: item.title ?? "",
url: item.link ?? "",
publishedAt: new Date(item.pubDate ?? ""),
summary: item.contentSnippet ?? "",
guid: item.guid ?? item.link ?? "",
}));
}
Polling every 5 minutes — reasonable balance between freshness and load on source. Deduplication by guid.
Official APIs
CryptoPanic API — news aggregator with sentiment scoring:
GET https://cryptopanic.com/api/v1/posts/?auth_token={key}¤cies=BTC,ETH&kind=news
Returns structured data with bullish/bearish community votes.
Messari API — quality news with asset tags:
GET https://data.messari.io/api/v1/news?page=1&limit=50
Santiment — news + on-chain data + social metrics in one API.
HTML scraping (when no API)
For sources without RSS — cheerio (Node.js) or BeautifulSoup (Python). Fragile approach: any markup change breaks parser. For critical sources — monitor parsing success rate and quick alert on drop.
import * as cheerio from "cheerio";
async function scrapeBlockworks(html: string): Promise<NewsItem[]> {
const $ = cheerio.load(html);
return $("article.post-card").map((_, el) => ({
title: $(el).find("h2.post-title").text().trim(),
url: $(el).find("a").attr("href") ?? "",
publishedAt: new Date($(el).find("time").attr("datetime") ?? ""),
summary: $(el).find("p.excerpt").text().trim(),
})).get();
}
Processing and storage
Deduplication critical — one news can appear in multiple sources. Normalized URL (remove UTM params) + title similarity (cosine similarity or Levenshtein distance) for duplicate detection.
CREATE TABLE news_items (
id BIGSERIAL PRIMARY KEY,
source VARCHAR(50) NOT NULL,
external_id VARCHAR(255) NOT NULL, -- guid from RSS
title TEXT NOT NULL,
url TEXT NOT NULL,
published_at TIMESTAMPTZ NOT NULL,
summary TEXT,
raw_content TEXT,
tags TEXT[], -- ['bitcoin', 'regulation', 'SEC']
UNIQUE (source, external_id)
);
CREATE INDEX idx_news_published ON news_items (published_at DESC);
CREATE INDEX idx_news_tags ON news_items USING GIN (tags);
Asset tagging — determine which crypto assets mentioned in news by ticker list and names. Simple regex-based approach gives 80–90% accuracy for major assets.
What matters in production
Robots.txt and rate limiting: respect source rules, don't generate excessive load. Jitter between requests (not even intervals, looks like bot).
User-Agent: identify yourself correctly. Some sources block headless browser user-agents.
Monitoring freshness: if last news from source older than 30 minutes — alert. RSS can hang without explicit error.
Realistic timeline for aggregator of 10–15 sources with API + storage: 2–3 weeks.







