Training NLP Model for Crypto News Analysis
News stream is one of most informative sources for understanding market movements. Major regulatory events, hacks, partnership announcements, technology updates — all materialize in news minutes before reflecting in price. NLP model capable of processing news stream in real-time provides temporal advantage.
News Data Collection
Sources and APIs:
- CryptoPanic API: crypto news aggregator, API free (with limits). JSON feed with title, source, currencies, date.
- NewsAPI: broad crypto topic coverage. 100 requests/day free.
- CoinDesk / Cointelegraph RSS: direct feed from key publications.
- Bloomberg Crypto (paid): institutional-level coverage.
- Custom scraper: BeautifulSoup + Playwright for API-less sites.
import httpx
import feedparser
from datetime import datetime
async def fetch_cryptopanic_news(api_key, currencies=['BTC','ETH'], limit=50):
url = f"https://cryptopanic.com/api/v1/posts/?auth_token={api_key}"
url += f"¤cies={','.join(currencies)}&kind=news&limit={limit}"
async with httpx.AsyncClient() as client:
response = await client.get(url)
data = response.json()
articles = []
for post in data.get('results', []):
articles.append({
'title': post['title'],
'source': post['source']['title'],
'published_at': post['published_at'],
'url': post['url'],
'currencies': [c['code'] for c in post.get('currencies', [])],
'votes': post.get('votes', {})
})
return articles
News Classification Model
Task: classify each news across multiple dimensions:
- Sentiment: positive/negative/neutral (for price)
- Category: regulation, technology, security, partnership, market, macro
- Impact score: how significant the event is (low/medium/high)
- Affected assets: which tokens are affected
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
class NewsClassifier:
def __init__(self):
# Fine-tuned FinBERT on crypto news
self.sentiment_model = AutoModelForSequenceClassification.from_pretrained(
'crypto_finbert_sentiment'
)
self.category_model = AutoModelForSequenceClassification.from_pretrained(
'crypto_news_category'
)
self.tokenizer = AutoTokenizer.from_pretrained('ProsusAI/finbert')
def classify(self, title, body=''):
# Use title + first 200 chars of body
text = title + ' ' + body[:200]
inputs = self.tokenizer(text, return_tensors='pt',
max_length=256, truncation=True, padding=True)
with torch.no_grad():
sentiment_logits = self.sentiment_model(**inputs).logits
category_logits = self.category_model(**inputs).logits
sentiment = torch.softmax(sentiment_logits, -1)
category = torch.softmax(category_logits, -1)
return {
'sentiment': {
'positive': sentiment[0][0].item(),
'negative': sentiment[0][1].item(),
'neutral': sentiment[0][2].item()
},
'category': self.category_labels[category.argmax().item()],
'sentiment_score': sentiment[0][0].item() - sentiment[0][1].item()
}
Fine-tuning on Crypto News
Creating labeled training dataset:
Automatic labeling (weak supervision):
- Regulatory decisions against crypto (SEC lawsuit, China ban) → negative
- Institutional adoption (Tesla, MicroStrategy bought BTC) → positive
- Technology upgrades (Ethereum Merge, Lightning Network) → positive
- Security incidents (exchange hack, smart contract exploit) → negative
- Market data (price new ATH, large inflows) → positive/negative by context
Manual labeling: selective labeling of 2000–3000 examples for quality fine-tuning.
from datasets import Dataset
from transformers import Trainer, TrainingArguments
def create_news_dataset(articles_with_labels):
"""
articles_with_labels: list of {'text': str, 'label': int}
"""
return Dataset.from_list(articles_with_labels)
training_args = TrainingArguments(
output_dir='./crypto_news_model',
num_train_epochs=5,
per_device_train_batch_size=16,
learning_rate=2e-5,
warmup_ratio=0.1,
weight_decay=0.01,
evaluation_strategy='epoch',
save_strategy='best',
metric_for_best_model='f1'
)
Named Entity Recognition (NER) for Crypto
Extracting mentioned tokens, companies, amounts from news:
# Custom NER model for crypto context
# Entities: COIN (Bitcoin, ETH), EXCHANGE (Binance, FTX),
# AMOUNT ($1B, 100,000 BTC), PROTOCOL (Uniswap, Aave)
from transformers import pipeline
ner_pipeline = pipeline('ner', model='crypto_ner_model', aggregation_strategy='simple')
def extract_crypto_entities(text):
entities = ner_pipeline(text)
coins = [e['word'] for e in entities if e['entity_group'] == 'COIN']
amounts = [e['word'] for e in entities if e['entity_group'] == 'AMOUNT']
return coins, amounts
Event Detection
Identifying specific high-impact events:
HIGH_IMPACT_PATTERNS = {
'hack': ['hack', 'exploit', 'stolen', 'drained', 'attacked', 'vulnerability'],
'regulation': ['SEC', 'banned', 'illegal', 'regulatory', 'compliance', 'lawsuit'],
'adoption': ['buys', 'acquired', 'invested', 'custody', 'ETF approved'],
'insolvency': ['bankrupt', 'insolvent', 'withdrawal halt', 'bankruptcy']
}
def detect_high_impact_event(text):
text_lower = text.lower()
for event_type, keywords in HIGH_IMPACT_PATTERNS.items():
if any(kw in text_lower for kw in keywords):
return event_type
return None
Upon high-impact event detection — immediate alert regardless of scheduled batch processing.
Realtime Processing Pipeline
News Feed (CryptoPanic, RSS)
→ Kafka topic: raw_news
→ Spark Streaming / Faust consumer
→ NLP classification (batch GPU inference)
→ PostgreSQL: classified_news
→ Redis: latest_sentiment_scores
→ WebSocket: realtime updates to dashboard
→ Alert system: high-impact events → Telegram
For production: batching NLP model requests (8–32 articles at once). GPU inference T4 processes ~500 articles/second.
Backtesting News Signal
Verify: did news classification actually precede price movements?
def backtest_news_signal(classified_news, price_data, lookback_hours=24):
results = []
for news in classified_news:
if news['sentiment_score'] > 0.5: # positive signal
# Price 1, 4, 24 hours after news
t = news['published_at']
for h in [1, 4, 24]:
future_return = get_return(price_data, t, h)
results.append({
'signal': 'positive',
'horizon': h,
'actual_return': future_return
})
return pd.DataFrame(results)
Developing NLP system for crypto news with fine-tuned model on crypto content, NER for entity extraction, event detection for high-impact events, realtime pipeline and backtesting news signal.







