Developing Crypto Market Sentiment Analysis System
Crypto market is uniquely sensitive to sentiment: a tweet from Elon Musk moved prices 20–30%. Sentiment analysis system aggregates signals from social networks, news, and on-chain data, forming quantitative market sentiment metric.
Data Sources
Twitter/X: real-time, high crypto-community activity. Hashtags: #BTC, #Bitcoin, #Crypto, #Ethereum. Twitter API v2 Basic tier: 500k tweets/month. Filtering by engagement (retweets > 10, likes > 50) reduces noise.
Reddit: r/CryptoCurrency (3M+ members), r/Bitcoin, r/ethfinance. Pushshift API or official Reddit API. High-upvote comments particularly informative.
Telegram: major crypto channels (often closed). Telethon (Python) for public channel parsing. Requires account and careful ToS compliance.
News sources: CoinDesk, Cointelegraph, Decrypt, Bloomberg Crypto. RSS feeds + scraping. NewsAPI for aggregation.
On-chain sentiment: SOPR > 1 (profit-taking), whale movements, exchange flows — objective data, not manipulation-prone.
NLP Pipeline
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
class CryptoSentimentAnalyzer:
def __init__(self, model_name='ProsusAI/finbert'):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
self.pipeline = pipeline(
'sentiment-analysis',
model=self.model,
tokenizer=self.tokenizer,
device=0 # GPU
)
def analyze_batch(self, texts, batch_size=32):
results = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]
# Truncate to 512 tokens
truncated = [t[:512] for t in batch]
batch_results = self.pipeline(truncated)
results.extend(batch_results)
return results
def get_sentiment_score(self, text):
result = self.pipeline(text[:512])[0]
# Convert to scalar score [-1, 1]
label = result['label']
score = result['score']
if label == 'positive':
return score
elif label == 'negative':
return -score
return 0 # neutral
Models for financial sentiment:
- FinBERT (ProsusAI): trained on financial texts. Best general-purpose baseline.
- CryptoBERT: fine-tuned specifically on crypto content. Understands "hodl", "wen lambo", "rekt".
- RoBERTa-large: more powerful base model, requires fine-tuning.
Fine-tuning on Crypto Data
Labeling strategy: take historical tweets from day t, if price rose next day > 1% → positive, fell > 1% → negative, otherwise neutral.
from transformers import TrainingArguments, Trainer
from datasets import Dataset
def fine_tune_crypto_sentiment(base_model, train_texts, train_labels):
training_args = TrainingArguments(
output_dir='./crypto_sentiment_model',
num_train_epochs=3,
per_device_train_batch_size=32,
warmup_steps=200,
weight_decay=0.01,
learning_rate=2e-5,
eval_strategy='steps',
eval_steps=500,
save_strategy='best',
load_best_model_at_end=True
)
trainer = Trainer(
model=base_model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)
trainer.train()
Sentiment Signal Aggregation
Single tweet — noisy signal. Time-based aggregation yields more reliable metric:
def aggregate_sentiment(sentiment_scores, weights, window='1h'):
"""
sentiment_scores: DataFrame with columns (timestamp, score, source, engagement)
weights: {source: weight} — different sources have different weights
"""
df = sentiment_scores.copy()
df['weighted_score'] = df.apply(
lambda row: row['score'] * weights.get(row['source'], 1.0) *
np.log1p(row['engagement']), # weight by engagement
axis=1
)
# Rolling aggregation
hourly = df.set_index('timestamp').resample(window)
aggregated = hourly['weighted_score'].sum() / hourly['engagement'].sum()
# Normalize to [-1, 1] via rolling z-score
rolling_mean = aggregated.rolling(168).mean() # 7 days
rolling_std = aggregated.rolling(168).std()
normalized = (aggregated - rolling_mean) / (rolling_std + 1e-8)
return normalized.clip(-3, 3) / 3 # [-1, 1]
Composite Sentiment Index
Final index combines multiple sources:
SENTIMENT_WEIGHTS = {
'twitter': 0.25,
'reddit': 0.20,
'news': 0.20,
'on_chain_sopr': 0.15,
'funding_rate': 0.10,
'fear_greed': 0.10
}
def compute_composite_index(signals):
total_weight = sum(SENTIMENT_WEIGHTS[s] for s in signals if s in SENTIMENT_WEIGHTS)
composite = sum(
signals[s] * SENTIMENT_WEIGHTS[s]
for s in signals
if s in SENTIMENT_WEIGHTS
) / total_weight
return composite
Correlation Analysis
Historical analysis shows sentiment → price correlation with 0–24 hour lag. Cross-correlation:
from scipy.signal import correlate
def cross_correlation_lag(sentiment, price_returns, max_lag_hours=48):
correlation = correlate(price_returns, sentiment, mode='full')
lags = np.arange(-max_lag_hours, max_lag_hours + 1)
max_corr_idx = correlation[len(sentiment)-max_lag_hours-1:len(sentiment)+max_lag_hours].argmax()
optimal_lag = lags[max_corr_idx]
return optimal_lag, correlation.max()
Dashboard and Alerts
Realtime Sentiment Dashboard:
- Current composite sentiment score (0–100 gauge)
- Breakdown by sources
- Trend last 24h/7d
- Top-10 trending tokens by sentiment
Anomaly Alerts:
- Sentiment > 2σ from average (very positive or negative)
- Sharp change > 0.5 in 1 hour
- Divergence: sentiment rising, price falling (or vice versa)
Tech stack: Python (transformers, torch), PostgreSQL for score storage, Redis for caching recent values, Celery for scheduled data collection tasks, React dashboard, Grafana for system metrics.
Developing full sentiment analysis system: data collection pipelines for multiple sources, FinBERT + fine-tuning, aggregation into composite index, realtime dashboard and historical correlation analysis with price movements.







