Crypto Data Scraping from Social Media (Twitter/X, Telegram, Discord)
Crypto community lives in Twitter/X, Telegram, and Discord. That's where signals appear first: insider leaks hours before announcement, panic building at depeg, pump coordination, early vulnerability discussions. For trading signals, sentiment analysis and security monitoring, need reliable data pipeline. Each platform has its own access specifics.
Twitter/X: API and workarounds
Official API
Twitter API v2 — only legal path. After X Corp restructuring, pricing became aggressive:
- Free tier — write only, read limited. Useless for parsing.
- Basic ($100/mo) — 10,000 posts/month. Barely covers one account.
- Pro ($5000/mo) — 1M tweets/month, Filtered Stream access. Real for serious analytics.
- Enterprise — Full Archive Search, Firehose. Price on request, tens of thousands monthly.
For crypto sentiment on Pro tier:
import tweepy
client = tweepy.Client(bearer_token=BEARER_TOKEN)
# Filtered Stream for real-time monitoring
class CryptoStreamListener(tweepy.StreamingClient):
def on_tweet(self, tweet):
if tweet.data:
asyncio.create_task(self.process_tweet(tweet))
async def process_tweet(self, tweet):
await self.queue.put({
"id": tweet.data.id,
"text": tweet.data.text,
"author_id": tweet.data.author_id,
"created_at": tweet.data.created_at,
"source": "twitter",
})
stream = CryptoStreamListener(bearer_token=BEARER_TOKEN, queue=event_queue)
# Filter rules (AND/OR/NOT operators)
stream.add_rules(tweepy.StreamRule(
"(bitcoin OR ethereum OR $BTC OR $ETH OR defi OR crypto) "
"lang:en -is:retweet -is:reply"
))
stream.filter(tweet_fields=["created_at", "author_id", "public_metrics"])
Recent Search for historical data (up to 7 days on Pro):
# Pagination via next_token
tweets = []
paginator = tweepy.Paginator(
client.search_recent_tweets,
query="$BTC OR bitcoin lang:en -is:retweet",
tweet_fields=["created_at", "public_metrics", "author_id"],
max_results=100,
limit=10, # 10 pages = 1000 tweets
)
async for tweet in paginator:
tweets.append(tweet)
Alternatives and limitations
Limited budget — third-party Twitter data providers: Brandwatch, Sprinklr, Tweetbinder. Sell access to historical and stream data at better prices.
Scraping via unofficial API (without keys) — ToS violation, legally risky for commercial projects. Technically possible via session cookies and reverse-engineered endpoints, but X Corp actively blocks.
Telegram: MTProto API
Telegram — main platform for crypto announcements. Most projects run official Telegram channels.
Telethon: User account API
Telegram provides two APIs: Bot API (limited) and MTProto API (full access via user account). For channel parsing need MTProto via Telethon library:
from telethon import TelegramClient, events
from telethon.tl.types import Channel
API_ID = int(os.getenv("TELEGRAM_API_ID"))
API_HASH = os.getenv("TELEGRAM_API_HASH")
async def monitor_channels(channel_usernames: list[str]):
async with TelegramClient("session", API_ID, API_HASH) as client:
# Subscribe to new messages
@client.on(events.NewMessage(chats=channel_usernames))
async def handler(event):
msg = event.message
await process_message({
"channel": event.chat.username,
"message_id": msg.id,
"text": msg.text or "",
"date": msg.date,
"views": msg.views,
"forwards": msg.forwards,
"has_media": bool(msg.media),
})
# Get channel history
async def fetch_history(channel: str, limit: int = 1000):
messages = []
async for msg in client.iter_messages(channel, limit=limit):
messages.append({
"id": msg.id,
"text": msg.text or "",
"date": msg.date,
"views": msg.views,
})
return messages
await client.run_until_disconnected()
Important: Telethon uses real user account. Telegram blocks accounts on suspicious activity (too many requests, parsing too many channels). Use dedicated account, respect rate limits, don't parse private groups without permission.
Bot API works only if bot is added to group/channel. Public channels inaccessible to bot without joining.
Telegram anomaly detection
Activity spike in channel — signal:
async def detect_activity_spike(channel: str, window_minutes: int = 60):
# Count messages last hour vs previous hour
now = datetime.utcnow()
hour_ago = now - timedelta(hours=1)
two_hours_ago = now - timedelta(hours=2)
recent_count = await db.count_messages(channel, hour_ago, now)
prev_count = await db.count_messages(channel, two_hours_ago, hour_ago)
if prev_count > 0:
spike_ratio = recent_count / prev_count
if spike_ratio > 3: # 3x normal
await alert(f"Activity spike in {channel}: {spike_ratio:.1f}x")
Discord: Bot API
Most DeFi projects use Discord for community. Technical discussions, early announcements, sometimes attack coordination.
Discord Bot
Need bot token from Discord Developer Portal and bot added to server:
import discord
from discord.ext import commands
intents = discord.Intents.default()
intents.message_content = True # Privileged intent — needs Discord approval
bot = commands.Bot(command_prefix="!", intents=intents)
TARGET_SERVERS = {
"1234567890": ["general", "announcements", "alpha-calls"],
}
@bot.event
async def on_message(message: discord.Message):
if message.author.bot:
return
guild_id = str(message.guild.id) if message.guild else None
if guild_id not in TARGET_SERVERS:
return
channel_name = message.channel.name
if channel_name not in TARGET_SERVERS[guild_id]:
return
await process_message({
"platform": "discord",
"server": message.guild.name,
"channel": channel_name,
"author": str(message.author),
"content": message.content,
"timestamp": message.created_at,
"attachments": [a.url for a in message.attachments],
})
Limitation: message_content — privileged intent. Discord requires bot verification (100+ servers) for it. Works on small servers without verification, on large ones — needs approval.
Message history available via channel.history(), but only for servers where bot already present. Can't retroactively get history.
Storage and Processing
Unified schema for messages from all platforms:
CREATE TABLE social_messages (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
platform TEXT NOT NULL, -- 'twitter', 'telegram', 'discord'
source_id TEXT NOT NULL, -- original message ID
channel TEXT, -- @username, channel_name, server/channel
author TEXT,
content TEXT NOT NULL,
metadata JSONB, -- platform-specific: views, likes, reactions
captured_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
published_at TIMESTAMPTZ,
UNIQUE (platform, source_id)
);
CREATE INDEX idx_social_platform_channel ON social_messages (platform, channel, published_at DESC);
CREATE INDEX idx_social_content_fts ON social_messages USING gin(to_tsvector('english', content));
GIN index for full-text search — needed for token mentions, keywords, contract addresses.
Sentiment Analysis
For crypto-specific sentiment, raw text needs processing:
Keyword extraction: ticker mentions ($BTC, $ETH, $PEPE), contract addresses (0x...), protocols.
Sentiment: specialized models beat general ones. FinBERT and CryptoBERT — fine-tuned BERT for financial/crypto content. Via HuggingFace:
from transformers import pipeline
sentiment = pipeline(
"sentiment-analysis",
model="ElKulako/cryptobert",
device=0, # GPU
)
def analyze_sentiment(text: str) -> dict:
result = sentiment(text[:512])[0] # BERT limited to 512 tokens
return {
"label": result["label"], # Bullish/Bearish/Neutral
"score": result["score"],
}
Volume-weighted sentiment — weight sentiment by reach: tweet with 100k impressions weighs more than 100. For Telegram — by message views.
Operational Limitations
Social media monitoring — legally sensitive. ToS of most platforms forbid commercial scraping without official API. Practical limits:
- Twitter: official API mandatory for any commercial use
- Telegram: public channel parsing via MTProto — gray area, not explicitly forbidden
- Discord: only via official Bot API, no web scraping
Development pipeline for two platforms (Twitter + Telegram) with sentiment and storage — 2-3 weeks. Add Discord and custom ML models — another 1-2 weeks.







