Crypto fake news classification model development

We design and develop full-cycle blockchain solutions: from smart contract architecture to launching DeFi protocols, NFT marketplaces and crypto exchanges. Security audits, tokenomics, integration with existing infrastructure.
Showing 1 of 1 servicesAll 1306 services
Crypto fake news classification model development
Complex
from 2 weeks to 3 months
FAQ
Blockchain Development Services
Blockchain Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1214
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    823

Development of Fake News Classification Model for Crypto

Crypto space is ideal environment for disinformation. High volatility means one fake tweet about Binance listing or protocol hack can move price 10s of percent. Pump-and-dump schemes start with info manipulation. Scam projects live on fabricated partnership news.

Developing fake news classifier in crypto is NLP task with domain-specific complexities: narrow domain terminology, fast information lifecycle (news obsoletes in hours), multilinguality, intentional text obfuscation by fake creators.

Problem definition and fake typology

Before building model, understand what exactly we classify. "Fake news" too broad.

Categories of misinformation in crypto:

  • Fake listings: false claims about exchange listing (CEX or DEX with fake token)
  • Fake partnerships: non-existent partnerships (X protocol integrates Y)
  • Fabricated exploits: false breach announcements for panic selling
  • Shill content: manipulative promotion without financial interest disclosure
  • Price manipulation narrative: coordinated narratives for pump/dump
  • Impersonation: content from accounts imitating official ones

Each category has specific text patterns, sources, and verification methods.

Data collection and labeling

Main problem — no ready dataset. Existing fake news datasets (LIAR, FakeNewsNet) don't cover crypto specifics.

Data sources

Twitter/X API: main platform for crypto news and manipulation. Academic Research API gives historical data access. Key filters: accounts > 1,000 followers in crypto niche, hashtags (#bitcoin, #defi), protocol keywords.

Telegram channels: Telethon for parsing public channels. Important source — pump-and-dump channels (many public).

Reddit: r/CryptoCurrency, r/Bitcoin, r/CryptoMoonShots. Pushshift API or Reddit API for historical data.

Crypto news aggregators: CoinDesk, Cointelegraph, Decrypt — verified news (positive class). Contrast with Twitter rumors.

Labeling strategy

Automatic labeling via cross-verification: if news appears on Twitter but not confirmed by official project channels in 24 hours — potential fake. If news contradicts on-chain data (announced listing but no pool) — fake with high probability.

Human labeling via crowdsourcing (Scale AI, Appen) with domain experts for final verification. Minimum three annotators per example, inter-annotator agreement (Cohen's kappa > 0.7) as quality threshold.

Model architecture

Feature Engineering

Text signals of fakes:

  • Excessive hype without specifics ("100x guaranteed", "next bitcoin")
  • Urgency ("buy NOW", "last chance")
  • Known project/person names without context
  • Grammar errors (impersonation accounts often careless)
  • Source mismatch

Metadata signals:

  • Account age and posting history
  • Follower/following ratio (0.01 = suspicious)
  • Spread velocity (viral first hour = suspicious)
  • Time pattern (post at 3:00 UTC)
  • Similar posts with identical narrative

Model stack

Baseline: TF-IDF + Logistic Regression / XGBoost. Fast training, good interpretability, establishes baseline F1.

Transformer-based: FinBERT (finance-oriented) or crypto-fine-tuned BERT. Understands context and language nuances.

Ensemble: combination of transformer features + metadata via gradient boosting. Typically beats single model.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier

class CryptoFakeNewsClassifier:
    def __init__(self, model_name: str = 'ProsusAI/finbert'):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.text_model = AutoModelForSequenceClassification.from_pretrained(
            model_name, 
            num_labels=6  # real + 5 fake categories
        )
        self.meta_classifier = GradientBoostingClassifier(
            n_estimators=300,
            max_depth=6,
            learning_rate=0.05
        )
        
    def extract_text_features(self, texts: list[str]) -> np.ndarray:
        """Extract CLS embeddings from transformer"""
        self.text_model.eval()
        all_embeddings = []
        
        batch_size = 32
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i+batch_size]
            inputs = self.tokenizer(
                batch,
                max_length=512,
                truncation=True,
                padding=True,
                return_tensors='pt'
            )
            
            with torch.no_grad():
                outputs = self.text_model(**inputs, output_hidden_states=True)
                cls_embeddings = outputs.hidden_states[-1][:, 0, :]
                all_embeddings.append(cls_embeddings.numpy())
        
        return np.vstack(all_embeddings)
    
    def predict(self, posts: list[dict]) -> dict:
        texts = [p['text'] for p in posts]
        
        text_features = self.extract_text_features(texts)
        meta_features = self.extract_meta_features(posts)
        
        # Concatenate features
        combined = np.hstack([text_features, meta_features])
        
        # Final classification
        probabilities = self.meta_classifier.predict_proba(combined)
        predictions = self.meta_classifier.predict(combined)
        
        return {
            'predictions': predictions,
            'probabilities': probabilities,
            'labels': ['real', 'fake_listing', 'fake_partnership', 
                      'fake_exploit', 'shill', 'impersonation']
        }

Fine-tuning on crypto domain

FinBERT trained on financial news but not crypto-specialized. Fine-tuning on crypto corpus significantly improves quality.

Verification via on-chain data

Unique crypto advantage: many claims verifiable on-chain.

Announced listing on Uniswap V3: check via Uniswap Subgraph — does pool exist? Announced exploit: check TVL change in DeFiLlama API for specified period. Announced partnership: search for on-chain interaction between contracts.

async def verify_listing_claim(token_address: str, dex: str = 'uniswap_v3') -> dict:
    """Verify listing claim via on-chain data"""
    
    if dex == 'uniswap_v3':
        query = """
        query PoolsForToken($token: String!) {
            pools(where: { 
                or: [
                    { token0: $token },
                    { token1: $token }
                ]
            }, first: 5) {
                id
                liquidity
                totalValueLockedUSD
                createdAtTimestamp
            }
        }
        """
        
        async with aiohttp.ClientSession() as session:
            async with session.post(
                'https://api.thegraph.com/subgraphs/name/uniswap/uniswap-v3',
                json={'query': query, 'variables': {'token': token_address.lower()}}
            ) as response:
                data = await response.json()
                pools = data.get('data', {}).get('pools', [])
                
                return {
                    'listing_exists': len(pools) > 0,
                    'pools': pools,
                    'total_tvl': sum(float(p['totalValueLockedUSD']) for p in pools)
                }

On-chain verification converts part of task from NLP to deterministic check. Significantly improves accuracy for categories with on-chain trace.

Model evaluation

For fake detection, accuracy is wrong metric. If 90% examples are real, model always predicting "real" gets 90% accuracy.

Precision, Recall, F1 by class — main metrics. Special attention to fake recall: FN (missed fake) worse than FP (false alarm).

Area Under ROC Curve (AUC-ROC): threshold-independent assessment.

Temporal stability: model must maintain quality on new data. Crypto narratives change fast — regular retraining necessary.

Target metrics for production: Fake detection recall > 0.85, Real precision > 0.90, F1 macro > 0.82.

Deployment and real use

Production model handles streaming data. Architecture: Kafka for Twitter/Telegram ingestion → inference service (FastAPI) → PostgreSQL + Elasticsearch for storage → alerting on detection.

Concept drift: crypto narratives change fast. Monthly retraining on new data. Monitor distribution shift (if input distribution significantly changes — signal to retrain).

Human verification: high-confidence fakes (>0.95) published automatically, borderline (0.6–0.95) go to human review. Reduces impact from model errors.

Development timeline

Data infrastructure (6–8 weeks): collect 50,000 examples, label them. ML models (4–6 weeks): baseline + transformer training + evaluation. Deployment (3–4 weeks): FastAPI, Docker, monitoring.

Total: 3–4 months to production-ready system.