Development of Pump-and-Dump Detection Model
Pump-and-dump in crypto works faster than traditional markets: from coordinated buying start to dump — hours or even minutes. On-chain data completely public, creating unique detection opportunity: can see wallet movements, volume concentration, transaction synchronization in real time.
Task — build system detecting P&D scheme during pump phase, before dump, to warn users or auto-protect protocol.
Anatomy of pump-and-dump scheme
Understanding mechanics critical for building correct features.
Accumulation phase: organizers gradually buy token with small orders trying not to move price. Signs: growing unique holder addresses with stagnant price, unusual buy volume at night, synchronized wallets (receiving ETH from single source).
Pump phase: coordinated buying, usually coordinated in Telegram/Discord. Price 200-2000% in hours. Volume spike 10-100x average. Social media spike with template messages.
Dump phase: organizers sell at peak. Retail buyers attracted by rise enter and hold bags. Price crashes to pre-pump or lower.
Features for model
On-chain metrics
Volume anomaly score:
VAS = current_volume / rolling_avg_volume_30d
Values > 10 without fundamental news — strong signal.
Holder concentration delta: HHI (Herfindahl-Hirschman Index) change:
HHI = Σ (balance_i / total_supply)²
Rising HHI = token concentration in fewer addresses = accumulation.
Transaction synchronization: coefficient of sync between independent addresses making buys in same time window (±5 minutes). Organic growth has uniform distribution. P&D — spike.
Wallet clustering: graph of address relationships. Addresses receiving ETH from same source, buying with same EOA, similar transaction patterns — probably controlled by one entity. If 60%+ volume from cluster — signal.
Price-volume divergence: healthy growth has volume rising gradually with price. P&D has volume then sharp price — or synchronized without ramping.
Cross-market metrics
DEX vs CEX price discrepancy: if DEX price significantly above CEX — possible intentional DEX price manipulation.
Liquidity depth change: sharp LP removal before pump reduces resistance — classic prep pattern.
New wallet ratio: percent of transactions from wallets created < 7 days ago. High = fresh addresses for organizing.
Social signals (optional)
Telegram/Discord monitoring for ticker mentions. Sudden spike + positive sentiment + template calls = coordinated pump signal.
Detection system architecture
Data pipeline
Blockchain RPC (geth/erigon)
→ Event streaming (WebSocket)
→ Kafka / RabbitMQ
→ Feature extractor (Python)
→ Feature store (Redis realtime, PostgreSQL historical)
→ ML model inference
→ Alert engine
Real-time blockchain connection via WebSocket:
from web3 import Web3, AsyncWeb3
import asyncio
async def stream_swaps(token_address: str, callback):
w3 = AsyncWeb3(AsyncWeb3.AsyncWebsocketProvider('wss://mainnet.infura.io/ws/v3/KEY'))
# Subscribe to Transfer events
transfer_filter = await w3.eth.filter({
'address': token_address,
'topics': [Web3.keccak(text='Transfer(address,address,uint256)').hex()]
})
while True:
events = await transfer_filter.get_new_entries()
for event in events:
await callback(event)
await asyncio.sleep(0.1)
Feature extraction
@dataclass
class TokenFeatures:
token_address: str
timestamp: float
volume_anomaly_score: float
new_wallet_ratio: float
transaction_sync_score: float
holder_hhi_delta: float
liquidity_depth_change: float
price_velocity: float
def compute_sync_score(
transactions: pd.DataFrame,
window_seconds: int = 300
) -> float:
"""How synchronized are independent addresses in buys"""
tx_times = transactions['timestamp'].values
unique_senders = transactions['from'].nunique()
if unique_senders < 2:
return 0.0
# Histogram of transactions by time windows
bins = np.arange(tx_times.min(), tx_times.max() + window_seconds, window_seconds)
hist, _ = np.histogram(tx_times, bins=bins)
# Normalized variance: low variance = high synchronization
if hist.mean() == 0:
return 0.0
cv = hist.std() / hist.mean()
return max(0, 1 - cv / 2)
ML model
For P&D detection, XGBoost or LightGBM on tabular features work well. Interpretable (SHAP values), fast inference, robust to missing data.
import xgboost as xgb
from sklearn.model_selection import TimeSeriesSplit
# Split by time: can't use future data to predict past
tscv = TimeSeriesSplit(n_splits=5)
model = xgb.XGBClassifier(
n_estimators=500,
max_depth=6,
learning_rate=0.01,
subsample=0.8,
colsample_bytree=0.8,
scale_pos_weight=neg_count / pos_count, # class balance
eval_metric='aucpr', # PR-AUC important with imbalance
early_stopping_rounds=50
)
Evaluation metrics: precision-recall more important than accuracy due to strong class imbalance. Goal: precision > 0.7 at recall > 0.6. False positives (false alarms) annoy users; false negatives (missed P&D) reputation damage.
Alerting implementation
Thresholds and confidence levels
Not binary "P&D / not P&D", but probability with thresholds:
- > 0.8: high confidence, immediate alert
- 0.6 - 0.8: medium confidence, warning
- < 0.6: monitoring, no alert
Integration with protocol
For protocols needing protection: trading contract can read risk score via oracle. If risk high — increased slippage tolerance or pause specific pool.
Limitations and disclaimers
Detection system doesn't eliminate P&D — it warns. Organizers adapt to detection algorithms (adversarial attacks). Model quality degrades over time, requires retraining.
Legal side: automatic trading blocks based on ML predictions carry legal risks depending on jurisdiction. Safer — warn users, not automatically limit trading.
Development timeline
Shard collection and labeling — 3-4 weeks, model — 2-3 weeks, infrastructure and alerting — 3-5 weeks, testing — 2 weeks.
Total: 8-14 weeks.







