Recommendation Systems: From Collaborative Filtering to Real-Time Serving
E-commerce with 300k SKU catalog. CTR on recommendations — 1.8%. After replacing "popular last 7 days" rule with collaborative filtering — 3.1%. After adding content features and re-ranking — 4.4%. Real numbers from real project. Difference between "show popular" and "show personalized" — measurable and substantial.
Collaborative Filtering: Matrix Factorization and Neural Approaches
Matrix Factorization. ALS (Alternating Least Squares) — classic implicit feedback algorithm (clicks, views, purchases without explicit ratings). Implicit library implements ALS with GPU acceleration, processes user×item matrices hundreds of millions non-zero values in minutes. Latent factors 64-256, regularization λ=0.01-0.1 — starting parameters.
Cold start problem: new user has no interaction history. For new items: no interactions. Classic CF helpless — need content features or hybrid approach.
Neural Collaborative Filtering. NCF replaces linear dot product in MF with neural network. Gains over well-tuned ALS moderate in practice, but NCF easier to extend with additional features (user age, product category, time of day).
Sequence-aware models. When interaction order matters (user watched A → B → C, what show next) — SASRec or BERT4Rec. Transformer architecture over interaction sequence — state-of-the-art for session recommendations. Trains on sequences, predicts next item.
Content-Based Filtering: When Interaction History Small
Content-based recommends based on item characteristics, not user behavior. Solves cold start for items: new product with description and category can recommend immediately.
Text embeddings. Product descriptions → embeddings via sentence-transformers (multilingual-e5-base or BGE-M3 for multilingual catalog) → similar search via cosine similarity. For 100k products — FAISS IndexFlatIP, query in <5ms.
Structured features. Category, brand, price, specs — via embedding layers in neural network or categorical features in gradient boosting. CatBoost handles categorical well without manual encoding.
Item2Vec. Train Word2Vec on interaction sequences: item_id instead of words, session instead of sentence. Fast, interpretable, works well for "similar products."
Hybrid Approaches: Two-Stage Retrieval + Ranking
Production recommendation systems almost always two-level.
Stage 1: Retrieval (candidate generation). From 300k products quickly select 100-500 candidates. Tools: ALS or Two-Tower model (separate encoders for user and item, dot product for scoring). Vector search via FAISS or Qdrant. Requirement — speed: <20ms.
Stage 2: Ranking. From 100-500 candidates rank final list (top 10-20). Heavy model with rich features: gradient boosting (LightGBM, CatBoost) or neural network with cross-features. Here consider context: device, time, previous session actions. Requirement: <50-100ms.
LightFM — library implementing hybrid factorization models supporting item and user features. Good starting point for mid-scale without heavy infrastructure.
Real-Time Serving: Architecture Under Load
Recommendation system on homepage — latency SLA 50-100ms with thousands requests per second. Serving architecture matters.
Precomputation vs. real-time. For most users recommendations precompute and cache. Batch job hourly/nightly → store top-100 recommendations in Redis by user_id → read from cache on request. Latency <5ms. Downside: doesn't account last-hours events.
Real-time context update. Hybrid: base recommendations from cache + real-time re-ranking with recent session actions. Kafka event stream (clicks, cart adds) → feature computation → context feature update → fast re-ranking.
Feature serving. Redis for user features with TTL (view count last 24h, last clicked item). Read latency <1ms. At 10k req/s load — Redis Cluster with replication.
A/B testing. Recommendation systems can't evaluate only offline metrics (NDCG, MAP). Offline correlates with online CTR but not always. A/B test with 5-10% traffic on new model, monitor CTR, conversion, revenue per session — only reliable way.
Metrics: Offline and Online
Offline metrics:
- NDCG@k (Normalized Discounted Cumulative Gain) — accounts position in list
- MAP@k (Mean Average Precision) — binary relevance tasks
- Recall@k — coverage: what share of relevant items hit top-k
- Coverage — what share of catalog actually recommended (popularity bias fight)
Online metrics:
- CTR (Click-Through Rate) — basic engagement
- Conversion Rate — recommendation to purchase/target action
- Revenue per user
- Diversity — recommendation variety (don't show 10 identical products)
Popularity bias — chronic CF problem. Popular items get more interactions → model recommends more → get more. Long tail (80% catalog) poorly recommended. Solution: diversity-aware re-ranking, debiasing in loss, popularity normalization in implicit feedback.
Project Stages
Data audit. Examine interaction history: user×item matrix density (<0.1% typical), activity distribution (20% users give 80% interactions), temporal patterns, cold start statistics.
Baseline. Popular items as recommendations — simple baseline, often hard to beat significantly. Fix offline metrics baseline.
Iterative improvement. ALS → add content features → two-stage system → sequence-aware models. Each step measure offline and verify A/B test.
Serving infrastructure. Batch precomputation, Redis caching, real-time re-ranking, monitoring.
Prototype on existing data with offline validation: 2-3 weeks. Production system with two-stage ranking, A/B testing, monitoring: 2-3 months.







