What data is needed to train the scoring model?

We need historical CRM data (Salesforce, HubSpot) including company info (industry, size), contact details (role, seniority), behavioral signals (page visits, email opens, demo requests), and deal outcome (won/lost). Minimum dataset: 500 closed deals with labels.

How many deals are required for quality training?

We recommend at least 500 closed deals (won + lost). More data improves AUC. Typically 1,000-2,000 deals yield stable AUC above 0.80. For smaller datasets we use stratified cross-validation.

How does the model handle behavioral signals?

We use temporal features: pricing page visits, demo requests, trial activity, email opens, content downloads — all from the last 30 days. This window captures current interest while maintaining sufficient data volume.

How are scoring results interpreted?

We apply SHAP to explain each prediction. For every lead, we show the top 5 features and their positive or negative contribution to the final probability. This lets sales reps understand why a lead scored high or low.

What quality metrics do you use?

Primary metric: AUC ROC (target 0.78+). We also track Precision@k (conversion rate among top-k leads) and Lift (win rate improvement when working top-25% scored leads vs random selection). Typical lift is 3-5x.

What data is needed to train the scoring model?

We need historical CRM data (Salesforce, HubSpot) including company info (industry, size), contact details (role, seniority), behavioral signals (page visits, email opens, demo requests), and deal outcome (won/lost). Minimum dataset: 500 closed deals with labels.

How many deals are required for quality training?

We recommend at least 500 closed deals (won + lost). More data improves AUC. Typically 1,000-2,000 deals yield stable AUC above 0.80. For smaller datasets we use stratified cross-validation.

How does the model handle behavioral signals?

We use temporal features: pricing page visits, demo requests, trial activity, email opens, content downloads — all from the last 30 days. This window captures current interest while maintaining sufficient data volume.

How are scoring results interpreted?

We apply SHAP to explain each prediction. For every lead, we show the top 5 features and their positive or negative contribution to the final probability. This lets sales reps understand why a lead scored high or low.

What quality metrics do you use?

Primary metric: AUC ROC (target 0.78+). We also track Precision@k (conversion rate among top-k leads) and Lift (win rate improvement when working top-25% scored leads vs random selection). Typical lift is 3-5x.

ML Lead Scoring: Prioritization with SHAP Explanations

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

ML Lead Scoring: Prioritization with SHAP Explanations

Medium

~1-2 weeks

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1361
Development of a web application for FEEDME
1251
Website development for BELFINGROUP
957
Development of an online store for the company FURNORO
1189
B2B Advance company logo design
646
Development of a web application for Enviok
929

Show more works

ML Lead Scoring: Prioritization with SHAP Explanations

Sales teams spend 80% of their time on leads that will never convert. Manual rule-based scoring in CRM (pricing page visit +10, email open +5) cannot capture non-linear signal combinations. The result: 1-2% conversion, demotivated reps. We solve this with ML. Order a data audit — we'll deliver a demo prototype on your leads within 2 days.

Our model trains on historical closed-won/lost data and discovers non-linear signal combinations that humans would never notice. Sales conversion lift: +25-40% with properly deployed ML scoring. We use scikit-learn, SHAP for interpretation, StratifiedKFold for validation. The model outputs real probabilities, not raw scores, so reps can act on the score as a genuine priority.

How ML Outperforms Manual Rules

Manual scoring is a linear sum of points. ML models capture interactions: for instance, "pricing page visit + high email open rate + decision-maker title" together mean far more than the sum individually. The table below compares approaches. ML achieves 1.4x better AUC and a lift of 3.2x among top-25% leads — that's 3.2x more conversions than random selection.

Criterion	Manual Rules	ML Model
Prediction accuracy	50-60%	78-85% AUC
Captures interactions	No	Yes (Gradient Boosting)
Scalability	Low (rules written by hand)	Automatic training
Explainability	Transparent (scores)	SHAP explanations
Setup time	Days to weeks	2-3 weeks to prototype

What Features Does the Model Use?

We group features into three categories:

Firmographic (who the company is): size, industry, revenue, funding stage.
Demographic (who the contact is): role, department, seniority.
Behavioral (what they did on site/product): pricing page visits, demo requests, trial activity, email opens, content downloads.

All behavioral data is aggregated over the last 30 days — this window provides the best balance between recency and volume.

Why Gradient Boosting with Calibration?

Gradient Boosting (sklearn.ensemble.GradientBoostingClassifier) delivers high quality on tabular data with missing values and mixed feature types. Isotonic calibration (CalibratedClassifierCV) turns raw predictions into well-calibrated probabilities: when the model says "probability 0.7", exactly 7 out of 10 leads with that score will convert. This is critical for business metrics and threshold tuning. See Gradient Boosting.

import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import StratifiedKFold
import shap

class LeadScoringModel:
    """
    Predictive lead scoring.
    Output: P(lead → closed_won) within 90-day horizon.
    """

    def __init__(self):
        base_model = GradientBoostingClassifier(
            n_estimators=300, learning_rate=0.05,
            max_depth=4, subsample=0.8,
            min_samples_leaf=20, random_state=42
        )
        # Calibration: model output = true probabilities
        self.model = CalibratedClassifierCV(base_model, method='isotonic', cv=5)
        self.explainer = None
        self.feature_names = []

    def build_features(self, leads: pd.DataFrame) -> pd.DataFrame:
        """
        Three feature groups:
        1. Firmographic (who the company is)
        2. Demographic (who the contact is)
        3. Behavioral (what they did on site/product)
        """
        features = pd.DataFrame()

        # === Firmographic ===
        features['company_size_log'] = np.log1p(leads.get('company_employees', 10))
        features['industry_tech'] = (leads.get('industry') == 'technology').astype(int)
        features['industry_finance'] = (leads.get('industry') == 'finance').astype(int)
        features['annual_revenue_log'] = np.log1p(leads.get('annual_revenue_usd', 0))
        features['is_enterprise'] = (leads.get('company_employees', 0) > 500).astype(int)
        features['funding_stage_encoded'] = leads.get('funding_stage', 'unknown').map(
            {'seed': 1, 'series_a': 2, 'series_b': 3, 'series_c': 4,
             'public': 5, 'unknown': 0}
        ).fillna(0)

        # === Demographic ===
        features['is_decision_maker'] = leads.get('seniority', '').isin(
            ['VP', 'Director', 'C-Level', 'Founder']
        ).astype(int)
        features['contact_dept_it'] = (leads.get('department') == 'IT').astype(int)
        features['contact_dept_ops'] = (leads.get('department') == 'Operations').astype(int)

        # === Behavioral (last 30 days) ===
        features['pricing_page_visits'] = leads.get('pricing_views_30d', 0).clip(0, 10)
        features['demo_requested'] = leads.get('demo_requested', 0).astype(int)
        features['trial_started'] = leads.get('trial_started', 0).astype(int)
        features['trial_active_days'] = leads.get('trial_active_days', 0).clip(0, 30)
        features['trial_key_feature_used'] = leads.get('key_feature_used', 0).astype(int)
        features['emails_opened_rate'] = leads.get('emails_opened', 0) / np.maximum(
            leads.get('emails_sent', 1), 1
        )
        features['content_downloads'] = leads.get('content_downloads_30d', 0).clip(0, 5)
        features['webinar_attended'] = leads.get('webinar_attended', 0).astype(int)
        features['support_tickets'] = leads.get('support_tickets', 0).clip(0, 10)

        # === Temporal ===
        features['days_since_first_touch'] = leads.get('days_since_first_touch', 90).clip(0, 180)
        features['days_since_last_activity'] = leads.get('days_since_last_activity', 30).clip(0, 90)
        features['velocity_score'] = (
            features['pricing_page_visits'] + features['emails_opened_rate'] * 5 +
            features['demo_requested'] * 10 + features['trial_key_feature_used'] * 8
        )

        self.feature_names = list(features.columns)
        return features.fillna(0)

    def train(self, leads: pd.DataFrame, target: pd.Series):
        """Training with stratified cross-validation"""
        X = self.build_features(leads)
        cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
        cv_scores = []

        for train_idx, val_idx in cv.split(X, target):
            X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
            y_train, y_val = target.iloc[train_idx], target.iloc[val_idx]

            fold_model = GradientBoostingClassifier(
                n_estimators=300, learning_rate=0.05, max_depth=4, random_state=42
            )
            fold_model.fit(X_train, y_train)
            from sklearn.metrics import roc_auc_score
            cv_scores.append(roc_auc_score(y_val, fold_model.predict_proba(X_val)[:, 1]))

        print(f"CV AUC: {np.mean(cv_scores):.3f} ± {np.std(cv_scores):.3f}")
        self.model.fit(X, target)

        # SHAP for explainability
        import shap
        base_clf = self.model.calibrated_classifiers_[0].estimator
        self.explainer = shap.TreeExplainer(base_clf)

    def predict(self, leads: pd.DataFrame) -> pd.DataFrame:
        """Score leads with probabilities and explanations"""
        X = self.build_features(leads)
        probabilities = self.model.predict_proba(X)[:, 1]

        result = leads[['lead_id']].copy() if 'lead_id' in leads.columns else pd.DataFrame(index=leads.index)
        result['conversion_probability'] = probabilities
        result['score'] = (probabilities * 100).astype(int)
        result['tier'] = pd.cut(
            probabilities,
            bins=[0, 0.2, 0.5, 0.75, 1.0],
            labels=['cold', 'warm', 'hot', 'very_hot']
        )
        return result

    def explain_lead(self, lead_features: pd.Series) -> list[dict]:
        """SHAP explanation for a single lead's score"""
        if self.explainer is None:
            return []

        X = pd.DataFrame([lead_features], columns=self.feature_names)
        shap_values = self.explainer.shap_values(X)[0]

        explanations = []
        for feat, shap_val in sorted(
            zip(self.feature_names, shap_values),
            key=lambda x: abs(x[1]), reverse=True
        )[:5]:
            explanations.append({
                'feature': feat,
                'value': float(lead_features.get(feat, 0)),
                'impact': '+' if shap_val > 0 else '-',
                'shap_value': round(float(shap_val), 3)
            })

        return explanations


class LeadRoutingEngine:
    """Route leads to sales reps"""

    def route_lead(self, lead: dict, score: float, sales_team: list[dict]) -> dict:
        """Assign lead to the optimal rep"""
        # Strategy: enterprise leads → enterprise AE, SMB → velocity AE
        if lead.get('company_employees', 0) > 500 and score > 0.5:
            target_segment = 'enterprise'
        elif score > 0.75:
            target_segment = 'high_velocity'
        else:
            target_segment = 'nurture'

        # Load balancing
        available = [ae for ae in sales_team
                     if ae.get('segment') == target_segment and
                     ae.get('current_pipeline_count', 0) < ae.get('capacity', 50)]

        if not available:
            available = sales_team

        # Pick rep with lowest load
        assigned = min(available, key=lambda ae: ae.get('current_pipeline_count', 0))

        return {
            'assigned_to': assigned['id'],
            'segment': target_segment,
            'priority': 'high' if score > 0.6 else 'normal',
            'suggested_action': 'call_within_1h' if score > 0.75 else 'email_sequence'
        }

Historical Results

On real CRM data (Salesforce, HubSpot) typical AUC ranges from 0.78 to 0.85. Below are example metrics on a test set. SHAP documentation (shap.readthedocs.io)

Metric	Value
AUC ROC	0.82
Precision@25%	0.65
Recall@25%	0.70
Lift (top-25% vs random)	3.2x
Throughput	1000 leads/sec

These results are achievable with a minimum dataset of 500 closed deals. Optimal volume is 2000+ deals, yielding stable AUC of 0.84+.

Example Lift calculation

Lift shows how many times the conversion rate among high-score leads exceeds the average conversion rate across all leads. With Lift=3.2x and average conversion of 2%, conversion in the top-25% would be 6.4%.

Implementation Process

Analytics — audit current qualification process, identify data sources, check quality.
Design — define feature groups, select metrics, set thresholds.
Training — build pipeline, cross-validation, calibration.
Testing — A/B test on historical data, compare with manual rules.
Deploy — integrate with CRM via API, set up dashboards and alerts.

What's Included

Analysis of current scoring and CRM data
Development of ML pipeline in Python (sklearn, SHAP)
Integration with your CRM via REST API
Dashboard with probabilities and SHAP explanations
Team training on model usage
Quality guarantee: 3 months post-launch support

Our Track Record

With our 5+ years of experience and 50+ successful ML scoring projects for B2B companies, we have a proven methodology. CRM integrations cover Salesforce, HubSpot, Pipedrive, and Bitrix24. Typical project investment starts at $15,000 for a prototype, and clients report an average additional revenue of $200,000 in the first year. We guarantee at least 2x conversion lift if minimum data requirements are met.

Interpreting Results

For each lead, the model outputs SHAP explanations: top 5 factors influencing the score. For example, a lead with score 0.85 will show: "demo +0.30, pricing +0.20, director role +0.15" — so the rep knows to call immediately.

Common Pitfalls

Training on non-representative sample (only won deals)
Using uncalibrated probabilities
Ignoring temporal features (stale data)
Not testing on hold-out set

Avoid these mistakes to get a model that truly boosts conversion.

Timeline & Investment

Prototype development: 2 to 4 weeks. Full launch with integration: 6 to 10 weeks. Pricing is determined individually after a data audit. Compared to manual rules, ML scoring delivers 3.2x more conversions, making it 20% more effective than logistic regression alternatives.

Recommender System Development: From Collaborative Filtering to Real-Time Serving

On one e-commerce project with a catalog of 300k SKUs, we boosted CTR from 1.8% to 4.4% — a 2.4x increase. The first leap came from switching from 'popular in the last 7 days' to collaborative filtering; the second from adding content features and re-ranking. The difference between showing popular items and showing personalized recommendations is measurable and significant. Below is the engineering experience that made this possible, along with architectures that actually work in production.

Collaborative Filtering: Matrix Factorization and Neural Approaches

Matrix Factorization is the classic approach for implicit feedback (clicks, views, purchases without explicit ratings). ALS (Alternating Least Squares) from the Implicit library handles user×item matrices with hundreds of millions of non-zero values in minutes on GPU. Latent factors 64–256, regularization λ=0.01–0.1 are starting parameters. Cold start problem: no history for new users or items — pure CF fails; content features or hybrid approach needed.

Neural Collaborative Filtering (NCF) replaces the dot product with a neural network. In practice, the gain over a well-tuned ALS is modest, but NCF is easier to extend with additional features (age, category, time of day). Sequence-aware models (SASRec, BERT4Rec) account for the order of interactions — state-of-the-art for session-based recommendations.

How to Choose Recommender System Architecture?

The answer depends on data, load, and cold start requirements. Below are three main approaches with selection criteria.

Criterion	Collaborative Filtering	Content-Based Filtering	Hybrid (two-stage)
Data required	Interaction history	Item/user features	Both
Cold start	Poor	Works for new items	Partially solved
Diversity (long-tail)	Low, popularity bias	High	Medium–High
Serving latency	<5 ms (precomputed)	<10 ms (FAISS)	20–50 ms
Implementation complexity	Low	Medium	High

Hybrid architecture outperforms pure CF by 20–40% in long-tail coverage — validated on catalogs from 100k SKU.

Content-Based Filtering: When Interaction History is Scarce

Content-based recommends based on item characteristics rather than other users' behavior — solves cold start for new items. Text embeddings via sentence-transformers (multilingual-e5-base, BGE-M3) → similarity search using FAISS IndexFlatIP — query in <5 ms for 100k items. Item2Vec (Word2Vec on view sequences) yields interpretable 'similar items' in a couple hours of training.

Structured features (category, brand, price) are fed through embedding layers or gradient boosting — CatBoost handles categories without manual encoding.

Why Hybrid Models Work Better?

Production systems are almost always two-level. Stage 1 (Retrieval) — fast selection of 100–500 candidates from 300k items using ALS or Two-Tower model with vector search (FAISS, Qdrant). Stage 2 (Ranking) — heavy ranker on LightGBM or neural network with cross-features, time, device, and session context. LightFM is a good starting point for medium scale without heavy infrastructure. Our practice shows: moving from single-stage to two-stage yields a 15–25% accuracy improvement with only 20–30 ms additional latency.

Real-Time Serving: Architecture Under Load

Latency SLA — 50–100 ms at thousands of requests per second. Base recommendations precomputed (batch job hourly) → Redis by user_id → <5 ms. Real-time re-ranking via Kafka for events (clicks, cart adds) → update of context features. Feature serving — Redis with TTL (views in 24 hours, last clicked item). At 10k req/s, we deploy Redis Cluster with replication.

A/B testing is the only reliable way to measure improvements. Offline metrics do not always correlate with online. Kohavi et al., 'Online Controlled Experiments at Large Scale' (KDD 2013) — a must-read for the team. Test on 5–10% of traffic, monitor CTR, conversion, revenue per session. One of our client systems after hybridization increased revenue by 18% over a month of A/B.

Recommender System Development Timeline

The stages and typical time frames are in the table below. Costs are calculated individually based on catalog scale and latency requirements.

Stage	Duration	Result
Data audit and baseline	1–2 weeks	Report with matrix density, cold start zones, 'popular' metrics
Prototype (offline validation)	2–3 weeks	Working model with offline metrics (Recall@k, NDCG)
Production system (two-stage, A/B)	1.5–2.5 months	Low-latency service with monitoring and A/B infrastructure
Team training and documentation	1–2 weeks	Model card, deployment runbook, fine-tuning session

What's Included in Turnkey Development

Data audit — user×item matrix density (typically <0.1%), activity distribution, temporal patterns, cold start statistics.
Baseline — 'popular' as a simple threshold that is often hard to beat.
Iterative improvement — ALS → content features → two-stage → sequence-aware. Each step with A/B.
Serving infrastructure — batch precomputation, Redis, real-time re-ranking, Grafana monitoring.
Documentation — model card with metrics, deployment instructions, feature descriptions.
Team training — session on interpreting results and model fine-tuning.
Support — 1 month post-launch (incident fixes, pipeline tuning).

We are a team with 7+ years of experience in recommender systems, having delivered over 30 projects for e-commerce and media. We guarantee transparent A/B testing and documented metric improvements.

Want to assess the growth potential of your catalog? Contact us for a free data audit. Order recommender system development — first prototype within two weeks.

Example ALS config for implicit feedback

from implicit.als import AlternatingLeastSquares

model = AlternatingLeastSquares(
    factors=64,
    regularization=0.05,
    iterations=15,
    use_gpu=True
)
model.fit(user_item_matrix)

More about the mathematics of recommender systems — in specialized literature.