AI CTR/CVR Prediction System for Ad Campaigns

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
AI CTR/CVR Prediction System for Ad Campaigns
Medium
~2-4 weeks
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Predicting CTR and CVR in advertising systems

CTR (Click-Through Rate) and CVR (Conversion Rate) are fundamental signals for programmatic pricing. A 20% error in CTR prediction directly translates into overpaying or under-winning auctions. At the scale of hundreds of millions of impressions per day, even an improvement in AUC from 0.76 to 0.78 translates into millions of dollars in saved or earned budget.

Features of the CTR/CVR prediction task

CTR prediction is a binary classification with three key challenges: extreme class imbalance (CTR 0.1-2%), massive volume (billions of examples per day), hidden conversions (CVR is only observed for those who clicked, which creates selection bias).

import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import roc_auc_score, log_loss

class CTRFeatureEngineer:
    """Признаки для CTR-модели в display advertising"""

    def build_features(self, bid_logs: pd.DataFrame) -> pd.DataFrame:
        """
        bid_logs: исторические логи показов с флагами clicked/converted
        """
        df = bid_logs.copy()

        # === Статистические признаки пользователя ===
        user_stats = df.groupby('user_id').agg(
            user_historical_ctr=('clicked', 'mean'),
            user_impression_count=('clicked', 'count'),
            user_conversion_rate=('converted', 'mean'),
        ).reset_index()

        # === Статистические признаки площадки ===
        site_stats = df.groupby('site_domain').agg(
            site_ctr=('clicked', 'mean'),
            site_conversion_rate=('converted', 'mean'),
            site_volume=('clicked', 'count'),
        ).reset_index()

        # === Признаки пересечения (user × ad) ===
        df['hour'] = pd.to_datetime(df['timestamp']).dt.hour
        df['is_weekend'] = pd.to_datetime(df['timestamp']).dt.dayofweek >= 5
        df['is_prime_time'] = df['hour'].between(18, 22)

        # Cross-признаки: важнее одиночных
        df['ad_position_encoded'] = df['ad_position'].map({'atf': 1, 'btf': 0}).fillna(0.5)

        df = df.merge(user_stats, on='user_id', how='left')
        df = df.merge(site_stats, on='site_domain', how='left')

        # Smoothed CTR для борьбы с разреженностью (Wilson smoothing)
        alpha = 100  # Prior strength
        global_ctr = df['clicked'].mean()
        df['user_smooth_ctr'] = (
            df['user_historical_ctr'].fillna(global_ctr) * df['user_impression_count'].fillna(0) +
            global_ctr * alpha
        ) / (df['user_impression_count'].fillna(0) + alpha)

        feature_cols = [
            'user_smooth_ctr', 'user_impression_count',
            'site_ctr', 'site_volume',
            'hour', 'is_weekend', 'is_prime_time',
            'ad_position_encoded', 'banner_width', 'banner_height',
            'floor_price',
        ]

        return df[feature_cols].fillna(0)


class CTRModel:
    """LightGBM для CTR с правильной калибровкой"""

    def __init__(self):
        self.model = lgb.LGBMClassifier(
            n_estimators=1000,
            learning_rate=0.03,
            num_leaves=255,
            min_child_samples=200,
            subsample=0.8,
            colsample_bytree=0.7,
            scale_pos_weight=50,  # Коррекция дисбаланса: 1 клик на 50 показов
            random_state=42,
            n_jobs=-1,
        )
        self.calibrator = None
        self._is_calibrated = False

    def train(self, X_train: np.ndarray, y_train: np.ndarray,
               X_val: np.ndarray, y_val: np.ndarray):
        """Обучение с ранней остановкой"""
        self.model.fit(
            X_train, y_train,
            eval_set=[(X_val, y_val)],
            eval_metric='auc',
            callbacks=[
                lgb.early_stopping(100, verbose=False),
                lgb.log_evaluation(200)
            ]
        )

        # Калибровка — ОБЯЗАТЕЛЬНА для использования в bid price calculation
        # Сырой LightGBM даёт хороший ranking, но плохие вероятности
        self.calibrator = CalibratedClassifierCV(self.model, cv='prefit', method='isotonic')
        self.calibrator.fit(X_val, y_val)
        self._is_calibrated = True

    def predict_ctr(self, X: np.ndarray) -> np.ndarray:
        """Калиброванные вероятности кликов"""
        if self._is_calibrated:
            return self.calibrator.predict_proba(X)[:, 1]
        return self.model.predict_proba(X)[:, 1]

    def evaluate(self, X_test: np.ndarray, y_test: np.ndarray) -> dict:
        raw_probs = self.model.predict_proba(X_test)[:, 1]
        cal_probs = self.predict_ctr(X_test) if self._is_calibrated else raw_probs

        return {
            'auc_raw': round(roc_auc_score(y_test, raw_probs), 4),
            'auc_calibrated': round(roc_auc_score(y_test, cal_probs), 4),
            'logloss_raw': round(log_loss(y_test, raw_probs), 4),
            'logloss_calibrated': round(log_loss(y_test, cal_probs), 4),
            'mean_predicted_ctr': round(float(cal_probs.mean()), 5),
            'actual_ctr': round(float(y_test.mean()), 5),
        }


class DelayedConversionCorrector:
    """
    Коррекция delayed conversions в CVR-модели.
    Конверсии могут происходить через часы/дни после клика.
    Обрезка обучающей выборки по времени создаёт смещение.
    """

    def adjust_for_delayed_conversions(self, clicks: pd.DataFrame,
                                         observation_window_hours: int = 24) -> pd.DataFrame:
        """
        Отбрасываем недавние клики, у которых ещё не истёк window конверсии.
        Иначе CVR будет занижен для последних примеров.
        """
        cutoff = pd.Timestamp.now() - pd.Timedelta(hours=observation_window_hours)
        return clicks[clicks['click_time'] < cutoff]

    def estimate_conversion_delay_distribution(self,
                                                 conversions: pd.DataFrame) -> dict:
        """Распределение задержек конверсий"""
        delays = (conversions['conversion_time'] - conversions['click_time']).dt.total_seconds() / 3600

        return {
            'p50_hours': round(float(delays.quantile(0.50)), 1),
            'p90_hours': round(float(delays.quantile(0.90)), 1),
            'p99_hours': round(float(delays.quantile(0.99)), 1),
            'recommended_window': f"{int(delays.quantile(0.95))} hours",
        }

CTR/CVR Model Quality Metrics

Metric Good Value Purpose
AUC-ROC > 0.75 for CTR Ranking Power
Log Loss < 0.10 Качество вероятностей
Calibration Error < 0.005 Точность CTR-оценок
NDCG@1000 > 0.85 Top Auctions
Delta AUC from a new feature > 0.001 Engineering ROI

For the CTR model, an AUC of 0.76 versus 0.74 is a significant difference in scale. Calibration is essential: an uncalibrated model can consistently overpay by 15-30%. Weekly model updates are recommended, as the advertising landscape changes rapidly.