A/B Testing Setup for ML Models in Production

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
A/B Testing Setup for ML Models in Production
Medium
~3-5 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1243
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1170
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    873
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1086
  • image_logo-advance_0.png
    B2B Advance company logo design
    563
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    830

Setting up A/B testing of ML models in production

A/B testing of ML models is the only way to reliably measure the business impact of a new version of the model. Metrics on a test dataset show that the model has become more accurate, but they don't answer the question: will this result in more revenue or a better user experience? A properly configured A/B test provides a statistically valid answer.

Differences between ML A/B and classic A/B

In classic A/B, users are randomly assigned to groups once. In ML A/B, there are additional complications:

  • Novelty effect: users respond to novelty that is not related to the quality of the model
  • Long-term effects: recommender systems influence long-term behavior not visible in a short-term test
  • Carryover effects: the effect of a previous prediction influences current behavior
  • Network effects: in collaborative systems, the behavior of one user influences others

A/B architecture for ML

Traffic distribution levels:

  1. User-level split—each user always receives the same version of the model. Suitable for personalization and recommendations.

  2. Request-level split – each request is randomly routed to one of the versions. Suitable for stateless services (search, pricing).

  3. Cohort-based split – breaking down users into segments. Important for ensuring a balance of demographic characteristics.

Traffic routing:

import hashlib

def get_model_version(user_id: str, experiment_id: str) -> str:
    # Детерминированное хэширование для стабильного назначения
    hash_key = f"{experiment_id}:{user_id}"
    hash_value = int(hashlib.md5(hash_key.encode()).hexdigest(), 16)
    bucket = hash_value % 100  # 0-99

    if bucket < 50:  # 50% трафика
        return "model_v2"
    else:
        return "model_v1_control"

Tools

Nginx / Envoy - infrastructure-level routing based on headers or weights.

Seldon Core / KServe — Kubernetes-native inference with built-in A/B:

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
spec:
  predictors:
    - name: control
      traffic: 70
      graph:
        name: model-v1
    - name: treatment
      traffic: 30
      graph:
        name: model-v2

Feature flags (LaunchDarkly, Unleash) — for flexible management of experiments without deployment.

Statistical methodology

Metrics for ML A/B:

  • Primary metric: business metric (conversion, ARPU, retention)
  • Guardrail metrics: latency, error rate - should not degrade
  • Secondary metrics: proxy indicators (CTR, engagement)

Sample size and test power:

To detect a 2% effect size with a 5% baseline conversion rate, a significance level of α=0.05, and 80% power, you need ~15,000 users per group. Use a power calculator (scipy.stats.norm or online tools) before launching.

Test stop:

  • Do not stop the test before the scheduled time due to early results (peek problem)
  • Minimum duration: 1-2 weeks to take into account daily and weekly patterns
  • Use Sequential testing (e-values) if you need to make decisions earlier

Analysis of results

from scipy import stats

control_conversions = [0, 1, 0, 1, ...]  # 0/1 для каждого пользователя
treatment_conversions = [0, 1, 1, 0, ...]

# t-тест для непрерывных метрик
t_stat, p_value = stats.ttest_ind(control_conversions, treatment_conversions)

# Chi-squared для бинарных метрик
from scipy.stats import chi2_contingency
contingency = [[control_success, control_fail],
               [treatment_success, treatment_fail]]
chi2, p_value, dof, expected = chi2_contingency(contingency)

print(f"Relative lift: {(treatment_rate - control_rate) / control_rate:.2%}")
print(f"P-value: {p_value:.4f}")
print(f"Statistically significant: {p_value < 0.05}")

A properly configured A/B test allows you to make decisions about model deployment not based on intuition, but on data with a measurable level of confidence.