Setting up A/B testing of ML models in production
A/B testing of ML models is the only way to reliably measure the business impact of a new version of the model. Metrics on a test dataset show that the model has become more accurate, but they don't answer the question: will this result in more revenue or a better user experience? A properly configured A/B test provides a statistically valid answer.
Differences between ML A/B and classic A/B
In classic A/B, users are randomly assigned to groups once. In ML A/B, there are additional complications:
- Novelty effect: users respond to novelty that is not related to the quality of the model
- Long-term effects: recommender systems influence long-term behavior not visible in a short-term test
- Carryover effects: the effect of a previous prediction influences current behavior
- Network effects: in collaborative systems, the behavior of one user influences others
A/B architecture for ML
Traffic distribution levels:
-
User-level split—each user always receives the same version of the model. Suitable for personalization and recommendations.
-
Request-level split – each request is randomly routed to one of the versions. Suitable for stateless services (search, pricing).
-
Cohort-based split – breaking down users into segments. Important for ensuring a balance of demographic characteristics.
Traffic routing:
import hashlib
def get_model_version(user_id: str, experiment_id: str) -> str:
# Детерминированное хэширование для стабильного назначения
hash_key = f"{experiment_id}:{user_id}"
hash_value = int(hashlib.md5(hash_key.encode()).hexdigest(), 16)
bucket = hash_value % 100 # 0-99
if bucket < 50: # 50% трафика
return "model_v2"
else:
return "model_v1_control"
Tools
Nginx / Envoy - infrastructure-level routing based on headers or weights.
Seldon Core / KServe — Kubernetes-native inference with built-in A/B:
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
spec:
predictors:
- name: control
traffic: 70
graph:
name: model-v1
- name: treatment
traffic: 30
graph:
name: model-v2
Feature flags (LaunchDarkly, Unleash) — for flexible management of experiments without deployment.
Statistical methodology
Metrics for ML A/B:
- Primary metric: business metric (conversion, ARPU, retention)
- Guardrail metrics: latency, error rate - should not degrade
- Secondary metrics: proxy indicators (CTR, engagement)
Sample size and test power:
To detect a 2% effect size with a 5% baseline conversion rate, a significance level of α=0.05, and 80% power, you need ~15,000 users per group. Use a power calculator (scipy.stats.norm or online tools) before launching.
Test stop:
- Do not stop the test before the scheduled time due to early results (peek problem)
- Minimum duration: 1-2 weeks to take into account daily and weekly patterns
- Use Sequential testing (e-values) if you need to make decisions earlier
Analysis of results
from scipy import stats
control_conversions = [0, 1, 0, 1, ...] # 0/1 для каждого пользователя
treatment_conversions = [0, 1, 1, 0, ...]
# t-тест для непрерывных метрик
t_stat, p_value = stats.ttest_ind(control_conversions, treatment_conversions)
# Chi-squared для бинарных метрик
from scipy.stats import chi2_contingency
contingency = [[control_success, control_fail],
[treatment_success, treatment_fail]]
chi2, p_value, dof, expected = chi2_contingency(contingency)
print(f"Relative lift: {(treatment_rate - control_rate) / control_rate:.2%}")
print(f"P-value: {p_value:.4f}")
print(f"Statistically significant: {p_value < 0.05}")
A properly configured A/B test allows you to make decisions about model deployment not based on intuition, but on data with a measurable level of confidence.







