A/B testing fine-tuned models
A/B testing for LLMs is the process of comparing two model versions (baseline vs candidate) on real traffic or representative request samples. Without A/B testing, it's impossible to reliably confirm that a fine-tuned model is better than the previous one in production conditions — lab metrics (ROUGE, F1) don't always correlate with real value.
A/B test structure for LLMs
Typical comparisons:
- Base model (GPT-4o / Llama base) vs fine-tuned version
- Fine-tuned v1 vs Fine-tuned v2 (dataset iteration)
- Model A (Llama 3.1 8B) vs Model B (Mistral 7B), both fine-tuned
- Prompt v1 vs Prompt v2 on single model
A/B test metrics:
- User preference (explicit: likes/dislikes, implicit: repeat usage)
- Task completion rate (user got needed answer on first try)
- Time-to-value (how many messages to solve task)
- Escalation rate (share of requests passed to human)
- Latency (P50, P95, P99)
Traffic routing implementation
import hashlib
import random
from typing import Literal
class ABRouter:
"""Deterministic A/B routing by user_id"""
def __init__(self, experiment_name: str, split: float = 0.5):
self.experiment_name = experiment_name
self.split = split # Share of traffic for variant B
def assign(self, user_id: str) -> Literal["control", "treatment"]:
"""Single user always goes to same group"""
hash_input = f"{self.experiment_name}:{user_id}"
hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
normalized = (hash_value % 10000) / 10000
return "treatment" if normalized < self.split else "control"
router = ABRouter("fine-tuned-v2-test", split=0.2) # 20% traffic to new model
# In request handler
def handle_request(user_id: str, prompt: str) -> str:
variant = router.assign(user_id)
if variant == "control":
response = baseline_model.generate(prompt)
model_version = "baseline"
else:
response = finetuned_model.generate(prompt)
model_version = "v2-finetuned"
log_event(user_id, variant, prompt, response, model_version)
return response
Statistical significance
It's not enough to just compare average metrics — need to verify statistical significance of the difference.
from scipy import stats
import numpy as np
def ab_significance_test(
control_outcomes: list[float],
treatment_outcomes: list[float],
alpha: float = 0.05
) -> dict:
"""
Two-sided t-test to check metric difference significance
control_outcomes: group A metrics (e.g., task_completion = [1,0,1,1,...])
"""
t_stat, p_value = stats.ttest_ind(control_outcomes, treatment_outcomes)
control_mean = np.mean(control_outcomes)
treatment_mean = np.mean(treatment_outcomes)
relative_lift = (treatment_mean - control_mean) / control_mean * 100
return {
"control_mean": control_mean,
"treatment_mean": treatment_mean,
"relative_lift_pct": relative_lift,
"p_value": p_value,
"significant": p_value < alpha,
"sample_sizes": {"control": len(control_outcomes), "treatment": len(treatment_outcomes)}
}
# Calculate required sample size
def required_sample_size(
baseline_rate: float, # Current metric (e.g., 0.75)
min_detectable_effect: float, # Minimum significant improvement (e.g., 0.05)
alpha: float = 0.05,
power: float = 0.80
) -> int:
from statsmodels.stats.power import TTestIndPower
analysis = TTestIndPower()
effect_size = min_detectable_effect / (baseline_rate * (1 - baseline_rate)) ** 0.5
n = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power)
return int(np.ceil(n))
# Example: baseline task_completion = 75%, want to detect 5% improvement
n = required_sample_size(0.75, 0.05)
print(f"Need {n} requests per group") # ~500–1000
Practical case study: support bot A/B test
Context: customer support bot based on Llama 3.1 8B fine-tuned v1. Version v2 prepared with improved dataset (+800 examples, fixed v1 failure cases).
Experiment:
- Control (80% traffic): v1
- Test (20% traffic): v2
- Duration: 14 days
- Sample size: 6200 dialogs for control group, 1550 for test group
Primary metric: task_completion_rate (user solved issue without escalation). Secondary: CSAT, escalation_rate, avg_turns_to_resolution.
Results:
| Metric | v1 (control) | v2 (treatment) | p-value | Significant? |
|---|---|---|---|---|
| Task completion | 71.3% | 78.9% | 0.0012 | Yes |
| CSAT | 3.8 | 4.1 | 0.034 | Yes |
| Escalation rate | 28.7% | 21.1% | 0.0008 | Yes |
| Avg turns | 3.2 | 2.9 | 0.18 | No |
| Latency P95 | 2.1s | 2.3s | — | +10% |
v2 is statistically significantly better on three of four metrics. P95 latency increase (+10%) is acceptable. Decision made for full rollout.
Tools for LLM A/B testing
LangSmith (LangChain): integrated experiment tracking, comparison view. Phoenix (Arize): OpenTelemetry-based observability for LLM. MLflow: universal experiment tracking. Weights & Biases: tables, histograms, LLM eval pipeline.
Timeline
- A/B infrastructure setup: 3–7 days
- Experiment (reaching required n): 1–4 weeks
- Results analysis and decision: 2–3 days
- Total: 2–5 weeks







