A/B Testing of Fine-Tuned Models

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
A/B Testing of Fine-Tuned Models
Medium
from 1 business day to 3 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

A/B testing fine-tuned models

A/B testing for LLMs is the process of comparing two model versions (baseline vs candidate) on real traffic or representative request samples. Without A/B testing, it's impossible to reliably confirm that a fine-tuned model is better than the previous one in production conditions — lab metrics (ROUGE, F1) don't always correlate with real value.

A/B test structure for LLMs

Typical comparisons:

  • Base model (GPT-4o / Llama base) vs fine-tuned version
  • Fine-tuned v1 vs Fine-tuned v2 (dataset iteration)
  • Model A (Llama 3.1 8B) vs Model B (Mistral 7B), both fine-tuned
  • Prompt v1 vs Prompt v2 on single model

A/B test metrics:

  • User preference (explicit: likes/dislikes, implicit: repeat usage)
  • Task completion rate (user got needed answer on first try)
  • Time-to-value (how many messages to solve task)
  • Escalation rate (share of requests passed to human)
  • Latency (P50, P95, P99)

Traffic routing implementation

import hashlib
import random
from typing import Literal

class ABRouter:
    """Deterministic A/B routing by user_id"""

    def __init__(self, experiment_name: str, split: float = 0.5):
        self.experiment_name = experiment_name
        self.split = split  # Share of traffic for variant B

    def assign(self, user_id: str) -> Literal["control", "treatment"]:
        """Single user always goes to same group"""
        hash_input = f"{self.experiment_name}:{user_id}"
        hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
        normalized = (hash_value % 10000) / 10000
        return "treatment" if normalized < self.split else "control"

router = ABRouter("fine-tuned-v2-test", split=0.2)  # 20% traffic to new model

# In request handler
def handle_request(user_id: str, prompt: str) -> str:
    variant = router.assign(user_id)

    if variant == "control":
        response = baseline_model.generate(prompt)
        model_version = "baseline"
    else:
        response = finetuned_model.generate(prompt)
        model_version = "v2-finetuned"

    log_event(user_id, variant, prompt, response, model_version)
    return response

Statistical significance

It's not enough to just compare average metrics — need to verify statistical significance of the difference.

from scipy import stats
import numpy as np

def ab_significance_test(
    control_outcomes: list[float],
    treatment_outcomes: list[float],
    alpha: float = 0.05
) -> dict:
    """
    Two-sided t-test to check metric difference significance
    control_outcomes: group A metrics (e.g., task_completion = [1,0,1,1,...])
    """
    t_stat, p_value = stats.ttest_ind(control_outcomes, treatment_outcomes)

    control_mean = np.mean(control_outcomes)
    treatment_mean = np.mean(treatment_outcomes)
    relative_lift = (treatment_mean - control_mean) / control_mean * 100

    return {
        "control_mean": control_mean,
        "treatment_mean": treatment_mean,
        "relative_lift_pct": relative_lift,
        "p_value": p_value,
        "significant": p_value < alpha,
        "sample_sizes": {"control": len(control_outcomes), "treatment": len(treatment_outcomes)}
    }

# Calculate required sample size
def required_sample_size(
    baseline_rate: float,   # Current metric (e.g., 0.75)
    min_detectable_effect: float,  # Minimum significant improvement (e.g., 0.05)
    alpha: float = 0.05,
    power: float = 0.80
) -> int:
    from statsmodels.stats.power import TTestIndPower
    analysis = TTestIndPower()
    effect_size = min_detectable_effect / (baseline_rate * (1 - baseline_rate)) ** 0.5
    n = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power)
    return int(np.ceil(n))

# Example: baseline task_completion = 75%, want to detect 5% improvement
n = required_sample_size(0.75, 0.05)
print(f"Need {n} requests per group")  # ~500–1000

Practical case study: support bot A/B test

Context: customer support bot based on Llama 3.1 8B fine-tuned v1. Version v2 prepared with improved dataset (+800 examples, fixed v1 failure cases).

Experiment:

  • Control (80% traffic): v1
  • Test (20% traffic): v2
  • Duration: 14 days
  • Sample size: 6200 dialogs for control group, 1550 for test group

Primary metric: task_completion_rate (user solved issue without escalation). Secondary: CSAT, escalation_rate, avg_turns_to_resolution.

Results:

Metric v1 (control) v2 (treatment) p-value Significant?
Task completion 71.3% 78.9% 0.0012 Yes
CSAT 3.8 4.1 0.034 Yes
Escalation rate 28.7% 21.1% 0.0008 Yes
Avg turns 3.2 2.9 0.18 No
Latency P95 2.1s 2.3s +10%

v2 is statistically significantly better on three of four metrics. P95 latency increase (+10%) is acceptable. Decision made for full rollout.

Tools for LLM A/B testing

LangSmith (LangChain): integrated experiment tracking, comparison view. Phoenix (Arize): OpenTelemetry-based observability for LLM. MLflow: universal experiment tracking. Weights & Biases: tables, histograms, LLM eval pipeline.

Timeline

  • A/B infrastructure setup: 3–7 days
  • Experiment (reaching required n): 1–4 weeks
  • Results analysis and decision: 2–3 days
  • Total: 2–5 weeks