LLM DPO (Direct Preference Optimization) Fine-Tuning

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
LLM DPO (Direct Preference Optimization) Fine-Tuning
Complex
from 1 week to 3 months
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Fine-tuning LLMs with DPO (Direct Preference Optimization)

DPO is an alignment method that trains a model to generate preferred responses without explicit reward model training and RLHF cycles. Proposed by Rafailov et al. (Stanford, 2023), DPO transforms the RL task into supervised learning on preference datasets (chosen/rejected pairs), significantly simplifying the alignment pipeline.

DPO vs RLHF: fundamental difference

RLHF (classical):

  1. Reward Model training on preference pairs
  2. LLM training via PPO using Reward Model
  3. KL-divergence from reference policy as regularizer

Drawbacks: PPO instability, need to keep 4 models in memory (actor, critic, reward, reference), complex tuning.

DPO:

  1. Direct optimization on pairs (chosen, rejected) without Reward Model
  2. Implicit reward determined through log-ratio of probabilities from trained/reference models
  3. Stable training like regular SFT

Mathematically DPO minimizes:

L_DPO = -E[log σ(β * (log π_θ(y_w|x)/π_ref(y_w|x) - log π_θ(y_l|x)/π_ref(y_l|x)))]

where y_w is the preferred response, y_l is rejected, β is the KL regularization temperature.

DPO dataset format

# Example preference dataset record
{
    "prompt": "Explain the difference between TCP and UDP",
    "chosen": "TCP (Transmission Control Protocol) ensures reliable data delivery with acknowledgment, flow control, and error checking. UDP (User Datagram Protocol) establishes no connection, provides no delivery guarantees, but offers minimal latency. TCP is used for HTTP, FTP, SMTP; UDP for DNS, video streaming, real-time games.",
    "rejected": "TCP is reliable, UDP is fast. TCP is slower because it checks each packet. Both are internet protocols."
}

DPO implementation via TRL

from trl import DPOTrainer, DPOConfig
from peft import LoraConfig

# Create reference model (frozen copy of SFT-trained model)
# TRL manages this automatically with use_reference_model=True

dpo_config = DPOConfig(
    output_dir="./dpo-model",
    num_train_epochs=1,              # DPO typically 1-3 epochs
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=5e-7,              # Significantly lower than SFT
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    beta=0.1,                        # KL temperature
    loss_type="sigmoid",             # "sigmoid", "hinge", "ipo", "kto_pair"
    max_length=2048,
    max_prompt_length=512,
    bf16=True,
    logging_steps=10,
)

trainer = DPOTrainer(
    model=model,             # SFT fine-tuned model
    ref_model=None,          # None = automatically created from model
    args=dpo_config,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    peft_config=LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj","v_proj"]),
)

trainer.train()

DPO loss_type variants

  • sigmoid: original DPO loss
  • hinge: SLiC-HF, less sensitive to outliers
  • ipo: IPO (Identity Preference Optimization), more stable version
  • kto_pair: KTO (Kahneman-Tversky Optimization), works with unpaired data

Creating preference datasets: practical methods

Method 1: Human annotation. Highest quality but expensive. Annotators view two responses and select the better one. Minimum 2-3 annotators per pair for reliability.

Method 2: AI-generation + human verification. GPT-4o generates chosen (high quality) and rejected (intentionally degraded). Humans verify 20-30% of the dataset.

Method 3: Production data. User interaction logs: likes/dislikes, ratings, operator corrections.

from openai import OpenAI

def generate_preference_pair(prompt: str, client: OpenAI) -> dict:
    """Generates chosen/rejected pair for DPO dataset"""

    # Good response
    chosen_response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Provide a detailed, accurate, well-structured response."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.3
    ).choices[0].message.content

    # Poor response — intentionally degrade quality
    rejected_response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Provide a brief, superficial response without details."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.9
    ).choices[0].message.content

    return {"prompt": prompt, "chosen": chosen_response, "rejected": rejected_response}

Practical case study: improving customer service quality

Task: language model for customer support answered correctly but with rigid, impersonal tone. SFT fine-tuning on new data partially solved the problem but required data recollection each time.

Solution: DPO on preference pairs. Chosen — operator responses with high CSAT. Rejected — responses with low CSAT. Volume: 2100 pairs.

Base model for DPO: SFT fine-tuned Mistral 7B.

Results:

  • Bot CSAT: 3.4 → 4.2 (out of 5)
  • Empathy score (LLM-as-judge): 2.8 → 4.1
  • Factual accuracy: unchanged (0.91 → 0.91)
  • Refusal rate: 12% → 4% (model became less overly cautious)
  • β=0.1 proved optimal: at β=0.5 accuracy dropped, at β=0.01 instability occurred

Typical pipeline: SFT → DPO

DPO is applied on top of SFT, not instead of it:

  1. SFT (Supervised Fine-Tuning): train model to format and deliver relevant domain responses
  2. DPO: align response quality to user preferences

Skipping SFT and direct DPO on base model is technically possible but less stable.

Timeline

  • Preference dataset collection and annotation: 3-6 weeks
  • SFT (if not conducted): 2-3 weeks
  • DPO training and iterations: 1-2 weeks
  • Quality evaluation (LLM-as-judge + human): 1 week
  • Total: 7-12 weeks