Fine-tuning LLMs with DPO (Direct Preference Optimization)
DPO is an alignment method that trains a model to generate preferred responses without explicit reward model training and RLHF cycles. Proposed by Rafailov et al. (Stanford, 2023), DPO transforms the RL task into supervised learning on preference datasets (chosen/rejected pairs), significantly simplifying the alignment pipeline.
DPO vs RLHF: fundamental difference
RLHF (classical):
- Reward Model training on preference pairs
- LLM training via PPO using Reward Model
- KL-divergence from reference policy as regularizer
Drawbacks: PPO instability, need to keep 4 models in memory (actor, critic, reward, reference), complex tuning.
DPO:
- Direct optimization on pairs (chosen, rejected) without Reward Model
- Implicit reward determined through log-ratio of probabilities from trained/reference models
- Stable training like regular SFT
Mathematically DPO minimizes:
L_DPO = -E[log σ(β * (log π_θ(y_w|x)/π_ref(y_w|x) - log π_θ(y_l|x)/π_ref(y_l|x)))]
where y_w is the preferred response, y_l is rejected, β is the KL regularization temperature.
DPO dataset format
# Example preference dataset record
{
"prompt": "Explain the difference between TCP and UDP",
"chosen": "TCP (Transmission Control Protocol) ensures reliable data delivery with acknowledgment, flow control, and error checking. UDP (User Datagram Protocol) establishes no connection, provides no delivery guarantees, but offers minimal latency. TCP is used for HTTP, FTP, SMTP; UDP for DNS, video streaming, real-time games.",
"rejected": "TCP is reliable, UDP is fast. TCP is slower because it checks each packet. Both are internet protocols."
}
DPO implementation via TRL
from trl import DPOTrainer, DPOConfig
from peft import LoraConfig
# Create reference model (frozen copy of SFT-trained model)
# TRL manages this automatically with use_reference_model=True
dpo_config = DPOConfig(
output_dir="./dpo-model",
num_train_epochs=1, # DPO typically 1-3 epochs
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=5e-7, # Significantly lower than SFT
lr_scheduler_type="cosine",
warmup_ratio=0.1,
beta=0.1, # KL temperature
loss_type="sigmoid", # "sigmoid", "hinge", "ipo", "kto_pair"
max_length=2048,
max_prompt_length=512,
bf16=True,
logging_steps=10,
)
trainer = DPOTrainer(
model=model, # SFT fine-tuned model
ref_model=None, # None = automatically created from model
args=dpo_config,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
peft_config=LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj","v_proj"]),
)
trainer.train()
DPO loss_type variants
- sigmoid: original DPO loss
- hinge: SLiC-HF, less sensitive to outliers
- ipo: IPO (Identity Preference Optimization), more stable version
- kto_pair: KTO (Kahneman-Tversky Optimization), works with unpaired data
Creating preference datasets: practical methods
Method 1: Human annotation. Highest quality but expensive. Annotators view two responses and select the better one. Minimum 2-3 annotators per pair for reliability.
Method 2: AI-generation + human verification. GPT-4o generates chosen (high quality) and rejected (intentionally degraded). Humans verify 20-30% of the dataset.
Method 3: Production data. User interaction logs: likes/dislikes, ratings, operator corrections.
from openai import OpenAI
def generate_preference_pair(prompt: str, client: OpenAI) -> dict:
"""Generates chosen/rejected pair for DPO dataset"""
# Good response
chosen_response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Provide a detailed, accurate, well-structured response."},
{"role": "user", "content": prompt}
],
temperature=0.3
).choices[0].message.content
# Poor response — intentionally degrade quality
rejected_response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Provide a brief, superficial response without details."},
{"role": "user", "content": prompt}
],
temperature=0.9
).choices[0].message.content
return {"prompt": prompt, "chosen": chosen_response, "rejected": rejected_response}
Practical case study: improving customer service quality
Task: language model for customer support answered correctly but with rigid, impersonal tone. SFT fine-tuning on new data partially solved the problem but required data recollection each time.
Solution: DPO on preference pairs. Chosen — operator responses with high CSAT. Rejected — responses with low CSAT. Volume: 2100 pairs.
Base model for DPO: SFT fine-tuned Mistral 7B.
Results:
- Bot CSAT: 3.4 → 4.2 (out of 5)
- Empathy score (LLM-as-judge): 2.8 → 4.1
- Factual accuracy: unchanged (0.91 → 0.91)
- Refusal rate: 12% → 4% (model became less overly cautious)
- β=0.1 proved optimal: at β=0.5 accuracy dropped, at β=0.01 instability occurred
Typical pipeline: SFT → DPO
DPO is applied on top of SFT, not instead of it:
- SFT (Supervised Fine-Tuning): train model to format and deliver relevant domain responses
- DPO: align response quality to user preferences
Skipping SFT and direct DPO on base model is technically possible but less stable.
Timeline
- Preference dataset collection and annotation: 3-6 weeks
- SFT (if not conducted): 2-3 weeks
- DPO training and iterations: 1-2 weeks
- Quality evaluation (LLM-as-judge + human): 1 week
- Total: 7-12 weeks







