Fine-tuning LLMs with ORPO
ORPO (Odds Ratio Preference Optimization) is a preference-based fine-tuning method proposed by Hong et al. (2024). Key difference from DPO: ORPO combines SFT and preference optimization in one step, does not require separate reference model, and uses odds ratio instead of log-probabilities for penalizing undesirable responses.
ORPO vs DPO: technical differences
DPO:
- Requires SFT-trained reference model
- Keeps two models in memory (trained + reference) or uses PEFT tricks
- Optimizes: log-ratio of probabilities
- Hyperparameter β determines KL penalty strength
ORPO:
- Reference model not needed
- One model in memory
- Optimizes SFT loss + OR-weighted rejection loss simultaneously
- Hyperparameter λ (lambda) — odds ratio loss weight
L_ORPO = L_SFT + λ * L_OR
L_SFT = -log P(y_w | x) # standard SFT loss on chosen responses
L_OR = -log(sigmoid(log(odds_ratio(y_w, x) / odds_ratio(y_l, x))))
where odds_ratio(y, x) = P(y|x) / (1 - P(y|x))
ORPO implementation via TRL
from trl import ORPOTrainer, ORPOConfig
from peft import LoraConfig
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3.1-8B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto"
)
orpo_config = ORPOConfig(
output_dir="./orpo-model",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=8e-6, # ORPO typically requires lower lr than SFT
lr_scheduler_type="linear",
warmup_ratio=0.1,
beta=0.1, # λ in ORPO — OR loss weight (called beta in TRL)
max_length=2048,
max_prompt_length=512,
bf16=True,
remove_unused_columns=False,
logging_steps=10,
)
trainer = ORPOTrainer(
model=model,
args=orpo_config,
train_dataset=train_dataset, # Format: prompt, chosen, rejected
eval_dataset=eval_dataset,
peft_config=LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
task_type="CAUSAL_LM",
),
)
trainer.train()
ORPO dataset format
Identical to DPO — preference pairs:
dataset = {
"prompt": "How to write technical specifications correctly?",
"chosen": "Technical specification includes several mandatory sections: project goal, functional requirements (with MoSCoW priorities), non-functional requirements (performance, security), constraints, acceptance criteria...",
"rejected": "Write what you want so developers understand the task"
}
Efficiency comparison: ORPO vs DPO vs SimPO in practice
Independent benchmarks on AlpacaEval 2.0 (Win Rate vs GPT-4 Turbo):
| Method | Win Rate | Memory (7B) | Training time |
|---|---|---|---|
| SFT only | ~5% | 1× | 1× |
| DPO | ~15–20% | 2× (ref model) | 1.3× |
| ORPO | ~18–22% | 1× | 1× |
| SimPO | ~20–25% | 1× | 1× |
ORPO outperforms DPO in memory efficiency with comparable quality. SimPO (Simple Preference Optimization) is a more recent method, often showing slightly better results.
Practical case study: aligning code to team standards
Task: fine-tune model for code review under specific company code standards — naming rules, mandatory security patterns, prohibited practices.
Problem with pure SFT: model reproduces "correct" reviews well, but doesn't penalize ignoring violations. Need penalty component.
ORPO dataset: 1800 pairs. Chosen — review identifying all standard violations. Rejected — review missing critical violations or generating false comments.
Base model: Qwen2.5-Coder-7B-Instruct.
Configuration: ORPO, β=0.1, lr=5e-6, 2 epochs.
Results:
- Standard violation recall: 0.67 → 0.91
- Comment precision (no false positives): 0.71 → 0.88
- False negative rate (missing critical violations): 28% → 7%
- Training time: 3.5h on 1×A100 40GB (no reference model overhead)
ORPO vs DPO: when to choose
Choose ORPO:
- Limited GPU resources (one model instead of two)
- No good SFT-trained reference model available
- Medium-complexity alignment task
Choose DPO:
- Already have high-quality SFT reference model
- Precise KL-divergence tuning required
- Experience with DPO pipeline
Choose SimPO:
- Maximum benchmark win rate needed
- Resources available for γ and β parameter tuning
Timeline
- Preference dataset collection: 3–5 weeks
- ORPO training (7B, LoRA, A100): 3–8 hours
- λ/β iterations: 3–5 days
- Evaluation (LLM-as-judge + human): 1 week
- Total: 5–8 weeks







