LLM ORPO Fine-Tuning

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
LLM ORPO Fine-Tuning
Complex
from 1 week to 3 months
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Fine-tuning LLMs with ORPO

ORPO (Odds Ratio Preference Optimization) is a preference-based fine-tuning method proposed by Hong et al. (2024). Key difference from DPO: ORPO combines SFT and preference optimization in one step, does not require separate reference model, and uses odds ratio instead of log-probabilities for penalizing undesirable responses.

ORPO vs DPO: technical differences

DPO:

  • Requires SFT-trained reference model
  • Keeps two models in memory (trained + reference) or uses PEFT tricks
  • Optimizes: log-ratio of probabilities
  • Hyperparameter β determines KL penalty strength

ORPO:

  • Reference model not needed
  • One model in memory
  • Optimizes SFT loss + OR-weighted rejection loss simultaneously
  • Hyperparameter λ (lambda) — odds ratio loss weight
L_ORPO = L_SFT + λ * L_OR

L_SFT = -log P(y_w | x)  # standard SFT loss on chosen responses

L_OR = -log(sigmoid(log(odds_ratio(y_w, x) / odds_ratio(y_l, x))))
where odds_ratio(y, x) = P(y|x) / (1 - P(y|x))

ORPO implementation via TRL

from trl import ORPOTrainer, ORPOConfig
from peft import LoraConfig
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

orpo_config = ORPOConfig(
    output_dir="./orpo-model",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=8e-6,           # ORPO typically requires lower lr than SFT
    lr_scheduler_type="linear",
    warmup_ratio=0.1,
    beta=0.1,                     # λ in ORPO — OR loss weight (called beta in TRL)
    max_length=2048,
    max_prompt_length=512,
    bf16=True,
    remove_unused_columns=False,
    logging_steps=10,
)

trainer = ORPOTrainer(
    model=model,
    args=orpo_config,
    train_dataset=train_dataset,  # Format: prompt, chosen, rejected
    eval_dataset=eval_dataset,
    peft_config=LoraConfig(
        r=16,
        lora_alpha=32,
        target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
        task_type="CAUSAL_LM",
    ),
)

trainer.train()

ORPO dataset format

Identical to DPO — preference pairs:

dataset = {
    "prompt": "How to write technical specifications correctly?",
    "chosen": "Technical specification includes several mandatory sections: project goal, functional requirements (with MoSCoW priorities), non-functional requirements (performance, security), constraints, acceptance criteria...",
    "rejected": "Write what you want so developers understand the task"
}

Efficiency comparison: ORPO vs DPO vs SimPO in practice

Independent benchmarks on AlpacaEval 2.0 (Win Rate vs GPT-4 Turbo):

Method Win Rate Memory (7B) Training time
SFT only ~5%
DPO ~15–20% 2× (ref model) 1.3×
ORPO ~18–22%
SimPO ~20–25%

ORPO outperforms DPO in memory efficiency with comparable quality. SimPO (Simple Preference Optimization) is a more recent method, often showing slightly better results.

Practical case study: aligning code to team standards

Task: fine-tune model for code review under specific company code standards — naming rules, mandatory security patterns, prohibited practices.

Problem with pure SFT: model reproduces "correct" reviews well, but doesn't penalize ignoring violations. Need penalty component.

ORPO dataset: 1800 pairs. Chosen — review identifying all standard violations. Rejected — review missing critical violations or generating false comments.

Base model: Qwen2.5-Coder-7B-Instruct.

Configuration: ORPO, β=0.1, lr=5e-6, 2 epochs.

Results:

  • Standard violation recall: 0.67 → 0.91
  • Comment precision (no false positives): 0.71 → 0.88
  • False negative rate (missing critical violations): 28% → 7%
  • Training time: 3.5h on 1×A100 40GB (no reference model overhead)

ORPO vs DPO: when to choose

Choose ORPO:

  • Limited GPU resources (one model instead of two)
  • No good SFT-trained reference model available
  • Medium-complexity alignment task

Choose DPO:

  • Already have high-quality SFT reference model
  • Precise KL-divergence tuning required
  • Experience with DPO pipeline

Choose SimPO:

  • Maximum benchmark win rate needed
  • Resources available for γ and β parameter tuning

Timeline

  • Preference dataset collection: 3–5 weeks
  • ORPO training (7B, LoRA, A100): 3–8 hours
  • λ/β iterations: 3–5 days
  • Evaluation (LLM-as-judge + human): 1 week
  • Total: 5–8 weeks