What is ORPO and how does it differ from DPO?

ORPO (Odds Ratio Preference Optimization) is a fine-tuning method that combines supervised fine-tuning and preference optimization in one step. Unlike DPO, it does not require a separate reference model, saving memory and simplifying the pipeline. It uses odds ratios instead of log probabilities to penalize undesired responses.

How much data is needed for ORPO fine-tuning?

For quality alignment, 1,000–3,000 chosen/rejected pairs are typically sufficient. More variety improves results. In a real project, 1,800 pairs increased recall of violations from 0.67 to 0.91. Larger datasets (10k+) can improve generalization but require more GPU hours.

What learning rate should be used for ORPO?

ORPO usually requires a lower learning rate than SFT. For 7B models, start at 5e-6 to 8e-6. Too high a rate leads to overfitting on the preference loss. Use a linear scheduler with warmup of 0.1 and gradual reduction. For best results, consider pairing with LoRA.

Can ORPO be used with LoRA?

Yes, ORPO works well with LoRA. In TRL, pass the peft_config to ORPOTrainer . LoRA cuts memory consumption by 2–3 times while maintaining alignment quality. Recommended target modules: q_proj , v_proj , k_proj , o_proj with rank 16.

Which benchmarks show ORPO's advantage?

On AlpacaEval 2.0, ORPO achieves a win rate of 18–22% vs GPT-4 Turbo, about 20% higher than DPO (15–20%) while using half the memory. SimPO yields 20–25% but needs tuning two hyperparameters. In a real code review task, ORPO reduced the false negative rate from 28% to 7%. Overall, ORPO is 2 times more memory efficient than DPO and trains 1.3 times faster.

What is ORPO and how does it differ from DPO?

ORPO (Odds Ratio Preference Optimization) is a fine-tuning method that combines supervised fine-tuning and preference optimization in one step. Unlike DPO, it does not require a separate reference model, saving memory and simplifying the pipeline. It uses odds ratios instead of log probabilities to penalize undesired responses.

How much data is needed for ORPO fine-tuning?

For quality alignment, 1,000–3,000 chosen/rejected pairs are typically sufficient. More variety improves results. In a real project, 1,800 pairs increased recall of violations from 0.67 to 0.91. Larger datasets (10k+) can improve generalization but require more GPU hours.

What learning rate should be used for ORPO?

ORPO usually requires a lower learning rate than SFT. For 7B models, start at 5e-6 to 8e-6. Too high a rate leads to overfitting on the preference loss. Use a linear scheduler with warmup of 0.1 and gradual reduction. For best results, consider pairing with LoRA.

Can ORPO be used with LoRA?

Yes, ORPO works well with LoRA. In TRL, pass the peft_config to ORPOTrainer . LoRA cuts memory consumption by 2–3 times while maintaining alignment quality. Recommended target modules: q_proj , v_proj , k_proj , o_proj with rank 16.

Which benchmarks show ORPO's advantage?

On AlpacaEval 2.0, ORPO achieves a win rate of 18–22% vs GPT-4 Turbo, about 20% higher than DPO (15–20%) while using half the memory. SimPO yields 20–25% but needs tuning two hyperparameters. In a real code review task, ORPO reduced the false negative rate from 28% to 7%. Overall, ORPO is 2 times more memory efficient than DPO and trains 1.3 times faster.

LLM Alignment: Combining SFT and Preference Optimization with ORPO

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

LLM Alignment: Combining SFT and Preference Optimization with ORPO

Complex

from 1 week to 3 months

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1347
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
948
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

LLM Alignment: Combining SFT and Preference Optimization with ORPO

Imagine you need to align a language model to specific preferences — for example, to strictly follow your company's coding style. The classic DPO approach requires two models in memory, while SFT cannot penalize bad responses. We encounter this often in practice. The solution is Odds Ratio Preference Optimization (ORPO), which combines SFT and preference optimization in a single step without a separate reference model.

Why ORPO Saves Memory and Time

ORPO was proposed by Hong et al. It unifies Instruction Tuning and Preference Optimization without needing a separate reference model. By using an odds ratio to penalize undesired responses, alignment can be done on a single GPU instead of two.

Method	Win Rate (AlpacaEval 2.0)	Memory (7B)	Training Time
SFT only	~5%	1×	1×
DPO	~15–20%	2× (ref model)	1.3×
ORPO	~18–22%	1×	1×
SimPO	~20–25%	1×	1×

ORPO uses 2× less memory than DPO while delivering comparable or better quality. This translates to significant cost savings: for a typical fine-tuning project, GPU costs are halved compared to DPO. SimPO (Simple Preference Optimization) is a newer method that often shows slightly better results but requires tuning two hyperparameters.

How ORPO Works: Math and Practice

The loss function combines SFT and odds ratio loss:

L_ORPO = L_SFT + λ * L_OR

L_SFT = -log P(y_w | x)  # standard SFT loss on chosen responses

L_OR = -log(sigmoid(log(odds_ratio(y_w, x) / odds_ratio(y_l, x))))
where odds_ratio(y, x) = P(y|x) / (1 - P(y|x))

The hyperparameter λ (called beta in TRL) determines the weight of the preference loss. Start with beta=0.1 and increase it if the model does not penalize bad responses enough.

How to Build a Quality Preference Dataset?

The dataset format is identical to DPO — pairs of prompt, chosen, rejected:

dataset = {
    "prompt": "How to write a technical specification properly?",
    "chosen": "A technical specification includes several mandatory sections: project goal, functional requirements (with MoSCoW priorities), non-functional requirements (performance, security), constraints, acceptance criteria...",
    "rejected": "Write whatever you want to make developers understand the task"
}

Important: the rejected responses should not just be poor but typically undesirable — so the model learns boundaries. Collect pairs using expert evaluations or LLM-as-Judge.

Implementing ORPO with TRL (Step-by-Step)

Load the base model and tokenizer.
Prepare the dataset with chosen and rejected responses.
Configure LoRA with target modules and rank 16.
Set ORPO hyperparameters: beta=0.1, learning_rate=8e-6, linear scheduler with 10% warmup.
Train using ORPOTrainer.

from trl import ORPOTrainer, ORPOConfig
from peft import LoraConfig
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

orpo_config = ORPOConfig(
    output_dir="./orpo-model",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=8e-6,
    lr_scheduler_type="linear",
    warmup_ratio=0.1,
    beta=0.1,
    max_length=2048,
    max_prompt_length=512,
    bf16=True,
    remove_unused_columns=False,
    logging_steps=10,
)

trainer = ORPOTrainer(
    model=model,
    args=orpo_config,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    peft_config=LoraConfig(
        r=16,
        lora_alpha=32,
        target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
        task_type="CAUSAL_LM",
    ),
)

trainer.train()

When to Choose ORPO Over DPO?

Choose ORPO when GPU resources are limited, a good SFT reference model is unavailable, or the alignment task is moderately complex. DPO is better if you already have a high-quality SFT reference model and need fine-grained control over KL divergence. SimPO is worth using when maximum benchmark win rate outweighs implementation simplicity. Note that ORPO is two times more memory efficient than DPO and trains 1.3 times faster.

Hyperparameter	ORPO	DPO	SimPO
λ (beta)	0.1–0.5	0.1–0.5	γ: 0.5–1.5, β: 0.1–0.5
Learning rate	5e-6 – 8e-6	1e-6 – 5e-6	1e-6 – 5e-6
Reference model	Not required	Required	Not required
Sensitivity to rejection quality	Medium	High	High

Real Case: Aligning a Code-Review Model to Fintech Standards

From our practice. A client — a fintech company with strict code security standards. Task: fine-tune Qwen2.5-Coder-7B-Instruct for automated code review, catching all violations. Problem with pure SFT: the model reproduces "correct" reviews well but does not penalize ignoring violations.

ORPO dataset: 1,800 pairs. Chosen — reviews that identified all standard violations. Rejected — reviews that missed critical violations or generated false positives.

Configuration: ORPO, β=0.1, lr=5e-6, 2 epochs, LoRA rank 16.

Results:

Recall of standard violations: 0.67 → 0.91
Precision of comments (no false positives): 0.71 → 0.88
False negative rate (missed critical violations): 28% → 7%
Training time: 3.5 hours on 1×A100 40GB (no reference model overhead)
Estimated annual savings: $40,000 by reducing manual review effort and avoiding security breaches.

What Our ORPO Fine-Tuning Service Includes (Deliverables)

Detailed analysis of your task and requirements
Preference dataset preparation (curation or generation of chosen/rejected pairs)
Selection of base model and PEFT scheme (typically LoRA)
ORPO training with hyperparameter tuning for λ/β
Evaluation using LLM-as-Judge and expert review
Complete pipeline documentation and usage recommendations
Team training on model deployment
Two weeks of post-delivery support

Company Experience and Trust Metrics

Our team has 5+ years of experience in NLP and AI alignment. We have completed over 50 fine-tuning projects for clients in finance, healthcare, and tech. Our certified engineers follow strict security protocols. We guarantee a measurable improvement in model alignment metrics or a free reassessment.

Timeline and How to Start

Estimated timelines:

Preference dataset collection: from 3 weeks (turnkey)
ORPO training (7B, LoRA, A100): 3–8 hours
λ/β iterations: 3–5 days
Evaluation (LLM-as-judge + human): 1 week
Total: 5–8 weeks depending on complexity

Contact us for a free assessment. We will select the optimal alignment method and propose a plan. Request a consultation — our engineers will analyze your task and provide recommendations.

LLM Development: Fine-Tuning, RAG, Agents, and Production Deployment

Using GPT‑4 or Claude 3.5 Sonnet through a public API is not a solution — it's just a tool. When the requirement is to "make it like ChatGPT, but on our data," there is a real engineering challenge behind it: from prompt engineering to training a 70B model on your own infrastructure. End-to-end LLM solution development is a complex stack, and we have been doing it for over 5 years. During this time, we have completed over 20 projects in generative AI: from RAG systems for legal departments to custom support agents. Where exactly your task falls depends on data, latency requirements, budget, and how critical confidentiality is.

A typical situation: the client has already tried ChatGPT, but results are unstable — sometimes accurate, sometimes hallucinating. Or they need integration into a corporate portal while complying with security policies. Let's break down each layer of the stack in detail — from RAG to production deployment.

Why Do RAG Systems Break and How to Fix It?

RAG (Retrieval-Augmented Generation) looks simple: find relevant documents, put them in context, get an answer. In practice, it fails in several places.

Chunking without overlap. Classic mistake: chunk_size=512, overlap=0. If the answer lies across two chunks, retrieval won't find either with sufficient confidence. Solution: overlap 15–25% of chunk_size, or better yet, sentence-aware splitting with spaCy or NLTK instead of naive character splitting.

Poor embedder. text-embedding-ada-002 is good for general use, but on legal or medical texts, specialized models like E5-large-v2, BGE-M3, or fine-tuned sentence-transformers on domain data outperform it. Recall@5 differences can be 15–25%.

No re-ranking. Vector search optimizes for speed, not relevance. A cross-encoder re-ranker (ms-marco-MiniLM-L-6-v2, bge-reranker-large) after initial retrieval improves top-3 accuracy with acceptable latency (+50–150ms). This is often more impactful than improving the embedding model.

Hybrid search. Dense vectors alone work poorly on exact queries: names, SKUs, codes. BM25 (sparse) finds exact matches but misses semantics. Hybrid via RRF (Reciprocal Rank Fusion) is the optimal compromise. Qdrant, Weaviate, and pgvector 0.7+ support hybrid search natively.

Typical production architecture for a corporate knowledge base

Documents → preprocessing (PyMuPDF, Unstructured)
Chunking → embedding (BGE-M3)
Qdrant (hybrid dense+sparse)
Cross-encoder re-ranking
Context → LLM (vLLM or OpenAI API)
Answer with sources (RAGAS for quality evaluation)

When to Fine-Tune Instead of Prompt Engineering?

Prompt engineering solves ~70% of LLM adaptation tasks for a domain. The remaining 30% require fine-tuning. Three indicators: the model ignores a specific output format even with detailed prompting; the task requires deep knowledge of specialized vocabulary (medicine, law); you need to significantly reduce token costs by replacing a large model with a smaller specialized one.

LoRA and QLoRA are the standard for SFT. LoRA adds trainable low-rank matrices to attention layers. A typical configuration for Llama-3 8B: r=64, lora_alpha=128, target_modules=["q_proj","v_proj","k_proj","o_proj"] yields ~0.8% trainable parameters, training on one A100 40GB. QLoRA adds 4-bit quantization (NF4) and allows fine-tuning 70B models on two A100 40GB, though speed drops by half compared to bf16.

DPO instead of RLHF. Direct Preference Optimization requires only (chosen, rejected) pairs, not scalar reward signals. DPOTrainer from the trl library (Hugging Face) implements it in a few dozen lines.

Common mistake. A dataset of 500 examples, 5 epochs, validation loss 0.8 — seems fine. But on test, the model degrades on general instructions. Cause: catastrophic forgetting. Solution: add 10–20% general instruction-following examples (Alpaca, FLAN) to the training set to preserve original capabilities.

How to Choose a Base Model: 8B or 70B?

Model	Parameters	Strengths	Context
Llama-3.1 8B	8B	Quality/speed balance	128k
Llama-3.1 70B	70B	Complex reasoning	128k
Mistral 7B / Mixtral 8x7B	7B / 47B	Efficiency for size	32k
Qwen2.5 72B	72B	Code, multilingual	128k
Gemma 2 27B	27B	Open license	8k

For most tasks, fine-tuning an 8B model is sufficient. 70B is needed when deep reasoning is required or the 8B baseline does not reach the required quality even after fine-tuning. Inference cost for Llama-3 8B via vLLM on A100 is efficient; the exact cost depends on volume.

What Does PagedAttention Bring to Production?

vLLM is the first choice for serving open-source models. PagedAttention is the key technical innovation: KV-cache is managed like virtual memory in an OS, without fragmentation. This yields 2–4x higher throughput compared to naive HuggingFace Transformers inference. The vLLM documentation confirms that continuous batching and PagedAttention are the standard for high-load LLM services.

Typical numbers on A100 80GB for Llama-3 8B (bf16): 400–600 req/s, P50 latency 200–400ms, P99 latency 600–900ms at concurrency 64. For 70B on two A100 with tensor parallelism: 80–120 req/s, P99 latency 1.5–2.5s. AWQ or GPTQ quantization reduces memory consumption by 2x with quality loss within 1–3%.

Multi-Agent Systems

Agents are LLMs with access to tools: search, code execution, API calls, database interaction. Common patterns:

ReAct (Reason + Act): the model reasons → chooses a tool → observes the result → reasons again. LangChain and LlamaIndex implement it out of the box.
Multi-agent orchestration: multiple specialized agents with a coordinator on top. Example: coordinator → researcher (search + summarization) → coder (code generation and execution) → critic (verification). Tools: AutoGen (Microsoft), CrewAI, custom implementation on LangGraph.

In production, agent systems are non-deterministic. Essential: guardrails, step limits, logging of each step, human-in-the-loop for critical actions.

How We Work: Stages, Timeline, Deliverables

Stage	Duration	What You Get
Audit and data collection	1–2 weeks	Eval dataset of 100+ examples, task formalization
Baseline (prompt + RAG)	1–2 weeks	Working prototype, quality metrics
Fine-tuning (if needed)	2–4 weeks	Trained model, LoRA weights, model card
Deployment and monitoring	1–2 weeks	vLLM server, Grafana + Prometheus
Documentation and training	1 week	API documentation, team training

What Is Included

We deliver:

Technical documentation (model card, configs, deployment instructions)
Access to infrastructure (code repository, trained weights)
1 month of post-deployment support (consultations, bug fixes)
Customer team training (2–3 sessions on system operation)

Timeline: basic RAG prototype — 1–2 weeks. Fine-tuning with customer data — 3–6 weeks (including data preparation). Production system with monitoring and retraining — 2–4 months. Cost is calculated individually based on data volume, model complexity, and infrastructure requirements.

We guarantee the quality of the final model with performance benchmarks and ongoing monitoring. Our engineers have hands‑on experience with dozens of production LLM systems.

Want to evaluate your project? Leave a request — we will prepare a preliminary summary within 1–2 business days. Or get a consultation on choosing the approach: RAG, fine-tuning, or hybrid — we will tell you what works best for you. Contact us to discuss your LLM development needs. Schedule a free consultation today.