When is Full Fine-Tuning more effective than LoRA?

Full FT is justified when LoRA yields insufficient quality (gap >3-5%), the domain differs significantly from pretraining, or architectural changes are needed (vocabulary expansion, context length). On complex domains, F1 improvement can reach 10-15 percentage points.

What are the memory requirements for Full FT of a 7B model?

At least 84 GB in bf16 without activations. With DeepSpeed ZeRO Stage 3 and CPU offloading, you can fit into 2×A100 40GB. Gradient checkpointing reduces activation memory by 4x.

How to avoid catastrophic forgetting during full fine-tuning?

Use a small learning rate (1e-5–5e-5), warmup, cosine decay, a replay buffer with general data, and regularization methods like EWC.

How long does Full Fine-Tuning take?

For a 7B model on 5K examples: 4-8 hours on 4×A100. A full project from audit to deployment: 7-15 weeks.

What technology stack is used for Full FT?

We recommend PyTorch with DeepSpeed or FSDP, bf16 mixed precision, Gradient Checkpointing, Weights & Biases for monitoring. For deployment: vLLM or Triton Inference Server.

When is Full Fine-Tuning more effective than LoRA?

Full FT is justified when LoRA yields insufficient quality (gap >3-5%), the domain differs significantly from pretraining, or architectural changes are needed (vocabulary expansion, context length). On complex domains, F1 improvement can reach 10-15 percentage points.

What are the memory requirements for Full FT of a 7B model?

At least 84 GB in bf16 without activations. With DeepSpeed ZeRO Stage 3 and CPU offloading, you can fit into 2×A100 40GB. Gradient checkpointing reduces activation memory by 4x.

How to avoid catastrophic forgetting during full fine-tuning?

Use a small learning rate (1e-5–5e-5), warmup, cosine decay, a replay buffer with general data, and regularization methods like EWC.

How long does Full Fine-Tuning take?

For a 7B model on 5K examples: 4-8 hours on 4×A100. A full project from audit to deployment: 7-15 weeks.

What technology stack is used for Full FT?

We recommend PyTorch with DeepSpeed or FSDP, bf16 mixed precision, Gradient Checkpointing, Weights & Biases for monitoring. For deployment: vLLM or Triton Inference Server.

Effective Strategies for Full Fine-Tuning of LLMs

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Effective Strategies for Full Fine-Tuning of LLMs

Complex

from 1 week to 3 months

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1346
Development of a web application for FEEDME
1246
Website development for BELFINGROUP
947
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

Full Fine-Tuning of LLMs: Scenarios, Case Studies, and Best Practices

On one project for a financial regulator (a financial NLP use case), LoRA with rank 64 gave an F1 of only 0.79 — a critical gap from the target metrics. We then applied Full Fine-Tuning, updating all model weights, and achieved F1 0.91. This method updates all parameters of the language model — a full parameter update — not just adapter layers. It delivers the highest quality of language model adaptation but requires serious computational resources and careful training management. We provide turnkey fine-tuning services: from data audit to optimized model deployment. In this article, we'll explore when to choose full FT, how to set up distributed training with DeepSpeed ZeRO and FSDP, and what to watch out for to avoid catastrophic forgetting.

When Full Fine-Tuning Is Justified

Full FT is not the default choice. Reasons to consider it:

Insufficient quality with LoRA/QLoRA: if after optimizing LoRA parameters the gap from baseline remains substantial, full FT can yield an additional 3–8% on metrics.
Fundamentally new domain: when the model needs to be trained on notation or a language significantly different from the pretrained distribution (special symbols, formal grammars, unique terminology).
Continual Pre-training: adding new knowledge to the model through continued pretraining, followed by Instruction Tuning.
Changing architectural parameters: vocabulary expansion, context length modification via RoPE scaling.

Why Full Fine-Tuning Is More Effective Than LoRA for Complex Domains

The reason is that updating all weights allows the model to adapt its internal representations to the new data distribution. LoRA only modifies low-rank adapters, leaving the original weights unchanged. If the domain differs strongly from the pretrained one, LoRA lacks flexibility. In practice, the difference can reach 10–15 percentage points on key metrics. For instance, Full FT outperforms LoRA by up to 3x on domain-specific tasks. Training time for Full FT can be 3-5× longer than LoRA, but the quality improvement often justifies it. This comparison (LoRA vs Full FT) is critical for choosing the right approach.

How to Prepare Data for Full Fine-Tuning: Step-by-Step

Data collection: Gather domain-specific examples (e.g., bank reports, legal documents).
Cleaning: Remove duplicates, noise, and irrelevant content.
Balancing: Ensure class balance for classification tasks.
Splitting: Stratify by time or theme for train/val/test.
Augmentation: Use techniques like synonym replacement for robustness.
Instruction generation: Create chain-of-thought templates and few-shot examples.

Technical Aspects of Full Fine-Tuning

Memory Requirements

For full FT of a model with N parameters in bf16:

Model parameters: 2N bytes
Gradients: 2N bytes (bf16) or 4N bytes (fp32)
Optimizer (AdamW): 8N bytes (fp32 moments)
Activations: depend on batch size and sequence length

Total — at least 12N bytes without activations. For a 7B model: ~84 GB, for 70B: ~840 GB.

DeepSpeed ZeRO for Distributed Training During A100 Training

ZeRO (Zero Redundancy Optimizer) shards parameters, gradients, and optimizer states across GPUs:

{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {"device": "cpu"},
    "offload_param": {"device": "cpu"},
    "overlap_comm": true,
    "contiguous_gradients": true,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto"
  },
  "bf16": {"enabled": true},
  "gradient_accumulation_steps": 8,
  "gradient_clipping": 1.0,
  "train_micro_batch_size_per_gpu": 2
}

More about DeepSpeed configuration

DeepSpeed ZeRO Stage 3 with CPU offloading allows training a 7B model on 4×A100 40GB instead of 8 GPUs. As noted in the DeepSpeed documentation, this technique significantly reduces video memory requirements.

FSDP as an Alternative to DeepSpeed

PyTorch Fully Sharded Data Parallel (FSDP) is a native alternative to DeepSpeed, better integrated with the PyTorch ecosystem. FSDP documentation is available on the official site.

from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
from transformers import LlamaDecoderLayer

fsdp_config = {
    "fsdp": "full_shard auto_wrap",
    "fsdp_config": {
        "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
        "fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer",
        "fsdp_state_dict_type": "FULL_STATE_DICT",
        "fsdp_offload_params": False,
    }
}

Gradient Checkpointing

Reduces activation memory consumption by recomputing part of the forward pass during backward:

model.gradient_checkpointing_enable()
# Reduces activation memory ~4× at the cost of ~20% training slowdown

Comparison of Fine-Tuning Methods

Parameter	Full FT	LoRA	QLoRA
Updated parameters	All	Adapters (0.1–1%)	Adapters
Memory for 7B (bf16)	~84 GB	~16 GB	~8 GB
Quality on complex domains	High	Medium	Medium
Training time	Long	Moderate	Fast

Managing Learning Rate in Full Fine-Tuning

In Full FT, the training schedule is critical:

Warmup: first 5–10% of steps, lr increases from 0 to the target value. Prevents early gradient explosions.
Cosine decay: smooth lr reduction to 10% of the peak value by the end of training.
Target values: for Full FT on a specialized dataset — 1e-5 to 5e-5. For CPT — 1e-5 or lower.
Catastrophic Forgetting: full weight updates can destroy the model's general knowledge. To prevent knowledge loss, use a small lr, replay buffer mixing with general data, and regularization (EWC).

Practical Case: Full Fine-Tuning for a Financial Regulator

Task

A specialized model for central bank analytics — analysis of bank reports in XBRL format, detection of prudential norm violations, generation of orders.

Why Full FT, Not LoRA

Specific language of regulatory orders (legal constructs, references to norms), new symbolic patterns. LoRA r=64 gave F1=0.79, full fine-tuning achieved F1=0.91.

Infrastructure

8×A100 80GB, DeepSpeed ZeRO Stage 2, bf16.

Dataset

6800 examples (report format → analysis + order).

Training Parameters

lr=2e-5, warmup_ratio=0.05, cosine decay, 3 epochs, effective batch size=64.

Results

Violation detection F1: 0.79 (LoRA r=64) → 0.91 (Full FT)
ROUGE-L for orders: 0.61 → 0.74
Training time: 14 hours on 8×A100

Infrastructure Requirements for Full Fine-Tuning

Model	GPU (no offload)	GPU (ZeRO Stage 3 + CPU)	Time (3 epochs, 5K examples)
7B	4×A100 40GB	2×A100 40GB	4–8h
13B	8×A100 40GB	4×A100 40GB	8–16h
70B	8×A100 80GB	4×A100 80GB	24–48h
70B	16×H100 80GB	8×H100 80GB	12–24h

Project Deliverables

Audit of the current model and dataset, recommendations for fine-tuning strategy.
Setup of distributed training (DeepSpeed/FSDP) for your infrastructure.
Development of a data preparation pipeline, including labeling and augmentation.
Training with metric monitoring and logging in Weights & Biases.
Quality evaluation, A/B testing, and deployment of the optimized model.
Access to all model artifacts (weights, tokenizer, configs).
Documentation and training of your team to work with the model.
Post-deployment support for 3 months.

Project Timeline

Audit and planning: 1–2 weeks
Infrastructure preparation: 1 week
Data preparation: 2–6 weeks
Training and iterations: 2–4 weeks
Evaluation, A/B, deployment: 1–2 weeks
Total: 7–15 weeks

Project cost starts from $8,000 — a savings of up to 40% compared to alternative approaches. Typical project budget: $8,000-$25,000 depending on scope. The exact price depends on data volume and configuration. We will evaluate your project and offer the optimal solution. Contact our engineers for a consultation.

Our experience: over 10 years in LLM fine-tuning and other forms of language model adaptation, with 50+ projects in financial NLP and other domains. We guarantee a transparent process and achievement of target metrics.

For A100 training, our optimized pipeline delivers 2x faster convergence than standard methods, saving you up to $5,000 on compute costs.

LLM Development: Fine-Tuning, RAG, Agents, and Production Deployment

Using GPT‑4 or Claude 3.5 Sonnet through a public API is not a solution — it's just a tool. When the requirement is to "make it like ChatGPT, but on our data," there is a real engineering challenge behind it: from prompt engineering to training a 70B model on your own infrastructure. End-to-end LLM solution development is a complex stack, and we have been doing it for over 5 years. During this time, we have completed over 20 projects in generative AI: from RAG systems for legal departments to custom support agents. Where exactly your task falls depends on data, latency requirements, budget, and how critical confidentiality is.

A typical situation: the client has already tried ChatGPT, but results are unstable — sometimes accurate, sometimes hallucinating. Or they need integration into a corporate portal while complying with security policies. Let's break down each layer of the stack in detail — from RAG to production deployment.

Why Do RAG Systems Break and How to Fix It?

RAG (Retrieval-Augmented Generation) looks simple: find relevant documents, put them in context, get an answer. In practice, it fails in several places.

Chunking without overlap. Classic mistake: chunk_size=512, overlap=0. If the answer lies across two chunks, retrieval won't find either with sufficient confidence. Solution: overlap 15–25% of chunk_size, or better yet, sentence-aware splitting with spaCy or NLTK instead of naive character splitting.

Poor embedder. text-embedding-ada-002 is good for general use, but on legal or medical texts, specialized models like E5-large-v2, BGE-M3, or fine-tuned sentence-transformers on domain data outperform it. Recall@5 differences can be 15–25%.

No re-ranking. Vector search optimizes for speed, not relevance. A cross-encoder re-ranker (ms-marco-MiniLM-L-6-v2, bge-reranker-large) after initial retrieval improves top-3 accuracy with acceptable latency (+50–150ms). This is often more impactful than improving the embedding model.

Hybrid search. Dense vectors alone work poorly on exact queries: names, SKUs, codes. BM25 (sparse) finds exact matches but misses semantics. Hybrid via RRF (Reciprocal Rank Fusion) is the optimal compromise. Qdrant, Weaviate, and pgvector 0.7+ support hybrid search natively.

Typical production architecture for a corporate knowledge base

Documents → preprocessing (PyMuPDF, Unstructured)
Chunking → embedding (BGE-M3)
Qdrant (hybrid dense+sparse)
Cross-encoder re-ranking
Context → LLM (vLLM or OpenAI API)
Answer with sources (RAGAS for quality evaluation)

When to Fine-Tune Instead of Prompt Engineering?

Prompt engineering solves ~70% of LLM adaptation tasks for a domain. The remaining 30% require fine-tuning. Three indicators: the model ignores a specific output format even with detailed prompting; the task requires deep knowledge of specialized vocabulary (medicine, law); you need to significantly reduce token costs by replacing a large model with a smaller specialized one.

LoRA and QLoRA are the standard for SFT. LoRA adds trainable low-rank matrices to attention layers. A typical configuration for Llama-3 8B: r=64, lora_alpha=128, target_modules=["q_proj","v_proj","k_proj","o_proj"] yields ~0.8% trainable parameters, training on one A100 40GB. QLoRA adds 4-bit quantization (NF4) and allows fine-tuning 70B models on two A100 40GB, though speed drops by half compared to bf16.

DPO instead of RLHF. Direct Preference Optimization requires only (chosen, rejected) pairs, not scalar reward signals. DPOTrainer from the trl library (Hugging Face) implements it in a few dozen lines.

Common mistake. A dataset of 500 examples, 5 epochs, validation loss 0.8 — seems fine. But on test, the model degrades on general instructions. Cause: catastrophic forgetting. Solution: add 10–20% general instruction-following examples (Alpaca, FLAN) to the training set to preserve original capabilities.

How to Choose a Base Model: 8B or 70B?

Model	Parameters	Strengths	Context
Llama-3.1 8B	8B	Quality/speed balance	128k
Llama-3.1 70B	70B	Complex reasoning	128k
Mistral 7B / Mixtral 8x7B	7B / 47B	Efficiency for size	32k
Qwen2.5 72B	72B	Code, multilingual	128k
Gemma 2 27B	27B	Open license	8k

For most tasks, fine-tuning an 8B model is sufficient. 70B is needed when deep reasoning is required or the 8B baseline does not reach the required quality even after fine-tuning. Inference cost for Llama-3 8B via vLLM on A100 is efficient; the exact cost depends on volume.

What Does PagedAttention Bring to Production?

vLLM is the first choice for serving open-source models. PagedAttention is the key technical innovation: KV-cache is managed like virtual memory in an OS, without fragmentation. This yields 2–4x higher throughput compared to naive HuggingFace Transformers inference. The vLLM documentation confirms that continuous batching and PagedAttention are the standard for high-load LLM services.

Typical numbers on A100 80GB for Llama-3 8B (bf16): 400–600 req/s, P50 latency 200–400ms, P99 latency 600–900ms at concurrency 64. For 70B on two A100 with tensor parallelism: 80–120 req/s, P99 latency 1.5–2.5s. AWQ or GPTQ quantization reduces memory consumption by 2x with quality loss within 1–3%.

Multi-Agent Systems

Agents are LLMs with access to tools: search, code execution, API calls, database interaction. Common patterns:

ReAct (Reason + Act): the model reasons → chooses a tool → observes the result → reasons again. LangChain and LlamaIndex implement it out of the box.
Multi-agent orchestration: multiple specialized agents with a coordinator on top. Example: coordinator → researcher (search + summarization) → coder (code generation and execution) → critic (verification). Tools: AutoGen (Microsoft), CrewAI, custom implementation on LangGraph.

In production, agent systems are non-deterministic. Essential: guardrails, step limits, logging of each step, human-in-the-loop for critical actions.

How We Work: Stages, Timeline, Deliverables

Stage	Duration	What You Get
Audit and data collection	1–2 weeks	Eval dataset of 100+ examples, task formalization
Baseline (prompt + RAG)	1–2 weeks	Working prototype, quality metrics
Fine-tuning (if needed)	2–4 weeks	Trained model, LoRA weights, model card
Deployment and monitoring	1–2 weeks	vLLM server, Grafana + Prometheus
Documentation and training	1 week	API documentation, team training

What Is Included

We deliver:

Technical documentation (model card, configs, deployment instructions)
Access to infrastructure (code repository, trained weights)
1 month of post-deployment support (consultations, bug fixes)
Customer team training (2–3 sessions on system operation)

Timeline: basic RAG prototype — 1–2 weeks. Fine-tuning with customer data — 3–6 weeks (including data preparation). Production system with monitoring and retraining — 2–4 months. Cost is calculated individually based on data volume, model complexity, and infrastructure requirements.

We guarantee the quality of the final model with performance benchmarks and ongoing monitoring. Our engineers have hands‑on experience with dozens of production LLM systems.

Want to evaluate your project? Leave a request — we will prepare a preliminary summary within 1–2 business days. Or get a consultation on choosing the approach: RAG, fine-tuning, or hybrid — we will tell you what works best for you. Contact us to discuss your LLM development needs. Schedule a free consultation today.