LLM Full Fine-Tuning

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
LLM Full Fine-Tuning
Complex
from 1 week to 3 months
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

LLM Fine-Tuning via Full Fine-Tuning Method

Full Fine-Tuning is a training method where all parameters of the language model are updated, not just adapter layers (as in LoRA). This is the most powerful specialization tool, giving highest quality, but requires significant computational resources and careful training management.

When Full Fine-Tuning is Justified

Full FT is chosen not by default, but when specific reasons exist:

Insufficient LoRA/QLoRA quality: if after LoRA optimization the gap from baseline remains substantial, Full FT can give additional 3–8% improvement in metrics.

Fundamentally new domain: when model needs training on notation or language significantly different from pretraining distribution (special symbols, formal grammars, unique terminology).

Continual Pre-Training: adding new knowledge to model through continued pretraining (CPT), then Instruction Tuning.

Architecture parameter changes: extending vocabulary (tokenizer), changing context length via RoPE scaling.

Technical Aspects of Full Fine-Tuning

Memory Requirements

For Full FT of N-parameter model in bf16:

  • Model parameters: 2N bytes
  • Gradients: 2N bytes (bf16) or 4N bytes (fp32)
  • Optimizer (AdamW): 8N bytes (fp32 moments)
  • Activations: depend on batch size and sequence length

Total — minimum 12N bytes without activations. For 7B: ~84 GB, for 70B: ~840 GB.

DeepSpeed ZeRO for Distributed Training

ZeRO (Zero Redundancy Optimizer) distributes parameters, gradients, and optimizer states across GPUs:

{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {"device": "cpu"},
    "offload_param": {"device": "cpu"},
    "overlap_comm": true,
    "contiguous_gradients": true,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto"
  },
  "bf16": {"enabled": true},
  "gradient_accumulation_steps": 8,
  "gradient_clipping": 1.0,
  "train_micro_batch_size_per_gpu": 2
}

ZeRO Stage 3 with CPU offloading allows training 7B model on 4×A100 40GB instead of 8 GPUs.

FSDP as Alternative to DeepSpeed

PyTorch Fully Sharded Data Parallel (FSDP) — native DeepSpeed alternative, better integrated with PyTorch ecosystem:

from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
from transformers import LlamaDecoderLayer

fsdp_config = {
    "fsdp": "full_shard auto_wrap",
    "fsdp_config": {
        "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
        "fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer",
        "fsdp_state_dict_type": "FULL_STATE_DICT",
        "fsdp_offload_params": False,
    }
}

Gradient Checkpointing

Reduces activation memory by recomputing part of forward pass during backward:

model.gradient_checkpointing_enable()
# Memory reduction ~4× with ~20% training slowdown

Managing Learning Rate in Full Fine-Tuning

For Full FT, learning rate schedule is critical:

Warmup: first 5–10% of steps lr grows from 0 to target. Prevents early gradient explosion.

Cosine decay: smooth lr reduction to 10% of peak by training end.

Target values: for Full FT on specialized dataset — 1e-5 to 5e-5. For CPT — 1e-5 or lower.

Catastrophic Forgetting: full weight update can destroy model's general knowledge. Mitigated by: low lr, replay buffer (mix with general data), EWC (Elastic Weight Consolidation).

Practical Case: Full FT for Financial Regulator

Task: specialized model for Central Bank analytics — analyze bank reports in XBRL formats, detect prudential regulation violations, generate directives.

Why Full FT not LoRA: specific language of regulatory directives (legal constructs, regulation references), new symbol patterns (form codes, regulatory formulas). LoRA r=64 gave F1=0.79, Full FT — F1=0.91.

Infrastructure: 8×A100 80GB, DeepSpeed ZeRO Stage 2, bf16.

Dataset: 6800 examples (report form → analysis + directive).

Training params: lr=2e-5, warmup_ratio=0.05, cosine decay, 3 epochs, effective batch size=64.

Results:

  • F1 violation detection: 0.79 (LoRA r=64) → 0.91 (Full FT)
  • ROUGE-L for directives: 0.61 → 0.74
  • Training time: 14 hours on 8×A100

Full Fine-Tuning Infrastructure Requirements

Model GPU (no offload) GPU (ZeRO Stage 3 + CPU) Time (3 epochs, 5K examples)
7B 4×A100 40GB 2×A100 40GB 4–8h
13B 8×A100 40GB 4×A100 40GB 8–16h
70B 8×A100 80GB 4×A100 80GB 24–48h
70B 16×H100 80GB 8×H100 80GB 12–24h

Project Timeline

  • Audit and planning: 1–2 weeks
  • Infrastructure preparation (cluster, DDP/FSDP/DeepSpeed): 1 week
  • Data preparation: 2–6 weeks
  • Training and iterations: 2–4 weeks
  • Evaluation, A/B, deployment: 1–2 weeks
  • Total: 7–15 weeks