LLM Fine-Tuning via Full Fine-Tuning Method
Full Fine-Tuning is a training method where all parameters of the language model are updated, not just adapter layers (as in LoRA). This is the most powerful specialization tool, giving highest quality, but requires significant computational resources and careful training management.
When Full Fine-Tuning is Justified
Full FT is chosen not by default, but when specific reasons exist:
Insufficient LoRA/QLoRA quality: if after LoRA optimization the gap from baseline remains substantial, Full FT can give additional 3–8% improvement in metrics.
Fundamentally new domain: when model needs training on notation or language significantly different from pretraining distribution (special symbols, formal grammars, unique terminology).
Continual Pre-Training: adding new knowledge to model through continued pretraining (CPT), then Instruction Tuning.
Architecture parameter changes: extending vocabulary (tokenizer), changing context length via RoPE scaling.
Technical Aspects of Full Fine-Tuning
Memory Requirements
For Full FT of N-parameter model in bf16:
- Model parameters: 2N bytes
- Gradients: 2N bytes (bf16) or 4N bytes (fp32)
- Optimizer (AdamW): 8N bytes (fp32 moments)
- Activations: depend on batch size and sequence length
Total — minimum 12N bytes without activations. For 7B: ~84 GB, for 70B: ~840 GB.
DeepSpeed ZeRO for Distributed Training
ZeRO (Zero Redundancy Optimizer) distributes parameters, gradients, and optimizer states across GPUs:
{
"zero_optimization": {
"stage": 3,
"offload_optimizer": {"device": "cpu"},
"offload_param": {"device": "cpu"},
"overlap_comm": true,
"contiguous_gradients": true,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto"
},
"bf16": {"enabled": true},
"gradient_accumulation_steps": 8,
"gradient_clipping": 1.0,
"train_micro_batch_size_per_gpu": 2
}
ZeRO Stage 3 with CPU offloading allows training 7B model on 4×A100 40GB instead of 8 GPUs.
FSDP as Alternative to DeepSpeed
PyTorch Fully Sharded Data Parallel (FSDP) — native DeepSpeed alternative, better integrated with PyTorch ecosystem:
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
from transformers import LlamaDecoderLayer
fsdp_config = {
"fsdp": "full_shard auto_wrap",
"fsdp_config": {
"fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
"fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer",
"fsdp_state_dict_type": "FULL_STATE_DICT",
"fsdp_offload_params": False,
}
}
Gradient Checkpointing
Reduces activation memory by recomputing part of forward pass during backward:
model.gradient_checkpointing_enable()
# Memory reduction ~4× with ~20% training slowdown
Managing Learning Rate in Full Fine-Tuning
For Full FT, learning rate schedule is critical:
Warmup: first 5–10% of steps lr grows from 0 to target. Prevents early gradient explosion.
Cosine decay: smooth lr reduction to 10% of peak by training end.
Target values: for Full FT on specialized dataset — 1e-5 to 5e-5. For CPT — 1e-5 or lower.
Catastrophic Forgetting: full weight update can destroy model's general knowledge. Mitigated by: low lr, replay buffer (mix with general data), EWC (Elastic Weight Consolidation).
Practical Case: Full FT for Financial Regulator
Task: specialized model for Central Bank analytics — analyze bank reports in XBRL formats, detect prudential regulation violations, generate directives.
Why Full FT not LoRA: specific language of regulatory directives (legal constructs, regulation references), new symbol patterns (form codes, regulatory formulas). LoRA r=64 gave F1=0.79, Full FT — F1=0.91.
Infrastructure: 8×A100 80GB, DeepSpeed ZeRO Stage 2, bf16.
Dataset: 6800 examples (report form → analysis + directive).
Training params: lr=2e-5, warmup_ratio=0.05, cosine decay, 3 epochs, effective batch size=64.
Results:
- F1 violation detection: 0.79 (LoRA r=64) → 0.91 (Full FT)
- ROUGE-L for directives: 0.61 → 0.74
- Training time: 14 hours on 8×A100
Full Fine-Tuning Infrastructure Requirements
| Model | GPU (no offload) | GPU (ZeRO Stage 3 + CPU) | Time (3 epochs, 5K examples) |
|---|---|---|---|
| 7B | 4×A100 40GB | 2×A100 40GB | 4–8h |
| 13B | 8×A100 40GB | 4×A100 40GB | 8–16h |
| 70B | 8×A100 80GB | 4×A100 80GB | 24–48h |
| 70B | 16×H100 80GB | 8×H100 80GB | 12–24h |
Project Timeline
- Audit and planning: 1–2 weeks
- Infrastructure preparation (cluster, DDP/FSDP/DeepSpeed): 1 week
- Data preparation: 2–6 weeks
- Training and iterations: 2–4 weeks
- Evaluation, A/B, deployment: 1–2 weeks
- Total: 7–15 weeks







