Fine-Tuning LLM with PEFT (Parameter-Efficient Fine-Tuning)
PEFT is not a single method but a family of approaches to parameter-efficient fine-tuning, unified by the peft library from Hugging Face. LoRA and QLoRA are the most popular representatives, but PEFT includes other techniques: Prefix Tuning, Prompt Tuning, IA³, AdaLoRA. The choice of specific method depends on the task, data volume, available resources, and inference requirements.
PEFT Methods: Comparison
| Method | Trainable Parameters | Inference Overhead | Application |
|---|---|---|---|
| LoRA | 0.1–5% | None (after merge) | Generation, classification |
| QLoRA | 0.1–5% | None (after merge) | Same, lower VRAM |
| DoRA | 0.1–5% | None (after merge) | Enhanced LoRA |
| AdaLoRA | 0.1–3% | None (after merge) | Adaptive rank |
| Prefix Tuning | <0.1% | Yes (prefix tokens) | Low data, NLU |
| Prompt Tuning | <0.01% | Yes | Minimal data |
| IA³ | <0.01% | None (multiplication) | Few-shot adaptation |
AdaLoRA: Adaptive Rank Selection
AdaLoRA automatically distributes the parameter "budget" across layers, allocating larger ranks to important layers:
from peft import AdaLoraConfig, get_peft_model
config = AdaLoraConfig(
init_r=12, # Initial rank
target_r=8, # Target average rank
beta1=0.85,
beta2=0.85,
deltaT=10, # Rank update step
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)
AdaLoRA is useful when it's unknown in advance which layers are most important for adaptation.
Prefix Tuning: Soft Tokens for Task
Prefix Tuning adds trainable "soft tokens" (virtual tokens) at the beginning of each model layer. Base weights are completely frozen:
from peft import PrefixTuningConfig
config = PrefixTuningConfig(
task_type="CAUSAL_LM",
num_virtual_tokens=20, # Number of prefix tokens
prefix_projection=True, # MLP for projection
)
Advantage: extremely few parameters (<0.1%). Disadvantage: prefix tokens occupy part of the context window during each inference.
IA³: In-context Activation Augmentation
IA³ introduces scaling vectors in attention and FFN layers:
from peft import IA3Config
config = IA3Config(
target_modules=["k_proj", "v_proj", "down_proj"],
feedforward_modules=["down_proj"],
task_type="CAUSAL_LM",
)
IA³ gives impressive results in few-shot scenarios with minimal data (50–200 examples), but underperforms LoRA with larger datasets.
Practical Method Comparison on Single Dataset
Task: financial news sentiment classification (Positive/Negative/Neutral).
Dataset: 1200 examples, Llama 3.1 8B Instruct base model.
| Method | Parameters | VRAM (A100) | Accuracy | Training Time |
|---|---|---|---|---|
| 5-shot (no FT) | 0 | 16 GB | 0.74 | — |
| IA³ | ~0.01% | 16 GB | 0.81 | 15 min |
| Prefix Tuning (20 tokens) | ~0.05% | 16 GB | 0.83 | 25 min |
| LoRA r=8 | ~0.2% | 18 GB | 0.89 | 45 min |
| LoRA r=16 | ~0.4% | 19 GB | 0.91 | 55 min |
| QLoRA r=16 (4-bit base) | ~0.4% | 9 GB | 0.90 | 70 min |
| Full FT | 100% | 4×A100 | 0.93 | 8 h |
Conclusion: LoRA r=16 is the optimal choice for most tasks. IA³ is justified only with critical resource constraints or very small datasets.
Managing Multiple Adapters through PEFT
PEFT allows loading and switching multiple adapters in a single model:
from peft import PeftModel
# Load base model
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
# Load multiple adapters
model = PeftModel.from_pretrained(base_model, "./adapter-legal", adapter_name="legal")
model.load_adapter("./adapter-finance", adapter_name="finance")
model.load_adapter("./adapter-medical", adapter_name="medical")
# Dynamic switching
model.set_adapter("legal")
output_legal = model.generate(...)
model.set_adapter("finance")
output_finance = model.generate(...)
This is the "one base instance — multiple specializations" architectural pattern, which reduces memory overhead when serving multiple domains.
Timeline
- PEFT method selection and experiments: 3–7 days
- Data preparation: 2–4 weeks
- Training and method comparison: 1–2 weeks
- Total: 3–6 weeks







