LLM PEFT (Parameter-Efficient Fine-Tuning)

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
LLM PEFT (Parameter-Efficient Fine-Tuning)
Medium
from 1 week to 3 months
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Fine-Tuning LLM with PEFT (Parameter-Efficient Fine-Tuning)

PEFT is not a single method but a family of approaches to parameter-efficient fine-tuning, unified by the peft library from Hugging Face. LoRA and QLoRA are the most popular representatives, but PEFT includes other techniques: Prefix Tuning, Prompt Tuning, IA³, AdaLoRA. The choice of specific method depends on the task, data volume, available resources, and inference requirements.

PEFT Methods: Comparison

Method Trainable Parameters Inference Overhead Application
LoRA 0.1–5% None (after merge) Generation, classification
QLoRA 0.1–5% None (after merge) Same, lower VRAM
DoRA 0.1–5% None (after merge) Enhanced LoRA
AdaLoRA 0.1–3% None (after merge) Adaptive rank
Prefix Tuning <0.1% Yes (prefix tokens) Low data, NLU
Prompt Tuning <0.01% Yes Minimal data
IA³ <0.01% None (multiplication) Few-shot adaptation

AdaLoRA: Adaptive Rank Selection

AdaLoRA automatically distributes the parameter "budget" across layers, allocating larger ranks to important layers:

from peft import AdaLoraConfig, get_peft_model

config = AdaLoraConfig(
    init_r=12,          # Initial rank
    target_r=8,         # Target average rank
    beta1=0.85,
    beta2=0.85,
    deltaT=10,          # Rank update step
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, config)

AdaLoRA is useful when it's unknown in advance which layers are most important for adaptation.

Prefix Tuning: Soft Tokens for Task

Prefix Tuning adds trainable "soft tokens" (virtual tokens) at the beginning of each model layer. Base weights are completely frozen:

from peft import PrefixTuningConfig

config = PrefixTuningConfig(
    task_type="CAUSAL_LM",
    num_virtual_tokens=20,  # Number of prefix tokens
    prefix_projection=True, # MLP for projection
)

Advantage: extremely few parameters (<0.1%). Disadvantage: prefix tokens occupy part of the context window during each inference.

IA³: In-context Activation Augmentation

IA³ introduces scaling vectors in attention and FFN layers:

from peft import IA3Config

config = IA3Config(
    target_modules=["k_proj", "v_proj", "down_proj"],
    feedforward_modules=["down_proj"],
    task_type="CAUSAL_LM",
)

IA³ gives impressive results in few-shot scenarios with minimal data (50–200 examples), but underperforms LoRA with larger datasets.

Practical Method Comparison on Single Dataset

Task: financial news sentiment classification (Positive/Negative/Neutral).

Dataset: 1200 examples, Llama 3.1 8B Instruct base model.

Method Parameters VRAM (A100) Accuracy Training Time
5-shot (no FT) 0 16 GB 0.74
IA³ ~0.01% 16 GB 0.81 15 min
Prefix Tuning (20 tokens) ~0.05% 16 GB 0.83 25 min
LoRA r=8 ~0.2% 18 GB 0.89 45 min
LoRA r=16 ~0.4% 19 GB 0.91 55 min
QLoRA r=16 (4-bit base) ~0.4% 9 GB 0.90 70 min
Full FT 100% 4×A100 0.93 8 h

Conclusion: LoRA r=16 is the optimal choice for most tasks. IA³ is justified only with critical resource constraints or very small datasets.

Managing Multiple Adapters through PEFT

PEFT allows loading and switching multiple adapters in a single model:

from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")

# Load multiple adapters
model = PeftModel.from_pretrained(base_model, "./adapter-legal", adapter_name="legal")
model.load_adapter("./adapter-finance", adapter_name="finance")
model.load_adapter("./adapter-medical", adapter_name="medical")

# Dynamic switching
model.set_adapter("legal")
output_legal = model.generate(...)

model.set_adapter("finance")
output_finance = model.generate(...)

This is the "one base instance — multiple specializations" architectural pattern, which reduces memory overhead when serving multiple domains.

Timeline

  • PEFT method selection and experiments: 3–7 days
  • Data preparation: 2–4 weeks
  • Training and method comparison: 1–2 weeks
  • Total: 3–6 weeks