Fine-Tuning Llama Language Models (Meta)
Llama 3.x is a family of open-source language models from Meta, available for commercial use with full control over weights. Unlike GPT-4o or Claude, you get weight files, can deploy the model on your infrastructure, and fine-tune without API restrictions. This makes Llama the priority choice for tasks requiring data privacy, on-premise deployment, and high inference volume.
Llama 3.x Model Lineup
| Model | Parameters | VRAM (fp16) | Use Case |
|---|---|---|---|
| Llama 3.2 1B | 1B | 2 GB | Edge, embedded systems |
| Llama 3.2 3B | 3B | 6 GB | Mobile, lightweight agents |
| Llama 3.1 8B | 8B | 16 GB | General tasks, fine-tuning |
| Llama 3.1 70B | 70B | 140 GB | Complex tasks, competitive with GPT-4 |
| Llama 3.1 405B | 405B | 800+ GB | State-of-the-art, multi-GPU |
For most fine-tuning tasks, Llama 3.1 8B or 70B is optimal — the first trains on a single A100 80GB, the second requires 2–4 GPUs.
Fine-Tuning Methods
Full Fine-Tuning: updates all weights. Maximum quality, but requires significant compute. For 8B model — minimum one A100 80GB, for 70B — 4×A100 or 8×A6000.
LoRA / QLoRA: updates only low-rank adapters added on top of frozen weights. QLoRA additionally quantizes the base model to 4-bit, allowing training of 70B on two A100 40GB. Quality approaches full fine-tuning for 5–15% of tasks.
Instruction Tuning: specialized supervised fine-tuning variant for instruction format adaptation. Important when training on domain-specific data from scratch.
Tech Stack: TRL + PEFT + Hugging Face
The main tooling is the trl library (Transformer Reinforcement Learning) paired with peft:
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
# QLoRA configuration
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3.1-8B-Instruct",
quantization_config=bnb_config,
device_map="auto"
)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
trainer = SFTTrainer(
model=model,
args=SFTConfig(
output_dir="./llama3-finetuned",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
bf16=True,
logging_steps=10,
),
train_dataset=dataset["train"],
)
trainer.train()
Deep Dive: Choosing target_modules for LoRA
The target_modules parameter determines which layers receive LoRA adapters. In Llama 3 architecture — transformer with GQA (Grouped Query Attention). Typical targets:
-
q_proj,k_proj,v_proj,o_proj— attention layers (minimal set) -
gate_proj,up_proj,down_proj— MLP layers (adds expressiveness) - All 6 together — maximum quality, more adapter parameters
LoRA rank r determines adapter size: r=8 gives ~0.1% extra parameters, r=64 — ~0.8%. For style specialization r=8–16 is enough, for complex knowledge extraction r=32–64.
Practical Example: Legal Assistant
Task: fine-tune Llama 3.1 8B for analyzing Russian arbitration decisions and extracting structured data (parties, dispute subject, court decision, amount).
Dataset: 3200 pairs (decision text → JSON). Data sourced from public kad.arbitr.ru database with manual annotation of 20% and synthetic labeling by GPT-4o for the rest (with sample manual verification).
Infrastructure: one A100 80GB, training 4 hours (3 epochs).
Results:
- F1 for claim amount extraction: 0.58 → 0.91
- Accuracy in determining initiator (plaintiff/defendant): 82% → 97%
- Token generation speed: 47 tok/s (vLLM, A100)
- Inference cost vs GPT-4o API: 12× lower when self-hosted
Fine-Tuned Model Inference
After training, the LoRA adapter can be:
- Used separately (PEFT inference): load base model + adapter
-
Merged into one model (
merge_and_unload()): simplifies deployment, removes PEFT overhead - Quantized after merge: GGUF via llama.cpp, AWQ via autoawq, GPTQ — to reduce VRAM requirements
# Merge adapter with base model
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./llama3-merged")
tokenizer.save_pretrained("./llama3-merged")
For production deployment use vLLM — it provides PagedAttention and continuous batching, increasing throughput 2–5× compared to naive transformers inference.
Timeline and Infrastructure
- Data preparation and annotation: 2–6 weeks
- Training (8B, LoRA, A100): 2–8 hours
- Training (70B, QLoRA, 2×A100): 12–48 hours
- Evaluation and iterations: 1–2 weeks
- Deployment with vLLM/TGI: 3–5 days
- Total from start to production: 4–10 weeks







