Llama (Meta) Language Model Fine-Tuning

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
Llama (Meta) Language Model Fine-Tuning
Complex
from 1 week to 3 months
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Fine-Tuning Llama Language Models (Meta)

Llama 3.x is a family of open-source language models from Meta, available for commercial use with full control over weights. Unlike GPT-4o or Claude, you get weight files, can deploy the model on your infrastructure, and fine-tune without API restrictions. This makes Llama the priority choice for tasks requiring data privacy, on-premise deployment, and high inference volume.

Llama 3.x Model Lineup

Model Parameters VRAM (fp16) Use Case
Llama 3.2 1B 1B 2 GB Edge, embedded systems
Llama 3.2 3B 3B 6 GB Mobile, lightweight agents
Llama 3.1 8B 8B 16 GB General tasks, fine-tuning
Llama 3.1 70B 70B 140 GB Complex tasks, competitive with GPT-4
Llama 3.1 405B 405B 800+ GB State-of-the-art, multi-GPU

For most fine-tuning tasks, Llama 3.1 8B or 70B is optimal — the first trains on a single A100 80GB, the second requires 2–4 GPUs.

Fine-Tuning Methods

Full Fine-Tuning: updates all weights. Maximum quality, but requires significant compute. For 8B model — minimum one A100 80GB, for 70B — 4×A100 or 8×A6000.

LoRA / QLoRA: updates only low-rank adapters added on top of frozen weights. QLoRA additionally quantizes the base model to 4-bit, allowing training of 70B on two A100 40GB. Quality approaches full fine-tuning for 5–15% of tasks.

Instruction Tuning: specialized supervised fine-tuning variant for instruction format adaptation. Important when training on domain-specific data from scratch.

Tech Stack: TRL + PEFT + Hugging Face

The main tooling is the trl library (Transformer Reinforcement Learning) paired with peft:

from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# QLoRA configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B-Instruct",
    quantization_config=bnb_config,
    device_map="auto"
)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

trainer = SFTTrainer(
    model=model,
    args=SFTConfig(
        output_dir="./llama3-finetuned",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        bf16=True,
        logging_steps=10,
    ),
    train_dataset=dataset["train"],
)

trainer.train()

Deep Dive: Choosing target_modules for LoRA

The target_modules parameter determines which layers receive LoRA adapters. In Llama 3 architecture — transformer with GQA (Grouped Query Attention). Typical targets:

  • q_proj, k_proj, v_proj, o_proj — attention layers (minimal set)
  • gate_proj, up_proj, down_proj — MLP layers (adds expressiveness)
  • All 6 together — maximum quality, more adapter parameters

LoRA rank r determines adapter size: r=8 gives ~0.1% extra parameters, r=64 — ~0.8%. For style specialization r=8–16 is enough, for complex knowledge extraction r=32–64.

Practical Example: Legal Assistant

Task: fine-tune Llama 3.1 8B for analyzing Russian arbitration decisions and extracting structured data (parties, dispute subject, court decision, amount).

Dataset: 3200 pairs (decision text → JSON). Data sourced from public kad.arbitr.ru database with manual annotation of 20% and synthetic labeling by GPT-4o for the rest (with sample manual verification).

Infrastructure: one A100 80GB, training 4 hours (3 epochs).

Results:

  • F1 for claim amount extraction: 0.58 → 0.91
  • Accuracy in determining initiator (plaintiff/defendant): 82% → 97%
  • Token generation speed: 47 tok/s (vLLM, A100)
  • Inference cost vs GPT-4o API: 12× lower when self-hosted

Fine-Tuned Model Inference

After training, the LoRA adapter can be:

  1. Used separately (PEFT inference): load base model + adapter
  2. Merged into one model (merge_and_unload()): simplifies deployment, removes PEFT overhead
  3. Quantized after merge: GGUF via llama.cpp, AWQ via autoawq, GPTQ — to reduce VRAM requirements
# Merge adapter with base model
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./llama3-merged")
tokenizer.save_pretrained("./llama3-merged")

For production deployment use vLLM — it provides PagedAttention and continuous batching, increasing throughput 2–5× compared to naive transformers inference.

Timeline and Infrastructure

  • Data preparation and annotation: 2–6 weeks
  • Training (8B, LoRA, A100): 2–8 hours
  • Training (70B, QLoRA, 2×A100): 12–48 hours
  • Evaluation and iterations: 1–2 weeks
  • Deployment with vLLM/TGI: 3–5 days
  • Total from start to production: 4–10 weeks