LLM QLoRA Fine-Tuning

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
LLM QLoRA Fine-Tuning
Medium
from 1 week to 3 months
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

LLM Fine-Tuning via QLoRA Method

QLoRA (Quantized Low-Rank Adaptation) combines base model quantization to 4-bit with LoRA adapter training in bf16/fp32. Proposed by Dettmers et al. in 2023 ("QLoRA: Efficient Finetuning of Quantized LLMs"). Key achievement: fine-tune 65B model on single A100 80GB GPU with minimal quality loss compared to Full Fine-Tuning in bf16.

How QLoRA Works

Step 1: base model loaded in 4-bit NormalFloat (NF4) — special quantization format optimal for normally distributed neural network weights.

Step 2: each quantized block stores separate scaling coefficient (Double Quantization — quantizing the scales themselves).

Step 3: during forward/backward pass, weights dequantized to bf16 "on the fly" for computations.

Step 4: LoRA adapters stored and updated in full precision (bf16/fp32).

Result: memory consumption reduced ~4× compared to bf16 LoRA, with quality barely suffering thanks to NF4 quantization.

QLoRA Implementation

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# NF4 quantization with Double Quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # NF4 for weights
    bnb_4bit_compute_dtype=torch.bfloat16,  # bf16 for computations
    bnb_4bit_use_double_quant=True,     # Additional savings ~0.4 bits/param
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-70B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
)

# Mandatory model preparation for kbit training
model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=64,
    lora_alpha=128,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
    bias="none",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable: 0.84% | all: 70B | memory: ~40 GB on 1×A100 80GB

Memory Requirements QLoRA vs LoRA vs Full FT

Method 7B 13B 34B 70B
Full FT (bf16) 4×A100 40GB 8×A100 40GB N/A 8×A100 80GB
LoRA (bf16) 1×A100 40GB 2×A100 40GB 4×A100 40GB 4×A100 80GB
QLoRA (NF4) 1×A100 24GB 1×A100 40GB 2×A100 40GB 1×A100 80GB

QLoRA enables working with 70B model on single A100 80GB — revolutionary reduction in infrastructure requirements.

Gradient Checkpointing with QLoRA

With QLoRA, activations occupy main memory. Gradient checkpointing is critical:

from trl import SFTTrainer, SFTConfig

training_args = SFTConfig(
    output_dir="./qlora-output",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=16,  # effective batch = 32
    gradient_checkpointing=True,      # mandatory
    gradient_checkpointing_kwargs={"use_reentrant": False},
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.05,
    bf16=True,
    max_seq_length=4096,
    logging_steps=25,
    save_steps=100,
    report_to="wandb",
)

Practical Case: 70B Model on Single A100 80GB

Task: specialize Llama 3.1 70B Instruct for analyzing legal contracts — risk classification, detect non-standard conditions, compare with template.

Why 70B not 8B: previously tested Llama 3.1 8B — quality on complex contracts unacceptable (too many missed nuances). 70B provides quality comparable to GPT-4o.

Infrastructure: 1×A100 80GB. QLoRA NF4, r=64, alpha=128.

Dataset: 1400 contracts with annotations (by practicing lawyers): each contract → list of risks with category and severity.

Training time: 3 epochs, 22 hours on single A100 80GB.

Results:

  • Risk recall (don't miss): 0.71 (8B fine-tuned) → 0.89 (70B QLoRA)
  • Risk precision: 0.79 → 0.87
  • Formulation quality (LLM-as-judge, 1–5): 3.6 → 4.5
  • Inference cost vs GPT-4o API: -71% (self-hosted vLLM)

QLoRA Limitations

Training speed: on-the-fly dequantization slows training ~20% compared to bf16 LoRA.

Heat dissipation: A100 with QLoRA operates at limit — needs adequate cooling.

Reproducibility: results slightly less reproducible due to quantization errors.

Timeline

  • Data preparation: 2–5 weeks
  • Training (70B, QLoRA, 1×A100 80GB): 12–36 hours
  • Iterations: 1–2 weeks
  • Total: 4–8 weeks