LLM Fine-Tuning via QLoRA Method
QLoRA (Quantized Low-Rank Adaptation) combines base model quantization to 4-bit with LoRA adapter training in bf16/fp32. Proposed by Dettmers et al. in 2023 ("QLoRA: Efficient Finetuning of Quantized LLMs"). Key achievement: fine-tune 65B model on single A100 80GB GPU with minimal quality loss compared to Full Fine-Tuning in bf16.
How QLoRA Works
Step 1: base model loaded in 4-bit NormalFloat (NF4) — special quantization format optimal for normally distributed neural network weights.
Step 2: each quantized block stores separate scaling coefficient (Double Quantization — quantizing the scales themselves).
Step 3: during forward/backward pass, weights dequantized to bf16 "on the fly" for computations.
Step 4: LoRA adapters stored and updated in full precision (bf16/fp32).
Result: memory consumption reduced ~4× compared to bf16 LoRA, with quality barely suffering thanks to NF4 quantization.
QLoRA Implementation
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch
# NF4 quantization with Double Quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NF4 for weights
bnb_4bit_compute_dtype=torch.bfloat16, # bf16 for computations
bnb_4bit_use_double_quant=True, # Additional savings ~0.4 bits/param
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3.1-70B-Instruct",
quantization_config=bnb_config,
device_map="auto",
)
# Mandatory model preparation for kbit training
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(
r=64,
lora_alpha=128,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM",
bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable: 0.84% | all: 70B | memory: ~40 GB on 1×A100 80GB
Memory Requirements QLoRA vs LoRA vs Full FT
| Method | 7B | 13B | 34B | 70B |
|---|---|---|---|---|
| Full FT (bf16) | 4×A100 40GB | 8×A100 40GB | N/A | 8×A100 80GB |
| LoRA (bf16) | 1×A100 40GB | 2×A100 40GB | 4×A100 40GB | 4×A100 80GB |
| QLoRA (NF4) | 1×A100 24GB | 1×A100 40GB | 2×A100 40GB | 1×A100 80GB |
QLoRA enables working with 70B model on single A100 80GB — revolutionary reduction in infrastructure requirements.
Gradient Checkpointing with QLoRA
With QLoRA, activations occupy main memory. Gradient checkpointing is critical:
from trl import SFTTrainer, SFTConfig
training_args = SFTConfig(
output_dir="./qlora-output",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=16, # effective batch = 32
gradient_checkpointing=True, # mandatory
gradient_checkpointing_kwargs={"use_reentrant": False},
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.05,
bf16=True,
max_seq_length=4096,
logging_steps=25,
save_steps=100,
report_to="wandb",
)
Practical Case: 70B Model on Single A100 80GB
Task: specialize Llama 3.1 70B Instruct for analyzing legal contracts — risk classification, detect non-standard conditions, compare with template.
Why 70B not 8B: previously tested Llama 3.1 8B — quality on complex contracts unacceptable (too many missed nuances). 70B provides quality comparable to GPT-4o.
Infrastructure: 1×A100 80GB. QLoRA NF4, r=64, alpha=128.
Dataset: 1400 contracts with annotations (by practicing lawyers): each contract → list of risks with category and severity.
Training time: 3 epochs, 22 hours on single A100 80GB.
Results:
- Risk recall (don't miss): 0.71 (8B fine-tuned) → 0.89 (70B QLoRA)
- Risk precision: 0.79 → 0.87
- Formulation quality (LLM-as-judge, 1–5): 3.6 → 4.5
- Inference cost vs GPT-4o API: -71% (self-hosted vLLM)
QLoRA Limitations
Training speed: on-the-fly dequantization slows training ~20% compared to bf16 LoRA.
Heat dissipation: A100 with QLoRA operates at limit — needs adequate cooling.
Reproducibility: results slightly less reproducible due to quantization errors.
Timeline
- Data preparation: 2–5 weeks
- Training (70B, QLoRA, 1×A100 80GB): 12–36 hours
- Iterations: 1–2 weeks
- Total: 4–8 weeks







