Fine-Tuning Qwen Language Models (Alibaba)
Qwen is a family of open-source language models from Alibaba Cloud, released under Apache 2.0 license (base versions) and Tongyi Qianwen License (larger versions). The Qwen2.5 family includes models from 0.5B to 72B parameters, plus specialized versions: Qwen2.5-Coder (programming), Qwen2.5-Math (mathematics), Qwen-VL (multimodal). By MMLU and HumanEval benchmarks, Qwen2.5-72B competes with Llama 3.1 70B.
Qwen2.5 Model Lineup for Fine-Tuning
| Model | Parameters | VRAM (bf16) | Feature |
|---|---|---|---|
| Qwen2.5-0.5B | 0.5B | 1 GB | Edge/IoT |
| Qwen2.5-1.5B | 1.5B | 3 GB | Mobile |
| Qwen2.5-7B | 7B | 14 GB | Main workhorse |
| Qwen2.5-14B | 14B | 28 GB | Quality/resource balance |
| Qwen2.5-32B | 32B | 64 GB | High quality |
| Qwen2.5-72B | 72B | 144 GB | State-of-the-art open |
| Qwen2.5-Coder-32B | 32B | 64 GB | Code, SQL, algorithms |
Qwen Advantages for Specific Tasks
Multilingual support: Qwen is trained on data with significant Chinese, English, and 27 other languages. Russian is represented much better than in many Western models, important when working with Russian-language corpora.
Long context: Qwen2.5 supports up to 128K tokens context. For fine-tuning tasks with long documents (contracts, research papers, regulations) this is a critical advantage.
Qwen2.5-Coder: specialized version outperforming most open-source models of the same size on HumanEval. When fine-tuned on corporate codebases, provides better starting point than fine-tuning general model.
Fine-Tuning via LLaMA-Factory
LLaMA-Factory is the most convenient tool for Qwen fine-tuning, supporting full spectrum of methods (Full, LoRA, QLoRA, DoRA) with unified config format:
# config.yaml
model_name_or_path: Qwen/Qwen2.5-7B-Instruct
method: lora
dataset: my_dataset
template: qwen
finetuning_type: lora
lora_rank: 16
lora_alpha: 32
lora_target: q_proj,v_proj
output_dir: ./qwen25-7b-finetuned
num_train_epochs: 3
per_device_train_batch_size: 4
gradient_accumulation_steps: 4
learning_rate: 2.0e-4
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
llamafactory-cli train config.yaml
Alternatively, use swift from ModelScope (Alibaba):
swift sft \
--model_type qwen2_5_7b_instruct \
--dataset my_dataset \
--train_type lora \
--output_dir ./output
Data Format: Qwen Chat Template
Qwen2.5 uses specific chat template with <|im_start|> and <|im_end|> tags:
<|im_start|>system
You are an assistant for financial reporting analysis.<|im_end|>
<|im_start|>user
Calculate EBITDA from: revenue 850M, COGS 420M, OpEx 180M, DA 45M<|im_end|>
<|im_start|>assistant
EBITDA = Revenue - COGS - OpEx + DA = 850 - 420 - 180 + 45 = **295M**<|im_end|>
When using transformers directly, apply tokenizer.apply_chat_template() for correct formatting.
Practical Case: Financial Analysis on Qwen2.5-14B
Task: automatic analysis of quarterly company reports (IFRS), extraction of key metrics, calculation of financial ratios, anomaly flags.
Dataset: 1800 examples: reporting data input → structured analysis (JSON + text summary).
Training: Qwen2.5-14B Instruct, QLoRA (r=32, alpha=64), 4 epochs, 2×A100 40GB, 6 hours.
Results:
- Coefficient calculation correctness: 71% → 94%
- Anomaly flag accuracy (F1): 0.67 → 0.88
- Text summary quality (human eval, 1–5): 3.1 → 4.4
- Tokens per request (avg): unchanged (~1800)
Deploying Fine-Tuned Qwen via vLLM
from vllm import LLM, SamplingParams
llm = LLM(
model="./qwen25-14b-merged",
dtype="bfloat16",
tensor_parallel_size=2, # 2 GPU
max_model_len=32768,
gpu_memory_utilization=0.9
)
sampling_params = SamplingParams(temperature=0.1, max_tokens=2048)
outputs = llm.generate(prompts, sampling_params)
vLLM provides continuous batching and PagedAttention, which at batch size 16 gives throughput ~240 tok/s on 2×A100.
Timeline
- Dataset preparation: 2–5 weeks
- Training (7B, QLoRA): 3–8 hours
- Training (72B, QLoRA, 4×A100): 24–72 hours
- Iterations and evaluation: 1–2 weeks
- Total: 4–8 weeks







