What fine-tuning methods are supported for Qwen?

LLaMA-Factory supports Full, LoRA, QLoRA, and DoRA. LoRA is the most popular method, allowing fine-tuning a 7B model on a single A100 40GB. QLoRA with 4-bit quantization halves memory requirements.

How much data is needed for Qwen fine-tuning?

For LoRA, 500–2000 labeled examples are sufficient. For full fine-tuning, at least 10,000 examples are required. Data quality is more important than quantity: a carefully cleaned dataset of 500 examples yields better results than 10,000 noisy ones.

Which Qwen model is best for Russian language?

Qwen2.5-7B-Instruct and 14B-Instruct are optimal for Russian-language tasks. They demonstrate high quality in MMLU and General Language Understanding benchmarks, comparable to Llama 3.1 8B and Mistral 7B.

How to deploy a fine-tuned Qwen model?

We recommend vLLM or Triton Inference Server. vLLM supports continuous batching and PagedAttention, achieving up to 240 tokens/s on 2×A100 for a 14B model. For lighter deployment, ONNX Runtime with INT8 quantization works well.

How long does Qwen fine-tuning take?

Dataset preparation: 2–5 weeks. LoRA training of a 7B model on 2×A100: 3–8 hours. For a 72B model with QLoRA, 24–72 hours. The full cycle with evaluation iterations: 4–8 weeks.

What fine-tuning methods are supported for Qwen?

LLaMA-Factory supports Full, LoRA, QLoRA, and DoRA. LoRA is the most popular method, allowing fine-tuning a 7B model on a single A100 40GB. QLoRA with 4-bit quantization halves memory requirements.

How much data is needed for Qwen fine-tuning?

For LoRA, 500–2000 labeled examples are sufficient. For full fine-tuning, at least 10,000 examples are required. Data quality is more important than quantity: a carefully cleaned dataset of 500 examples yields better results than 10,000 noisy ones.

Which Qwen model is best for Russian language?

Qwen2.5-7B-Instruct and 14B-Instruct are optimal for Russian-language tasks. They demonstrate high quality in MMLU and General Language Understanding benchmarks, comparable to Llama 3.1 8B and Mistral 7B.

How to deploy a fine-tuned Qwen model?

We recommend vLLM or Triton Inference Server. vLLM supports continuous batching and PagedAttention, achieving up to 240 tokens/s on 2×A100 for a 14B model. For lighter deployment, ONNX Runtime with INT8 quantization works well.

How long does Qwen fine-tuning take?

Dataset preparation: 2–5 weeks. LoRA training of a 7B model on 2×A100: 3–8 hours. For a 72B model with QLoRA, 24–72 hours. The full cycle with evaluation iterations: 4–8 weeks.

Fine-Tuning Qwen Models: A Technical Guide

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Fine-Tuning Qwen Models: A Technical Guide

Complex

from 1 week to 3 months

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1346
Development of a web application for FEEDME
1246
Website development for BELFINGROUP
947
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

Fine-Tuning Qwen Models: A Technical Guide

We often encounter situations where a pre-trained LLM doesn't understand corporate terminology or makes calculation errors. Fine-tuning Qwen2.5 from Alibaba solves this — the model adapts to your domain, language, and output format. According to the Qwen2.5 technical report, the MMLU score reaches 85%, outperforming many open-source models. For instance, Qwen2.5-72B beats Llama 3.1 70B in MMLU by 2 percentage points, and on Chinese tasks by 15%. The family ranges from 0.5B to 72B parameters under Apache 2.0 (base versions) and Tongyi Qianwen License (large). Specialized variants: Qwen2.5-Coder for programming, Qwen2.5-Math for mathematics, and Qwen-VL for multimodal tasks. If you need to process long documents (contracts, scientific articles, regulations), Qwen2.5 supports up to 128K tokens context. For most production tasks, 7B or 14B are chosen, but if maximum accuracy is required — 72B. For edge devices, 0.5B and 1.5B fit.

How to Choose the Model Size for Fine-Tuning

Model	Parameters	VRAM (bf16)	Feature
Qwen2.5-0.5B	0.5B	1 GB	Edge/IoT
Qwen2.5-1.5B	1.5B	3 GB	Mobile
Qwen2.5-7B	7B	14 GB	Main workhorse
Qwen2.5-14B	14B	28 GB	Balance quality/resources
Qwen2.5-32B	32B	64 GB	High quality
Qwen2.5-72B	72B	144 GB	State-of-the-art open
Qwen2.5-Coder-32B	32B	64 GB	Code, SQL, algorithms

For most production tasks, 7B or 14B suffice. 0.5B and 1.5B are for inference on devices, 72B for maximum accuracy on complex scenarios.

QLoRA uses 4-bit weight quantization, allowing fine-tuning a 7B model on a single A100 40GB. Quality drop is no more than 2% compared to full fine-tuning. Cost reduction: QLoRA instead of Full halves GPU hours per experiment on a 7B model.

Why Qwen is Convenient for Multilingual and Long Contexts

Multilingual: Qwen is trained on data with a significant share of Chinese, English, and 27 other languages. Russian is represented much better than in many western models, which is important when working with Russian-language corpora.

Long context: Qwen2.5 supports up to 128K tokens. For fine-tuning tasks with long documents (contracts, scientific articles, regulations), this is a critical advantage.

Qwen2.5-Coder: a specialized version that outperforms most open-source models of the same size on HumanEval. When fine-tuned on a corporate codebase, it gives a better start than fine-tuning a general model.

How to Prepare a Dataset for Qwen Fine-Tuning

Data collection: collect 500 to 2000 examples relevant to your task. For financial analysis — reports with calculations.
Cleaning: remove duplicates, fix typos, check format compliance.
Labeling: each example must contain a user-assistant pair in Qwen chat template format.
Validation: create a test set (10% of the dataset) for quality evaluation.

Comparison of Fine-Tuning Methods

Method	Memory (7B)	Training Speed	Quality
Full	56 GB	1x	Baseline
LoRA (rank 16)	16 GB	3x	98-99% of Full
QLoRA (4-bit)	8 GB	5x	95-98% of Full

QLoRA reduces memory requirements by 7x without critical quality loss — optimal for quick experiments.

Fine-tuning with LLaMA-Factory

LLaMA-Factory is the most convenient tool for Qwen fine-tuning, supporting the full range of methods (Full, LoRA, QLoRA, DoRA) with a unified configuration format:

# config.yaml
model_name_or_path: Qwen/Qwen2.5-7B-Instruct
method: lora
dataset: my_dataset
template: qwen
finetuning_type: lora
lora_rank: 16
lora_alpha: 32
lora_target: q_proj,v_proj
output_dir: ./qwen25-7b-finetuned
num_train_epochs: 3
per_device_train_batch_size: 4
gradient_accumulation_steps: 4
learning_rate: 2.0e-4
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true

llamafactory-cli train config.yaml

Alternatively, using swift from ModelScope (Alibaba):

swift sft \
  --model_type qwen2_5_7b_instruct \
  --dataset my_dataset \
  --train_type lora \
  --output_dir ./output

Data Format: Qwen Chat Template

Qwen2.5 uses a specific chat template with <|im_start|> and <|im_end|> tags:

<|im_start|>system
You are an assistant for financial report analysis.<|im_end|>
<|im_start|>user
Calculate EBITDA from the data: revenue 850M, COGS 420M, OpEx 180M, DA 45M<|im_end|>
<|im_start|>assistant
EBITDA = Revenue - COGS - OpEx + DA = 850 - 420 - 180 + 45 = 295M<|im_end|>

When using transformers directly, apply tokenizer.apply_chat_template() for correct formatting.

Practical Case: Financial Analysis on Qwen2.5-14B

From our practice: a client needed automatic analysis of quarterly IFRS reports with indicator extraction, ratio calculations, and anomaly flags. The dataset — 1800 examples from corporate reporting. We fine-tuned Qwen2.5-14B Instruct via QLoRA (r=32, alpha=64), 4 epochs, on 2×A100 40GB for 6 hours. Results:

Correctness of ratio calculation: 71% → 94%
Anomaly flag accuracy (F1): 0.67 → 0.88
Text summary quality (human eval, 1–5): 3.1 → 4.4

Qwen2.5-14B outperformed Llama 3.1 8B by 12% in indicator extraction accuracy. MMLU and HumanEval confirm the model's competitive position. Inference cost savings: vLLM with INT4 quantization reduces cost by 40% compared to bf16, saving several hundred dollars per month at 100k requests load.

Deploying Fine-Tuned Qwen with vLLM

from vllm import LLM, SamplingParams

llm = LLM(
    model="./qwen25-14b-merged",
    dtype="bfloat16",
    tensor_parallel_size=2,  # 2 GPU
    max_model_len=32768,
    gpu_memory_utilization=0.9
)

sampling_params = SamplingParams(temperature=0.1, max_tokens=2048)
outputs = llm.generate(prompts, sampling_params)

vLLM provides continuous batching and PagedAttention, giving throughput ~240 tok/s on 2×A100 at batch size 16 — 3x higher than vanilla Transformers.

What's Included in Qwen Fine-Tuning Work

Task analysis and requirements gathering
Dataset preparation: cleaning, labeling, quality check
Training configuration setup (LoRA/QLoRA, hyperparameters)
Training and intermediate metric evaluation
Baseline vs fine-tuned comparison on test set
Deployment on chosen infrastructure (vLLM, Triton, SageMaker)
Documentation and team training

Additionally: we provide access to our test benches and metrics. Assess your project — write to us. Our team has extensive experience in NLP and LLM fine-tuning, having completed over 30 fine-tuning projects.

Timelines

Dataset preparation: 2–5 weeks
Training (7B, QLoRA): 3–8 hours
Training (72B, QLoRA, 4×A100): 24–72 hours
Iterations and evaluation: 1–2 weeks
Total: 4–8 weeks

Order end-to-end Qwen fine-tuning — get a model that understands your business. Contact us for a consultation.

LLM Development: Fine-Tuning, RAG, Agents, and Production Deployment

Using GPT‑4 or Claude 3.5 Sonnet through a public API is not a solution — it's just a tool. When the requirement is to "make it like ChatGPT, but on our data," there is a real engineering challenge behind it: from prompt engineering to training a 70B model on your own infrastructure. End-to-end LLM solution development is a complex stack, and we have been doing it for over 5 years. During this time, we have completed over 20 projects in generative AI: from RAG systems for legal departments to custom support agents. Where exactly your task falls depends on data, latency requirements, budget, and how critical confidentiality is.

A typical situation: the client has already tried ChatGPT, but results are unstable — sometimes accurate, sometimes hallucinating. Or they need integration into a corporate portal while complying with security policies. Let's break down each layer of the stack in detail — from RAG to production deployment.

Why Do RAG Systems Break and How to Fix It?

RAG (Retrieval-Augmented Generation) looks simple: find relevant documents, put them in context, get an answer. In practice, it fails in several places.

Chunking without overlap. Classic mistake: chunk_size=512, overlap=0. If the answer lies across two chunks, retrieval won't find either with sufficient confidence. Solution: overlap 15–25% of chunk_size, or better yet, sentence-aware splitting with spaCy or NLTK instead of naive character splitting.

Poor embedder. text-embedding-ada-002 is good for general use, but on legal or medical texts, specialized models like E5-large-v2, BGE-M3, or fine-tuned sentence-transformers on domain data outperform it. Recall@5 differences can be 15–25%.

No re-ranking. Vector search optimizes for speed, not relevance. A cross-encoder re-ranker (ms-marco-MiniLM-L-6-v2, bge-reranker-large) after initial retrieval improves top-3 accuracy with acceptable latency (+50–150ms). This is often more impactful than improving the embedding model.

Hybrid search. Dense vectors alone work poorly on exact queries: names, SKUs, codes. BM25 (sparse) finds exact matches but misses semantics. Hybrid via RRF (Reciprocal Rank Fusion) is the optimal compromise. Qdrant, Weaviate, and pgvector 0.7+ support hybrid search natively.

Typical production architecture for a corporate knowledge base

Documents → preprocessing (PyMuPDF, Unstructured)
Chunking → embedding (BGE-M3)
Qdrant (hybrid dense+sparse)
Cross-encoder re-ranking
Context → LLM (vLLM or OpenAI API)
Answer with sources (RAGAS for quality evaluation)

When to Fine-Tune Instead of Prompt Engineering?

Prompt engineering solves ~70% of LLM adaptation tasks for a domain. The remaining 30% require fine-tuning. Three indicators: the model ignores a specific output format even with detailed prompting; the task requires deep knowledge of specialized vocabulary (medicine, law); you need to significantly reduce token costs by replacing a large model with a smaller specialized one.

LoRA and QLoRA are the standard for SFT. LoRA adds trainable low-rank matrices to attention layers. A typical configuration for Llama-3 8B: r=64, lora_alpha=128, target_modules=["q_proj","v_proj","k_proj","o_proj"] yields ~0.8% trainable parameters, training on one A100 40GB. QLoRA adds 4-bit quantization (NF4) and allows fine-tuning 70B models on two A100 40GB, though speed drops by half compared to bf16.

DPO instead of RLHF. Direct Preference Optimization requires only (chosen, rejected) pairs, not scalar reward signals. DPOTrainer from the trl library (Hugging Face) implements it in a few dozen lines.

Common mistake. A dataset of 500 examples, 5 epochs, validation loss 0.8 — seems fine. But on test, the model degrades on general instructions. Cause: catastrophic forgetting. Solution: add 10–20% general instruction-following examples (Alpaca, FLAN) to the training set to preserve original capabilities.

How to Choose a Base Model: 8B or 70B?

Model	Parameters	Strengths	Context
Llama-3.1 8B	8B	Quality/speed balance	128k
Llama-3.1 70B	70B	Complex reasoning	128k
Mistral 7B / Mixtral 8x7B	7B / 47B	Efficiency for size	32k
Qwen2.5 72B	72B	Code, multilingual	128k
Gemma 2 27B	27B	Open license	8k

For most tasks, fine-tuning an 8B model is sufficient. 70B is needed when deep reasoning is required or the 8B baseline does not reach the required quality even after fine-tuning. Inference cost for Llama-3 8B via vLLM on A100 is efficient; the exact cost depends on volume.

What Does PagedAttention Bring to Production?

vLLM is the first choice for serving open-source models. PagedAttention is the key technical innovation: KV-cache is managed like virtual memory in an OS, without fragmentation. This yields 2–4x higher throughput compared to naive HuggingFace Transformers inference. The vLLM documentation confirms that continuous batching and PagedAttention are the standard for high-load LLM services.

Typical numbers on A100 80GB for Llama-3 8B (bf16): 400–600 req/s, P50 latency 200–400ms, P99 latency 600–900ms at concurrency 64. For 70B on two A100 with tensor parallelism: 80–120 req/s, P99 latency 1.5–2.5s. AWQ or GPTQ quantization reduces memory consumption by 2x with quality loss within 1–3%.

Multi-Agent Systems

Agents are LLMs with access to tools: search, code execution, API calls, database interaction. Common patterns:

ReAct (Reason + Act): the model reasons → chooses a tool → observes the result → reasons again. LangChain and LlamaIndex implement it out of the box.
Multi-agent orchestration: multiple specialized agents with a coordinator on top. Example: coordinator → researcher (search + summarization) → coder (code generation and execution) → critic (verification). Tools: AutoGen (Microsoft), CrewAI, custom implementation on LangGraph.

In production, agent systems are non-deterministic. Essential: guardrails, step limits, logging of each step, human-in-the-loop for critical actions.

How We Work: Stages, Timeline, Deliverables

Stage	Duration	What You Get
Audit and data collection	1–2 weeks	Eval dataset of 100+ examples, task formalization
Baseline (prompt + RAG)	1–2 weeks	Working prototype, quality metrics
Fine-tuning (if needed)	2–4 weeks	Trained model, LoRA weights, model card
Deployment and monitoring	1–2 weeks	vLLM server, Grafana + Prometheus
Documentation and training	1 week	API documentation, team training

What Is Included

We deliver:

Technical documentation (model card, configs, deployment instructions)
Access to infrastructure (code repository, trained weights)
1 month of post-deployment support (consultations, bug fixes)
Customer team training (2–3 sessions on system operation)

Timeline: basic RAG prototype — 1–2 weeks. Fine-tuning with customer data — 3–6 weeks (including data preparation). Production system with monitoring and retraining — 2–4 months. Cost is calculated individually based on data volume, model complexity, and infrastructure requirements.

We guarantee the quality of the final model with performance benchmarks and ongoing monitoring. Our engineers have hands‑on experience with dozens of production LLM systems.

Want to evaluate your project? Leave a request — we will prepare a preliminary summary within 1–2 business days. Or get a consultation on choosing the approach: RAG, fine-tuning, or hybrid — we will tell you what works best for you. Contact us to discuss your LLM development needs. Schedule a free consultation today.