Fine-Tuning DeepSeek Language Models
DeepSeek is a family of open-source language models from Chinese company DeepSeek AI, released under MIT license. DeepSeek-V3 and DeepSeek-R1 are current flagship models, competing with GPT-4o and Claude 3.5 Sonnet on most benchmarks at significantly lower inference cost. Open weights and high quality make DeepSeek attractive for enterprise fine-tuning scenarios.
DeepSeek Family: Model Navigation
| Model | Parameters | Architecture | Application |
|---|---|---|---|
| DeepSeek-V3 | 671B (MoE, ~37B active) | MoE | Flagship, general purpose |
| DeepSeek-R1 | 671B (MoE) | MoE + Chain-of-Thought | Reasoning, mathematics |
| DeepSeek-R1-Distill-Llama-70B | 70B | Dense | Reasoning, more accessible |
| DeepSeek-R1-Distill-Llama-8B | 8B | Dense | Lightweight reasoning |
| DeepSeek-R1-Distill-Qwen-32B | 32B | Dense | Quality/resource balance |
| DeepSeek-Coder-V2 | 236B (MoE) | MoE | Code generation |
For practical fine-tuning, distilled versions (8B, 32B, 70B) are more commonly used — they train on regular GPU clusters and deliver good results for specialized tasks.
Architectural Feature: Multi-head Latent Attention (MLA)
DeepSeek-V3 uses MLA — attention mechanism with KV-cache compression. Compared to GQA (Grouped Query Attention in Llama), MLA reduces KV-cache by 5–13× at comparable quality. This is critical for long-context inference — DeepSeek supports 128K tokens with reasonable memory requirements.
When fine-tuning, MLA layers are processed normally through peft, but when choosing target_modules we account for specifics: in DeepSeek-V3, attention projections are named q_proj, kv_a_proj_with_mqa, kv_b_proj, o_proj.
LoRA Fine-Tuning DeepSeek-R1-Distill-Qwen-32B
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
import torch
model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True, # required for DeepSeek
torch_dtype=torch.bfloat16,
)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 160,432,128 || all params: 32,783,822,848 || trainable%: 0.49
Working with Chain-of-Thought When Fine-Tuning R1
DeepSeek-R1 generates reasoning in special <think>...</think> tags before final answer. When fine-tuning on specialized data, it's important to:
- Preserve think-blocks in training examples — maintains model's reasoning capability
-
Don't suppress
<think>token in prompts during training -
Remove reasoning from inference if needed — use
<think>\n\n</think>as prefix or adjust stopping criteria
Example training example for R1:
{
"messages": [
{"role": "user", "content": "Is there a contradiction between section 3.2 and section 7.1 of the contract?"},
{"role": "assistant", "content": "<think>\nSection 3.2 sets payment deadline — 30 days from delivery. Section 7.1 grants supplier the right to suspend deliveries if payment is overdue by more than 15 days. No contradiction — section 7.1 creates a right that arises before the general payment deadline expires.\n</think>\n\nNo direct contradiction between sections..."}
]
}
Practical Case: Reasoning Model for Financial Audit
Task: fine-tune DeepSeek-R1-Distill-Qwen-32B for detecting violations in financial documents — inconsistencies between balance sheet items, double-entry violations, anomalous transactions.
Dataset: 2100 examples: financial document fragments + think-block with step-by-step auditor reasoning + final conclusion. Data prepared with practicing auditors.
Training: QLoRA (r=32), 3 epochs, 4×A100 40GB, 18 hours.
Results:
- Violation detection precision: 0.61 → 0.89
- Recall (doesn't miss violations): 0.54 → 0.84
- F1: 0.57 → 0.87
- Reasoning quality (auditor evaluation, 1–5): 2.8 → 4.3
Inference via vLLM with MoE Support
For DeepSeek-V3/R1 (full size) requires special vLLM configuration:
from vllm import LLM, SamplingParams
llm = LLM(
model="deepseek-ai/DeepSeek-V3",
tensor_parallel_size=8, # 8×H100 for full model
trust_remote_code=True,
max_model_len=65536,
dtype="bfloat16",
)
For distilled models (8B, 32B) 1–4 GPUs are sufficient.
Project Timeline
- Dataset preparation with think-blocks: 3–8 weeks (significantly more complex than standard SFT)
- Training (32B, 4×A100): 12–24 hours
- Reasoning quality evaluation: 2 weeks (requires expert evaluation)
- Deployment and monitoring: 1–2 weeks
- Total: 7–14 weeks







