DeepSeek Language Model Fine-Tuning

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1566 services

DeepSeek Language Model Fine-Tuning

Complex

from 1 week to 3 months

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1308
Development of a web application for FEEDME
1221
Website development for BELFINGROUP
921
Development of an online store for the company FURNORO
1149
B2B Advance company logo design
611
Development of a web application for Enviok
886

Show more works

Fine-Tuning DeepSeek Language Models

DeepSeek is a family of open-source language models from Chinese company DeepSeek AI, released under MIT license. DeepSeek-V3 and DeepSeek-R1 are current flagship models, competing with GPT-4o and Claude 3.5 Sonnet on most benchmarks at significantly lower inference cost. Open weights and high quality make DeepSeek attractive for enterprise fine-tuning scenarios.

DeepSeek Family: Model Navigation

Model	Parameters	Architecture	Application
DeepSeek-V3	671B (MoE, ~37B active)	MoE	Flagship, general purpose
DeepSeek-R1	671B (MoE)	MoE + Chain-of-Thought	Reasoning, mathematics
DeepSeek-R1-Distill-Llama-70B	70B	Dense	Reasoning, more accessible
DeepSeek-R1-Distill-Llama-8B	8B	Dense	Lightweight reasoning
DeepSeek-R1-Distill-Qwen-32B	32B	Dense	Quality/resource balance
DeepSeek-Coder-V2	236B (MoE)	MoE	Code generation

For practical fine-tuning, distilled versions (8B, 32B, 70B) are more commonly used — they train on regular GPU clusters and deliver good results for specialized tasks.

Architectural Feature: Multi-head Latent Attention (MLA)

DeepSeek-V3 uses MLA — attention mechanism with KV-cache compression. Compared to GQA (Grouped Query Attention in Llama), MLA reduces KV-cache by 5–13× at comparable quality. This is critical for long-context inference — DeepSeek supports 128K tokens with reasonable memory requirements.

When fine-tuning, MLA layers are processed normally through peft, but when choosing target_modules we account for specifics: in DeepSeek-V3, attention projections are named q_proj, kv_a_proj_with_mqa, kv_b_proj, o_proj.

LoRA Fine-Tuning DeepSeek-R1-Distill-Qwen-32B

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
import torch

model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,  # required for DeepSeek
    torch_dtype=torch.bfloat16,
)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 160,432,128 || all params: 32,783,822,848 || trainable%: 0.49

Working with Chain-of-Thought When Fine-Tuning R1

DeepSeek-R1 generates reasoning in special <think>...</think> tags before final answer. When fine-tuning on specialized data, it's important to:

Preserve think-blocks in training examples — maintains model's reasoning capability
Don't suppress <think> token in prompts during training
Remove reasoning from inference if needed — use <think>\n\n</think> as prefix or adjust stopping criteria

Example training example for R1:

{
  "messages": [
    {"role": "user", "content": "Is there a contradiction between section 3.2 and section 7.1 of the contract?"},
    {"role": "assistant", "content": "<think>\nSection 3.2 sets payment deadline — 30 days from delivery. Section 7.1 grants supplier the right to suspend deliveries if payment is overdue by more than 15 days. No contradiction — section 7.1 creates a right that arises before the general payment deadline expires.\n</think>\n\nNo direct contradiction between sections..."}
  ]
}

Practical Case: Reasoning Model for Financial Audit

Task: fine-tune DeepSeek-R1-Distill-Qwen-32B for detecting violations in financial documents — inconsistencies between balance sheet items, double-entry violations, anomalous transactions.

Dataset: 2100 examples: financial document fragments + think-block with step-by-step auditor reasoning + final conclusion. Data prepared with practicing auditors.

Training: QLoRA (r=32), 3 epochs, 4×A100 40GB, 18 hours.

Results:

Violation detection precision: 0.61 → 0.89
Recall (doesn't miss violations): 0.54 → 0.84
F1: 0.57 → 0.87
Reasoning quality (auditor evaluation, 1–5): 2.8 → 4.3

Inference via vLLM with MoE Support

For DeepSeek-V3/R1 (full size) requires special vLLM configuration:

from vllm import LLM, SamplingParams

llm = LLM(
    model="deepseek-ai/DeepSeek-V3",
    tensor_parallel_size=8,   # 8×H100 for full model
    trust_remote_code=True,
    max_model_len=65536,
    dtype="bfloat16",
)

For distilled models (8B, 32B) 1–4 GPUs are sufficient.

Project Timeline

Dataset preparation with think-blocks: 3–8 weeks (significantly more complex than standard SFT)
Training (32B, 4×A100): 12–24 hours
Reasoning quality evaluation: 2 weeks (requires expert evaluation)
Deployment and monitoring: 1–2 weeks
Total: 7–14 weeks