Knowledge Distillation from Large to Small Model

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
Knowledge Distillation from Large to Small Model
Complex
from 1 week to 3 months
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Knowledge Distillation from large to small model

Knowledge Distillation (KD) is a technique for training a small model (student) using outputs of a large model (teacher) as "soft labels". Instead of training only on correct answers (hard labels), the student learns to reproduce the teacher's probability distribution across the entire vocabulary — this carries significantly more information about task structure.

Types of distillation for LLMs

Black-box distillation (Response Distillation): Use only final answers from teacher model. Teacher is a black box (can be GPT-4o API). Student is trained on a dataset where labels are teacher outputs.

# Collect data from teacher (GPT-4o)
def collect_teacher_outputs(prompts: list[str], client) -> list[dict]:
    dataset = []
    for prompt in prompts:
        teacher_response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.3
        ).choices[0].message.content

        dataset.append({"prompt": prompt, "response": teacher_response})
    return dataset

# Student (Llama 3.1 8B) trained on GPT-4o answers via SFT

White-box distillation (Feature/Logit Distillation): Access to teacher logits (probability distribution). Allows training student on "soft labels" — more informative.

import torch
import torch.nn.functional as F

def distillation_loss(
    student_logits,    # [batch, seq_len, vocab_size]
    teacher_logits,    # [batch, seq_len, vocab_size]
    labels,            # [batch, seq_len]
    temperature: float = 4.0,
    alpha: float = 0.5  # balance KD and SFT loss
) -> torch.Tensor:
    """
    Combined loss: alpha*KD + (1-alpha)*SFT
    temperature smooths teacher distribution
    """
    # KD loss on soft labels
    soft_teacher = F.softmax(teacher_logits / temperature, dim=-1)
    soft_student = F.log_softmax(student_logits / temperature, dim=-1)
    kd_loss = F.kl_div(soft_student, soft_teacher, reduction="batchmean") * (temperature ** 2)

    # SFT loss on hard labels
    sft_loss = F.cross_entropy(
        student_logits.view(-1, student_logits.size(-1)),
        labels.view(-1),
        ignore_index=-100
    )

    return alpha * kd_loss + (1 - alpha) * sft_loss

Sequence-level KD (SeqKD): Instead of token-level logits, student trains on best generated sequences from teacher (beam search outputs). Simpler to implement with black-box access.

DeepSeek-R1 Distill: example of industrial distillation

Most known modern example — distillation of DeepSeek-R1 (671B, MoE) into series of dense models:

  • DeepSeek-R1-Distill-Qwen-32B: 32B parameters, retains ~85% of R1 reasoning ability
  • DeepSeek-R1-Distill-Llama-70B: 70B parameters, ~92% of R1
  • DeepSeek-R1-Distill-Llama-8B: 8B parameters, ~70% of R1

Process: teacher (R1) generates 800K examples with CoT reasoning; student trains on them via standard SFT.

Practical case study: corporate assistant distillation

Task: client runs GPT-4o fine-tuned for contract analysis (inference cost — $4000/month). Need to reduce cost 10× without quality dropping below 90% of GPT-4o level.

Strategy:

  1. Collect 12,000 requests from production logs
  2. Run through GPT-4o — get teacher responses (distillation dataset)
  3. Fine-tune Llama 3.1 8B on this data (SFT distillation)
  4. Additionally apply DPO with preferred=GPT-4o answers, rejected=Llama baseline

Infrastructure: data collection via OpenAI API (~$180 for 12K requests), training on A100 40GB — 6 hours.

Results:

  • Quality retention vs GPT-4o (LLM-judge): 91%
  • Latency p95: 4.2s (GPT-4o API) → 0.9s (self-hosted vLLM)
  • Inference cost: $4000/month → $380/month (server + electricity)

Temperature selection in distillation

Temperature T in KD loss determines "softness" of teacher distribution:

T Effect
T=1 Original probabilities (sharp)
T=2–4 Smoothed — student sees "silver" answers better
T=5–10 Very soft — for small student with limited capacity

Practice: T=3–5 for most tasks, selected empirically.

Distillation limitations

  • Capacity bottleneck: student cannot exceed teacher, maximum approaches teacher level
  • Teacher dependency: if teacher makes mistakes, student inherits them
  • Narrow domain: black-box KD works well for specialization, poorly for general capability
  • Size gap: distilling 405B → 8B loses more than 70B → 8B

Timeline

  • Collecting data from teacher: 1–3 days
  • Distillation dataset preparation: 1–2 weeks
  • Student training (8B, SFT): 3–10 hours
  • Evaluation vs teacher: 3–5 days
  • Total: 3–6 weeks