What is neural network pruning?

Pruning is a method of removing insignificant parameters (weights, neurons, layers) from a trained model. The goal is to reduce size and speed up inference with minimal accuracy loss. It applies to all types of neural networks and is especially effective for LLMs.

What pruning methods exist?

Main methods: unstructured pruning (zeroing individual weights), structured pruning (removing whole structures: neurons, attention heads, layers), semi-structured pruning (N:M sparsity), and depth pruning (removing layers). Popular methods include SparseGPT, Wanda, and LLM-Pruner, which we use depending on the task.

How much does pruning degrade model quality?

Degradation depends on the method and pruning ratio. With 50% sparsity using SparseGPT or Wanda, quality loss is typically 1–5%, partially recoverable via recovery fine-tuning (restores 50–70% of lost quality). In our cases, we maintain quality within 2–7% while achieving 2–4× compression.

Which pruning method is best for my model?

Choice depends on the task: for maximum compression without hardware support — unstructured pruning (SparseGPT, Wanda); for acceleration on standard GPUs — structured pruning (LLM-Pruner) or N:M sparsity; for edge deployment — depth pruning + quantization. We help select the optimal strategy for your hardware and requirements.

How long does pruning a model take?

Timelines depend on model size and method: strategy selection — 3–5 days, calibration and pruning — 4–24 hours, recovery fine-tuning — 2–8 hours, benchmarking — 3–5 days. Full cycle — 2–4 weeks. In urgent cases, we can complete within 1 week.

What is neural network pruning?

Pruning is a method of removing insignificant parameters (weights, neurons, layers) from a trained model. The goal is to reduce size and speed up inference with minimal accuracy loss. It applies to all types of neural networks and is especially effective for LLMs.

What pruning methods exist?

Main methods: unstructured pruning (zeroing individual weights), structured pruning (removing whole structures: neurons, attention heads, layers), semi-structured pruning (N:M sparsity), and depth pruning (removing layers). Popular methods include SparseGPT, Wanda, and LLM-Pruner, which we use depending on the task.

How much does pruning degrade model quality?

Degradation depends on the method and pruning ratio. With 50% sparsity using SparseGPT or Wanda, quality loss is typically 1–5%, partially recoverable via recovery fine-tuning (restores 50–70% of lost quality). In our cases, we maintain quality within 2–7% while achieving 2–4× compression.

Which pruning method is best for my model?

Choice depends on the task: for maximum compression without hardware support — unstructured pruning (SparseGPT, Wanda); for acceleration on standard GPUs — structured pruning (LLM-Pruner) or N:M sparsity; for edge deployment — depth pruning + quantization. We help select the optimal strategy for your hardware and requirements.

How long does pruning a model take?

Timelines depend on model size and method: strategy selection — 3–5 days, calibration and pruning — 4–24 hours, recovery fine-tuning — 2–8 hours, benchmarking — 3–5 days. Full cycle — 2–4 weeks. In urgent cases, we can complete within 1 week.

Pruning Neural Networks: Optimize and Accelerate Inference

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

Pruning Neural Networks: Optimize and Accelerate Inference

Complex

from 1 week to 3 months

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1347
Development of a web application for FEEDME
1247
Website development for BELFINGROUP
948
Development of an online store for the company FURNORO
1183
B2B Advance company logo design
642
Development of a web application for Enviok
921

Show more works

Imagine: your trained GPT-like model takes 16 GB and produces a response in 5 seconds on a GPU, but on CPU or edge devices it's unacceptable. Clients complain about delays, infrastructure costs are rising. We constantly encounter this situation: the model is good, but you can't deploy it in production—it's expensive and slow. Pruning neural networks is one of the key tools to fix this.

Pruning—removing insignificant parameters (weights, neurons, attention heads, layers) from a trained neural network. The goal is to reduce model size and speed up inference with minimal quality loss. In the context of LLMs, pruning is often combined with model quantization and distillation for maximum compression.

Why is pruning important for LLMs?

Large language models (LLMs) contain billions of parameters, but many are redundant. Pruning can compress the model 2–4 times without significant degradation, which is critical for edge deployment, reducing GPU costs, and lowering latency. Without pruning, modern LLMs often remain research artifacts—too expensive for real-world application. Weight trimming is a synonym for unstructured pruning.

Types of pruning

Unstructured pruning: individual weights are zeroed throughout the matrix. High compression, but requires sparse computation—standard GPUs do not accelerate sparse operations out of the box.
Structured pruning: entire structural elements are removed—neurons, attention heads, layers. The result is a truly smaller dense model that runs faster on standard hardware.
Semi-structured pruning (N:M sparsity): N weights are removed from each block of M. The 2:4 format is supported by NVIDIA Ampere and later at the hardware level (up to 2× acceleration).
Depth pruning: entire layers are removed, reducing model depth.

Which pruning method to choose for your model?

The choice of pruning method depends on the target hardware and quality requirements. Below is a comparison of popular approaches.

Method	Pruning type	Compression	Needs retraining	Execution time	GPU support
SparseGPT	Unstructured	up to 60%	No	Hours	Yes (2:4)
Wanda	Unstructured	up to 60%	No	Minutes	No
LLM-Pruner	Structured	25–50%	Yes (recovery FT)	Days	Yes
Depth pruning	Structured (layers)	20–40%	Yes	Hours	Yes

Pruning methods with code examples

LLM-Pruner: structured pruning of LLMs

# Example usage of LLM-Pruner
# pip install llm-pruner

from LLMPruner.pruner import LlamaStructuredPruner
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-7B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-7B")

pruner = LlamaStructuredPruner(
    model=model,
    tokenizer=tokenizer,
    pruning_ratio=0.25,  # Remove 25% of parameters
)

# Compute parameter importance on calibration data
calibration_data = ["Text for weight importance analysis...", ...]
pruner.get_mask(calibration_data, method="taylor")  # Taylor expansion importance

# Apply mask and prune
pruned_model = pruner.prune()

SparseGPT: efficient unstructured pruning without retraining

From the SparseGPT documentation: SparseGPT can prune 50% of the weights without retraining. — GitHub

# sparsegpt — library from the method authors
# Conceptual code example

from sparsegpt import SparseGPT

sparsegpt = SparseGPT(model)
sparsegpt.fasterprune(
    sparsity=0.5,         # 50% sparsity
    prunen=2,             # N in N:M
    prunem=4,             # M in N:M (2:4 — hardware supported)
    percdamp=0.01,
    blocksize=128,
)

With 2:4 sparsity (50%) on NVIDIA A100/H100, inference acceleration on Tensor Core is about 1.7–2×.

Wanda: simple and effective pruning

# Wanda is simpler than SparseGPT but comparable in quality
# Runs in minutes on a 7B model

def wanda_pruning(model, calibration_loader, sparsity=0.5):
    """Simplified Wanda implementation"""
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Linear):
            # Accumulate activation statistics
            activation_norms = get_activation_norms(module, calibration_loader)

            # Importance score = |W| * ||X||
            importance = module.weight.abs() * activation_norms

            # Pruning by threshold
            threshold = torch.quantile(importance, sparsity)
            mask = importance > threshold
            module.weight.data *= mask

    return model

Depth pruning: removing layers

For LLMs, middle layers are often less critical than first and last ones:

def depth_prune_llm(model, layers_to_remove: list[int]):
    """Remove specified decoder layers"""
    # For Llama architecture
    remaining_layers = [
        layer for i, layer in enumerate(model.model.layers)
        if i not in layers_to_remove
    ]
    model.model.layers = torch.nn.ModuleList(remaining_layers)
    return model

# Example: remove 8 middle layers out of 32 (25% depth reduction)
pruned_model = depth_prune_llm(model, layers_to_remove=list(range(12, 20)))
# Result: 24-layer model from 32-layer

Practical case: edge deployment optimization

Task: fine-tuned Llama 3.1 8B for an industrial controller (ARM server, 16GB RAM, no GPU). Requirement: inference < 2s per request.

Our optimization strategy:

GGUF Q4_K_M quantization: 8B → 4.1GB, 8 tok/s on CPU (insufficient)
Depth pruning (remove 8 layers out of 32): -25% latency, -3% quality
Width pruning of attention heads (remove 20% heads): -15% latency
Re-quantization: GGUF Q4_K_M on pruned model

Final pruned+quantized model characteristics:

Metric	Before optimization	After optimization
Size	4.1 GB	3.1 GB
Throughput	8 tok/s	14 tok/s
Latency (100 tokens)	7 s	1.8 s
Quality loss (LLM-judge)	—	7%

In a typical project with a 7B parameter model, infrastructure savings reach $12,000/month (based on reduced GPU hours). Compare: pruning yields a 2.5× better cost-performance ratio than no optimization. This is 50% better than quantization alone.

Recovery Fine-Tuning after pruning

Pruning always causes degradation. Recovery training restores part of the quality:

# After pruning — brief fine-tuning for recovery
from trl import SFTTrainer, SFTConfig

# Use the same dataset as for fine-tuning, but with a lower LR
recovery_config = SFTConfig(
    num_train_epochs=1,       # 1 epoch for recovery
    learning_rate=5e-5,       # Lower than full fine-tuning
    gradient_checkpointing=True,
    bf16=True,
)
trainer = SFTTrainer(model=pruned_model, args=recovery_config, train_dataset=dataset)
trainer.train()

Recovery fine-tuning typically recovers 50–70% of lost quality with 1 training epoch.

What is included in the work

Full documentation of all changes, including pruning masks and recovery steps.
Access to all pruned model variants (GGUF, ONNX, TensorRT) for immediate testing.
Training session for your team on how to deploy and maintain the optimized model.
Support for 3 months after deployment, including troubleshooting and performance monitoring.
Additional deliverables: model analysis, pruning strategy, calibration data, benchmarking reports.

Work process: stages and timeline

Analysis: meet, discuss requirements, receive model or dataset. Identify bottlenecks (latency, size, quality).
Design: select pruning and recovery methods, prepare pipeline. Estimate expected effect.
Implementation: perform pruning, recovery fine-tuning using chosen tools.
Testing: compare with original, record metrics (perplexity, latency, throughput).
Deployment: package model, provide integration instructions.

Timeline:

Pruning strategy selection: 3–5 days
Calibration and pruning: 4–24 hours (depends on method and size)
Recovery fine-tuning: 2–8 hours
Benchmarking and evaluation: 3–5 days
Total: 2–4 weeks

Cost is calculated individually—depends on model size, task complexity, and required quality. Our experience: 5+ years in ML, 10+ projects on model optimization for production. We guarantee transparent results and full documentation.

If you have a similar task, contact us. Request a consultation—we'll tell you how to speed up inference by 2–5× without significant quality loss.

LLM Development: Fine-Tuning, RAG, Agents, and Production Deployment

Using GPT‑4 or Claude 3.5 Sonnet through a public API is not a solution — it's just a tool. When the requirement is to "make it like ChatGPT, but on our data," there is a real engineering challenge behind it: from prompt engineering to training a 70B model on your own infrastructure. End-to-end LLM solution development is a complex stack, and we have been doing it for over 5 years. During this time, we have completed over 20 projects in generative AI: from RAG systems for legal departments to custom support agents. Where exactly your task falls depends on data, latency requirements, budget, and how critical confidentiality is.

A typical situation: the client has already tried ChatGPT, but results are unstable — sometimes accurate, sometimes hallucinating. Or they need integration into a corporate portal while complying with security policies. Let's break down each layer of the stack in detail — from RAG to production deployment.

Why Do RAG Systems Break and How to Fix It?

RAG (Retrieval-Augmented Generation) looks simple: find relevant documents, put them in context, get an answer. In practice, it fails in several places.

Chunking without overlap. Classic mistake: chunk_size=512, overlap=0. If the answer lies across two chunks, retrieval won't find either with sufficient confidence. Solution: overlap 15–25% of chunk_size, or better yet, sentence-aware splitting with spaCy or NLTK instead of naive character splitting.

Poor embedder. text-embedding-ada-002 is good for general use, but on legal or medical texts, specialized models like E5-large-v2, BGE-M3, or fine-tuned sentence-transformers on domain data outperform it. Recall@5 differences can be 15–25%.

No re-ranking. Vector search optimizes for speed, not relevance. A cross-encoder re-ranker (ms-marco-MiniLM-L-6-v2, bge-reranker-large) after initial retrieval improves top-3 accuracy with acceptable latency (+50–150ms). This is often more impactful than improving the embedding model.

Hybrid search. Dense vectors alone work poorly on exact queries: names, SKUs, codes. BM25 (sparse) finds exact matches but misses semantics. Hybrid via RRF (Reciprocal Rank Fusion) is the optimal compromise. Qdrant, Weaviate, and pgvector 0.7+ support hybrid search natively.

Typical production architecture for a corporate knowledge base

Documents → preprocessing (PyMuPDF, Unstructured)
Chunking → embedding (BGE-M3)
Qdrant (hybrid dense+sparse)
Cross-encoder re-ranking
Context → LLM (vLLM or OpenAI API)
Answer with sources (RAGAS for quality evaluation)

When to Fine-Tune Instead of Prompt Engineering?

Prompt engineering solves ~70% of LLM adaptation tasks for a domain. The remaining 30% require fine-tuning. Three indicators: the model ignores a specific output format even with detailed prompting; the task requires deep knowledge of specialized vocabulary (medicine, law); you need to significantly reduce token costs by replacing a large model with a smaller specialized one.

LoRA and QLoRA are the standard for SFT. LoRA adds trainable low-rank matrices to attention layers. A typical configuration for Llama-3 8B: r=64, lora_alpha=128, target_modules=["q_proj","v_proj","k_proj","o_proj"] yields ~0.8% trainable parameters, training on one A100 40GB. QLoRA adds 4-bit quantization (NF4) and allows fine-tuning 70B models on two A100 40GB, though speed drops by half compared to bf16.

DPO instead of RLHF. Direct Preference Optimization requires only (chosen, rejected) pairs, not scalar reward signals. DPOTrainer from the trl library (Hugging Face) implements it in a few dozen lines.

Common mistake. A dataset of 500 examples, 5 epochs, validation loss 0.8 — seems fine. But on test, the model degrades on general instructions. Cause: catastrophic forgetting. Solution: add 10–20% general instruction-following examples (Alpaca, FLAN) to the training set to preserve original capabilities.

How to Choose a Base Model: 8B or 70B?

Model	Parameters	Strengths	Context
Llama-3.1 8B	8B	Quality/speed balance	128k
Llama-3.1 70B	70B	Complex reasoning	128k
Mistral 7B / Mixtral 8x7B	7B / 47B	Efficiency for size	32k
Qwen2.5 72B	72B	Code, multilingual	128k
Gemma 2 27B	27B	Open license	8k

For most tasks, fine-tuning an 8B model is sufficient. 70B is needed when deep reasoning is required or the 8B baseline does not reach the required quality even after fine-tuning. Inference cost for Llama-3 8B via vLLM on A100 is efficient; the exact cost depends on volume.

What Does PagedAttention Bring to Production?

vLLM is the first choice for serving open-source models. PagedAttention is the key technical innovation: KV-cache is managed like virtual memory in an OS, without fragmentation. This yields 2–4x higher throughput compared to naive HuggingFace Transformers inference. The vLLM documentation confirms that continuous batching and PagedAttention are the standard for high-load LLM services.

Typical numbers on A100 80GB for Llama-3 8B (bf16): 400–600 req/s, P50 latency 200–400ms, P99 latency 600–900ms at concurrency 64. For 70B on two A100 with tensor parallelism: 80–120 req/s, P99 latency 1.5–2.5s. AWQ or GPTQ quantization reduces memory consumption by 2x with quality loss within 1–3%.

Multi-Agent Systems

Agents are LLMs with access to tools: search, code execution, API calls, database interaction. Common patterns:

ReAct (Reason + Act): the model reasons → chooses a tool → observes the result → reasons again. LangChain and LlamaIndex implement it out of the box.
Multi-agent orchestration: multiple specialized agents with a coordinator on top. Example: coordinator → researcher (search + summarization) → coder (code generation and execution) → critic (verification). Tools: AutoGen (Microsoft), CrewAI, custom implementation on LangGraph.

In production, agent systems are non-deterministic. Essential: guardrails, step limits, logging of each step, human-in-the-loop for critical actions.

How We Work: Stages, Timeline, Deliverables

Stage	Duration	What You Get
Audit and data collection	1–2 weeks	Eval dataset of 100+ examples, task formalization
Baseline (prompt + RAG)	1–2 weeks	Working prototype, quality metrics
Fine-tuning (if needed)	2–4 weeks	Trained model, LoRA weights, model card
Deployment and monitoring	1–2 weeks	vLLM server, Grafana + Prometheus
Documentation and training	1 week	API documentation, team training

What Is Included

We deliver:

Technical documentation (model card, configs, deployment instructions)
Access to infrastructure (code repository, trained weights)
1 month of post-deployment support (consultations, bug fixes)
Customer team training (2–3 sessions on system operation)

Timeline: basic RAG prototype — 1–2 weeks. Fine-tuning with customer data — 3–6 weeks (including data preparation). Production system with monitoring and retraining — 2–4 months. Cost is calculated individually based on data volume, model complexity, and infrastructure requirements.

We guarantee the quality of the final model with performance benchmarks and ongoing monitoring. Our engineers have hands‑on experience with dozens of production LLM systems.

Want to evaluate your project? Leave a request — we will prepare a preliminary summary within 1–2 business days. Or get a consultation on choosing the approach: RAG, fine-tuning, or hybrid — we will tell you what works best for you. Contact us to discuss your LLM development needs. Schedule a free consultation today.