LLM Inference Optimization with vLLM

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1566 services

LLM Inference Optimization with vLLM

Medium

~3-5 days

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1305
Development of a web application for FEEDME
1214
Website development for BELFINGROUP
916
Development of an online store for the company FURNORO
1144
B2B Advance company logo design
608
Development of a web application for Enviok
881

Show more works

Optimizing LLM Inference with vLLM

vLLM is the most popular open-source engine for high-performance LLM inference. Its key innovation is PagedAttention: managing the key-value cache similar to OS virtual memory, eliminating fragmentation and increasing throughput by 15-24x compared to a naive implementation.

Why Standard HuggingFace Transformers Are Insufficient

HF transformers are good for experiments, bad for production:

Each request is processed independently - no batching of requests
The KV cache is stored entirely for each sequence - VRAM is used inefficiently
No prefill/decode separation
Throughput: ~10–50 tokens/sec per request

vLLM on the same GPUs: 500–2000 tokens/sec via concurrent batching.

Basic vLLM deployment

# Установка
pip install vllm

# Запуск сервера (OpenAI-совместимый API)
python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --tensor-parallel-size 1 \       # для одной GPU
  --max-model-len 8192 \
  --max-num-seqs 256 \             # максимальный concurrent batch
  --gpu-memory-utilization 0.90 \  # 90% VRAM для модели
  --host 0.0.0.0 \
  --port 8000

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    messages=[{"role": "user", "content": "Explain transformer attention"}],
    max_tokens=500,
    temperature=0.7
)

PagedAttention: How it works

Standard KV cache: a contiguous block of VRAM of the maximum length is allocated for each sequence. Fragmentation—up to 60% of memory is lost between sequences.

PagedAttention: The KV cache is divided into pages of a fixed size (usually 16 tokens). Pages are allocated on demand and may be non-contiguous. Prefix sharing: If two sequences share a common prefix (system prompt), the pages with the same prefix are shared, saving VRAM even with identical system prompts.

Tensor Parallelism for Large Models

# LLaMA-70B на 4xA100 80GB
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-70b-instruct \
  --tensor-parallel-size 4 \       # шардирование по 4 GPU
  --dtype bfloat16 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.95

Tensor parallelism splits attention heads and FFN matrices across GPUs. For the 70B model: 4xA100 80GB = sufficient for BF16.

Quantization to save VRAM

# AWQ квантизация (лучшее качество среди 4-bit методов)
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Mistral-7B-Instruct-v0.3-AWQ \
  --quantization awq \
  --dtype auto

# GPTQ
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Llama-2-13B-GPTQ \
  --quantization gptq

# FP8 (для H100)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-8b \
  --quantization fp8

Result: 7B model in AWQ 4-bit takes up ~4 GB of VRAM instead of ~14 GB in BF16.

Speculative Decoding

Accelerating decoding through a draft model: a small model (draft) generates several tokens, while a larger model (target) verifies them in parallel. If there's a match, all tokens are accepted as a single forward pass.

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-70b-instruct \
  --speculative-model meta-llama/Llama-3-8b-instruct \
  --num-speculative-tokens 5 \
  --tensor-parallel-size 4

Gain: 1.5–2.5x speedup for typical texts with <1% quality change.

Performance tuning

# Параметры для максимального throughput (не latency)
VLLM_CONFIG = {
    "max_num_seqs": 512,            # больше concurrent запросов
    "max_num_batched_tokens": 32768, # токены в одном forward pass
    "block_size": 32,               # размер страницы KV-кеша
    "swap_space": 4,                # GB для CPU offload при VRAM OOM
}

# Параметры для минимальной latency (не throughput)
VLLM_CONFIG_LATENCY = {
    "max_num_seqs": 32,
    "max_num_batched_tokens": 4096,
    "disable_async_output_proc": False,
}

Performance benchmark

On one A100 80GB, Mistral-7B-Instruct, 500-token responses:

Implementation	Throughput (req/s)	P99 Latency
HF transformers (batch=1)	1.2	8.5s
HF transformers (batch=16)	4.1	22s
vLLM (256 concurrent)	28.5	12s
vLLM + AWQ 4-bit	52.3	7s