Optimizing LLM Inference with vLLM
vLLM is the most popular open-source engine for high-performance LLM inference. Its key innovation is PagedAttention: managing the key-value cache similar to OS virtual memory, eliminating fragmentation and increasing throughput by 15-24x compared to a naive implementation.
Why Standard HuggingFace Transformers Are Insufficient
HF transformers are good for experiments, bad for production:
- Each request is processed independently - no batching of requests
- The KV cache is stored entirely for each sequence - VRAM is used inefficiently
- No prefill/decode separation
- Throughput: ~10–50 tokens/sec per request
vLLM on the same GPUs: 500–2000 tokens/sec via concurrent batching.
Basic vLLM deployment
# Установка
pip install vllm
# Запуск сервера (OpenAI-совместимый API)
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--tensor-parallel-size 1 \ # для одной GPU
--max-model-len 8192 \
--max-num-seqs 256 \ # максимальный concurrent batch
--gpu-memory-utilization 0.90 \ # 90% VRAM для модели
--host 0.0.0.0 \
--port 8000
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.3",
messages=[{"role": "user", "content": "Explain transformer attention"}],
max_tokens=500,
temperature=0.7
)
PagedAttention: How it works
Standard KV cache: a contiguous block of VRAM of the maximum length is allocated for each sequence. Fragmentation—up to 60% of memory is lost between sequences.
PagedAttention: The KV cache is divided into pages of a fixed size (usually 16 tokens). Pages are allocated on demand and may be non-contiguous. Prefix sharing: If two sequences share a common prefix (system prompt), the pages with the same prefix are shared, saving VRAM even with identical system prompts.
Tensor Parallelism for Large Models
# LLaMA-70B на 4xA100 80GB
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-70b-instruct \
--tensor-parallel-size 4 \ # шардирование по 4 GPU
--dtype bfloat16 \
--max-model-len 16384 \
--gpu-memory-utilization 0.95
Tensor parallelism splits attention heads and FFN matrices across GPUs. For the 70B model: 4xA100 80GB = sufficient for BF16.
Quantization to save VRAM
# AWQ квантизация (лучшее качество среди 4-bit методов)
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Mistral-7B-Instruct-v0.3-AWQ \
--quantization awq \
--dtype auto
# GPTQ
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Llama-2-13B-GPTQ \
--quantization gptq
# FP8 (для H100)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-8b \
--quantization fp8
Result: 7B model in AWQ 4-bit takes up ~4 GB of VRAM instead of ~14 GB in BF16.
Speculative Decoding
Accelerating decoding through a draft model: a small model (draft) generates several tokens, while a larger model (target) verifies them in parallel. If there's a match, all tokens are accepted as a single forward pass.
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-70b-instruct \
--speculative-model meta-llama/Llama-3-8b-instruct \
--num-speculative-tokens 5 \
--tensor-parallel-size 4
Gain: 1.5–2.5x speedup for typical texts with <1% quality change.
Performance tuning
# Параметры для максимального throughput (не latency)
VLLM_CONFIG = {
"max_num_seqs": 512, # больше concurrent запросов
"max_num_batched_tokens": 32768, # токены в одном forward pass
"block_size": 32, # размер страницы KV-кеша
"swap_space": 4, # GB для CPU offload при VRAM OOM
}
# Параметры для минимальной latency (не throughput)
VLLM_CONFIG_LATENCY = {
"max_num_seqs": 32,
"max_num_batched_tokens": 4096,
"disable_async_output_proc": False,
}
Performance benchmark
On one A100 80GB, Mistral-7B-Instruct, 500-token responses:
| Implementation | Throughput (req/s) | P99 Latency |
|---|---|---|
| HF transformers (batch=1) | 1.2 | 8.5s |
| HF transformers (batch=16) | 4.1 | 22s |
| vLLM (256 concurrent) | 28.5 | 12s |
| vLLM + AWQ 4-bit | 52.3 | 7s |







