LLM Inference Optimization with vLLM

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
LLM Inference Optimization with vLLM
Medium
~3-5 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Optimizing LLM Inference with vLLM

vLLM is the most popular open-source engine for high-performance LLM inference. Its key innovation is PagedAttention: managing the key-value cache similar to OS virtual memory, eliminating fragmentation and increasing throughput by 15-24x compared to a naive implementation.

Why Standard HuggingFace Transformers Are Insufficient

HF transformers are good for experiments, bad for production:

  • Each request is processed independently - no batching of requests
  • The KV cache is stored entirely for each sequence - VRAM is used inefficiently
  • No prefill/decode separation
  • Throughput: ~10–50 tokens/sec per request

vLLM on the same GPUs: 500–2000 tokens/sec via concurrent batching.

Basic vLLM deployment

# Установка
pip install vllm

# Запуск сервера (OpenAI-совместимый API)
python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --tensor-parallel-size 1 \       # для одной GPU
  --max-model-len 8192 \
  --max-num-seqs 256 \             # максимальный concurrent batch
  --gpu-memory-utilization 0.90 \  # 90% VRAM для модели
  --host 0.0.0.0 \
  --port 8000
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    messages=[{"role": "user", "content": "Explain transformer attention"}],
    max_tokens=500,
    temperature=0.7
)

PagedAttention: How it works

Standard KV cache: a contiguous block of VRAM of the maximum length is allocated for each sequence. Fragmentation—up to 60% of memory is lost between sequences.

PagedAttention: The KV cache is divided into pages of a fixed size (usually 16 tokens). Pages are allocated on demand and may be non-contiguous. Prefix sharing: If two sequences share a common prefix (system prompt), the pages with the same prefix are shared, saving VRAM even with identical system prompts.

Tensor Parallelism for Large Models

# LLaMA-70B на 4xA100 80GB
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-70b-instruct \
  --tensor-parallel-size 4 \       # шардирование по 4 GPU
  --dtype bfloat16 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.95

Tensor parallelism splits attention heads and FFN matrices across GPUs. For the 70B model: 4xA100 80GB = sufficient for BF16.

Quantization to save VRAM

# AWQ квантизация (лучшее качество среди 4-bit методов)
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Mistral-7B-Instruct-v0.3-AWQ \
  --quantization awq \
  --dtype auto

# GPTQ
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Llama-2-13B-GPTQ \
  --quantization gptq

# FP8 (для H100)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-8b \
  --quantization fp8

Result: 7B model in AWQ 4-bit takes up ~4 GB of VRAM instead of ~14 GB in BF16.

Speculative Decoding

Accelerating decoding through a draft model: a small model (draft) generates several tokens, while a larger model (target) verifies them in parallel. If there's a match, all tokens are accepted as a single forward pass.

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-70b-instruct \
  --speculative-model meta-llama/Llama-3-8b-instruct \
  --num-speculative-tokens 5 \
  --tensor-parallel-size 4

Gain: 1.5–2.5x speedup for typical texts with <1% quality change.

Performance tuning

# Параметры для максимального throughput (не latency)
VLLM_CONFIG = {
    "max_num_seqs": 512,            # больше concurrent запросов
    "max_num_batched_tokens": 32768, # токены в одном forward pass
    "block_size": 32,               # размер страницы KV-кеша
    "swap_space": 4,                # GB для CPU offload при VRAM OOM
}

# Параметры для минимальной latency (не throughput)
VLLM_CONFIG_LATENCY = {
    "max_num_seqs": 32,
    "max_num_batched_tokens": 4096,
    "disable_async_output_proc": False,
}

Performance benchmark

On one A100 80GB, Mistral-7B-Instruct, 500-token responses:

Implementation Throughput (req/s) P99 Latency
HF transformers (batch=1) 1.2 8.5s
HF transformers (batch=16) 4.1 22s
vLLM (256 concurrent) 28.5 12s
vLLM + AWQ 4-bit 52.3 7s