LLM Inference Optimization with TensorRT-LLM

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
LLM Inference Optimization with TensorRT-LLM
Complex
~3-5 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Optimizing LLM Inference with TensorRT-LLM

TensorRT-LLM is an NVIDIA library for the most efficient LLM inference on NVIDIA GPUs. While vLLM is a convenient production server, TensorRT-LLM is a low-level engine for maximum performance on NVIDIA hardware. Speedup: 2-4x compared to vLLM on the same GPUs.

Architecture and operating principle

TensorRT-LLM compiles the model into an optimized TensorRT engine:

  1. Graph compilation: the model graph is compiled taking into account a specific GPU (architecture, VRAM, tensor cores)
  2. Kernel fusion: multiple operations are combined into one CUDA kernel (LayerNorm + Linear, Flash Attention)
  3. Quantization: FP8, INT8, INT4 with precise calibration methods
  4. In-flight batching: the most advanced implementation of continuous batching

Installing and compiling the model

# Установка через Docker (рекомендуется)
docker pull nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3

# Или pip
pip install tensorrt-llm --extra-index-url https://pypi.nvidia.com
import tensorrt_llm
from tensorrt_llm.builder import BuildConfig, build_model
from tensorrt_llm.models import LLaMAForCausalLM

# Загрузка HuggingFace модели
hf_model_path = "meta-llama/Llama-3-8b-instruct"

# Конфигурация компиляции
build_config = BuildConfig(
    max_batch_size=64,
    max_input_len=2048,
    max_output_len=512,
    max_beam_width=1,               # greedy decoding
    strongly_typed=True,
    plugin_config={
        "gpt_attention_plugin": "float16",
        "gemm_plugin": "float16",
        "rmsnorm_quantization_plugin": False,
        "use_paged_context_fmha": True,    # PagedAttention
        "use_fp8_context_fmha": False,
    }
)

# Компиляция занимает 5-30 минут в зависимости от модели и GPU
engine = build_model(
    model=LLaMAForCausalLM.from_hugging_face(hf_model_path),
    build_config=build_config
)
engine.save("./llama3-8b-engine/")

FP8 Quantization on H100

The H100 has hardware support for FP8 - the biggest performance boost:

from tensorrt_llm.quantization import QuantAlgo

build_config_fp8 = BuildConfig(
    max_batch_size=128,
    max_input_len=4096,
    max_output_len=1024,
    quant_config=QuantConfig(
        quant_algo=QuantAlgo.FP8,
        kv_cache_quant_algo=QuantAlgo.FP8,  # KV-кеш тоже в FP8
    ),
    plugin_config={
        "use_fp8_context_fmha": True,  # Flash Attention в FP8
        "gemm_plugin": "float16",
    }
)

FP8 on H100: ~2x throughput gain over BF16, <0.5% quality degradation on standard benchmarks.

Integration with Triton Inference Server

TensorRT-LLM integrates natively with NVIDIA Triton:

# Структура для Triton
model_repository/
├── ensemble/
│   └── config.pbtxt
├── preprocessing/      # токенизация
│   ├── config.pbtxt
│   └── 1/model.py
├── tensorrt_llm/       # TRT-LLM движок
│   ├── config.pbtxt
│   └── 1/
│       ├── model.engine
│       └── config.json
└── postprocessing/     # детокенизация
    ├── config.pbtxt
    └── 1/model.py
# tensorrt_llm/config.pbtxt
name: "tensorrt_llm"
backend: "tensorrtllm"
max_batch_size: 128

parameters {
  key: "max_beam_width"
  value: { string_value: "1" }
}
parameters {
  key: "executor_worker_path"
  value: { string_value: "/opt/tritonserver/backends/tensorrtllm/trtllmExecutorWorker" }
}
parameters {
  key: "decoding_mode"
  value: { string_value: "top_p_top_k" }
}

Multi-GPU with Tensor Parallelism

# LLaMA-70B на 4xH100
build_config_tp4 = BuildConfig(
    max_batch_size=64,
    max_input_len=8192,
    max_output_len=2048,
    auto_parallel_config=AutoParallelConfig(
        world_size=4,
        gpus_per_node=4,
        shards_along_head=4,       # tensor parallelism
    )
)

# Запуск mpirun для multi-GPU
# mpirun -n 4 python run_inference.py

Comparison with vLLM

Parameter vLLM TensorRT-LLM
Ease of deployment High Average
Performance on NVIDIA Good Maximum
Non-NVIDIA support Yes (ROCm, CPU) No
Compilation time No 5-30 min
OpenAI API Built-in Via Triton
Model update Fast Recompilation

Recommendation: vLLM for most production use cases. TensorRT-LLM — when you need to maximize NVIDIA GPU utilization (high-load services, cost optimization on cloud GPUs).

Implementation timeframes

Day 1–3: Installing TRT-LLM, compiling the first model, measuring baseline metrics

Week 1–2: Selecting optimal compilation parameters, quantization, integration with Triton

Week 3–4: Load testing, monitoring, production deployment

Month 2: Optimization for specific use cases (latency vs. throughput), multi-model deployment