LLM Inference Optimization with Triton Inference Server

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
LLM Inference Optimization with Triton Inference Server
Complex
~3-5 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Optimizing LLM Inference with Triton Inference Server

NVIDIA Triton Inference Server is a production-grade serving platform for multiple models simultaneously. Its key advantages over vLLM include managing multiple models of different types (CV, NLP, tabular) through a single endpoint, GPU sharing between models, and ensemble pipelines.

Triton Architecture

Triton supports multiple backends simultaneously:

  • tensorrtllm: TRT-LLM engine for LLM
  • python: arbitrary Python code (preprocessing, postprocessing)
  • onnxruntime: ONNX models
  • pytorch: TorchScript models
  • tensorflow: SavedModel
  • openvino: Intel OpenVINO

Ensemble model combines several backends into a pipeline: preprocessing (python) → LLM (tensorrtllm) → postprocessing (python).

Model Repository Configuration

model_repository/
├── llama3_8b/
│   ├── config.pbtxt
│   └── 1/
│       └── model.engine          # TRT-LLM скомпилированный движок
├── sentence_encoder/
│   ├── config.pbtxt
│   └── 1/
│       └── model.onnx
├── rag_pipeline/                 # Ensemble
│   └── config.pbtxt
└── text_classifier/
    ├── config.pbtxt
    └── 1/
        └── model.plan            # TensorRT plan

LLM configuration via tensorrtllm backend

# llama3_8b/config.pbtxt
name: "llama3_8b"
backend: "tensorrtllm"
max_batch_size: 256

input [
  { name: "input_ids"       data_type: TYPE_INT32  dims: [-1] },
  { name: "input_lengths"   data_type: TYPE_INT32  dims: [1] },
  { name: "request_output_len" data_type: TYPE_INT32 dims: [1] },
  { name: "sampling_config" data_type: TYPE_BYTES  dims: [1] optional: true },
  { name: "streaming"       data_type: TYPE_BOOL   dims: [1] optional: true }
]

output [
  { name: "output_ids"      data_type: TYPE_INT32  dims: [-1] },
  { name: "sequence_length" data_type: TYPE_INT32  dims: [1] }
]

parameters {
  key: "executor_worker_path"
  value: { string_value: "/opt/tritonserver/backends/tensorrtllm/trtllmExecutorWorker" }
}
parameters {
  key: "max_beam_width" value: { string_value: "1" }
}
parameters {
  key: "decoding_mode" value: { string_value: "top_k_top_p" }
}
parameters {
  key: "max_tokens_in_paged_kv_cache" value: { string_value: "20000" }
}
parameters {
  key: "scheduler_policy" value: { string_value: "guaranteed_no_evict" }
}

Ensemble Pipeline for RAG

# rag_pipeline/config.pbtxt
name: "rag_pipeline"
platform: "ensemble"
max_batch_size: 32

input [
  { name: "query" data_type: TYPE_STRING dims: [1] }
]
output [
  { name: "response" data_type: TYPE_STRING dims: [1] }
]

ensemble_scheduling {
  step [
    {
      model_name: "query_encoder"
      model_version: 1
      input_map { key: "text" value: "query" }
      output_map { key: "embeddings" value: "query_embeddings" }
    },
    {
      model_name: "retriever"
      model_version: 1
      input_map { key: "query_embeddings" value: "query_embeddings" }
      output_map { key: "context" value: "retrieved_context" }
    },
    {
      model_name: "llama3_8b"
      model_version: 1
      input_map {
        key: "input_ids" value: "augmented_input_ids"  # после preprocessing
      }
      output_map { key: "output_ids" value: "response_ids" }
    }
  ]
}

Dynamic Batching

# Добавляется в config.pbtxt любой модели
dynamic_batching {
  preferred_batch_size: [8, 16, 32]
  max_queue_delay_microseconds: 5000   # ждём до 5ms для набора батча
  priority_levels: 3
  default_priority_level: 2
  default_queue_policy {
    allow_timeout_override: true
    timeout_action: REJECT
    default_timeout_microseconds: 30000000  # 30 сек
  }
}

Instance Groups - multiple copies of a model

# Две копии модели на двух GPU
instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [0]
  },
  {
    count: 1
    kind: KIND_GPU
    gpus: [1]
  }
]

# Для CPU-моделей: несколько воркеров
instance_group [
  {
    count: 4
    kind: KIND_CPU
  }
]

Client for Triton

import tritonclient.http as tritonhttpclient
import numpy as np

client = tritonhttpclient.InferenceServerClient("localhost:8000")

# Проверка доступности
assert client.is_server_live()
assert client.is_model_ready("llama3_8b")

def generate_text(prompt: str, max_tokens: int = 512) -> str:
    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8b-instruct")
    input_ids = tokenizer.encode(prompt, return_tensors="np")

    inputs = [
        tritonhttpclient.InferInput("input_ids", input_ids.shape, "INT32"),
        tritonhttpclient.InferInput("input_lengths", [1, 1], "INT32"),
        tritonhttpclient.InferInput("request_output_len", [1, 1], "INT32"),
    ]
    inputs[0].set_data_from_numpy(input_ids)
    inputs[1].set_data_from_numpy(np.array([[input_ids.shape[1]]], dtype=np.int32))
    inputs[2].set_data_from_numpy(np.array([[max_tokens]], dtype=np.int32))

    outputs = [tritonhttpclient.InferRequestedOutput("output_ids")]
    response = client.infer("llama3_8b", inputs, outputs=outputs)
    output_ids = response.as_numpy("output_ids")

    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

Monitoring via Prometheus

Triton exports metrics automatically on port 8002:

# prometheus.yml
scrape_configs:
  - job_name: triton
    static_configs:
      - targets: ["triton:8002"]

Metrics: nv_inference_request_success, nv_inference_queue_duration_us, nv_gpu_utilization, nv_gpu_memory_used_bytes — everything you need for SLA monitoring out of the box.

Implementation timeframes

Week 1: Triton installation, first model configuration, smoke test

Week 2: TRT-LLM compilation, ensemble pipeline, load testing

Week 3–4: Multi-GPU setup, monitoring, integration with production traffic

Month 2: Optimization of dynamic batching, autoscaling, multi-model management