Optimizing LLM Inference with Triton Inference Server
NVIDIA Triton Inference Server is a production-grade serving platform for multiple models simultaneously. Its key advantages over vLLM include managing multiple models of different types (CV, NLP, tabular) through a single endpoint, GPU sharing between models, and ensemble pipelines.
Triton Architecture
Triton supports multiple backends simultaneously:
- tensorrtllm: TRT-LLM engine for LLM
- python: arbitrary Python code (preprocessing, postprocessing)
- onnxruntime: ONNX models
- pytorch: TorchScript models
- tensorflow: SavedModel
- openvino: Intel OpenVINO
Ensemble model combines several backends into a pipeline: preprocessing (python) → LLM (tensorrtllm) → postprocessing (python).
Model Repository Configuration
model_repository/
├── llama3_8b/
│ ├── config.pbtxt
│ └── 1/
│ └── model.engine # TRT-LLM скомпилированный движок
├── sentence_encoder/
│ ├── config.pbtxt
│ └── 1/
│ └── model.onnx
├── rag_pipeline/ # Ensemble
│ └── config.pbtxt
└── text_classifier/
├── config.pbtxt
└── 1/
└── model.plan # TensorRT plan
LLM configuration via tensorrtllm backend
# llama3_8b/config.pbtxt
name: "llama3_8b"
backend: "tensorrtllm"
max_batch_size: 256
input [
{ name: "input_ids" data_type: TYPE_INT32 dims: [-1] },
{ name: "input_lengths" data_type: TYPE_INT32 dims: [1] },
{ name: "request_output_len" data_type: TYPE_INT32 dims: [1] },
{ name: "sampling_config" data_type: TYPE_BYTES dims: [1] optional: true },
{ name: "streaming" data_type: TYPE_BOOL dims: [1] optional: true }
]
output [
{ name: "output_ids" data_type: TYPE_INT32 dims: [-1] },
{ name: "sequence_length" data_type: TYPE_INT32 dims: [1] }
]
parameters {
key: "executor_worker_path"
value: { string_value: "/opt/tritonserver/backends/tensorrtllm/trtllmExecutorWorker" }
}
parameters {
key: "max_beam_width" value: { string_value: "1" }
}
parameters {
key: "decoding_mode" value: { string_value: "top_k_top_p" }
}
parameters {
key: "max_tokens_in_paged_kv_cache" value: { string_value: "20000" }
}
parameters {
key: "scheduler_policy" value: { string_value: "guaranteed_no_evict" }
}
Ensemble Pipeline for RAG
# rag_pipeline/config.pbtxt
name: "rag_pipeline"
platform: "ensemble"
max_batch_size: 32
input [
{ name: "query" data_type: TYPE_STRING dims: [1] }
]
output [
{ name: "response" data_type: TYPE_STRING dims: [1] }
]
ensemble_scheduling {
step [
{
model_name: "query_encoder"
model_version: 1
input_map { key: "text" value: "query" }
output_map { key: "embeddings" value: "query_embeddings" }
},
{
model_name: "retriever"
model_version: 1
input_map { key: "query_embeddings" value: "query_embeddings" }
output_map { key: "context" value: "retrieved_context" }
},
{
model_name: "llama3_8b"
model_version: 1
input_map {
key: "input_ids" value: "augmented_input_ids" # после preprocessing
}
output_map { key: "output_ids" value: "response_ids" }
}
]
}
Dynamic Batching
# Добавляется в config.pbtxt любой модели
dynamic_batching {
preferred_batch_size: [8, 16, 32]
max_queue_delay_microseconds: 5000 # ждём до 5ms для набора батча
priority_levels: 3
default_priority_level: 2
default_queue_policy {
allow_timeout_override: true
timeout_action: REJECT
default_timeout_microseconds: 30000000 # 30 сек
}
}
Instance Groups - multiple copies of a model
# Две копии модели на двух GPU
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [0]
},
{
count: 1
kind: KIND_GPU
gpus: [1]
}
]
# Для CPU-моделей: несколько воркеров
instance_group [
{
count: 4
kind: KIND_CPU
}
]
Client for Triton
import tritonclient.http as tritonhttpclient
import numpy as np
client = tritonhttpclient.InferenceServerClient("localhost:8000")
# Проверка доступности
assert client.is_server_live()
assert client.is_model_ready("llama3_8b")
def generate_text(prompt: str, max_tokens: int = 512) -> str:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8b-instruct")
input_ids = tokenizer.encode(prompt, return_tensors="np")
inputs = [
tritonhttpclient.InferInput("input_ids", input_ids.shape, "INT32"),
tritonhttpclient.InferInput("input_lengths", [1, 1], "INT32"),
tritonhttpclient.InferInput("request_output_len", [1, 1], "INT32"),
]
inputs[0].set_data_from_numpy(input_ids)
inputs[1].set_data_from_numpy(np.array([[input_ids.shape[1]]], dtype=np.int32))
inputs[2].set_data_from_numpy(np.array([[max_tokens]], dtype=np.int32))
outputs = [tritonhttpclient.InferRequestedOutput("output_ids")]
response = client.infer("llama3_8b", inputs, outputs=outputs)
output_ids = response.as_numpy("output_ids")
return tokenizer.decode(output_ids[0], skip_special_tokens=True)
Monitoring via Prometheus
Triton exports metrics automatically on port 8002:
# prometheus.yml
scrape_configs:
- job_name: triton
static_configs:
- targets: ["triton:8002"]
Metrics: nv_inference_request_success, nv_inference_queue_duration_us, nv_gpu_utilization, nv_gpu_memory_used_bytes — everything you need for SLA monitoring out of the box.
Implementation timeframes
Week 1: Triton installation, first model configuration, smoke test
Week 2: TRT-LLM compilation, ensemble pipeline, load testing
Week 3–4: Multi-GPU setup, monitoring, integration with production traffic
Month 2: Optimization of dynamic batching, autoscaling, multi-model management







