Optimizing LLM inference with Text Generation Inference (TGI)
Text Generation Inference (TGI) is HuggingFace's production server for LLM inference. It's written in Rust (server) and Python (model logic). It's easier to configure than vLLM and natively integrated with HuggingFace Hub. It's used in Hugging Chat and many production deployments.
Quick start
# Docker (рекомендуется)
docker run --gpus all \
-p 8080:80 \
-v /data/models:/data \
ghcr.io/huggingface/text-generation-inference:2.1 \
--model-id meta-llama/Llama-3-8b-instruct \
--max-input-length 4096 \
--max-total-tokens 8192 \
--max-batch-prefill-tokens 32768 \
--num-shard 1 \
--dtype bfloat16 \
--huggingface-hub-token $HF_TOKEN
# Клиент через официальный пакет
from huggingface_hub import InferenceClient
client = InferenceClient(model="http://localhost:8080")
response = client.text_generation(
prompt="Explain transformer attention in simple terms",
max_new_tokens=512,
temperature=0.7,
repetition_penalty=1.1,
stream=False
)
# Streaming
for token in client.text_generation(prompt, stream=True):
print(token, end="", flush=True)
Key Features of TGI
Continuous batching (in-flight batching): new requests are added to the batch while previous ones are being generated. The implementation is similar to vLLM.
Flash Attention 2: An efficient self-attention implementation with O(n) memory instead of O(n²). Automatically enabled for supported models.
Tensor Parallelism: Distribute the model across multiple GPUs via --num-shard.
Speculative Decoding: via --speculate N — the draft model generates N tokens, the target verifies.
Quantization: Support for GPTQ, AWQ, EETQ, BitsAndBytes out of the box.
Configuration for different scenarios
# Максимальный throughput (batch processing)
docker run --gpus all ghcr.io/huggingface/text-generation-inference:2.1 \
--model-id mistralai/Mixtral-8x7B-Instruct-v0.1 \
--num-shard 2 \ # 2 GPU для MoE модели
--max-input-length 8192 \
--max-total-tokens 16384 \
--max-batch-prefill-tokens 131072 \ # большой prefill batch
--max-waiting-tokens 20 \ # ждём больше токенов для батча
--dtype bfloat16
# Минимальная latency (интерактивные чаты)
docker run --gpus all ghcr.io/huggingface/text-generation-inference:2.1 \
--model-id meta-llama/Llama-3-8b-instruct \
--max-input-length 2048 \
--max-total-tokens 4096 \
--max-batch-prefill-tokens 4096 \
--max-concurrent-requests 32 \ # ограничиваем очередь
--waiting-served-ratio 1.2
# Экономия VRAM через quantization
docker run --gpus all ghcr.io/huggingface/text-generation-inference:2.1 \
--model-id TheBloke/Llama-2-13B-AWQ \
--quantize awq \
--dtype float16
Custom Handlers
TGI allows you to add preprocessing/postprocessing via a custom handler:
# custom_handler.py
class CustomHandler:
def __init__(self):
self.tokenizer = AutoTokenizer.from_pretrained(...)
def preprocess(self, inputs: dict) -> dict:
"""Преобразование входящего запроса перед inference."""
prompt = inputs.get("inputs", "")
# Добавление system prompt
full_prompt = f"<|system|>You are a helpful assistant.<|end|>\n<|user|>{prompt}<|end|>\n<|assistant|>"
return {"inputs": full_prompt, **{k: v for k, v in inputs.items() if k != "inputs"}}
def postprocess(self, model_output: dict) -> dict:
"""Постобработка вывода модели."""
generated = model_output["generated_text"]
# Убираем системный prompt из вывода
return {"generated_text": generated.split("<|assistant|>")[-1].strip()}
Monitoring and metrics
TGI exports Prometheus metrics to /metrics:
tgi_request_duration_seconds_bucket # latency histogram
tgi_batch_inference_duration_seconds # batch inference time
tgi_request_input_length # длины входов
tgi_request_generated_tokens # длины сгенерированных токенов
tgi_batch_current_size # текущий размер батча
tgi_queue_size # размер очереди ожидания
TGI vs vLLM
| Parameter | TGI | vLLM |
|---|---|---|
| Integration with HF Hub | Native | Via HF |
| Performance | Similar | Slightly higher on NVIDIA |
| Custom backend | Limited | More flexible |
| Docker image | Ready | Need to collect |
| Streaming | SSE out of the box | Yes |
| Documentation | Excellent | Good |
For most use cases, both options provide similar performance. TGI is more convenient when working in the HF ecosystem.







