LLM Inference Optimization with Text Generation Inference (TGI)

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
LLM Inference Optimization with Text Generation Inference (TGI)
Medium
~3-5 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Optimizing LLM inference with Text Generation Inference (TGI)

Text Generation Inference (TGI) is HuggingFace's production server for LLM inference. It's written in Rust (server) and Python (model logic). It's easier to configure than vLLM and natively integrated with HuggingFace Hub. It's used in Hugging Chat and many production deployments.

Quick start

# Docker (рекомендуется)
docker run --gpus all \
  -p 8080:80 \
  -v /data/models:/data \
  ghcr.io/huggingface/text-generation-inference:2.1 \
  --model-id meta-llama/Llama-3-8b-instruct \
  --max-input-length 4096 \
  --max-total-tokens 8192 \
  --max-batch-prefill-tokens 32768 \
  --num-shard 1 \
  --dtype bfloat16 \
  --huggingface-hub-token $HF_TOKEN
# Клиент через официальный пакет
from huggingface_hub import InferenceClient

client = InferenceClient(model="http://localhost:8080")

response = client.text_generation(
    prompt="Explain transformer attention in simple terms",
    max_new_tokens=512,
    temperature=0.7,
    repetition_penalty=1.1,
    stream=False
)

# Streaming
for token in client.text_generation(prompt, stream=True):
    print(token, end="", flush=True)

Key Features of TGI

Continuous batching (in-flight batching): new requests are added to the batch while previous ones are being generated. The implementation is similar to vLLM.

Flash Attention 2: An efficient self-attention implementation with O(n) memory instead of O(n²). Automatically enabled for supported models.

Tensor Parallelism: Distribute the model across multiple GPUs via --num-shard.

Speculative Decoding: via --speculate N — the draft model generates N tokens, the target verifies.

Quantization: Support for GPTQ, AWQ, EETQ, BitsAndBytes out of the box.

Configuration for different scenarios

# Максимальный throughput (batch processing)
docker run --gpus all ghcr.io/huggingface/text-generation-inference:2.1 \
  --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 \
  --num-shard 2 \                           # 2 GPU для MoE модели
  --max-input-length 8192 \
  --max-total-tokens 16384 \
  --max-batch-prefill-tokens 131072 \       # большой prefill batch
  --max-waiting-tokens 20 \                 # ждём больше токенов для батча
  --dtype bfloat16

# Минимальная latency (интерактивные чаты)
docker run --gpus all ghcr.io/huggingface/text-generation-inference:2.1 \
  --model-id meta-llama/Llama-3-8b-instruct \
  --max-input-length 2048 \
  --max-total-tokens 4096 \
  --max-batch-prefill-tokens 4096 \
  --max-concurrent-requests 32 \            # ограничиваем очередь
  --waiting-served-ratio 1.2

# Экономия VRAM через quantization
docker run --gpus all ghcr.io/huggingface/text-generation-inference:2.1 \
  --model-id TheBloke/Llama-2-13B-AWQ \
  --quantize awq \
  --dtype float16

Custom Handlers

TGI allows you to add preprocessing/postprocessing via a custom handler:

# custom_handler.py
class CustomHandler:
    def __init__(self):
        self.tokenizer = AutoTokenizer.from_pretrained(...)

    def preprocess(self, inputs: dict) -> dict:
        """Преобразование входящего запроса перед inference."""
        prompt = inputs.get("inputs", "")

        # Добавление system prompt
        full_prompt = f"<|system|>You are a helpful assistant.<|end|>\n<|user|>{prompt}<|end|>\n<|assistant|>"

        return {"inputs": full_prompt, **{k: v for k, v in inputs.items() if k != "inputs"}}

    def postprocess(self, model_output: dict) -> dict:
        """Постобработка вывода модели."""
        generated = model_output["generated_text"]
        # Убираем системный prompt из вывода
        return {"generated_text": generated.split("<|assistant|>")[-1].strip()}

Monitoring and metrics

TGI exports Prometheus metrics to /metrics:

tgi_request_duration_seconds_bucket  # latency histogram
tgi_batch_inference_duration_seconds  # batch inference time
tgi_request_input_length              # длины входов
tgi_request_generated_tokens          # длины сгенерированных токенов
tgi_batch_current_size                # текущий размер батча
tgi_queue_size                        # размер очереди ожидания

TGI vs vLLM

Parameter TGI vLLM
Integration with HF Hub Native Via HF
Performance Similar Slightly higher on NVIDIA
Custom backend Limited More flexible
Docker image Ready Need to collect
Streaming SSE out of the box Yes
Documentation Excellent Good

For most use cases, both options provide similar performance. TGI is more convenient when working in the HF ecosystem.