AI Inference Latency Monitoring Setup

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
AI Inference Latency Monitoring Setup
Simple
from 1 business day to 3 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1218
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    854
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1047
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    825

Setting up AI inference latency monitoring

LLM inference latency is a critical metric for user experience. It has two components: Time to First Token (TTFT)—the delay before a response begins, and Time Per Output Token (TPOT)—the speed of generation. TTFT is perceived by the user more acutely than TPOT (streaming).

Latency metrics

from prometheus_client import Histogram, Summary
import time

# Гистограммы latency
TTFT_HISTOGRAM = Histogram(
    "llm_time_to_first_token_seconds",
    "Time to first token",
    buckets=[0.1, 0.3, 0.5, 1.0, 2.0, 5.0, 10.0]
)

TOTAL_LATENCY = Histogram(
    "llm_total_latency_seconds",
    "Total request latency",
    labelnames=["model", "endpoint"],
    buckets=[0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0]
)

TPOT_HISTOGRAM = Histogram(
    "llm_time_per_output_token_ms",
    "Time per output token in milliseconds",
    buckets=[5, 10, 20, 50, 100, 200]
)

class LatencyTracker:
    def track_streaming_request(self, request_id: str, model: str):
        start = time.time()
        first_token_time = None

        def on_first_token():
            nonlocal first_token_time
            first_token_time = time.time()
            TTFT_HISTOGRAM.observe(first_token_time - start)

        def on_complete(total_tokens: int):
            end = time.time()
            total_latency = end - start
            TOTAL_LATENCY.labels(model=model, endpoint="/v1/chat").observe(total_latency)

            if first_token_time and total_tokens > 1:
                decode_time = end - first_token_time
                tpot_ms = (decode_time / (total_tokens - 1)) * 1000
                TPOT_HISTOGRAM.observe(tpot_ms)

        return on_first_token, on_complete

Grafana alerts latency

# Alerting rules
- alert: LLMHighTTFT
  expr: histogram_quantile(0.95, rate(llm_time_to_first_token_seconds_bucket[5m])) > 3
  for: 5m
  annotations:
    summary: "TTFT p95 > 3 секунды"

- alert: LLMHighTotalLatency
  expr: histogram_quantile(0.99, rate(llm_total_latency_seconds_bucket[5m])) > 30
  for: 5m
  annotations:
    summary: "Total latency p99 > 30 секунд"

Diagnostics by components

When latency is high, you need to understand where:

  • Queuing time: time in the vLLM queue → an indication of insufficient bandwidth
  • Prefill time: input context processing → long system prompts
  • Decode time: token generation → determined by max_tokens and TPOT

vLLM diagnostic metrics: vllm:time_to_first_token_seconds, vllm:time_per_output_token_seconds, vllm:e2e_request_latency_seconds — all broken down by percentiles.