Setting up AI inference latency monitoring
LLM inference latency is a critical metric for user experience. It has two components: Time to First Token (TTFT)—the delay before a response begins, and Time Per Output Token (TPOT)—the speed of generation. TTFT is perceived by the user more acutely than TPOT (streaming).
Latency metrics
from prometheus_client import Histogram, Summary
import time
# Гистограммы latency
TTFT_HISTOGRAM = Histogram(
"llm_time_to_first_token_seconds",
"Time to first token",
buckets=[0.1, 0.3, 0.5, 1.0, 2.0, 5.0, 10.0]
)
TOTAL_LATENCY = Histogram(
"llm_total_latency_seconds",
"Total request latency",
labelnames=["model", "endpoint"],
buckets=[0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0]
)
TPOT_HISTOGRAM = Histogram(
"llm_time_per_output_token_ms",
"Time per output token in milliseconds",
buckets=[5, 10, 20, 50, 100, 200]
)
class LatencyTracker:
def track_streaming_request(self, request_id: str, model: str):
start = time.time()
first_token_time = None
def on_first_token():
nonlocal first_token_time
first_token_time = time.time()
TTFT_HISTOGRAM.observe(first_token_time - start)
def on_complete(total_tokens: int):
end = time.time()
total_latency = end - start
TOTAL_LATENCY.labels(model=model, endpoint="/v1/chat").observe(total_latency)
if first_token_time and total_tokens > 1:
decode_time = end - first_token_time
tpot_ms = (decode_time / (total_tokens - 1)) * 1000
TPOT_HISTOGRAM.observe(tpot_ms)
return on_first_token, on_complete
Grafana alerts latency
# Alerting rules
- alert: LLMHighTTFT
expr: histogram_quantile(0.95, rate(llm_time_to_first_token_seconds_bucket[5m])) > 3
for: 5m
annotations:
summary: "TTFT p95 > 3 секунды"
- alert: LLMHighTotalLatency
expr: histogram_quantile(0.99, rate(llm_total_latency_seconds_bucket[5m])) > 30
for: 5m
annotations:
summary: "Total latency p99 > 30 секунд"
Diagnostics by components
When latency is high, you need to understand where:
- Queuing time: time in the vLLM queue → an indication of insufficient bandwidth
- Prefill time: input context processing → long system prompts
- Decode time: token generation → determined by max_tokens and TPOT
vLLM diagnostic metrics: vllm:time_to_first_token_seconds, vllm:time_per_output_token_seconds, vllm:e2e_request_latency_seconds — all broken down by percentiles.







