Setting up LLM inference caching
Caching LLM responses reduces latency and GPU costs. It has two levels: semantic cache (similar queries yield the same response) and KV-cache (prefix sharing at the intra-model level).
Semantic Caching
Exact query matches are rare. Semantic cache finds semantically similar queries and returns a stored response without inference:
from sentence_transformers import SentenceTransformer
import numpy as np
import redis
import json
class SemanticCache:
def __init__(self, similarity_threshold: float = 0.92):
self.encoder = SentenceTransformer("paraphrase-multilingual-mpnet-base-v2")
self.redis = redis.Redis(host="localhost", port=6379, db=1)
self.threshold = similarity_threshold
# Для поиска похожих — Qdrant или pgvector
from qdrant_client import QdrantClient
self.vector_db = QdrantClient("localhost", port=6333)
def get(self, prompt: str, system_prompt: str = "") -> str | None:
cache_key = self._make_key(prompt, system_prompt)
# 1. Точное совпадение (дешево)
exact = self.redis.get(cache_key)
if exact:
return json.loads(exact)["response"]
# 2. Семантическое совпадение (через vector search)
embedding = self.encoder.encode(prompt)
results = self.vector_db.search(
collection_name="llm_cache",
query_vector=embedding.tolist(),
limit=1,
score_threshold=self.threshold
)
if results:
cached_response = json.loads(results[0].payload["response"])
# Обновляем TTL найденной записи
self.redis.expire(results[0].id, 3600)
return cached_response
return None
def set(self, prompt: str, response: str, system_prompt: str = "", ttl: int = 3600):
embedding = self.encoder.encode(prompt)
cache_id = self._make_key(prompt, system_prompt)
# Сохраняем в vector DB
self.vector_db.upsert(
collection_name="llm_cache",
points=[{
"id": abs(hash(cache_id)) % (2**31),
"vector": embedding.tolist(),
"payload": {"prompt": prompt, "response": json.dumps(response), "system_prompt": system_prompt}
}]
)
# Сохраняем в Redis для точного lookup
self.redis.setex(cache_id, ttl, json.dumps({"response": response}))
def _make_key(self, prompt: str, system_prompt: str) -> str:
import hashlib
return hashlib.sha256(f"{system_prompt}||{prompt}".encode()).hexdigest()
GPTCache is a ready-made solution
GPTCache is a specialized library for LLM caching with support for various vector storage types:
from gptcache import cache
from gptcache.adapter import openai as cached_openai
from gptcache.embedding import Onnx
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation
# Настройка кэша
embedding_model = Onnx()
data_manager = get_data_manager(
CacheBase("sqlite"),
VectorBase("qdrant", host="localhost", port=6333, dimension=512)
)
cache.init(
embedding_func=embedding_model.to_embeddings,
data_manager=data_manager,
similarity_evaluation=SearchDistanceEvaluation(max_distance=0.3),
cache_enable_func=lambda *args, **kwargs: True
)
# Использование — прозрачная замена openai клиента
response = cached_openai.ChatCompletion.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is Python?"}]
)
KV-Cache Prefix in vLLM
vLLM automatically caches KV values for common prefixes (system prompts):
# vLLM автоматически использует prefix caching
# Важно: system prompt должен быть одинаковым для разных запросов
# При запуске включаем prefix caching
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-8b-instruct \
--enable-prefix-caching \ # явное включение
--max-model-len 8192
# Метрика эффективности кэша
# vllm:cache_config_info{...} -> num_gpu_blocks
# vllm:gpu_cache_usage_perc -> процент занятых блоков кэша
A prefix cache hit rate of 60–80% is typical with the same system prompt for all requests.
When not to cache
- Personalized responses (different for each user)
- Queries with current time/date
- Financial data (rates, prices) - become outdated quickly
- Code generation (minor prompt variations → different code)
- temperature > 0.8 - stochastic responses
Caching is most effective for FAQ bots, RAGs with fixed documents, and classification tasks.
Caching metrics
Key KPIs: cache hit rate (target > 30% for FAQ, < 5% for creative), latency reduction (p99 latency with vs. without cache), cost savings (% of requests not sent to inference), staleness rate (% of cached responses that become stale).







