Inference Caching Setup (KV-cache, Semantic Cache)

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
Inference Caching Setup (KV-cache, Semantic Cache)
Medium
from 1 business day to 3 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Setting up LLM inference caching

Caching LLM responses reduces latency and GPU costs. It has two levels: semantic cache (similar queries yield the same response) and KV-cache (prefix sharing at the intra-model level).

Semantic Caching

Exact query matches are rare. Semantic cache finds semantically similar queries and returns a stored response without inference:

from sentence_transformers import SentenceTransformer
import numpy as np
import redis
import json

class SemanticCache:
    def __init__(self, similarity_threshold: float = 0.92):
        self.encoder = SentenceTransformer("paraphrase-multilingual-mpnet-base-v2")
        self.redis = redis.Redis(host="localhost", port=6379, db=1)
        self.threshold = similarity_threshold
        # Для поиска похожих — Qdrant или pgvector
        from qdrant_client import QdrantClient
        self.vector_db = QdrantClient("localhost", port=6333)

    def get(self, prompt: str, system_prompt: str = "") -> str | None:
        cache_key = self._make_key(prompt, system_prompt)

        # 1. Точное совпадение (дешево)
        exact = self.redis.get(cache_key)
        if exact:
            return json.loads(exact)["response"]

        # 2. Семантическое совпадение (через vector search)
        embedding = self.encoder.encode(prompt)
        results = self.vector_db.search(
            collection_name="llm_cache",
            query_vector=embedding.tolist(),
            limit=1,
            score_threshold=self.threshold
        )

        if results:
            cached_response = json.loads(results[0].payload["response"])
            # Обновляем TTL найденной записи
            self.redis.expire(results[0].id, 3600)
            return cached_response

        return None

    def set(self, prompt: str, response: str, system_prompt: str = "", ttl: int = 3600):
        embedding = self.encoder.encode(prompt)
        cache_id = self._make_key(prompt, system_prompt)

        # Сохраняем в vector DB
        self.vector_db.upsert(
            collection_name="llm_cache",
            points=[{
                "id": abs(hash(cache_id)) % (2**31),
                "vector": embedding.tolist(),
                "payload": {"prompt": prompt, "response": json.dumps(response), "system_prompt": system_prompt}
            }]
        )

        # Сохраняем в Redis для точного lookup
        self.redis.setex(cache_id, ttl, json.dumps({"response": response}))

    def _make_key(self, prompt: str, system_prompt: str) -> str:
        import hashlib
        return hashlib.sha256(f"{system_prompt}||{prompt}".encode()).hexdigest()

GPTCache is a ready-made solution

GPTCache is a specialized library for LLM caching with support for various vector storage types:

from gptcache import cache
from gptcache.adapter import openai as cached_openai
from gptcache.embedding import Onnx
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation

# Настройка кэша
embedding_model = Onnx()
data_manager = get_data_manager(
    CacheBase("sqlite"),
    VectorBase("qdrant", host="localhost", port=6333, dimension=512)
)

cache.init(
    embedding_func=embedding_model.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=SearchDistanceEvaluation(max_distance=0.3),
    cache_enable_func=lambda *args, **kwargs: True
)

# Использование — прозрачная замена openai клиента
response = cached_openai.ChatCompletion.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is Python?"}]
)

KV-Cache Prefix in vLLM

vLLM automatically caches KV values for common prefixes (system prompts):

# vLLM автоматически использует prefix caching
# Важно: system prompt должен быть одинаковым для разных запросов

# При запуске включаем prefix caching
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-8b-instruct \
  --enable-prefix-caching \           # явное включение
  --max-model-len 8192

# Метрика эффективности кэша
# vllm:cache_config_info{...} -> num_gpu_blocks
# vllm:gpu_cache_usage_perc -> процент занятых блоков кэша

A prefix cache hit rate of 60–80% is typical with the same system prompt for all requests.

When not to cache

  • Personalized responses (different for each user)
  • Queries with current time/date
  • Financial data (rates, prices) - become outdated quickly
  • Code generation (minor prompt variations → different code)
  • temperature > 0.8 - stochastic responses

Caching is most effective for FAQ bots, RAGs with fixed documents, and classification tasks.

Caching metrics

Key KPIs: cache hit rate (target > 30% for FAQ, < 5% for creative), latency reduction (p99 latency with vs. without cache), cost savings (% of requests not sent to inference), staleness rate (% of cached responses that become stale).