AI Inference Cost Optimization
AI model inference in production can be significantly more expensive than expected. Costs for LLM APIs, GPU computing, and managed inference services grow nonlinearly with scale. System optimization can reduce costs by 40-70% without sacrificing quality.
Audit of current expenses
The first step is to understand the cost structure:
- What percentage of queries use an expensive model, although the task can be solved with a cheap one? What's the cache hit rate? Are duplicate requests being cached? What's the average context size? Are any extra tokens being transmitted?
- What is the GPU utilization during batch inference?
Optimization strategies
1. Model routing – the right model for each task
class IntelligentModelRouter:
def route_request(self, request: dict) -> str:
query = request['query']
complexity = self.complexity_estimator.estimate(query)
if complexity < 0.3:
return "gpt-4o-mini" # $0.15/1M input tokens
elif complexity < 0.7:
return "gpt-4o" # $5.00/1M input tokens
else:
return "gpt-4-turbo" # $10.00/1M input tokens
def estimate_complexity(self, query: str) -> float:
# Эвристики: длина, наличие code, math, multi-step reasoning
features = [
len(query.split()) / 200,
1.0 if any(kw in query for kw in ['calculate', 'code', 'step-by-step']) else 0,
1.0 if '```' in query else 0,
]
return min(np.mean(features) * 1.5, 1.0)
2. Semantic caching
from redis import Redis
import numpy as np
class SemanticCache:
def __init__(self, similarity_threshold: float = 0.95):
self.redis = Redis()
self.threshold = similarity_threshold
self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
def get(self, query: str) -> str | None:
query_emb = self.embedder.encode(query)
cached_keys = self.redis.keys("cache:*")
for key in cached_keys:
cached_data = json.loads(self.redis.get(key))
cached_emb = np.array(cached_data['embedding'])
similarity = np.dot(query_emb, cached_emb) / (
np.linalg.norm(query_emb) * np.linalg.norm(cached_emb)
)
if similarity > self.threshold:
return cached_data['response']
return None
def set(self, query: str, response: str, ttl: int = 3600):
key = f"cache:{hashlib.md5(query.encode()).hexdigest()}"
self.redis.setex(key, ttl, json.dumps({
'embedding': self.embedder.encode(query).tolist(),
'response': response
}))
3. Prompt compression
# LLMLingua для сжатия длинных контекстов
from llmlingua import PromptCompressor
compressor = PromptCompressor(
model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
use_llmlingua2=True
)
compressed = compressor.compress_prompt(
context,
rate=0.5, # Сжать до 50% токенов
force_tokens=['\n', '?'] # Не сжимать переносы строк и вопросы
)
# Экономия: 40-60% токенов при <5% деградации качества
4. Quantization for self-hosted inference
# 4-bit квантизация модели через bitsandbytes (GPTQ)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b-chat-hf",
quantization_config=quantization_config,
device_map="auto"
)
# 70B модель в 4-bit: ~35GB VRAM вместо ~140GB в BF16
Expected savings
| Method | Cost reduction | Degradation of quality |
|---|---|---|
| Model routing | 50-70% | <5% |
| Semantic caching | 20-40% | 0% |
| Prompt compression | 30-50% | 1-5% |
| 4-bit quantization | 40-60% (self-hosted) | 1-3% |
| Batch inference | 30-50% | 0% |
The combination of model routing and semantic caching provides the greatest effect without the risk of degradation for most production scenarios.







