Developing an API wrapper for the Model-as-a-Service AI model
An API wrapper turns an ML model into a ready-to-use web service: it adds authentication, rate limiting, versioning, logging, and monitoring. It's a layer between the raw model and external consumers—be it a front-end application, a partner service, or an internal team.
MaaS API Architecture
[Client] → [API Gateway] → [Auth/Rate Limit] → [Request Validation]
→ [Model Router] → [Inference Service] → [Response Formatter]
↕ ↕
[Usage Logger] [Cache Layer]
Implementation on FastAPI
from fastapi import FastAPI, HTTPException, Depends, Header
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, Field
import time
import hashlib
app = FastAPI(title="Model-as-a-Service API", version="1.0.0")
class PredictionRequest(BaseModel):
inputs: list[dict] = Field(..., description="List of feature dictionaries")
model_version: str = Field(default="latest")
options: dict = Field(default_factory=dict)
class PredictionResponse(BaseModel):
predictions: list
model_version: str
request_id: str
latency_ms: float
async def verify_api_key(x_api_key: str = Header(...)):
if not await api_key_store.verify(x_api_key):
raise HTTPException(status_code=401, detail="Invalid API key")
return await api_key_store.get_client(x_api_key)
@app.post("/v1/predict", response_model=PredictionResponse)
async def predict(
request: PredictionRequest,
client = Depends(verify_api_key)
):
# Rate limiting
if not await rate_limiter.check(client.id, limit=100, window=60):
raise HTTPException(status_code=429, detail="Rate limit exceeded")
# Cache check
cache_key = hashlib.md5(str(request.inputs).encode()).hexdigest()
cached = await cache.get(cache_key)
if cached:
return cached
# Inference
start = time.perf_counter()
try:
model = model_registry.get(request.model_version)
predictions = model.predict(request.inputs)
except Exception as e:
await logger.error(client.id, request, str(e))
raise HTTPException(status_code=500, detail=str(e))
latency = (time.perf_counter() - start) * 1000
response = PredictionResponse(
predictions=predictions,
model_version=model.version,
request_id=generate_request_id(),
latency_ms=latency
)
# Log usage
await usage_logger.log(client.id, request, response, latency)
await cache.set(cache_key, response, ttl=300)
return response
API versioning
# v1 — legacy формат
@app.post("/v1/predict")
async def predict_v1(request: PredictionRequestV1):
...
# v2 — новый формат с batch поддержкой
@app.post("/v2/predict")
async def predict_v2(request: PredictionRequestV2):
...
# Deprecation header для v1
@app.middleware("http")
async def add_deprecation_header(request, call_next):
response = await call_next(request)
if request.url.path.startswith("/v1/"):
response.headers["Deprecation"] = "true"
response.headers["Sunset"] = "2025-12-31"
return response
Monitoring and SLA
Key metrics: p50/p95/p99 latency, error rate, request volume, cache hit rate, model version distribution. SLA goal: p95 < 200ms, error rate < 0.1%, uptime 99.9%.
The addition of streaming for LLM models (SSE), webhook callbacks for long predictions, and Python/JavaScript SDK clients make the API a full-fledged product, not just an HTTP endpoint.







