ML Model Production Performance Monitoring Setup
ML model monitoring in production is not just tracking quality metrics (AUC, F1, RMSE), but also infrastructure metrics (latency, throughput, GPU utilization), business metrics, and operational indicators. Without comprehensive monitoring, it's impossible to respond quickly to degradation.
Monitoring Levels
Level 1 — Infrastructure:
- Latency: p50, p95, p99 of inference requests
- Throughput: requests per second
- Error rate: 5xx errors, timeouts
- Resource utilization: CPU/GPU/RAM, memory bandwidth
- Queue depth: when using batch inference
Level 2 — Data and Model:
- Feature statistics: mean, std, min, max, null rate for each input feature
- Prediction distribution: histogram of predictions
- Confidence distribution: for classifiers
- Data drift: KS-test, PSI (see drift monitoring details)
Level 3 — Business Metrics:
- Proxy-metrics: CTR, conversion, engagement—without waiting for ground truth
- Downstream business KPIs: revenue impact, churn rate
- A/B metrics when testing versions in parallel
Monitoring Stack
Prometheus + Grafana — standard for infrastructure metrics. ML-specific metrics exported via prometheus_client:
from prometheus_client import Histogram, Counter, Gauge
REQUEST_LATENCY = Histogram(
'ml_inference_latency_seconds',
'Inference request latency',
buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5]
)
PREDICTION_DISTRIBUTION = Histogram(
'ml_prediction_score',
'Distribution of model prediction scores',
buckets=[0.1 * i for i in range(11)]
)
@REQUEST_LATENCY.time()
def predict(features):
score = model.predict_proba(features)[0][1]
PREDICTION_DISTRIBUTION.observe(score)
return score
Evidently + Grafana — for drift monitoring with visualization. Evidently generates metrics compatible with Prometheus.
OpenTelemetry — standardized way to instrument for tracing, metrics, and logs. Especially useful in microservice architectures where inference is one of many services.
Logging Prediction Pairs
For delayed quality metric calculation (when ground truth appears later), log (request, prediction) pairs with unique ID:
import uuid
def predict_and_log(request_features):
prediction_id = str(uuid.uuid4())
prediction = model.predict(request_features)
# Log to ClickHouse/BigQuery/Kafka
prediction_store.log({
'prediction_id': prediction_id,
'timestamp': datetime.utcnow(),
'features': request_features.to_dict(),
'prediction': float(prediction),
'model_version': MODEL_VERSION
})
return prediction, prediction_id
When ground truth becomes known (e.g., user made or didn't make purchase), it's recorded with same prediction_id, and system computes actual quality metrics.
Dashboards
Recommended Grafana dashboard structure:
- Operational Overview — latency, throughput, error rate in real-time
- Model Health — prediction distribution, feature statistics, drift metrics
- Business Impact — proxy-metrics and downstream KPIs
- Model Comparison — compare current and previous version during canary deployment
Alerting
Alert levels and channels:
- Warning (Slack): drift PSI > 0.15, latency p99 > 500ms
- Critical (PagerDuty): error rate > 1%, latency p99 > 2s, prediction rate near zero or 100%
- Fatal (page on-call): inference service unavailable
Average time from problem detection to investigation start with configured monitoring: 5-10 minutes vs several hours without.







