Setting up Shadow Deployment for ML models
Shadow deployment (mirror deployment) is a strategy in which a new version of a model receives the same requests as production, but its responses are not delivered to users. The goal is to test the new model's behavior on real traffic without any risk to users.
When to use shadow deployment
- A radical change in the model architecture (for example, a transition from gradient boosting to a neural network)
- The new version hasn't been fully tested yet, but we need to see some real data.
- Checking latency and resource utilization under real load
- Validation of the data processing pipeline before the new version
- Testing a large LLM model before replacing it with a smaller one
Architecture
[User Request]
|
├──→ [Production Model V1] ──→ [Response to User]
|
└──→ [Shadow Model V2] ──→ [Prediction logged, not returned]
|
[Comparison DB]
|
[Metrics Dashboard]
Implementation with Envoy/Istio
Istio mirror:
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: ml-inference
spec:
hosts:
- ml-inference
http:
- route:
- destination:
host: ml-inference
subset: v1
weight: 100
mirror:
host: ml-inference
subset: v2-shadow
mirrorPercentage:
value: 100 # Зеркалировать 100% трафика
Nginx mirror:
location /predict {
proxy_pass http://model-v1;
mirror /shadow;
mirror_request_body on;
}
location = /shadow {
internal;
proxy_pass http://model-v2-shadow/predict;
}
Application-level implementation
For more flexible logging and comparison, here's the code implementation:
import asyncio
import logging
async def predict_with_shadow(request_features):
# Production модель — синхронно
production_result = production_model.predict(request_features)
# Shadow модель — асинхронно, не блокирует ответ
asyncio.create_task(
run_shadow_prediction(request_features, production_result)
)
return production_result
async def run_shadow_prediction(features, production_result):
try:
shadow_result = shadow_model.predict(features)
# Логирование для сравнения
comparison_store.log({
'timestamp': datetime.utcnow(),
'production_score': float(production_result),
'shadow_score': float(shadow_result),
'agreement': abs(production_result - shadow_result) < 0.1,
'features_hash': hash_features(features)
})
except Exception as e:
logging.error(f"Shadow prediction failed: {e}")
# Ошибка в shadow не влияет на production
Comparison metrics
Agreement rate — the percentage of queries where the model predictions match (within the specified tolerance):
df['agreement'] = abs(df['production'] - df['shadow']) < threshold
agreement_rate = df['agreement'].mean()
# Цель: > 95% agreement для критичных систем
Prediction distribution comparison:
from scipy.stats import ks_2samp
ks_stat, p_value = ks_2samp(df['production'], df['shadow'])
# Если p_value < 0.05 — распределения значимо отличаются
Latency comparison: The shadow model may be slower without impacting users, but it indicates future latency issues during the transition.
When to switch from Shadow to Canary
Recommendations for the transition:
- Shadow testing was completed for at least 1 week on real traffic
- Agreement rate > 95% (or agreed business decision on acceptable discrepancy)
- Latency of the shadow model < SLA (even though it is not yet critical)
- Resource utilization is normal at peak load
- No unexpected errors in the shadow service logs
Shadow deployment is the safest testing strategy, especially for systems where the cost of error is high: financial decisions, medical diagnostics, security systems.







