Setting up CI/CD for ML models: automatic training and deployment
CI/CD for ML is fundamentally different from classic software CI/CD: it requires testing not only code but also data, model metrics, and inference performance. A full-fledged ML pipeline includes automatic retraining, model quality validation, promotion to staging/production, and rollback in case of degradation.
ML CI/CD Architecture
[Git Push] → [Data Validation] → [Model Training] → [Model Evaluation]
→ [Model Registry] → [Staging Deploy] → [Integration Tests]
→ [Canary Deploy] → [Production Promote] → [Monitoring]
Each stage is a separate task in the pipeline with clear success/failure criteria.
Orchestration tools
GitHub Actions / GitLab CI – suitable for small teams. Enough to run training on self-hosted runners with GPUs.
Kubeflow Pipelines is a Kubernetes-native orchestrator for ML. Each step is a separate container. It supports caching of intermediate results, a visual pipeline graph, and versioning.
MLflow Projects + Prefect/Airflow is a less monolithic approach. Prefect or Airflow orchestrates, MLflow tracks.
Vertex AI Pipelines / SageMaker Pipelines — managed options for the corresponding clouds.
Example pipeline on GitHub Actions
name: ML Training Pipeline
on:
schedule:
- cron: '0 2 * * 1' # Еженедельно по понедельникам
push:
paths:
- 'src/train.py'
- 'params.yaml'
jobs:
validate-data:
runs-on: self-hosted
steps:
- uses: actions/checkout@v3
- name: Run Great Expectations
run: python validate_data.py --suite training_data
train-model:
needs: validate-data
runs-on: [self-hosted, gpu]
steps:
- name: Train
run: python train.py --config params.yaml
- name: Evaluate
run: python evaluate.py --threshold 0.92
promote-to-staging:
needs: train-model
runs-on: ubuntu-latest
steps:
- name: Register and promote
run: |
python scripts/promote_model.py \
--stage staging \
--min-f1 0.92
Model testing as part of CI
Data validation (Great Expectations, Pandera): checking the scheme, distributions, and presence of outliers in the training data before starting training.
Model evaluation gates: The model advances to the next stage only if the metrics exceed threshold values. It's important to compare not with an absolute threshold, but with the current production model: the new version should be at least 1-2% better than the existing one.
Inference latency tests: automatic testing of p95 inference latency. If the model has become more accurate but is three times slower, that's not progress.
Shadow testing: The new model runs production traffic in parallel with the current one, and the results are compared without affecting users.
Deployment strategies
| Strategy | Risk | Rollback | Use case |
|---|---|---|---|
| Blue-Green | Average | Instantaneous | Small models |
| Canary (5% → 25% → 100%) | Short | Fast | Critical services |
| Shadow | Minimum | Not needed | Risk-free testing |
| Rolling | Average | Slow | Stateless inference |
Rollback mechanism
Automatic rollback should be triggered when:
- Business metrics fell by more than X% (CTR, conversion)
- The error rate of the inference service has exceeded the threshold.
- Latency p99 exceeded SLA
# Мониторинг и автооткат
if current_model_metrics['f1'] < production_model_metrics['f1'] * 0.97:
model_registry.transition_to_stage(current_version, 'Archived')
model_registry.transition_to_stage(previous_version, 'Production')
alert_team("Auto-rollback triggered")
Setup times
Basic pipeline (training + deployment to staging): 1 week. Full pipeline with tests, canary deployment, and autorollback: 3-4 weeks. Enterprise version with Kubeflow and full integration into corporate CI/CD: 6-8 weeks.







