MLOps: Infrastructure for Training, Deploying, and Monitoring ML Models
Model trained, metrics excellent. Three months later in production quality dropped 12%. No one knows exactly when — no monitoring. Can't quickly retrain — training script lives in data scientist's notebook, who quit. Retraining data needs manual collection from three systems. Not hypothetical — happens in roughly half of cases we see.
MLOps is engineering discipline making ML systems reproducible, manageable, and maintainable in production.
Experiment Tracking and Reproducibility
Without experiment tracking ML projects quickly become chaos: which checkpoint is best, what hyperparameters were used, which dataset. Reproducing results month later — quest.
MLflow — open source standard for tracking. Logs parameters, metrics, artifacts (models, plots), and code. MLflow Model Registry — centralized model storage with versioning and lifecycle stages (Staging → Production → Archived). Deploy via MLflow Serving or integrate with external systems.
import mlflow
mlflow.set_experiment("fraud-detection-v2")
with mlflow.start_run():
mlflow.log_params({"learning_rate": 3e-4, "batch_size": 64, "epochs": 10})
mlflow.log_metric("val_f1", val_f1, step=epoch)
mlflow.pytorch.log_model(model, "model")
This is minimum. Production adds system metrics (GPU utilization, memory), dataset (hash, version), code (git commit hash) logging.
Weights & Biases — richer UI, collaboration features, sweep for hyperparameter optimization. Preferable for teams with active experimentation. MLflow for on-premise deployment without external dependencies.
DVC (Data Version Control) — data and model versioning on top of git. Data stored in S3/GCS/Azure Blob, only metadata (hashes) in git. dvc repro reproduces entire pipeline from raw data to metrics. Without it "dataset version 3 with augmentations" — just hope colleagues remember.
Training Pipelines: Kubeflow, Airflow, Prefect
When orchestrator needed. 100-line training script in cron — fine for simple tasks. But multi-step pipeline (data loading → preprocessing → feature engineering → training → validation → deploy if quality above threshold) needs orchestrator with retry logic, visualization, alerts.
Kubeflow Pipelines — Kubernetes-native ML orchestrator. Each pipeline step — Docker container. Supports parallel steps, conditionals, artifacts between steps. Integrates with Katib (AutoML), KServe (serving), Feast (feature store). High entry barrier but scales to hundreds parallel runs.
Apache Airflow — more generic DAG orchestrator, not ML-specific. Wide operator ecosystem (S3, Spark, DBT, Kubernetes). Simpler deploy than Kubeflow if company already has Airflow.
Prefect / Metaflow — more modern alternatives, less boilerplate. Prefect 2.x with @flow and @task decorators — fast start for small teams.
Typical Kubeflow training pipeline architecture:
- Data ingestion — fetch from S3/DB, validate schema via Great Expectations
- Preprocessing — transformations, normalization, train/val/test split
- Training — GPU training, MLflow logging
- Evaluation — compute metrics, compare with baseline in Model Registry
- Conditional deployment — deploy only if new model beats current by >2% F1
Each component separate Docker image. Pipeline versioned in git. Run manually or scheduled (weekly retraining on new data).
Model Registry and Lifecycle Management
Model Registry not just checkpoint storage. Centralized system knowing:
- Current production model (and its metrics)
- History of all versions with training parameters
- Metadata: dataset, git commit, validation results
- Lifecycle: None → Staging → Production → Archived
MLflow Model Registry — standard. For enterprise — Vertex AI (GCP), SageMaker (AWS), Azure ML.
Model promotion through stages. Auto-promote to Staging after successful eval. Manual or auto (via A/B test) promotion to Production. Rollback — switch to previous Production version in seconds.
Serving: From Flask to Triton
Simple case. FastAPI + PyTorch/ONNX on single server — 80% production ML deployments like this. Sufficient for most tasks with <100 req/s load.
from fastapi import FastAPI
import onnxruntime as ort
app = FastAPI()
session = ort.InferenceSession("model.onnx", providers=["CUDAExecutionProvider"])
@app.post("/predict")
async def predict(request: PredictRequest):
inputs = preprocess(request.text)
outputs = session.run(None, {"input_ids": inputs})
return {"label": postprocess(outputs)}
Triton Inference Server — production standard for high loads. Dynamic batching (auto-collects requests), concurrent model execution, model ensemble. Supports TensorRT, ONNX, PyTorch TorchScript, TensorFlow SavedModel. At 500+ req/s and multiple models — Triton beats any custom serving.
KServe (formerly KFServing) — Kubernetes-native ML serving with autoscaling, canary deployments, A/B testing out-of-box. Scale-to-zero for inactive models — saves infrastructure cost.
vLLM for LLM serving — separate story, covered in LLM section.
Monitoring: Data Drift, Model Drift, Infrastructure Metrics
Monitoring — usually done last and regretted first. Three levels.
Infrastructure monitoring. Latency (P50/P95/P99), throughput (req/s), error rate (4xx, 5xx), GPU/CPU utilization. Prometheus + Grafana standard. Alert on P99 latency > threshold or error rate > 1%.
Data drift monitoring. Input distribution changes over time. Detect via:
- PSI (Population Stability Index) for numbers: PSI > 0.2 — strong drift
- Chi-squared test for categorical
- Kolmogorov-Smirnov for continuous
- Evidently AI — open source library with ready drift tests and HTML reports
Model drift monitoring. If ground truth available with lag — monitor actual quality metrics. If not — surrogate metrics: prediction score distribution, confident prediction share.
Alerting. Three levels: INFO (minor drift, log), WARNING (significant drift, notify team), CRITICAL (quality below threshold, auto-switch to fallback or human review).
Feature Store
Feature Store solves training-serving skew problem. If feature engineering differs between training and inference — mismatch inevitable.
Feast — open source Feature Store. Offline store (S3 + Parquet/Delta) for training, online store (Redis, DynamoDB) for low-latency inference. Feature definitions as code, materialization job syncs offline → online.
Tecton (commercial), Vertex AI Feature Store (GCP), SageMaker Feature Store (AWS) — managed options with less ops overhead.
When needed: multiple models share same features; features computed from streaming data (real-time); large team with different feature engineering and model training staff.
CI/CD for ML
ML CI/CD — regular CI/CD plus ML-specific steps.
ML-specific CI checks:
- Reproducibility check: run training with fixed seed, result must match
- Data validation: Great Expectations or Pandera schema/distribution checks
- Model performance check: auto eval on holdout, block merge if degradation > threshold
- Latency regression test: inference must meet SLA
GitOps for model deployment. Merge to main → CI runs training → eval → if passes → auto deploy to Staging → smoke tests → manual promotion to Production (or auto on successful canary).
Tools: GitHub Actions / GitLab CI for CI, ArgoCD for GitOps on Kubernetes.
Common MLOps Mistakes
No training reproducibility. Fix random seeds (torch.manual_seed, numpy.random.seed, random.seed) and record in experiment metadata. Without — debugging irregular results is pain.
Training-serving skew. Different preprocessing in train and inference. Solution: single preprocessing module used in both. For sklearn: Pipeline object including transforms.
No data versioning. "Dataset v3" in filename not versioning. DVC + S3 — minimum.
No fallback. Model crashes or degrades — no plan. Always have fallback: previous model version, rule-based logic, or graceful "can't answer."
Timelines and Stages
Basic MLOps infrastructure (experiment tracking, Model Registry, serving, basic monitoring): 4-6 weeks. Full platform with Kubeflow, Feature Store, CI/CD and advanced monitoring: 3-5 months. Audit of existing infrastructure and roadmap: 1-2 weeks.







