ML lifecycle, CI/CD for ML, drift detection, and production deployment patterns
Machine learning models don't deploy themselves. Moving from a Jupyter notebook to production requires solving challenges that traditional software engineering addresses with CI/CD—testing, reproducibility, monitoring, and rollback. MLOps applies these principles to ML systems.
This guide covers the MLOps lifecycle, practical CI/CD patterns for ML, drift detection, serving architectures, and team organization.
ML systems have unique challenges that make them harder to build and maintain than traditional software:
┌─────────────────────────────────────────────────────────────┐
│ Business Goal → Data Collection → Data Processing → │
│ │
│ Feature Engineering → Model Training → Evaluation → │
│ │
│ Deployment → Monitoring → (Data/Concept Shift) → Retraining │
└─────────────────────────────────────────────────────────────┘
ML pipelines codify the steps from data to deployment:
Pipeline Components:
1. Data extraction: Load from warehouse, APIs, files
2. Data validation: Check schema, statistics, anomalies
3. Data transformation: Feature engineering, preprocessing
4. Model training: Fit model to processed data
5. Model evaluation: Validate on test set, check metrics
6. Model serving: Deploy to production
Tools: Kubeflow Pipelines, Apache Airflow, Metaflow, Prefect
import kfp
from kfp import components
from kfp.dsl import pipeline
@pipeline(name="train-model")
def train_pipeline(
data_path: str,
model_type: str,
learning_rate: float
):
data_op = components.load_component_from_url('.../data.yaml')
train_op = components.load_component_from_url('.../train.yaml')
data = data_op(path=data_path)
train_task = train_op(
data=data.output,
model_type=model_type,
lr=learning_rate
)
For ML, CI must validate more than code:
CI Pipeline Trigger: Pull request or push to main
1. Run code quality checks (lint, types, tests)
2. Validate training data schema
3. Run data quality checks (nulls, distributions)
4. Train model with test hyperparameters
5. Evaluate model metrics against baseline
6. If all pass → merge; if fail → reject
Deploy model artifacts, not just code:
CD Pipeline Trigger: Merge to main
1. Train on full dataset
2. Evaluate against production test set
3. Generate model card (documentation)
4. Store model in model registry
5. Deploy to staging environment
6. Run integration tests
7. Deploy to production (canary or blue-green)
Centralized storage for model versions:
Model Registry:
- Model name and version
- Training data (link/reference)
- Hyperparameters
- Metrics (accuracy, latency, etc.)
- Lineage (code, data, pipeline that produced it)
- Approval status (staging, production, archived)
Process requests asynchronously on a schedule:
Use cases:
- Recommendation systems (compute daily)
- Fraud scoring (score overnight)
- Report generation
Architecture:
Data → Pipeline → Batch predictions → Database
Tools: Spark, Airflow, Dataflow
Serve predictions on demand:
| Pattern | Use Case | Latency | Cost |
|---|---|---|---|
| Embedded model | Low-latency, high-throughput | <1ms | Low (compute) |
| Model server | General purpose | 10-100ms | Moderate |
| Serverless | Variable traffic | 100-500ms (cold) | Pay-per-use |
Shadow mode:
Production → Serve to user
Shadow → Serve same request, log but don't use
Benefit: Validate new model with live traffic without affecting users
A/B Testing:
10% traffic → Model A (control)
90% traffic → Model B (treatment)
Benefit: Measure real performance difference
ML models degrade when the world changes. Detecting this drift is critical for maintaining quality.
Population Stability Index (PSI):
PSI < 0.1: No significant drift
PSI 0.1-0.2: Moderate drift, investigate
PSI > 0.2: Significant drift, retraining needed
PSI = Σ ((Actual% - Expected%) × ln(Actual% / Expected%))
Monitor in production:
Feature stores centralize feature definitions for consistency between training and serving:
Feature Store Components:
1. Transformation layer: Feature computation logic
2. Storage layer: Low-latency serving store + high-volume offline store
3. Serving layer: API for online, batch for training
Benefits:
- Consistent features (same computation for train and serve)
- Reusable features across models
- Feature versioning and lineage
Track experiments to understand what works:
What to track:
- Parameters (hyperparameters, data version)
- Metrics (training, validation, test)
- Artifacts (model files, visualizations)
- Logs (stdout, charts)
- Metadata (who ran it, when, git commit)
import mlflow
mlflow.set_experiment("my-experiment")
with mlflow.start_run():
mlflow.log_param("lr", 0.001)
mlflow.log_param("depth", 8)
model = train_model(...)
mlflow.log_metric("accuracy", accuracy)
mlflow.log_artifact("model.pkl")
mlflow.sklearn.log_model(model, "model")
| Traditional | ML-First |
|---|---|
| ML engineers embedded in product teams | Centralized ML platform team |
| Data scientists do everything | Clear role separation: MLE, DS, MLOps |
| Ad-hoc processes | Standardized MLOps practices |
Stack:
- GitHub Actions for CI/CD
- MLflow for experiment tracking
- Docker for containerization
- Kubernetes or SageMaker for serving
- Great Expectations for data validation
Process:
- Feature store (manual, simple)
- Basic monitoring (custom)
- Manual model promotion
Stack:
- Kubeflow or Airflow for pipelines
- Feature store (Feast)
- Model registry
- Automated retraining triggers
- Comprehensive monitoring
Process:
- Standardized experiment tracking
- Automated model promotion with gates
- Shadow mode testing before production
Stack:
- Full ML platform (internal or vendor)
- Centralized feature store
- Model governance and approval workflows
- Sophisticated monitoring and alerting
- A/B testing infrastructure
Process:
- Full MLOps maturity
- Self-service tools for data scientists
- Automated data and model quality gates
MLOps is about applying software engineering discipline to ML systems. The goal is reliable, reproducible, observable ML systems that can be maintained and improved over time.
Start simple: version your code and data, track experiments, build basic monitoring. As your ML practice matures, invest in more sophisticated tooling. Don't build a feature store if you're training five models—do build one when multiple teams are sharing features across dozens of models.