MLOps Engineering Practice

ML lifecycle, CI/CD for ML, drift detection, and production deployment patterns

Published: January 2026 | Reading Time: 15 minutes | Category: AI & Machine Learning

Server infrastructure representing MLOps pipeline

Machine learning models don't deploy themselves. Moving from a Jupyter notebook to production requires solving challenges that traditional software engineering addresses with CI/CD—testing, reproducibility, monitoring, and rollback. MLOps applies these principles to ML systems.

This guide covers the MLOps lifecycle, practical CI/CD patterns for ML, drift detection, serving architectures, and team organization.

The ML Lifecycle

ML systems have unique challenges that make them harder to build and maintain than traditional software:

The ML Lifecycle Stages

┌─────────────────────────────────────────────────────────────┐
│  Business Goal → Data Collection → Data Processing →         │
│                                                             │
│  Feature Engineering → Model Training → Evaluation →         │
│                                                             │
│  Deployment → Monitoring → (Data/Concept Shift) → Retraining  │
└─────────────────────────────────────────────────────────────┘
    

ML Pipelines

ML pipelines codify the steps from data to deployment:

Pipeline Components:
  1. Data extraction: Load from warehouse, APIs, files
  2. Data validation: Check schema, statistics, anomalies
  3. Data transformation: Feature engineering, preprocessing
  4. Model training: Fit model to processed data
  5. Model evaluation: Validate on test set, check metrics
  6. Model serving: Deploy to production

Tools: Kubeflow Pipelines, Apache Airflow, Metaflow, Prefect
    

Kubeflow Pipelines Example

import kfp
from kfp import components
from kfp.dsl import pipeline

@pipeline(name="train-model")
def train_pipeline(
    data_path: str,
    model_type: str,
    learning_rate: float
):
    data_op = components.load_component_from_url('.../data.yaml')
    train_op = components.load_component_from_url('.../train.yaml')
    
    data = data_op(path=data_path)
    train_task = train_op(
        data=data.output,
        model_type=model_type,
        lr=learning_rate
    )
    

CI/CD for ML

Continuous Integration

For ML, CI must validate more than code:

CI Pipeline Trigger: Pull request or push to main

1. Run code quality checks (lint, types, tests)
2. Validate training data schema
3. Run data quality checks (nulls, distributions)
4. Train model with test hyperparameters
5. Evaluate model metrics against baseline
6. If all pass → merge; if fail → reject
    

Continuous Delivery/Deployment

Deploy model artifacts, not just code:

CD Pipeline Trigger: Merge to main

1. Train on full dataset
2. Evaluate against production test set
3. Generate model card (documentation)
4. Store model in model registry
5. Deploy to staging environment
6. Run integration tests
7. Deploy to production (canary or blue-green)
    

Model Registry

Centralized storage for model versions:

Model Registry:
  - Model name and version
  - Training data (link/reference)
  - Hyperparameters
  - Metrics (accuracy, latency, etc.)
  - Lineage (code, data, pipeline that produced it)
  - Approval status (staging, production, archived)
    

Model Serving Patterns

Batch (Offline) Prediction

Process requests asynchronously on a schedule:

Use cases:
  - Recommendation systems (compute daily)
  - Fraud scoring (score overnight)
  - Report generation

Architecture:
  Data → Pipeline → Batch predictions → Database
  
Tools: Spark, Airflow, Dataflow
    

Online (Real-Time) Serving

Serve predictions on demand:

Pattern Use Case Latency Cost
Embedded model Low-latency, high-throughput <1ms Low (compute)
Model server General purpose 10-100ms Moderate
Serverless Variable traffic 100-500ms (cold) Pay-per-use

Model Server Options

Shadow Mode and A/B Testing

Shadow mode:
  Production → Serve to user
  Shadow      → Serve same request, log but don't use
  
Benefit: Validate new model with live traffic without affecting users

A/B Testing:
  10% traffic → Model A (control)
  90% traffic → Model B (treatment)
  
Benefit: Measure real performance difference
    

Drift Detection

ML models degrade when the world changes. Detecting this drift is critical for maintaining quality.

Types of Drift

Statistical Tests for Drift

Population Stability Index (PSI):
  PSI < 0.1:  No significant drift
  PSI 0.1-0.2: Moderate drift, investigate
  PSI > 0.2:  Significant drift, retraining needed

PSI = Σ ((Actual% - Expected%) × ln(Actual% / Expected%))
    

Automated Monitoring

Monitor in production:

Monitor, Don't Hope: Production ML systems WILL degrade. Without monitoring, you won't know until business metrics suffer. Build monitoring before deployment, not after.

Feature Stores

Feature stores centralize feature definitions for consistency between training and serving:

Feature Store Components:
  1. Transformation layer: Feature computation logic
  2. Storage layer: Low-latency serving store + high-volume offline store
  3. Serving layer: API for online, batch for training
  
Benefits:
  - Consistent features (same computation for train and serve)
  - Reusable features across models
  - Feature versioning and lineage
    

Feature Store Options

Experiment Tracking

Track experiments to understand what works:

What to track:
  - Parameters (hyperparameters, data version)
  - Metrics (training, validation, test)
  - Artifacts (model files, visualizations)
  - Logs (stdout, charts)
  - Metadata (who ran it, when, git commit)
    

MLFlow Tracking

import mlflow

mlflow.set_experiment("my-experiment")

with mlflow.start_run():
    mlflow.log_param("lr", 0.001)
    mlflow.log_param("depth", 8)
    
    model = train_model(...)
    
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_artifact("model.pkl")
    mlflow.sklearn.log_model(model, "model")
    

Experiment Tracking Tools

Team Structure

Traditional vs ML-First Organizations

Traditional ML-First
ML engineers embedded in product teams Centralized ML platform team
Data scientists do everything Clear role separation: MLE, DS, MLOps
Ad-hoc processes Standardized MLOps practices

ML Roles

Practical MLOps Architecture

Small Team (<5 ML engineers)

Stack:
  - GitHub Actions for CI/CD
  - MLflow for experiment tracking
  - Docker for containerization
  - Kubernetes or SageMaker for serving
  - Great Expectations for data validation

Process:
  - Feature store (manual, simple)
  - Basic monitoring (custom)
  - Manual model promotion
    

Medium Team (5-20 ML engineers)

Stack:
  - Kubeflow or Airflow for pipelines
  - Feature store (Feast)
  - Model registry
  - Automated retraining triggers
  - Comprehensive monitoring

Process:
  - Standardized experiment tracking
  - Automated model promotion with gates
  - Shadow mode testing before production
    

Large Organization (20+ ML engineers)

Stack:
  - Full ML platform (internal or vendor)
  - Centralized feature store
  - Model governance and approval workflows
  - Sophisticated monitoring and alerting
  - A/B testing infrastructure

Process:
  - Full MLOps maturity
  - Self-service tools for data scientists
  - Automated data and model quality gates
    

Common Pitfalls

Conclusion

MLOps is about applying software engineering discipline to ML systems. The goal is reliable, reproducible, observable ML systems that can be maintained and improved over time.

Start simple: version your code and data, track experiments, build basic monitoring. As your ML practice matures, invest in more sophisticated tooling. Don't build a feature store if you're training five models—do build one when multiple teams are sharing features across dozens of models.