MLOps Engineering Practice

ML lifecycle, CI/CD for ML, drift detection, and production deployment patterns

Published: January 2026 | Reading Time: 15 minutes | Category: AI & Machine Learning

Server infrastructure representing MLOps pipeline

Machine learning models don't deploy themselves. Moving from a Jupyter notebook to production requires solving challenges that traditional software engineering addresses with CI/CD—testing, reproducibility, monitoring, and rollback. MLOps applies these principles to ML systems.

This guide covers the MLOps lifecycle, practical CI/CD patterns for ML, drift detection, serving architectures, and team organization.

The ML Lifecycle

ML systems have unique challenges that make them harder to build and maintain than traditional software:

Hybrid nature: Code + data + model = working system
Data dependency: Model quality depends on data quality
Noisy testing: ML evaluation is statistical, not deterministic
Concept drift: Model degrades as world changes

The ML Lifecycle Stages

┌─────────────────────────────────────────────────────────────┐
│  Business Goal → Data Collection → Data Processing →         │
│                                                             │
│  Feature Engineering → Model Training → Evaluation →         │
│                                                             │
│  Deployment → Monitoring → (Data/Concept Shift) → Retraining  │
└─────────────────────────────────────────────────────────────┘

ML Pipelines

ML pipelines codify the steps from data to deployment:

Pipeline Components:
  1. Data extraction: Load from warehouse, APIs, files
  2. Data validation: Check schema, statistics, anomalies
  3. Data transformation: Feature engineering, preprocessing
  4. Model training: Fit model to processed data
  5. Model evaluation: Validate on test set, check metrics
  6. Model serving: Deploy to production

Tools: Kubeflow Pipelines, Apache Airflow, Metaflow, Prefect

Kubeflow Pipelines Example

import kfp
from kfp import components
from kfp.dsl import pipeline

@pipeline(name="train-model")
def train_pipeline(
    data_path: str,
    model_type: str,
    learning_rate: float
):
    data_op = components.load_component_from_url('.../data.yaml')
    train_op = components.load_component_from_url('.../train.yaml')
    
    data = data_op(path=data_path)
    train_task = train_op(
        data=data.output,
        model_type=model_type,
        lr=learning_rate
    )

CI/CD for ML

Continuous Integration

For ML, CI must validate more than code:

Code: Unit tests, linting, type checking
Data: Schema validation, data quality checks
Model: Training succeeds, metrics meet threshold

CI Pipeline Trigger: Pull request or push to main

1. Run code quality checks (lint, types, tests)
2. Validate training data schema
3. Run data quality checks (nulls, distributions)
4. Train model with test hyperparameters
5. Evaluate model metrics against baseline
6. If all pass → merge; if fail → reject

Continuous Delivery/Deployment

Deploy model artifacts, not just code:

CD Pipeline Trigger: Merge to main

1. Train on full dataset
2. Evaluate against production test set
3. Generate model card (documentation)
4. Store model in model registry
5. Deploy to staging environment
6. Run integration tests
7. Deploy to production (canary or blue-green)

Model Registry

Centralized storage for model versions:

Model Registry:
  - Model name and version
  - Training data (link/reference)
  - Hyperparameters
  - Metrics (accuracy, latency, etc.)
  - Lineage (code, data, pipeline that produced it)
  - Approval status (staging, production, archived)

Model Serving Patterns

Batch (Offline) Prediction

Process requests asynchronously on a schedule:

Use cases:
  - Recommendation systems (compute daily)
  - Fraud scoring (score overnight)
  - Report generation

Architecture:
  Data → Pipeline → Batch predictions → Database
  
Tools: Spark, Airflow, Dataflow

Online (Real-Time) Serving

Serve predictions on demand:

Pattern	Use Case	Latency	Cost
Embedded model	Low-latency, high-throughput	<1ms	Low (compute)
Model server	General purpose	10-100ms	Moderate
Serverless	Variable traffic	100-500ms (cold)	Pay-per-use

Model Server Options

TensorFlow Serving: Optimized for TensorFlow models
Triton: Multi-framework (TensorFlow, PyTorch, ONNX)
Ray Serve: Flexible, supports complex pipelines
BentoML: Framework-agnostic, easy deployment
SageMaker endpoints: Managed, AWS integration

Shadow Mode and A/B Testing

Shadow mode:
  Production → Serve to user
  Shadow      → Serve same request, log but don't use
  
Benefit: Validate new model with live traffic without affecting users

A/B Testing:
  10% traffic → Model A (control)
  90% traffic → Model B (treatment)
  
Benefit: Measure real performance difference

Drift Detection

ML models degrade when the world changes. Detecting this drift is critical for maintaining quality.

Types of Drift

Data drift: Input distribution changes (e.g., more younger users)
Concept drift: Relationship between input and output changes (e.g., fraud patterns evolve)
Prediction drift: Output distribution changes (e.g., higher average scores)

Statistical Tests for Drift

Population Stability Index (PSI):
  PSI < 0.1:  No significant drift
  PSI 0.1-0.2: Moderate drift, investigate
  PSI > 0.2:  Significant drift, retraining needed

PSI = Σ ((Actual% - Expected%) × ln(Actual% / Expected%))

Automated Monitoring

Monitor in production:

Input statistics: Distribution of features over time
Output statistics: Prediction distribution changes
Ground truth: When labels available, monitor accuracy
Business metrics: Conversion rate, error rate changes

        Monitor, Don't Hope: Production ML systems WILL degrade. Without monitoring, you won't know until business metrics suffer. Build monitoring before deployment, not after.
    

Feature Stores

Feature stores centralize feature definitions for consistency between training and serving:

Feature Store Components:
  1. Transformation layer: Feature computation logic
  2. Storage layer: Low-latency serving store + high-volume offline store
  3. Serving layer: API for online, batch for training
  
Benefits:
  - Consistent features (same computation for train and serve)
  - Reusable features across models
  - Feature versioning and lineage

Feature Store Options

Feast: Open source, GCP/AWS/Azure compatible
Tecton: Managed, enterprise features
SageMaker Feature Store: AWS managed
Databricks Feature Store: Spark-based

Experiment Tracking

Track experiments to understand what works:

What to track:
  - Parameters (hyperparameters, data version)
  - Metrics (training, validation, test)
  - Artifacts (model files, visualizations)
  - Logs (stdout, charts)
  - Metadata (who ran it, when, git commit)

MLFlow Tracking

import mlflow

mlflow.set_experiment("my-experiment")

with mlflow.start_run():
    mlflow.log_param("lr", 0.001)
    mlflow.log_param("depth", 8)
    
    model = train_model(...)
    
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_artifact("model.pkl")
    mlflow.sklearn.log_model(model, "model")

Experiment Tracking Tools

MLflow: Open source, widely used
Weights & Biases: Great UX, collaborative
Neptune: Lightweight, integrates with many frameworks
TensorBoard: Native to TensorFlow

Team Structure

Traditional vs ML-First Organizations

Traditional	ML-First
ML engineers embedded in product teams	Centralized ML platform team
Data scientists do everything	Clear role separation: MLE, DS, MLOps
Ad-hoc processes	Standardized MLOps practices

ML Roles

Data Scientist: Business problem framing, modeling, analysis
ML Engineer: Model development, pipeline building, deployment
MLOps Engineer: Infrastructure, tooling, platform
Data Engineer: Data pipeline, warehouse, quality

Practical MLOps Architecture

Small Team (<5 ML engineers)

Stack:
  - GitHub Actions for CI/CD
  - MLflow for experiment tracking
  - Docker for containerization
  - Kubernetes or SageMaker for serving
  - Great Expectations for data validation

Process:
  - Feature store (manual, simple)
  - Basic monitoring (custom)
  - Manual model promotion

Medium Team (5-20 ML engineers)

Stack:
  - Kubeflow or Airflow for pipelines
  - Feature store (Feast)
  - Model registry
  - Automated retraining triggers
  - Comprehensive monitoring

Process:
  - Standardized experiment tracking
  - Automated model promotion with gates
  - Shadow mode testing before production

Large Organization (20+ ML engineers)

Stack:
  - Full ML platform (internal or vendor)
  - Centralized feature store
  - Model governance and approval workflows
  - Sophisticated monitoring and alerting
  - A/B testing infrastructure

Process:
  - Full MLOps maturity
  - Self-service tools for data scientists
  - Automated data and model quality gates

Common Pitfalls

Data without governance: Unknown data lineage, schema changes break models
Training-serving skew: Different preprocessing in train vs serve
No monitoring: Model degrades silently
Over-engineering: Building platform before understanding requirements
Experiment chaos: No tracking leads to unreproducible results

Conclusion

MLOps is about applying software engineering discipline to ML systems. The goal is reliable, reproducible, observable ML systems that can be maintained and improved over time.

Start simple: version your code and data, track experiments, build basic monitoring. As your ML practice matures, invest in more sophisticated tooling. Don't build a feature store if you're training five models—do build one when multiple teams are sharing features across dozens of models.

ML Model Evaluation Metrics Cloud Native Architecture Privacy Computing Overview