Classification, regression, ranking, and production monitoring metrics
Evaluating machine learning models requires more than a single accuracy number. The right metrics illuminate model behavior, identify failure modes, and guide optimization. A classifier with 95% accuracy might be useless if it misses the rare cases you care about most. A regression model might have low error on average but systematically overpredict for your most valuable customers.
This guide covers the essential metrics for classification, regression, and ranking tasks—explaining what each measures, when to use it, and how to avoid common pitfalls.
The foundation of classification evaluation is the confusion matrix, which cross-tabulates predictions against actual labels:
Predicted
Neg Pos
Actual Neg [ TN | FP ]
Pos [ FN | TP ]
TN = True Negative (correctly predicted negative)
FP = False Positive (incorrectly predicted positive)
FN = False Negative (missed positive cases)
TP = True Positive (correctly predicted positive)
From these four numbers, dozens of metrics can be derived.
The most common metrics:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
= Correct predictions / Total predictions
Precision = TP / (TP + FP)
= Of predicted positives, how many are correct?
= "Does our positive prediction signal really positive?"
Recall = TP / (TP + FN)
= Of actual positives, how many did we find?
= "Did we miss any real positives?"
F1 = 2 × (Precision × Recall) / (Precision + Recall)
= Harmonic mean of precision and recall
= 1 (perfect) to 0 (worst)
Consider a fraud detection model where 0.1% of transactions are fraudulent. A naive model that always predicts "not fraud" achieves 99.9% accuracy—but is completely useless. For imbalanced data:
| Scenario | Problem | Better Metric |
|---|---|---|
| 99:1 class ratio | Accuracy meaningless | Precision, Recall, F1 |
| Medical screening | Missing positives catastrophic | Recall, AUC-ROC |
| Search ranking | Order matters | NDCG, MAP |
| Multi-class | Per-class performance varies | Macro/Micro F1 |
ROC (Receiver Operating Characteristic) curves plot true positive rate against false positive rate at different classification thresholds. AUC (Area Under Curve) measures the area under this curve.
TPR (Sensitivity/Recall) = TP / (TP + FN)
FPR = FP / (FP + TN)
ROC curve: Plot TPR vs FPR as threshold varies from 0 to 1
AUC = 1.0: Perfect classifier
AUC = 0.5: Random classifier (diagonal line)
AUC < 0.5: Worse than random (invert predictions)
AUC is threshold-independent and robust to class imbalance. An AUC of 0.95 means that a randomly chosen positive example will be ranked higher than a randomly chosen negative example 95% of the time.
For highly imbalanced data, PR-AUC (Area Under the Precision-Recall Curve) is more informative than ROC-AUC. When the negative class vastly outnumbers the positive class, ROC curves can look deceptively good.
Specificity = TN / (TN + FP)
= Of negatives, how many did we correctly identify?
Balanced Accuracy = (Recall + Specificity) / 2
= Average of recall for each class
= Better than accuracy for imbalanced data
For multi-class problems, averaging across classes:
Macro F1: Average F1 across all classes (unweighted)
Treats 100-sample class and 10000-sample class equally
Micro F1: Pool all TP, FP, FN globally, then compute F1
Same as accuracy for multi-class
Weighted F1: Average F1 weighted by class frequency
Most common choice for imbalanced multi-class
For each class, treat it as positive and all others as negative. This reveals per-class performance—critical when different errors have different costs.
MAE = (1/n) × Σ|y_i - ŷ_i|
Pros:
- Interpretable (same units as output)
- Robust to outliers (no squaring)
Cons:
- Less sensitive to large errors
- Not differentiable at zero (harder to optimize)
MSE = (1/n) × Σ(y_i - ŷ_i)²
RMSE = √MSE
Pros:
- Differiable (gradient-based optimization)
- Penalizes large errors heavily
Cons:
- Sensitive to outliers (squared penalty)
- Harder to interpret (units are squared)
R² = 1 - (SS_res / SS_tot)
SS_res = Σ(y_i - ŷ_i)² (residual sum of squares)
SS_tot = Σ(y_i - ȳ)² (total sum of squares)
Interpretation:
R² = 1: Perfect prediction
R² = 0: No better than predicting the mean
R² < 0: Worse than predicting the mean
R² measures the proportion of variance explained by the model. However, adding more features always increases R²—use adjusted R² to penalize unnecessary complexity.
MAPE = (100/n) × Σ|y_i - ŷ_i| / y_i
SMAPE = (100/n) × Σ|y_i - ŷ_i| / ((|y_i| + |ŷ_i|) / 2)
Interpretation:
MAPE = 10%: Predictions are off by 10% on average
Warning: MAPE is undefined when y_i = 0 and can be
artificially high for small y_i values
Quantile Loss = (1/n) × Σ max(q × (y - ŷ), (1-q) × (ŷ - y))
where q is the quantile (e.g., 0.5 for median)
Use when:
- Asymmetric cost for over- vs under-prediction
- Want to predict medians rather than means
- Robust to outliers
For query q:
AP(q) = Σ Precision@k × ΔRecall@k
where k ranges over all relevant items
MAP = Average of AP across all queries
Use for: Search engines, recommendation systems
DCG@k = Σ (rel_i / log₂(i+1)) for i=1 to k
NDCG@k = DCG@k / IDCG@k
where IDCG is the ideal DCG (best possible ordering)
NDCG = 1.0: Perfect ranking
NDCG = 0.0: Worst possible ranking
Use for: Ranking with graded relevance (not just relevant/not)
Single train-test splits can be misleading due to variance. Cross-validation provides more robust estimates:
Data: Split into K folds (typically 5 or 10)
For each fold i:
Train on all folds except i
Validate on fold i
Report: Average performance across folds, with variance
K-fold gives K estimates of performance. The mean is a more robust estimate; the variance indicates stability.
For classification, ensure each fold has the same class distribution as the full dataset. This is critical for imbalanced data.
For temporal data, never use random splits. Use forward-looking splits:
Week 1-8: Train
Week 9: Validate
Week 10: Test
Or: Rolling origin cross-validation
Train: Weeks 1-8 → Test: Week 9
Train: Weeks 1-9 → Test: Week 10
Train: Weeks 1-10 → Test: Week 11
When comparing models, is the difference statistically significant? Use:
Production models degrade when input distributions shift. Monitor:
Population Stability Index (PSI):
PSI < 0.1: No significant shift
PSI 0.1-0.2: Moderate shift, investigate
PSI > 0.2: Significant shift, retraining recommended
PSI = Σ ((Actual% - Expected%) × ln(Actual% / Expected%))
Production ML requires rigorous experimentation. A/B testing (or multi-armed bandit testing) validates that model changes improve outcomes in the real world.
Randomized experiment:
Control: Users see model A (current)
Treatment: Users see model B (new)
Randomization ensures:
- No selection bias
- Similar user distributions
- Statistical validity
Offline metrics (accuracy, AUC) don't always translate to online improvements. Track business metrics:
Statistical power determines experiment duration:
For detecting 1% improvement in 1% baseline CTR:
- Baseline CTR: 1%
- Minimum detectable effect: 1%
- Statistical power: 80%
- Significance level: 5%
Required sample size: ~1,000,000 impressions per variant
At 10,000 impressions/day: ~100 days for experiment
Large effects (5%+) can be detected much faster. Calculate sample size before running experiments to avoid inconclusive results.
Calibration measures whether predicted probabilities match observed frequencies:
Well-calibrated model:
When model predicts 80% probability, event occurs 80% of time
Poorly-calibrated model:
When model predicts 80% probability, event occurs only 60% of time
Calibration is critical for:
- Risk assessment
- Medical decisions
- Financial models
Visualize calibration by binning predictions and comparing predicted vs actual frequencies:
Post-hoc calibration techniques can improve poorly-calibrated models:
Temperature scaling is simplest and often works well—divide logits by a single learned temperature parameter to calibrate softmax outputs.
No single metric captures everything. The right metrics depend on your problem, your data, and the costs of different errors. Build a metric hierarchy: a primary metric for optimization, secondary metrics for understanding behavior, and monitoring metrics for production health.
Remember that optimizing for a metric can lead to Goodhart's Law—manipulating the metric rather than the underlying goal. Always validate that improvements in metrics translate to real-world value.