Model Evaluation & Metrics

ROC curves, AUC, cross-validation, overfitting vs underfitting, and choosing the right evaluation metrics.

Intermediate · 14 min read

Why Evaluation Matters

A model that gets 99% accuracy on training data might perform terribly in production. Proper evaluation tells you how the model will perform on unseen data — which is all that matters in the real world. Choosing the right metric depends on your problem and what errors cost you.

Classification Metrics

Metric Formula Range When to Use
Accuracy (TP+TN) / Total 0-1 Balanced classes only
Precision TP / (TP+FP) 0-1 When FP are costly (spam → inbox)
Recall (Sensitivity) TP / (TP+FN) 0-1 When FN are costly (missed cancer)
F1 Score 2×P×R / (P+R) 0-1 Balance precision & recall
Specificity TN / (TN+FP) 0-1 True negative rate
AUC-ROC Area under ROC curve 0-1 Overall model quality, threshold-independent

The ROC Curve

The ROC (Receiver Operating Characteristic) curve plots True Positive Rate vs False Positive Rate at every classification threshold. AUC (Area Under Curve) summarizes this into a single number — 1.0 is perfect, 0.5 is random guessing.

  • AUC = 1.0: Perfect model — separates classes completely
  • AUC = 0.9: Excellent — strong discrimination
  • AUC = 0.7-0.8: Fair — reasonable performance
  • AUC = 0.5: No discrimination — random coin flip

Use AUC-ROC when you need a threshold-independent measure of model quality. Use F1 when you care about a specific threshold and want to balance precision and recall.

Regression Metrics

Metric Formula Interpretation
MSE (1/n) Σ(ŷ-y)² Average squared error — penalizes large errors
RMSE √MSE Same units as target — more interpretable
MAE (1/n) Σ|ŷ-y| Average absolute error — robust to outliers
R² Score 1 - SS_res/SS_tot Fraction of variance explained (1.0 = perfect)

Overfitting vs Underfitting

The goal is a model that generalizes well — performing similarly on training and test data.

Underfitting

  • Model too simple to capture patterns
  • High training error AND high test error
  • Symptom: training loss plateaus high
  • Fix: more capacity (layers, neurons), more features, train longer

Overfitting

  • Model memorizes training data instead of learning patterns
  • Low training error BUT high test error
  • Symptom: training loss drops, validation loss rises
  • Fix: more data, dropout, regularization, early stopping

Cross-Validation

K-fold cross-validation gives a more robust estimate of model performance. Instead of a single train/test split, it rotates through K different splits and averages the results.

Split Data — Divide into K equal folds (e.g., K=5)

Fold 1 as Test — Train on folds 2-5, test on fold 1

Fold 2 as Test — Train on folds 1,3-5, test on fold 2

... Repeat K times — Each fold gets a turn as test set

Average Scores — Mean ± std of K evaluations

5-fold or 10-fold cross-validation is standard. Use stratified K-fold for classification to maintain class proportions in each fold. Leave-one-out (K=N) is for tiny datasets.

Training vs Test Accuracy

Scenario Train Accuracy Test Accuracy Diagnosis Action
Good fit 92% 90% Healthy generalization Deploy!
Overfitting 99% 75% Memorizing training data Add regularization, get more data
Underfitting 65% 63% Model too simple More capacity, better features
Data leakage 99% 99% Test data leaked into training Check preprocessing pipeline!

Key Takeaways

  1. Choose metrics based on your problem — accuracy is misleading with imbalanced data
  2. AUC-ROC is threshold-independent; F1 balances precision and recall at a specific threshold
  3. Overfitting: low train error, high test error — fix with regularization or more data
  4. Cross-validation gives robust estimates by averaging over K different train/test splits
  5. Always compare training vs test performance to diagnose model health

Part of the Training, Optimization & Deployment series on Tekivex. Browse all tutorials or explore our open-source products.