Model Evaluation & Metrics

ROC curves, AUC, cross-validation, overfitting vs underfitting, and choosing the right evaluation metrics.

Intermediate · 14 min read

Why Evaluation Matters

A model that gets 99% accuracy on training data might perform terribly in production. Proper evaluation tells you how the model will perform on unseen data — which is all that matters in the real world. Choosing the right metric depends on your problem and what errors cost you.

Classification Metrics

Metric	Formula	Range	When to Use
Accuracy	(TP+TN) / Total	0-1	Balanced classes only
Precision	TP / (TP+FP)	0-1	When FP are costly (spam → inbox)
Recall (Sensitivity)	TP / (TP+FN)	0-1	When FN are costly (missed cancer)
F1 Score	2×P×R / (P+R)	0-1	Balance precision & recall
Specificity	TN / (TN+FP)	0-1	True negative rate
AUC-ROC	Area under ROC curve	0-1	Overall model quality, threshold-independent

The ROC Curve

The ROC (Receiver Operating Characteristic) curve plots True Positive Rate vs False Positive Rate at every classification threshold. AUC (Area Under Curve) summarizes this into a single number — 1.0 is perfect, 0.5 is random guessing.

AUC = 1.0: Perfect model — separates classes completely
AUC = 0.9: Excellent — strong discrimination
AUC = 0.7-0.8: Fair — reasonable performance
AUC = 0.5: No discrimination — random coin flip

Use AUC-ROC when you need a threshold-independent measure of model quality. Use F1 when you care about a specific threshold and want to balance precision and recall.

Regression Metrics

Metric	Formula	Interpretation
MSE	(1/n) Σ(ŷ-y)²	Average squared error — penalizes large errors
RMSE	√MSE	Same units as target — more interpretable
MAE	(1/n) Σ\|ŷ-y\|	Average absolute error — robust to outliers
R² Score	1 - SS_res/SS_tot	Fraction of variance explained (1.0 = perfect)

Overfitting vs Underfitting

The goal is a model that generalizes well — performing similarly on training and test data.

Underfitting

Model too simple to capture patterns
High training error AND high test error
Symptom: training loss plateaus high
Fix: more capacity (layers, neurons), more features, train longer

Overfitting

Model memorizes training data instead of learning patterns
Low training error BUT high test error
Symptom: training loss drops, validation loss rises
Fix: more data, dropout, regularization, early stopping

Cross-Validation

K-fold cross-validation gives a more robust estimate of model performance. Instead of a single train/test split, it rotates through K different splits and averages the results.

Split Data — Divide into K equal folds (e.g., K=5)

Fold 1 as Test — Train on folds 2-5, test on fold 1

Fold 2 as Test — Train on folds 1,3-5, test on fold 2

... Repeat K times — Each fold gets a turn as test set

Average Scores — Mean ± std of K evaluations

5-fold or 10-fold cross-validation is standard. Use stratified K-fold for classification to maintain class proportions in each fold. Leave-one-out (K=N) is for tiny datasets.

Training vs Test Accuracy

Scenario	Train Accuracy	Test Accuracy	Diagnosis	Action
Good fit	92%	90%	Healthy generalization	Deploy!
Overfitting	99%	75%	Memorizing training data	Add regularization, get more data
Underfitting	65%	63%	Model too simple	More capacity, better features
Data leakage	99%	99%	Test data leaked into training	Check preprocessing pipeline!

Key Takeaways

Choose metrics based on your problem — accuracy is misleading with imbalanced data
AUC-ROC is threshold-independent; F1 balances precision and recall at a specific threshold
Overfitting: low train error, high test error — fix with regularization or more data
Cross-validation gives robust estimates by averaging over K different train/test splits
Always compare training vs test performance to diagnose model health

Part of the Training, Optimization & Deployment series on Tekivex. Browse all tutorials or explore our open-source products.