Maria Aguilera

A single train/test split is a lottery:

Train/test split changes → score changes.
High-variance estimate. Two splits can disagree by 5–10 % on the same model.
You can't tell if a 0.92 vs 0.90 difference is real or noise.

The fix: cross-validation. Train and evaluate k times on different splits, average the scores.

CV scores are usually slightly lower than single-split scores — because the model trains on less data each fold. That's expected. The CV score is more honest.

The standard recipe:

Shuffle the data.
Split into k equal folds (5 or 10 is typical).
For each fold i:
- Train on the other k−1 folds.
- Score on fold i.
Report mean ± std of the k scores.

Variants worth knowing:

Variant	When to use
Stratified k-fold	Classification with imbalanced classes. Keeps class ratios in each fold.
Group k-fold	When samples come in groups (same user, same patient) — never split a group across folds.
Time-series CV	Time-ordered data. Train on past, test on future. Never shuffle.
LOOCV	Tiny data. `k = n`. Expensive but useful.

Cross-validation handles model selection. For the final honest score, you still need a held-out test set:

Train (~60 %) — fit the model parameters.
Validation (~20 %) — tune hyperparameters / pick model family. CV happens here.
Test (~20 %) — evaluate exactly once at the end.

If you peek at the test set during selection → it's not a test set any more, it's just more validation. Your honest score is gone.

Small data trick: drop the explicit val set, do CV inside train, keep a separate test set.

The fundamental trade-off:

High bias (underfitting) — model too simple, can't capture the pattern. Both train and test error high.
High variance (overfitting) — model too complex, memorises noise. Train error low, test error high.
Sweet spot — generalisation gap (test − train) is small, both errors low.

Symptoms:

Symptom	Likely cause
Train low, test high	Variance / overfitting
Train high, test high	Bias / underfitting
Train high, test low	Bug or extreme regularisation
Train low, test low	Good fit

Fixes: more data → reduces variance. More features → reduces bias. Regularisation → trades variance for bias.

Practical recipe to land in the U-shape's valley:

Plot the learning curve. Train + val score vs training set size.
- Both flat at low score → high bias.
- Wide gap that doesn't close → high variance.
Plot the validation curve. Score vs a complexity knob (depth, λ, K).
- Pick the complexity where val score peaks.
Cross-validate the chosen settings.
Final eval on the held-out test set.

Classification models output probabilities — the threshold turns them into decisions.

The ROC curve plots:

TPR (recall) on Y, vs
FPR on X,
across all thresholds from 0 to 1.

AUC (area under curve):

1.0 → perfect ranker.
0.5 → random.
< 0.5 → worse than random (flip predictions).

AUC is threshold-independent — it measures how well the model ranks positives above negatives, regardless of cut-off.

Caveat: under heavy class imbalance, PR-AUC is more honest than ROC-AUC. ROC can look fine even if you're missing most positives.

Threshold tuning: pick the threshold that matches the precision/recall trade-off you need for your business cost.

For continuous targets:

Metric	Notes
MAE	Mean absolute error. Easiest to interpret. Robust to outliers.
MSE	Mean squared error. Penalises large errors more. Differentiable.
RMSE	Square root of MSE. Same units as `y`. Most reported.
R²	Fraction of variance explained. 1 = perfect, 0 = baseline mean, < 0 = worse than mean.
MAPE	Mean absolute percentage error. Breaks when `y` is near 0.

Default: report RMSE + R² together. RMSE for "how wrong on average", R² for "how much variance is explained".

Unsupervised — no labels to score against. Two paths:

Internal metrics (no labels needed):

Silhouette score — how tight clusters are vs how far apart. Range −1 to 1.
Davies–Bouldin — lower is better.
Inertia (KMeans) — sum of squared distances to centroid. Used in elbow method.

External metrics (with ground truth labels):

Adjusted Rand Index (ARI) — measures agreement with true clusters, adjusted for chance.
Normalised Mutual Information (NMI) — information shared between clustering and labels.

For KMeans: elbow method on inertia + silhouette score to pick k.

Part 5 · Cross-Validation, Bias–Variance & ROC — Cheat Sheet

The single-split problem

K-fold CV — the workhorse

Train / Val / Test — the 3-way split

Bias vs Variance — the U-shape

Finding the sweet spot

ROC, AUC & thresholds

Regression metrics

Clustering evaluation