Cheat sheet

Part 5 · Cross-Validation, Bias–Variance & ROC — Cheat Sheet

How to evaluate models honestly. K-fold, train/val/test, bias-variance, ROC, regression and clustering metrics.

1

The single-split problem

A single train/test split is a lottery:

  • Train/test split changes → score changes.
  • High-variance estimate. Two splits can disagree by 5–10 % on the same model.
  • You can't tell if a 0.92 vs 0.90 difference is real or noise.

The fix: cross-validation. Train and evaluate k times on different splits, average the scores.

CV scores are usually slightly lower than single-split scores — because the model trains on less data each fold. That's expected. The CV score is more honest.

2

K-fold CV — the workhorse

The standard recipe:

  1. Shuffle the data.
  2. Split into k equal folds (5 or 10 is typical).
  3. For each fold i:
    • Train on the other k−1 folds.
    • Score on fold i.
  4. Report mean ± std of the k scores.

Variants worth knowing:

VariantWhen to use
Stratified k-foldClassification with imbalanced classes. Keeps class ratios in each fold.
Group k-foldWhen samples come in groups (same user, same patient) — never split a group across folds.
Time-series CVTime-ordered data. Train on past, test on future. Never shuffle.
LOOCVTiny data. k = n. Expensive but useful.
3

Train / Val / Test — the 3-way split

Cross-validation handles model selection. For the final honest score, you still need a held-out test set:

  • Train (~60 %) — fit the model parameters.
  • Validation (~20 %) — tune hyperparameters / pick model family. CV happens here.
  • Test (~20 %) — evaluate exactly once at the end.

If you peek at the test set during selection → it's not a test set any more, it's just more validation. Your honest score is gone.

Small data trick: drop the explicit val set, do CV inside train, keep a separate test set.

4

Bias vs Variance — the U-shape

The fundamental trade-off:

  • High bias (underfitting) — model too simple, can't capture the pattern. Both train and test error high.
  • High variance (overfitting) — model too complex, memorises noise. Train error low, test error high.
  • Sweet spot — generalisation gap (test − train) is small, both errors low.

Symptoms:

SymptomLikely cause
Train low, test highVariance / overfitting
Train high, test highBias / underfitting
Train high, test lowBug or extreme regularisation
Train low, test lowGood fit

Fixes: more data → reduces variance. More features → reduces bias. Regularisation → trades variance for bias.

5

Finding the sweet spot

Practical recipe to land in the U-shape's valley:

  1. Plot the learning curve. Train + val score vs training set size.
    • Both flat at low score → high bias.
    • Wide gap that doesn't close → high variance.
  2. Plot the validation curve. Score vs a complexity knob (depth, λ, K).
    • Pick the complexity where val score peaks.
  3. Cross-validate the chosen settings.
  4. Final eval on the held-out test set.
6

ROC, AUC & thresholds

Classification models output probabilities — the threshold turns them into decisions.

The ROC curve plots:

  • TPR (recall) on Y, vs
  • FPR on X,
  • across all thresholds from 0 to 1.

AUC (area under curve):

  • 1.0 → perfect ranker.
  • 0.5 → random.
  • < 0.5 → worse than random (flip predictions).

AUC is threshold-independent — it measures how well the model ranks positives above negatives, regardless of cut-off.

Caveat: under heavy class imbalance, PR-AUC is more honest than ROC-AUC. ROC can look fine even if you're missing most positives.

Threshold tuning: pick the threshold that matches the precision/recall trade-off you need for your business cost.

7

Regression metrics

For continuous targets:

MetricNotes
MAEMean absolute error. Easiest to interpret. Robust to outliers.
MSEMean squared error. Penalises large errors more. Differentiable.
RMSESquare root of MSE. Same units as y. Most reported.
Fraction of variance explained. 1 = perfect, 0 = baseline mean, < 0 = worse than mean.
MAPEMean absolute percentage error. Breaks when y is near 0.

Default: report RMSE + R² together. RMSE for "how wrong on average", R² for "how much variance is explained".

8

Clustering evaluation

Unsupervised — no labels to score against. Two paths:

Internal metrics (no labels needed):

  • Silhouette score — how tight clusters are vs how far apart. Range −1 to 1.
  • Davies–Bouldin — lower is better.
  • Inertia (KMeans) — sum of squared distances to centroid. Used in elbow method.

External metrics (with ground truth labels):

  • Adjusted Rand Index (ARI) — measures agreement with true clusters, adjusted for chance.
  • Normalised Mutual Information (NMI) — information shared between clustering and labels.

For KMeans: elbow method on inertia + silhouette score to pick k.