| Variant | When to use |
|---|---|
| Stratified k-fold | Classification with imbalanced classes. Keeps class ratios in each fold. |
| Group k-fold | When samples come in groups (same user, same patient) — never split a group across folds. |
| Time-series CV | Time-ordered data. Train on past, test on future. Never shuffle. |
| LOOCV | Tiny data. k = n. Expensive but useful. |
The single-split problem
A single train/test split is a lottery:
- Train/test split changes → score changes.
- High-variance estimate. Two splits can disagree by 5–10 % on the same model.
- You can't tell if a 0.92 vs 0.90 difference is real or noise.
The fix: cross-validation. Train and evaluate k times on different splits, average the scores.
CV scores are usually slightly lower than single-split scores — because the model trains on less data each fold. That's expected. The CV score is more honest.
K-fold CV — the workhorse
The standard recipe:
- Shuffle the data.
- Split into
kequal folds (5 or 10 is typical). - For each fold
i:- Train on the other
k−1folds. - Score on fold
i.
- Train on the other
- Report mean ± std of the
kscores.
Variants worth knowing:
Train / Val / Test — the 3-way split
Cross-validation handles model selection. For the final honest score, you still need a held-out test set:
- Train (~60 %) — fit the model parameters.
- Validation (~20 %) — tune hyperparameters / pick model family. CV happens here.
- Test (~20 %) — evaluate exactly once at the end.
If you peek at the test set during selection → it's not a test set any more, it's just more validation. Your honest score is gone.
Small data trick: drop the explicit val set, do CV inside train, keep a separate test set.
Bias vs Variance — the U-shape
The fundamental trade-off:
- High bias (underfitting) — model too simple, can't capture the pattern. Both train and test error high.
- High variance (overfitting) — model too complex, memorises noise. Train error low, test error high.
- Sweet spot — generalisation gap (test − train) is small, both errors low.
Symptoms:
| Symptom | Likely cause |
|---|---|
| Train low, test high | Variance / overfitting |
| Train high, test high | Bias / underfitting |
| Train high, test low | Bug or extreme regularisation |
| Train low, test low | Good fit |
Fixes: more data → reduces variance. More features → reduces bias. Regularisation → trades variance for bias.
Finding the sweet spot
Practical recipe to land in the U-shape's valley:
- Plot the learning curve. Train + val score vs training set size.
- Both flat at low score → high bias.
- Wide gap that doesn't close → high variance.
- Plot the validation curve. Score vs a complexity knob (depth,
λ, K).- Pick the complexity where val score peaks.
- Cross-validate the chosen settings.
- Final eval on the held-out test set.
ROC, AUC & thresholds
Classification models output probabilities — the threshold turns them into decisions.
The ROC curve plots:
- TPR (recall) on Y, vs
- FPR on X,
- across all thresholds from 0 to 1.
AUC (area under curve):
- 1.0 → perfect ranker.
- 0.5 → random.
- < 0.5 → worse than random (flip predictions).
AUC is threshold-independent — it measures how well the model ranks positives above negatives, regardless of cut-off.
Caveat: under heavy class imbalance, PR-AUC is more honest than ROC-AUC. ROC can look fine even if you're missing most positives.
Threshold tuning: pick the threshold that matches the precision/recall trade-off you need for your business cost.
Regression metrics
For continuous targets:
| Metric | Notes |
|---|---|
| MAE | Mean absolute error. Easiest to interpret. Robust to outliers. |
| MSE | Mean squared error. Penalises large errors more. Differentiable. |
| RMSE | Square root of MSE. Same units as y. Most reported. |
| R² | Fraction of variance explained. 1 = perfect, 0 = baseline mean, < 0 = worse than mean. |
| MAPE | Mean absolute percentage error. Breaks when y is near 0. |
Default: report RMSE + R² together. RMSE for "how wrong on average", R² for "how much variance is explained".
Clustering evaluation
Unsupervised — no labels to score against. Two paths:
Internal metrics (no labels needed):
- Silhouette score — how tight clusters are vs how far apart. Range −1 to 1.
- Davies–Bouldin — lower is better.
- Inertia (KMeans) — sum of squared distances to centroid. Used in elbow method.
External metrics (with ground truth labels):
- Adjusted Rand Index (ARI) — measures agreement with true clusters, adjusted for chance.
- Normalised Mutual Information (NMI) — information shared between clustering and labels.
For KMeans: elbow method on inertia + silhouette score to pick k.