Cheat sheet

Part 4 · Classification Metrics — Cheat Sheet

Confusion matrix, precision/recall trade-off, F1, MCC, Cohen's Kappa, and what to use when classes are imbalanced.

1

Evaluation vs Validation

Often confused — they're different stages:

ValidationEvaluation
WhenDuring training / model selectionOnce, at the very end
On whatValidation fold / setHeld-out test set
PurposePick hyperparameters, compare modelsHonest production-quality measure
FrequencyMany timesExactly once

If you tune on the test set, your evaluation is no longer evaluation — it's just more validation.

2

The confusion matrix

Every classification metric is born here. For binary:

Predicted = 1Predicted = 0
Actual = 1TPFN
Actual = 0FPTN
  • TP — correctly said yes.
  • FN — said no when it was yes. Type II error.
  • FP — said yes when it was no. Type I error.
  • TN — correctly said no.

Every metric is just a different ratio of these four numbers. Learn the matrix, the rest follows.

3

Accuracy & when it lies

Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}

The headline metric — and the most overused.

It lies when:

  • Classes are imbalanced. Fraud is 0.1 % of transactions → predict "no fraud" always → 99.9 % accuracy → useless.
  • Mistakes have asymmetric costs. Missing a cancer is not equivalent to a false alarm. Accuracy weights them the same.

Use accuracy only when classes are roughly balanced AND mistakes cost the same on both sides.

4

Precision & Recall

The two real questions:

Precision=TPTP+FPRecall=TPTP+FN\text{Precision} = \frac{TP}{TP + FP} \quad\text{Recall} = \frac{TP}{TP + FN}

  • Precision"When my model says yes, how often is it right?" Punishes false alarms.
  • Recall"Of all the actual yes's, how many did I catch?" Punishes misses.

Which one matters more? Depends on the cost of mistakes:

  • High-stakes screening (cancer, fraud) → Recall. Missing a case is catastrophic.
  • Costly intervention (spam filter, manual review queue) → Precision. Each false alarm wastes resources.

The trade-off: raising the threshold ↑ precision but ↓ recall, and vice versa. There's no free lunch.

5

F1 — balancing P & R

F1=2PRP+RF_1 = 2 \cdot \frac{P \cdot R}{P + R}

Harmonic mean of precision and recall. Punishes extreme values — if either is near zero, F1 collapses.

  • F1 = 1 → perfect P and R both.
  • F1 = 0 → at least one of them is 0.

Use F1 when you want one number that balances precision and recall and don't have a clear preference.

F-beta lets you weight one more than the other: Fβ=(1+β2)PRβ2P+RF_\beta = (1+\beta^2) \frac{PR}{\beta^2 P + R}

β > 1 favours recall; β < 1 favours precision.

6

MCC — the fair one

Matthews Correlation Coefficient — the most honest single-number metric under imbalance.

MCC=TPTNFPFN(TP+FP)(TP+FN)(TN+FP)(TN+FN)\text{MCC} = \frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}

  • Range: −1 to +1 (1 = perfect, 0 = random, −1 = perfectly wrong).
  • Uses all four cells of the confusion matrix — F1 ignores TN.
  • Honest under severe imbalance — accuracy and F1 both can mislead, MCC won't.

Use MCC when classes are imbalanced and you want a single trustworthy number.

7

Cohen's Kappa

"Better than random chance?"

κ=pope1pe\kappa = \frac{p_o - p_e}{1 - p_e}

Where p_o = observed accuracy, p_e = accuracy expected by chance.

  • κ = 1 → perfect agreement.
  • κ = 0 → no better than random guessing weighted by class frequencies.
  • κ < 0 → worse than random.

Useful when comparing your model against a baseline guesser. Also classic in inter-annotator agreement.

8

Multi-class metrics

For more than 2 classes, you compute per-class metrics then average. Three ways:

AveragingWhat it doesUse when
MacroUnweighted mean across classes.All classes equally important.
WeightedMean weighted by class support.Account for class imbalance.
MicroAggregate TP/FP/FN across all classes, then compute.One overall number; same as accuracy for balanced multi-class.

Rule of thumb: report macro-F1 if you care about every class equally (esp. minority classes). Report micro or weighted if you care about overall sample-level performance.