Cheat sheet

Part 11 · LDA & QDA — Cheat Sheet

Generative classifiers. Linear and Quadratic Discriminant Analysis, the shared / per-class covariance trade-off, and why LDA also moonlights as dimensionality reduction.

1

Generative vs discriminative

GenerativeDiscriminative
Models`P(Xy)andP(y)`
ExamplesLDA, QDA, Naïve BayesLogistic Regression, SVM, NN, trees
StrengthWorks with little data, can sample from P(X)Higher accuracy when assumptions are wrong
WeaknessStrong distributional assumptionsNeeds more data

LDA / QDA are generative: they fit the distribution of features within each class, then use Bayes' theorem to predict the class.

2

LDA — the assumption

LDA assumes:

  • Each class follows a multivariate Gaussian: X | y = k ∼ N(μ_k, Σ).
  • All classes share the same covariance matrix Σ — only the means differ.

Because covariance is shared, the decision boundary between any two classes is linear (a hyperplane).

Posterior:

P(y = k | X) ∝ π_k · N(X; μ_k, Σ)

Predict the class with the highest posterior.

3

QDA — relaxing the assumption

QDA drops the shared-covariance assumption:

  • Each class has its own covariance matrix Σ_k.
  • X | y = k ∼ N(μ_k, Σ_k).

Result: the boundary between classes is quadratic — ellipses, hyperbolas, parabolas.

QDA has more flexibility but needs more data to estimate Σ_k per class. With small data, the per-class covariance estimates are noisy and QDA overfits.

4

LDA vs QDA — the trade-off

LDAQDA
BoundaryLinearQuadratic
ParametersK · d means + 1 covarianceK · d means + K covariances
Data hungry?NoYes
BiasHigher (boundary forced linear)Lower
VarianceLowerHigher
Good whenClasses have similar shapesClasses have different shapes

Try LDA first. Move to QDA only if classes clearly have different spreads or if LDA underfits visibly.

5

LDA as dimensionality reduction

LDA can also project data to a lower-dim space while maximising class separability. It finds axes that:

  • Maximise between-class variance
  • Minimise within-class variance

Unlike PCA (unsupervised, max variance), LDA uses labels.

  • Max output dimensions: n_classes − 1.
  • Often beats PCA when downstream task is classification.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
X_proj = lda.fit_transform(X, y)
6

When they win

  • Roughly Gaussian features within each class. Real-world: many natural measurements (heights, temperatures, lab values).
  • Small datasets. Few parameters to estimate, low variance.
  • You want a probabilistic output that's reasonably calibrated.
  • Classification with > 2 classes. LDA handles multiclass cleanly — one fit, not one-vs-rest.
7

When they lose

  • Features are far from Gaussian. Heavy tails, multimodal distributions, sparse text features.
  • Large datasets where you can afford a flexible model (boosting, NN).
  • Boundary is genuinely non-linear and non-quadratic — kernel SVM or trees would do better.
  • Correlated features hurt the covariance estimate — use shrinkage.
8

Shrinkage & regularisation

When d is large relative to n, the covariance estimate is noisy or singular.

Shrinkage: pull the per-class covariance toward a simpler estimate (e.g., the identity matrix). scikit-learn supports this in LDA:

LinearDiscriminantAnalysis(solver="lsqr", shrinkage="auto")
  • solver="lsqr" or "eigen" enables shrinkage.
  • shrinkage="auto" uses the Ledoit-Wolf formula.

For QDA: regularisation parameter reg_param adds λ · I to each class covariance.