Maria Aguilera

	Generative	Discriminative
Models	`P(X	y)`and`P(y)`
Examples	LDA, QDA, Naïve Bayes	Logistic Regression, SVM, NN, trees
Strength	Works with little data, can sample from `P(X)`	Higher accuracy when assumptions are wrong
Weakness	Strong distributional assumptions	Needs more data

LDA / QDA are generative: they fit the distribution of features within each class, then use Bayes' theorem to predict the class.

LDA assumes:

Each class follows a multivariate Gaussian: X | y = k ∼ N(μ_k, Σ).
All classes share the same covariance matrix Σ — only the means differ.

Because covariance is shared, the decision boundary between any two classes is linear (a hyperplane).

Posterior:

P(y = k | X) ∝ π_k · N(X; μ_k, Σ)

Predict the class with the highest posterior.

QDA drops the shared-covariance assumption:

Each class has its own covariance matrix Σ_k.
X | y = k ∼ N(μ_k, Σ_k).

Result: the boundary between classes is quadratic — ellipses, hyperbolas, parabolas.

QDA has more flexibility but needs more data to estimate Σ_k per class. With small data, the per-class covariance estimates are noisy and QDA overfits.

	LDA	QDA
Boundary	Linear	Quadratic
Parameters	`K · d` means + 1 covariance	`K · d` means + `K` covariances
Data hungry?	No	Yes
Bias	Higher (boundary forced linear)	Lower
Variance	Lower	Higher
Good when	Classes have similar shapes	Classes have different shapes

Try LDA first. Move to QDA only if classes clearly have different spreads or if LDA underfits visibly.

LDA can also project data to a lower-dim space while maximising class separability. It finds axes that:

Maximise between-class variance
Minimise within-class variance

Unlike PCA (unsupervised, max variance), LDA uses labels.

Max output dimensions: n_classes − 1.
Often beats PCA when downstream task is classification.

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
X_proj = lda.fit_transform(X, y)

Roughly Gaussian features within each class. Real-world: many natural measurements (heights, temperatures, lab values).
Small datasets. Few parameters to estimate, low variance.
You want a probabilistic output that's reasonably calibrated.
Classification with > 2 classes. LDA handles multiclass cleanly — one fit, not one-vs-rest.

Features are far from Gaussian. Heavy tails, multimodal distributions, sparse text features.
Large datasets where you can afford a flexible model (boosting, NN).
Boundary is genuinely non-linear and non-quadratic — kernel SVM or trees would do better.
Correlated features hurt the covariance estimate — use shrinkage.

When d is large relative to n, the covariance estimate is noisy or singular.

Shrinkage: pull the per-class covariance toward a simpler estimate (e.g., the identity matrix). scikit-learn supports this in LDA:

LinearDiscriminantAnalysis(solver="lsqr", shrinkage="auto")

solver="lsqr" or "eigen" enables shrinkage.
shrinkage="auto" uses the Ledoit-Wolf formula.

For QDA: regularisation parameter reg_param adds λ · I to each class covariance.

Part 11 · LDA & QDA — Cheat Sheet

Generative vs discriminative

LDA — the assumption

QDA — relaxing the assumption

LDA vs QDA — the trade-off

LDA as dimensionality reduction

When they win

When they lose

Shrinkage & regularisation