| Generative | Discriminative | |
|---|---|---|
| Models | `P(X | y)andP(y)` |
| Examples | LDA, QDA, Naïve Bayes | Logistic Regression, SVM, NN, trees |
| Strength | Works with little data, can sample from P(X) | Higher accuracy when assumptions are wrong |
| Weakness | Strong distributional assumptions | Needs more data |
Generative vs discriminative
LDA / QDA are generative: they fit the distribution of features within each class, then use Bayes' theorem to predict the class.
LDA — the assumption
LDA assumes:
- Each class follows a multivariate Gaussian:
X | y = k ∼ N(μ_k, Σ). - All classes share the same covariance matrix
Σ— only the means differ.
Because covariance is shared, the decision boundary between any two classes is linear (a hyperplane).
Posterior:
P(y = k | X) ∝ π_k · N(X; μ_k, Σ)
Predict the class with the highest posterior.
QDA — relaxing the assumption
QDA drops the shared-covariance assumption:
- Each class has its own covariance matrix
Σ_k. X | y = k ∼ N(μ_k, Σ_k).
Result: the boundary between classes is quadratic — ellipses, hyperbolas, parabolas.
QDA has more flexibility but needs more data to estimate Σ_k per class. With small data, the per-class covariance estimates are noisy and QDA overfits.
LDA vs QDA — the trade-off
| LDA | QDA | |
|---|---|---|
| Boundary | Linear | Quadratic |
| Parameters | K · d means + 1 covariance | K · d means + K covariances |
| Data hungry? | No | Yes |
| Bias | Higher (boundary forced linear) | Lower |
| Variance | Lower | Higher |
| Good when | Classes have similar shapes | Classes have different shapes |
Try LDA first. Move to QDA only if classes clearly have different spreads or if LDA underfits visibly.
LDA as dimensionality reduction
LDA can also project data to a lower-dim space while maximising class separability. It finds axes that:
- Maximise between-class variance
- Minimise within-class variance
Unlike PCA (unsupervised, max variance), LDA uses labels.
- Max output dimensions:
n_classes − 1. - Often beats PCA when downstream task is classification.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
X_proj = lda.fit_transform(X, y)When they win
- Roughly Gaussian features within each class. Real-world: many natural measurements (heights, temperatures, lab values).
- Small datasets. Few parameters to estimate, low variance.
- You want a probabilistic output that's reasonably calibrated.
- Classification with > 2 classes. LDA handles multiclass cleanly — one fit, not one-vs-rest.
When they lose
- Features are far from Gaussian. Heavy tails, multimodal distributions, sparse text features.
- Large datasets where you can afford a flexible model (boosting, NN).
- Boundary is genuinely non-linear and non-quadratic — kernel SVM or trees would do better.
- Correlated features hurt the covariance estimate — use shrinkage.
Shrinkage & regularisation
When d is large relative to n, the covariance estimate is noisy or singular.
Shrinkage: pull the per-class covariance toward a simpler estimate (e.g., the identity matrix). scikit-learn supports this in LDA:
LinearDiscriminantAnalysis(solver="lsqr", shrinkage="auto")solver="lsqr"or"eigen"enables shrinkage.shrinkage="auto"uses the Ledoit-Wolf formula.
For QDA: regularisation parameter reg_param adds λ · I to each class covariance.