| Variant | Likelihood model | Use for |
|---|---|---|
| Gaussian NB | Each feature is a normal distribution per class. | Continuous numeric features. |
| Multinomial NB | Features are counts. Probabilities from frequency. | Text classification (TF or word counts). |
| Bernoulli NB | Features are binary (present / absent). | Text with binary presence, fraud flags. |
Bayes' theorem
The formula every classifier secretly wants:
P(class | features) = P(features | class) · P(class) / P(features)
Where:
P(class)— prior. How common is the class overall?P(features | class)— likelihood. How likely is this feature combination given the class?P(features)— evidence. Same for all classes, so often ignored when comparing.
You pick the class with the highest posterior P(class | features).
Why 'naïve'?
The trick that makes it tractable: assume all features are conditionally independent given the class.
P(x1, x2, ..., xn | class) = P(x1 | class) · P(x2 | class) · ... · P(xn | class)
In reality, features are usually correlated — your "naïve" model is wrong about the dependencies. But the ranking of classes is often still correct, which is all classification needs.
The three variants
Picking the wrong variant for your feature type tanks accuracy. Numeric → Gaussian. Counts → Multinomial. Binary → Bernoulli.
Laplace smoothing
A single zero in a likelihood P(word | class) = 0 collapses the whole product to zero. The fix: add 1 (or α) to every count.
P(word | class) = (count + α) / (total + α · V)
Where V is the vocabulary size. Default α = 1 (Laplace) or α < 1 (Lidstone).
Without smoothing, any unseen word in test data breaks the prediction.
When NB wins
- Text classification. Spam, sentiment, topic. Hard to beat with a simple model.
- Very high-dimensional sparse features. Word counts, n-grams.
- Small training data. NB is data-efficient — it estimates per-feature, per-class.
- You need a fast baseline. Train + predict are nearly free.
- You need probability outputs, not just a class label.
When NB loses
- Features are heavily correlated and the dependence carries signal (medical, financial).
- Numeric features with non-Gaussian distributions — Gaussian NB assumes normality.
- You need calibrated probabilities — NB's outputs are confident but often miscalibrated (close to 0 or 1).
- You have lots of data and can afford gradient-boosted trees or neural networks.
Practical workflow
- Pick a variant matching your feature types.
- Apply smoothing (default
α = 1). - Cross-validate. NB's main hyperparameter is
α. - Compare against a baseline —
DummyClassifier, logistic regression. - If you need probabilities for downstream decisions, calibrate with
CalibratedClassifierCV.
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(
TfidfVectorizer(),
MultinomialNB(alpha=1.0),
)The honest take
Naïve Bayes is the linear regression of classification: a baseline that's almost always too dumb but almost never embarrassing. If a fancier model can't beat NB by a clear margin, the fancier model probably isn't the right tool.
Common pattern: NB hits ~85 % on text in 3 lines of code. The team spends a week to push BERT to 88 %. Sometimes that's worth it. Sometimes it isn't.