Maria Aguilera

The formula every classifier secretly wants:

P(class | features) = P(features | class) · P(class) / P(features)

Where:

P(class) — prior. How common is the class overall?
P(features | class) — likelihood. How likely is this feature combination given the class?
P(features) — evidence. Same for all classes, so often ignored when comparing.

You pick the class with the highest posterior P(class | features).

The trick that makes it tractable: assume all features are conditionally independent given the class.

P(x1, x2, ..., xn | class) = P(x1 | class) · P(x2 | class) · ... · P(xn | class)

In reality, features are usually correlated — your "naïve" model is wrong about the dependencies. But the ranking of classes is often still correct, which is all classification needs.

Variant	Likelihood model	Use for
Gaussian NB	Each feature is a normal distribution per class.	Continuous numeric features.
Multinomial NB	Features are counts. Probabilities from frequency.	Text classification (TF or word counts).
Bernoulli NB	Features are binary (present / absent).	Text with binary presence, fraud flags.

Picking the wrong variant for your feature type tanks accuracy. Numeric → Gaussian. Counts → Multinomial. Binary → Bernoulli.

A single zero in a likelihood P(word | class) = 0 collapses the whole product to zero. The fix: add 1 (or α) to every count.

P(word | class) = (count + α) / (total + α · V)

Where V is the vocabulary size. Default α = 1 (Laplace) or α < 1 (Lidstone).

Without smoothing, any unseen word in test data breaks the prediction.

Text classification. Spam, sentiment, topic. Hard to beat with a simple model.
Very high-dimensional sparse features. Word counts, n-grams.
Small training data. NB is data-efficient — it estimates per-feature, per-class.
You need a fast baseline. Train + predict are nearly free.
You need probability outputs, not just a class label.

Features are heavily correlated and the dependence carries signal (medical, financial).
Numeric features with non-Gaussian distributions — Gaussian NB assumes normality.
You need calibrated probabilities — NB's outputs are confident but often miscalibrated (close to 0 or 1).
You have lots of data and can afford gradient-boosted trees or neural networks.

Pick a variant matching your feature types.
Apply smoothing (default α = 1).
Cross-validate. NB's main hyperparameter is α.
Compare against a baseline — DummyClassifier, logistic regression.
If you need probabilities for downstream decisions, calibrate with CalibratedClassifierCV.

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(
    TfidfVectorizer(),
    MultinomialNB(alpha=1.0),
)

Naïve Bayes is the linear regression of classification: a baseline that's almost always too dumb but almost never embarrassing. If a fancier model can't beat NB by a clear margin, the fancier model probably isn't the right tool.

Common pattern: NB hits ~85 % on text in 3 lines of code. The team spends a week to push BERT to 88 %. Sometimes that's worth it. Sometimes it isn't.

Part 6 · Naïve Bayes — Cheat Sheet

Bayes' theorem

Why 'naïve'?

The three variants

Laplace smoothing

When NB wins

When NB loses

Practical workflow

The honest take