Cheat sheet

Part 6 · Naïve Bayes — Cheat Sheet

The probabilistic baseline. Bayes' theorem, the conditional-independence assumption, Gaussian / Multinomial / Bernoulli variants, and when to actually use it.

1

Bayes' theorem

The formula every classifier secretly wants:

P(class | features) = P(features | class) · P(class) / P(features)

Where:

  • P(class) — prior. How common is the class overall?
  • P(features | class) — likelihood. How likely is this feature combination given the class?
  • P(features) — evidence. Same for all classes, so often ignored when comparing.

You pick the class with the highest posterior P(class | features).

2

Why 'naïve'?

The trick that makes it tractable: assume all features are conditionally independent given the class.

P(x1, x2, ..., xn | class) = P(x1 | class) · P(x2 | class) · ... · P(xn | class)

In reality, features are usually correlated — your "naïve" model is wrong about the dependencies. But the ranking of classes is often still correct, which is all classification needs.

3

The three variants

VariantLikelihood modelUse for
Gaussian NBEach feature is a normal distribution per class.Continuous numeric features.
Multinomial NBFeatures are counts. Probabilities from frequency.Text classification (TF or word counts).
Bernoulli NBFeatures are binary (present / absent).Text with binary presence, fraud flags.

Picking the wrong variant for your feature type tanks accuracy. Numeric → Gaussian. Counts → Multinomial. Binary → Bernoulli.

4

Laplace smoothing

A single zero in a likelihood P(word | class) = 0 collapses the whole product to zero. The fix: add 1 (or α) to every count.

P(word | class) = (count + α) / (total + α · V)

Where V is the vocabulary size. Default α = 1 (Laplace) or α < 1 (Lidstone).

Without smoothing, any unseen word in test data breaks the prediction.

5

When NB wins

  • Text classification. Spam, sentiment, topic. Hard to beat with a simple model.
  • Very high-dimensional sparse features. Word counts, n-grams.
  • Small training data. NB is data-efficient — it estimates per-feature, per-class.
  • You need a fast baseline. Train + predict are nearly free.
  • You need probability outputs, not just a class label.
6

When NB loses

  • Features are heavily correlated and the dependence carries signal (medical, financial).
  • Numeric features with non-Gaussian distributions — Gaussian NB assumes normality.
  • You need calibrated probabilities — NB's outputs are confident but often miscalibrated (close to 0 or 1).
  • You have lots of data and can afford gradient-boosted trees or neural networks.
7

Practical workflow

  1. Pick a variant matching your feature types.
  2. Apply smoothing (default α = 1).
  3. Cross-validate. NB's main hyperparameter is α.
  4. Compare against a baselineDummyClassifier, logistic regression.
  5. If you need probabilities for downstream decisions, calibrate with CalibratedClassifierCV.
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(
    TfidfVectorizer(),
    MultinomialNB(alpha=1.0),
)
8

The honest take

Naïve Bayes is the linear regression of classification: a baseline that's almost always too dumb but almost never embarrassing. If a fancier model can't beat NB by a clear margin, the fancier model probably isn't the right tool.

Common pattern: NB hits ~85 % on text in 3 lines of code. The team spends a week to push BERT to 88 %. Sometimes that's worth it. Sometimes it isn't.