
Table of Contents
- 1. What text classification actually is
- 2. Annotated datasets are the fuel
- 3. How do we even feed text into a classifier?
- 4. Logistic regression: the baseline you always build first
- 5. Methodologies: the four classical generations
- 6. Why classical models can't "speak the language"
- 7. The real-world decision tree
- 8. Feature engineering: where most of the practical wins live
- 9. What this connects to
Last update: June 2026. All opinions are my own.
NLP from Scratch · Part 6/10
📋 In a hurry? The four-page cheat sheet for this post — foundations, applications, the four classical methodologies, and the practitioner rules — printable, downloadable, condensed for fast revision.
"All models are wrong, but some are useful. But Naïve Bayes is too wrong for text." — The line that opens the methodology arc.
What text classification actually is
Strip away the hype and text classification is one thing: given a document, assign it to one class from a fixed list. News article → sports / politics / technology. Email → spam / ham. Movie review → positive / negative. Tweet → toxic / safe. The shape of the problem is always the same — document in, class label out.

The simplest version is binary — only two classes. Spam classification is the canonical example, and it is the one this post leans on throughout because everyone has felt the pain.

The problems classical text classification has solved in production, mostly with the same handful of techniques:
- Personalization (Quora digests, Spotify "Discover Weekly" tagging)
- Authorship attribution (who wrote the Federalist Papers, was this essay written by a man or a woman)
- Sentiment analysis (movie reviews, product reviews, social media)
- Topic/genre assignment (news routing)
- Spam detection
- Age/gender identification from text
- Language identification
- Sarcasm detection
- Fake news detection
It's a long list. The interesting thing is that all of these problems were largely cracked before deep learning, with the same classical pipeline we are about to build.
For a tour of what each of these actually looks like in production — the shape of the problem, the technique that usually wins, and the headache that bites you when you ship — see Text Classification in the Wild.
Annotated datasets are the fuel
Supervised classification needs labels. The thing classical NLP cannot fake away — you need a training set of documents paired with the correct class. Without it, no classifier learns anything.

In her notes: "Quality and quantity of labels drive model performance." If you remember nothing else about classical text classification, remember this. A small clean dataset usually beats a big noisy one, and a big clean dataset beats anything.
How do we even feed text into a classifier?
A logistic regression takes a vector of numbers and outputs a probability. But a document is a string of words. So before any classifier touches the data, we need a way to turn each document into a vector. There are two classical answers — both of which you already met in Part 2.
The first is the document-term matrix (DTM). Rows are documents, columns are unique terms in the vocabulary, and each cell is a count — how many times this term appears in this document. Each document becomes a row vector with as many dimensions as your vocabulary has words.

Two problems with raw counts:
- Common words dominate. The word
thewill appear in every document and crush the signal of rarer, more meaningful words. - Long documents look more important. A 2,000-word article looks "louder" than a tweet just because it has more of every word.
TF-IDF is the fix. Each cell is no longer a raw count — it is multiplied by the inverse document frequency of the term. Rare terms get amplified; common terms get downweighted.
Where N is the total number of documents and df(t) is the number of documents containing term t. The +1 is a smoothing trick so we never divide by zero for unseen terms.

So now we have vectors — sparse, high-dimensional vectors, but vectors a classifier can chew on.
Logistic regression: the baseline you always build first
Before we get fancy, install this rule in your head:
Always implement a logistic regression classifier as a baseline, even when you plan to ship something more sophisticated.Repeated three times in the source notes. It is the single most important practitioner rule in this whole post.

Why this rule matters in practice: in the notes, logistic regression hits ~80% accuracy on the running text classification task; a fine-tuned transformer hits ~84%. The transformer wins, but the gap is much smaller than people expect. And the LR baseline tells you whether your data is good, your labels are clean, and your feature engineering is sane — long before you spend three weeks training a BERT.
A second rule worth installing now, also from the notes, but bolded and underlined:
❌ Do not use CNN for text classification. They had historical success with images, and people tried to copy that over for text. They can capture some local structure but they do not capture sequential information. Use RNNs or transformers instead.
Methodologies: the four classical generations
There are four generations of classifiers worth knowing, and they line up with how the field actually evolved. We will walk each one — including why each one wasn't enough on its own.
Generation 1 — Hand-coded rules
The first answer to "how do we classify text?" is: don't bother with machine learning. Write rules.

- If the email contains
"Nigerian prince"and"wire transfer"→ spam. - If the review contains
"don't buy"and the rating is missing → likely fake.
The honest truth from the notes: rules are still widely used in industry. Spam filters, fake-news detectors, content moderation. They are precision-driven, codify domain knowledge, and are carefully refined by experts over years.
The catch: they are very expensive to build and maintain. Every edge case is a new rule. Every new spam tactic needs a new rule. The system grows in complexity until nobody understands it. So we want to replace rules with something that learns from data.
Generation 2 — Naïve Bayes
The first probabilistic classifier most people meet, and the right one to start with because it gives you the mental model for the rest.

Naïve Bayes asks one question: given the words in this document, what is the most probable class?
Bayes' theorem gives us four pieces, each with a name:

The four named pieces are worth memorising — they will come back in every probabilistic NLP model after this:
- Posterior — how probable is our hypothesis given the observed evidence? This is what we want.
- Likelihood — how probable is the evidence given that our hypothesis is true?
- Prior — how probable was our hypothesis before we saw any evidence?
- Marginal — how probable is the new evidence under all possible hypotheses? In practice we ignore this because it's the same for every class.
For classification, we pick the class with the highest posterior. Naïve Bayes makes one cheating assumption to make the likelihood tractable: all words are independent given the class. The word prize and the word now are treated as if they have nothing to do with each other once you know the email is spam.

That assumption is, in the notes' phrasing, "completely wrong." Words in real language are deeply correlated — prize and free and claim all travel together. Naïve Bayes pretends they don't.
It still works surprisingly well — that is the famous result — but the wrongness of the independence assumption is what motivates the next generation.
Generation 3 — Maximum Entropy classifiers (a.k.a. logistic regression)
MaxEnt classifiers were the answer to "Naïve Bayes is too wrong for text". They drop the independence assumption.

The intuition is one of the most beautiful in all of NLP and it deserves to land:
Assume nothing about your probability distribution other than what you have observed.That is the maximum-entropy principle. You start with the most uniform distribution possible (which makes no assumptions), and as you observe data, you add constraints — but you only add the ones the data forces. Among all the distributions that satisfy your constraints, you pick the one with the largest entropy. Largest entropy = most uniform = fewest assumptions.
The notes use a concrete example with four classes — economics, sports, politics, art. Before you see any data, the maximum-entropy distribution is uniform:
Now you inspect documents. You notice that when the word ball appears, the document is about sports 70% of the time. That is a constraint. You update the conditional distribution:
You inspect more documents. You notice Bush correlates with politics (80%), game with sports (60%), stock with economics (50%). Each observation is another constraint:
The key move: among all probability distributions that satisfy all of these constraints, MaxEnt picks the one with the largest entropy — the most uniform one — because that distribution makes the fewest extra assumptions.

There is an infinite number of distributions that satisfy any given set of constraints. The maximum-entropy distribution is the one that says "yes to these constraints, but no extra structure beyond that." That is the cleanest possible probabilistic model.
Here's the magic from Berger et al. (1996): solving the maximum-entropy problem turns out to be exactly equivalent to fitting a multinomial logistic regression whose weights maximize the likelihood of the training data. This is why:

sklearn.linear_model.LogisticRegressionis the MaxEnt classifier. There is no separateMaxEntclass. You search "sklearn maxent" and the documentation hands youLogisticRegression.- Logistic regression works absurdly well for text classification. It is not a coincidence. It is solving the maximum-entropy problem.
That is the single most important fact in this whole post, and the reason the "always build an LR baseline" rule is so strong. Logistic regression is not a hack — it is the principled probabilistic model that makes the fewest assumptions about your data.
Generation 4 — Support Vector Machines
Before deep learning won, SVMs were state of the art for text classification. They have three properties that make them especially well-suited:

- Well-suited for sparse data — and TF-IDF features are extremely sparse.
- Well-suited for high-dimensional data — and vocabularies are huge.
- The large-margin constraint gives them a useful trade-off between robustness and accuracy.
The trade-off with SVMs: they are harder to train than logistic regression (especially with non-linear kernels), but on TF-IDF features and a reasonable dataset, they used to beat everything. They still hold their own in plenty of practical settings — especially when you have small datasets.
The reason classical NLP eventually moved past SVMs was not that SVMs were broken — it was that they hit a ceiling that no amount of feature engineering could break through. And that ceiling has a name.
Why classical models can't "speak the language"
Logistic regression on TF-IDF features will get you a long way. But there is a hard ceiling, and once you understand it, everything in Part 7 clicks.

The classical pipeline treats text as a bag of words. The order is gone. The relationships between words are gone. "not boring" and "boring not" look identical to the model. The phrase "this is the greatest screwball comedy ever filmed" and the phrase "unbelievably disappointing" both end up as TF-IDF vectors that — depending on vocabulary — might not even be that different.
And it gets worse, because natural language is layered in ways the bag of words cannot see:

The instinct here is: "ok, more data will fix this." It will not.

So just throwing more data at a logistic-regression baseline runs into two walls simultaneously: the bag-of-words representation can't capture meaning, and you can't label enough new data to drown out the noise.
The real-world decision tree
The practitioner answer from the notes is a small decision tree based on how much data you have. Memorize this one — it is the closest thing classical NLP has to "what should I actually try first?"

- No training data? Manually written rules. Need careful crafting, very time-consuming, but works for narrow domains.
- Very little data? Naïve Bayes. Or label more data — DIY, gamification, Mechanical Turk. Or semi-supervised annotation / bootstrapping (label a tiny seed set, train, use the model's confident predictions as new labels, retrain).
- A reasonable amount of data? SVM. Regularized logistic regression / MaxEnt. User-interpretable decision trees (useful because users like to hack rules into them, and management likes quick fixes).
- Big data? Try deep learning. This is where Part 7 picks up.

There is also the classic Brill & Banko spelling-correction plot — accuracy as a function of data size, where with enough data the choice of classifier almost stops mattering. The implication: feature engineering and data labelling often beat classifier choice. Spend your time on the input, not the algorithm.
Feature engineering: where most of the practical wins live
In real text classification systems, the feature engineering matters more than the classifier. From the notes:

- Domain-specific features and weights. Very important in real performance. Generic TF-IDF on a domain-specific corpus leaves money on the table.

- Sometimes need to collapse terms. Part numbers, chemical formulas, IP addresses — collapse them all into a single token like
<PART_NUMBER>so the classifier learns "there is a part number here" rather than memorizing every part number. - Upweighting — count a word as if it occurred twice when it appears in certain places: title words (Cohen & Singer 1996), the first sentence of each paragraph (Murata 1999), sentences that contain title words (Ko et al. 2002).

- N-grams — bigrams and trigrams capture some lexical information (negation:
"not good", object-action pairs) and some semantic information (named entities, compound names).
There is also NLP-specific feature engineering on top of the bag of words:
- POS tags for first-level word-sense disambiguation.
Bookas a noun → book sales.Bookas a verb → travel agency. Same word, completely different signal. - Dependency parsing for word relationships that n-grams cannot capture. Two words may have a strong grammatical relationship across many filler words; n-grams only see adjacency.
The pattern: classical text classification stops being about "which classifier" and starts being about "which features".
What this connects to
You now have the full classical text-classification stack: representation (DTM, TF-IDF), classifier (rules → NB → MaxEnt/LR → SVM), feature engineering, and the real-world decision tree that picks which to use. With this you can build a working classifier for almost any narrow-domain problem — spam, sentiment on product reviews, topic routing, content moderation.
The wall you hit is the bag-of-words ceiling. Counts and frequencies cannot capture meaning, word order, polysemy, or long-range dependencies. To break through, you need a representation that does — and that representation comes from language modelling, the deep-learning approach to text classification, and the transfer-learning workflow that ties it all together.
That is Part 7: the deep-learning side of the same problem.
