Cheat sheet

Part 2 · From Text to Vectors — Cheat Sheet

Every preprocessing decision, every formula, every trap from bag of words to lemmatization — condensed for fast revision.

Part 2 · From Text to Vectors — Cheat Sheet — printable cheat sheet
Download PNG

Or read the searchable version below.

1

Core terminology

TermMeaning
CorpusThe dataset — collection of documents
DocumentOne row, one classification target (tweet, article, paragraph)
WordsRaw tokens as written
TermsWords after preprocessing — the actual feature columns
VocabularySet of all unique terms, size V

Rule: the document is whatever you're classifying — sentence, paragraph, or chapter, depending on the task.

2

Bag of words

Split the document into words. Count them. Forget the order.

  • "dog toy" and "toy dog"same vector. Order is lost.

Limitations:

  • Word order — gone
  • Context — gone
  • Synonyms (doctor / physician) — separate columns
  • Fillers (the, a) dominate counts

Still the foundation of every classical NLP system.

3

CountVectorizer

Each document → vector of length V (vocabulary size).

Matrix shape: N × V · mostly zeros (sparse) · use scipy.sparse.

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(lowercase=True, stop_words="english")
X = cv.fit_transform(corpus)
print(cv.vocabulary_)  # word → column index

You always need the word→index map. Without it the matrix is useless.

4

Weighting schemes

Binary

0 / 1 — does the word appear?

TF

Raw count of word in document.

TF-IDF

TF × log(N / df). Boosts rare informative terms, crushes stop words automatically.

TF-IDF formula: tf-idf(t, d) = tf(t, d) × log(N / df(t))

  • Word in every doc → df = N → IDF = log(1) = 0 → crushed.
  • Word in one doc → IDF = log(N) → max boost.
  • Log squashes scale (1M docs → ~14, not 1M).
5

Vector similarity

MetricWhen to use
EuclideanVectors already L2-normalised, or magnitudes truly matter
CosineDocuments of different lengths — default for NLP

The long-book trap. A long biology book and a short biology pamphlet → far apart under Euclidean, parallel under cosine. Length ≠ topic.

cosine(u, v) = (u · v) / (‖u‖ · ‖v‖) · cosine_distance = 1 − cosine_similarity

After L2 normalisation, cosine and Euclidean rank the same.

6

Stop words

Why: extremely common · uninformative · inflate dimensionality · dominate raw counts.

How:

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words="english")

Tradeoff:

  • ✅ Classification, search ranking
  • ❌ Summarization, generation (you destroy sentence structure)

TF-IDF down-weights stop words automatically — you often don't need a manual list at all.

7

Tokenization

FlavourUnitVocabularyUsed by
Wordwordup to 1M+classical NLP
Subwordpiece (un + ##happy)~30–50kBERT, GPT
Charactercharacter~100OCR, low-resource
Sentencesentencen/a (segmentation)summarization

Word-level traps: It's → 1 or 2 tokens? Casing (Cat vs cat)? Punctuation (I hate cats?I hate cats)? Accents (Naïve vs Naive)?

Languages without spaces (Japanese, Chinese, Thai) → whitespace split is useless. Use a learned segmenter.

8

Normalization

String-level — collapse surface variants:

SurfaceTarget
U.S.A. / USA / U.S.Ausa
Coooooolcool
Naïvenaive
running / ran / runsrun

Vector-level — L1 vs L2:

NormFormulaResult
L2x / √(Σxᵢ²)Vector length = 1 (default for TF-IDF)
L1x / ΣxᵢCells sum to 1 (probabilistic models)
9

Stemming vs Lemmatization

StemmingLemmatization
MethodChop endings via rulesDictionary + POS
OutputMay not be a real wordAlways a real word
SpeedFastSlower
Examplereplacement → replacreplacement → replacement
ToolPorter Stemmer (NLTK)WordNet (NLTK), spaCy
from nltk.stem import PorterStemmer, WordNetLemmatizer
PorterStemmer().stem("studies")           # 'studi'
WordNetLemmatizer().lemmatize("studies", pos="v")  # 'study'

Modern transformer pipeline: skip both — subword tokenization collapses morphological variants for free.

10

Regular expressions

First-pass tool. Often the only tool you need.

import re

# Emails
re.findall(r"[\w\.-]+@[\w\.-]+\.\w+", text)

# Dates 2026-06-29
re.findall(r"\b\d{4}-\d{2}-\d{2}\b", text)

# Strip HTML
re.sub(r"<[^>]+>", "", text)

Used everywhere in NLP:

  • Named entities, dates, URLs, currency
  • Features inside ML classifiers
  • Cleaning before vectorization

Rule of thumb: try regex first. If it dissolves the problem, you don't need a model.