Maria Aguilera

For each new point:

Compute its distance to every training point.
Pick the K closest neighbours.
Classify → majority vote among the K labels.
Regress → mean (or weighted mean) of the K target values.

No model is fit at training time — just stored. All the work happens at inference.

K controls the bias-variance trade-off:

K	Behaviour
K = 1	Boundary follows every training point. High variance, overfits.
Small K (3–5)	Local detail, can capture small clusters.
Large K (50+)	Smooth boundary. High bias, underfits.
K = N	Always predicts the majority class. Useless.

Pick K by cross-validation. Use odd values for binary classification to avoid ties.

KNN uses Euclidean distance by default:

d(x, y) = sqrt(Σ (x_i − y_i)²)

If income ranges in 0–200,000 and age ranges in 0–80, the income axis dominates every distance. The model becomes "1-NN on income" and ignores age completely.

StandardScaler before KNN. Always. No exceptions.

Same applies to clustering (KMeans), PCA, and anything else distance-based.

KNN supports many distance functions:

Metric	Use for
Euclidean (L2)	Default. Continuous features.
Manhattan (L1)	High-dim or robust to outliers.
Minkowski (general)	Tune `p` parameter (1=Manhattan, 2=Euclidean).
Hamming	Binary / categorical features.
Cosine	Text vectors, recommender systems.

For text and recommenders, cosine similarity (1 − cosine distance) usually beats Euclidean.

Default: all K neighbours vote equally.

Weighted variant: weight each neighbour by 1 / distance — closer neighbours have more say.

KNeighborsClassifier(n_neighbors=5, weights="distance")

Helps when K is moderately large and you don't want distant outliers to dilute the vote.

Low-dimensional, dense data. Distances are meaningful.
Decision boundary is locally smooth but globally complex.
You can't afford to train a model but inference latency isn't critical.
Recommender systems — user-user or item-item similarity.
Anomaly detection — points far from their K nearest neighbours are suspicious.

High-dimensional data — curse of dimensionality. All points are "equally far". See Part 10 / PCA cheat sheet.
Massive datasets at inference time. O(n) per prediction. Use ANN libraries (FAISS, Annoy) or just pick another model.
Imbalanced classes. Majority class wins votes; minority class disappears.
You need interpretability beyond "these were its 5 nearest neighbours."

Memory-based collaborative filtering:

User-user CF: "Find K users similar to you (by past ratings), recommend what they liked."

Item-item CF: "Find K items similar to this one (by who rated them), suggest those."

Steps:

Build a user × item rating matrix.
Normalise (mean-centre each row).
Compute pairwise cosine similarities between users (or items).
For each unseen (user, item): weighted average of similar users' ratings of that item.

Pros: simple, no model training. Cons: cold start (new users / items), poor scaling.

Modern systems combine KNN with matrix factorisation or embeddings.

Part 12 · KNN & Recommender Systems — Cheat Sheet

The algorithm

The K knob

Why scale always

Distance metrics

Weighted KNN

When KNN wins

When KNN loses

KNN as a recommender