| K | Behaviour |
|---|---|
| K = 1 | Boundary follows every training point. High variance, overfits. |
| Small K (3–5) | Local detail, can capture small clusters. |
| Large K (50+) | Smooth boundary. High bias, underfits. |
| K = N | Always predicts the majority class. Useless. |
The algorithm
For each new point:
- Compute its distance to every training point.
- Pick the K closest neighbours.
- Classify → majority vote among the K labels.
- Regress → mean (or weighted mean) of the K target values.
No model is fit at training time — just stored. All the work happens at inference.
The K knob
K controls the bias-variance trade-off:
Pick K by cross-validation. Use odd values for binary classification to avoid ties.
Why scale always
KNN uses Euclidean distance by default:
d(x, y) = sqrt(Σ (x_i − y_i)²)
If income ranges in 0–200,000 and age ranges in 0–80, the income axis dominates every distance. The model becomes "1-NN on income" and ignores age completely.
StandardScaler before KNN. Always. No exceptions.
Same applies to clustering (KMeans), PCA, and anything else distance-based.
Distance metrics
KNN supports many distance functions:
| Metric | Use for |
|---|---|
| Euclidean (L2) | Default. Continuous features. |
| Manhattan (L1) | High-dim or robust to outliers. |
| Minkowski (general) | Tune p parameter (1=Manhattan, 2=Euclidean). |
| Hamming | Binary / categorical features. |
| Cosine | Text vectors, recommender systems. |
For text and recommenders, cosine similarity (1 − cosine distance) usually beats Euclidean.
Weighted KNN
Default: all K neighbours vote equally.
Weighted variant: weight each neighbour by 1 / distance — closer neighbours have more say.
KNeighborsClassifier(n_neighbors=5, weights="distance")Helps when K is moderately large and you don't want distant outliers to dilute the vote.
When KNN wins
- Low-dimensional, dense data. Distances are meaningful.
- Decision boundary is locally smooth but globally complex.
- You can't afford to train a model but inference latency isn't critical.
- Recommender systems — user-user or item-item similarity.
- Anomaly detection — points far from their K nearest neighbours are suspicious.
When KNN loses
- High-dimensional data — curse of dimensionality. All points are "equally far". See Part 10 / PCA cheat sheet.
- Massive datasets at inference time.
O(n)per prediction. Use ANN libraries (FAISS, Annoy) or just pick another model. - Imbalanced classes. Majority class wins votes; minority class disappears.
- You need interpretability beyond "these were its 5 nearest neighbours."
KNN as a recommender
Memory-based collaborative filtering:
User-user CF: "Find K users similar to you (by past ratings), recommend what they liked."
Item-item CF: "Find K items similar to this one (by who rated them), suggest those."
Steps:
- Build a user × item rating matrix.
- Normalise (mean-centre each row).
- Compute pairwise cosine similarities between users (or items).
- For each unseen
(user, item): weighted average of similar users' ratings of that item.
Pros: simple, no model training. Cons: cold start (new users / items), poor scaling.
Modern systems combine KNN with matrix factorisation or embeddings.