Maria Aguilera

At every node, the tree:

Considers each feature.
For each feature, tries possible split points (thresholds for numeric, group splits for categorical).
Scores each candidate by how much impurity it removes.
Picks the split with the largest impurity reduction.

Recursively splits until a stopping rule fires (max depth, min samples, no impurity gain).

For classification (smaller = purer node):

Gini impurity — 1 − Σ p². Probability of misclassifying a random sample if you label it by class distribution.
Entropy — −Σ p · log(p). Information-theoretic measure of disorder.

Truth: Gini and entropy almost always pick the same splits. Gini is slightly faster; entropy slightly favours balanced splits. Don't lose sleep over it.

For regression:

MSE — mean squared error within the node. Splits minimise variance.

A decision tree with no depth limit will keep splitting until every leaf is pure — one sample per leaf if needed. Train accuracy = 100 %. Test accuracy = wherever the noise sends it.

Knobs to control overfitting:

Hyperparameter	Effect
`max_depth`	Hard cap on tree depth. Smaller = more bias, less variance.
`min_samples_split`	Need ≥ N samples to consider a split.
`min_samples_leaf`	Each leaf must have ≥ N samples. Strong regulariser.
`max_features`	Only consider a subset at each split. Adds variance to bagging.
`ccp_alpha`	Cost-complexity pruning. Penalty per leaf.

scikit-learn's DecisionTreeClassifier treats numeric splits only — you have to encode categoricals first.

Encoding	Tree behaviour
Ordinal	Tree splits at a number, but only matches if the order is meaningful.
One-hot	Tree picks "is category X" or "not". Each cat = one binary feature.
Target / mean	Tree splits at a threshold of the mean target — works well, watch leakage.

LightGBM and CatBoost can handle categoricals natively without these workarounds — see Part 8.

The list of things trees don't care about:

Scaling. Splits use thresholds; multiplying a feature by 1000 doesn't change which rows are above or below.
Outliers. A single weird value only affects its own leaf.
Feature distributions. No Gaussian assumption.
Correlated features. Trees just pick one of the correlated pair.

This makes trees a fast first choice on messy tabular data — minimal preprocessing.

A small tree (depth ≤ 4) is the most interpretable model in ML:

if age > 50:
    if blood_pressure > 140:
        → "high risk"
    else:
        → "low risk"
else:
    → "low risk"

A doctor can read it and audit it. A deep tree loses this — by depth 20, the rules are gibberish to humans.

The trade-off: shallow trees are interpretable but biased. Use a small tree for explanation, a forest / boosting for performance.

Trees can rank features by how much impurity they remove across all splits:

clf.feature_importances_

Caveats:

Biased toward high-cardinality features (more thresholds to try).
Splits credit only one of correlated features — the other looks useless.
Not the same as causal importance.

For more honest importance: permutation importance (sklearn.inspection).

Mostly: don't. Single trees overfit too easily and lose to forests / boosting in pure performance.

Exceptions:

You need to explain the model to a stakeholder.
Inference must be < 1 ms on a constrained device.
You want a quick visualisation of which features split where.
Baseline before ensembles to see what depth-2 already gets you.

For real prediction tasks → jump straight to Random Forest / XGBoost / LightGBM (Part 8).

Part 7 · Decision Trees — Cheat Sheet

How a split is chosen

Impurity measures

Why trees overfit

Categorical handling

What trees don't need

Interpretability

Feature importance

When to use a single tree