Maria Aguilera

Step	Meaning
Select data	Choose and collect input for future cases you want to generalise to.
Preprocess	Clean errors, nulls, outliers, formats, irrelevant info.
Transform	Scale, encode, bin, log-transform, build features.
Model	Train and validate. Use CV for tuning, test only at the end.

Before you touch data: Is ML even the right tool? If a deterministic rule works, ML may be unnecessary.

Feature meaning first. -1 may be impossible age but a valid code elsewhere.
Representative data. Small samples → sampling noise. Flawed sampling → sampling bias.
Biases (biased data → biased model):
- Volunteer — participants differ from non-participants.
- Selection — sample drawn from a narrow subgroup.
- Survival — looking only at what passed the filter (the WWII planes).
Supervised = target given (regression / classification). Unsupervised = no target labels.

Mandatory

KNN · SVM · PCA · gradient descent · regularised linear/logistic

Recommended

Neural nets · Linear/Logistic regression

Optional

Decision trees · Random Forest · Gradient Boosted Trees

Scaler	Use when
`StandardScaler`	Default. Mean 0, std 1. No strong outliers.
`RobustScaler`	Outliers present. Uses median + IQR.
`MinMaxScaler`	Need bounded range `[0, 1]` or non-negative.
`Normalizer`	Each row is a vector; angle/cosine matters (text, recommenders).

scaler = StandardScaler()
scaler.fit(X_train)                   # train only
X_train_s = scaler.transform(X_train)
X_test_s  = scaler.transform(X_test)  # NEVER fit on test

Encoding	Use when	Main trap
Ordinal	Real order: `low < medium < high`.	Fake distance/order on nominal categories.
One-hot	Nominal, manageable cardinality.	More columns + collinearity.
Target	High cardinality (zip code).	Uses `y` → leakage if outside CV.

Zip code / product ID = category, even when stored as a number.
Binary Yes/No → one dummy column is enough.
Use OneHotEncoder (not pd.get_dummies) inside pipelines — safer for ML workflow.
Tree models tolerate correlated dummies better than linear models.

preprocess = make_column_transformer(
    (StandardScaler(), numeric_cols),
    (OneHotEncoder(handle_unknown="ignore"), cat_cols),
)

Ask first, then act:

Question	Action
Is it an error?	Correct, remove, or impute.
Valid and meaningful?	Keep. Use robust scaler / log / add indicator.
Target of interest (fraud, failure)?	Do not remove — model needs it.
Tiny and non-representative?	Remove if it won't bias.

Detection:

Z-score — |z| > 3. Assumes ~Gaussian.
IQR — outside [Q1 − 1.5·IQR, Q3 + 1.5·IQR]. No distribution assumption.
Model-based — IsolationForest, OneClassSVM, LocalOutlierFactor, robust covariance. Best for high-dim / non-Gaussian.

Type	Meaning	Action
MCAR	Random missingness.	Remove if few; impute if many.
MAR	Depends on observed vars.	Impute with multivariate (KNN/Iterative).
MNAR	Missingness itself is signal.	Add `MissingIndicator` + impute.

Situation	Best start
Numerical missing	Median (robust to outliers).
Categorical missing	Most frequent or constant `"Missing"`.
Missingness meaningful	`add_indicator=True`.
Need relationships	`KNNImputer` / `IterativeImputer`.
> 50 % missing, no meaning	Drop column.

num_pipe = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())
cat_pipe = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OneHotEncoder(handle_unknown="ignore"),
)

Skewness — Log / Box-Cox on long right tails (prices, income, time).
Binning — Group levels to reduce cardinality.
Discretisation — Continuous → categories via KBinsDiscretizer.
Typing — Dates as datetime64, IDs as category, currency strings → numeric.
0 ≠ missing. Always check the domain meaning before treating zeros.

numeric_features     = ["age", "salary"]
categorical_features = ["boro", "zipcode"]

num_pipe = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())
cat_pipe = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OneHotEncoder(handle_unknown="ignore"),
)

preprocess = make_column_transformer(
    (num_pipe, numeric_features),
    (cat_pipe, categorical_features),
)

pipe = make_pipeline(preprocess, LogisticRegression(max_iter=1000))

# Cross-validate the WHOLE pipeline
scores = cross_val_score(pipe, X, y, cv=5)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

Part 2 · Data Cleaning & Preprocessing — Cheat Sheet

ML Process

Understanding the data

Scaling — when it matters

Categorical features

Outliers

Null values

Other preprocessing

Pipeline template