Cheat sheet

Part 2 · Data Cleaning & Preprocessing — Cheat Sheet

Decisions, code patterns, and traps. Designed for fast exam-style revision and quick lookup during interviews.

1

ML Process

StepMeaning
Select dataChoose and collect input for future cases you want to generalise to.
PreprocessClean errors, nulls, outliers, formats, irrelevant info.
TransformScale, encode, bin, log-transform, build features.
ModelTrain and validate. Use CV for tuning, test only at the end.

Before you touch data: Is ML even the right tool? If a deterministic rule works, ML may be unnecessary.

2

Understanding the data

  • Feature meaning first. -1 may be impossible age but a valid code elsewhere.
  • Representative data. Small samples → sampling noise. Flawed sampling → sampling bias.
  • Biases (biased data → biased model):
    • Volunteer — participants differ from non-participants.
    • Selection — sample drawn from a narrow subgroup.
    • Survival — looking only at what passed the filter (the WWII planes).
  • Supervised = target given (regression / classification). Unsupervised = no target labels.
3

Scaling — when it matters

Mandatory

KNN · SVM · PCA · gradient descent · regularised linear/logistic

Recommended

Neural nets · Linear/Logistic regression

Optional

Decision trees · Random Forest · Gradient Boosted Trees

ScalerUse when
StandardScalerDefault. Mean 0, std 1. No strong outliers.
RobustScalerOutliers present. Uses median + IQR.
MinMaxScalerNeed bounded range [0, 1] or non-negative.
NormalizerEach row is a vector; angle/cosine matters (text, recommenders).
scaler = StandardScaler()
scaler.fit(X_train)                   # train only
X_train_s = scaler.transform(X_train)
X_test_s  = scaler.transform(X_test)  # NEVER fit on test
4

Categorical features

EncodingUse whenMain trap
OrdinalReal order: low < medium < high.Fake distance/order on nominal categories.
One-hotNominal, manageable cardinality.More columns + collinearity.
TargetHigh cardinality (zip code).Uses y → leakage if outside CV.
  • Zip code / product ID = category, even when stored as a number.
  • Binary Yes/Noone dummy column is enough.
  • Use OneHotEncoder (not pd.get_dummies) inside pipelines — safer for ML workflow.
  • Tree models tolerate correlated dummies better than linear models.
preprocess = make_column_transformer(
    (StandardScaler(), numeric_cols),
    (OneHotEncoder(handle_unknown="ignore"), cat_cols),
)
5

Outliers

Ask first, then act:

QuestionAction
Is it an error?Correct, remove, or impute.
Valid and meaningful?Keep. Use robust scaler / log / add indicator.
Target of interest (fraud, failure)?Do not remove — model needs it.
Tiny and non-representative?Remove if it won't bias.

Detection:

  • Z-score|z| > 3. Assumes ~Gaussian.
  • IQR — outside [Q1 − 1.5·IQR, Q3 + 1.5·IQR]. No distribution assumption.
  • Model-basedIsolationForest, OneClassSVM, LocalOutlierFactor, robust covariance. Best for high-dim / non-Gaussian.
6

Null values

TypeMeaningAction
MCARRandom missingness.Remove if few; impute if many.
MARDepends on observed vars.Impute with multivariate (KNN/Iterative).
MNARMissingness itself is signal.Add MissingIndicator + impute.
SituationBest start
Numerical missingMedian (robust to outliers).
Categorical missingMost frequent or constant "Missing".
Missingness meaningfuladd_indicator=True.
Need relationshipsKNNImputer / IterativeImputer.
> 50 % missing, no meaningDrop column.
num_pipe = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())
cat_pipe = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OneHotEncoder(handle_unknown="ignore"),
)
7

Other preprocessing

  • Skewness — Log / Box-Cox on long right tails (prices, income, time).
  • Binning — Group levels to reduce cardinality.
  • Discretisation — Continuous → categories via KBinsDiscretizer.
  • Typing — Dates as datetime64, IDs as category, currency strings → numeric.
  • 0 ≠ missing. Always check the domain meaning before treating zeros.
8

Pipeline template

numeric_features     = ["age", "salary"]
categorical_features = ["boro", "zipcode"]

num_pipe = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())
cat_pipe = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OneHotEncoder(handle_unknown="ignore"),
)

preprocess = make_column_transformer(
    (num_pipe, numeric_features),
    (cat_pipe, categorical_features),
)

pipe = make_pipeline(preprocess, LogisticRegression(max_iter=1000))

# Cross-validate the WHOLE pipeline
scores = cross_val_score(pipe, X, y, cv=5)
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)