Maria Aguilera

Kaggle binary classification: predict whether a passenger was transported to another dimension after a collision in the Spaceship Titanic.

~8,700 train rows, ~4,300 test. Features include passenger demographics, cabin, group bookings, and spend at on-board services.

A standard impute-encode-train task on the surface. The interesting layer is what the IDs actually mean.

Two hidden structures:

PassengerID = "gggg_pp"

Cabin = "Deck/Num/Side"

Both are categorical structures encoded as strings. The model can't see them unless you split them out.

Once group_id is exposed:

Same group → same Cabin / HomePlanet / Destination. If one passenger's value is known, fill the rest of the group.
Same group → similar CryoSleep status. Family decisions tend to align.
Spend columns when CryoSleep = True → 0. People in cryo can't shop.

Replacing mean / median imputation with structural rules plugged the bulk of the missingness without inventing data.

Each one came from a hypothesis: what would make a group behave the same way?

Ten models, same pipeline, same CV scheme:

Stratified 5-fold CV, accuracy + F1 reported.

Winner: Gradient Boosting variants on top, ensemble slightly above each individually. Margin was a few tenths of a percent.

Read the data schema first. Anything stored as a compound string (A/137/P, 0007_02) is begging to be split.
Structural imputation > statistical imputation. Use relationships before you reach for the mean.
Bake-offs are useful, but the gap between #1 and #5 is usually smaller than the gap between bad and good features.
Stratified CV matters when classes are imbalanced — even slightly.

Spaceship Titanic — Cheat Sheet