The task
Kaggle binary classification: predict whether a passenger was transported to another dimension after a collision in the Spaceship Titanic.
~8,700 train rows, ~4,300 test. Features include passenger demographics, cabin, group bookings, and spend at on-board services.
A standard impute-encode-train task on the surface. The interesting layer is what the IDs actually mean.
The schema reading
Two hidden structures:
PassengerID = "gggg_pp"
gggg= group / travel-party number.pp= position within the group.- People with the same
ggggare family or friends travelling together.
Cabin = "Deck/Num/Side"
Deck= letter A–G.Num= cabin number.Side= P (port) or S (starboard).
Both are categorical structures encoded as strings. The model can't see them unless you split them out.
Relational imputation
Once group_id is exposed:
- Same group → same Cabin / HomePlanet / Destination. If one passenger's value is known, fill the rest of the group.
- Same group → similar
CryoSleepstatus. Family decisions tend to align. - Spend columns when
CryoSleep = True→ 0. People in cryo can't shop.
Replacing mean / median imputation with structural rules plugged the bulk of the missingness without inventing data.
Feature engineering moves
group_size— count of passengers pergroup_id.is_solo— binary,group_size == 1.cabin_deck,cabin_num,cabin_side— split out from rawCabin.total_spend— sum of RoomService + FoodCourt + ShoppingMall + Spa + VRDeck.spent_anything— binarytotal_spend > 0.
Each one came from a hypothesis: what would make a group behave the same way?
The bake-off
Ten models, same pipeline, same CV scheme:
- Logistic Regression
- KNN
- Naïve Bayes
- Decision Tree
- Random Forest
- Gradient Boosting (sklearn)
- XGBoost
- LightGBM
- CatBoost
- Stacked ensemble
Stratified 5-fold CV, accuracy + F1 reported.
Winner: Gradient Boosting variants on top, ensemble slightly above each individually. Margin was a few tenths of a percent.
What I learned
- Read the data schema first. Anything stored as a compound string (
A/137/P,0007_02) is begging to be split. - Structural imputation > statistical imputation. Use relationships before you reach for the mean.
- Bake-offs are useful, but the gap between #1 and #5 is usually smaller than the gap between bad and good features.
- Stratified CV matters when classes are imbalanced — even slightly.