Cheat sheet

Forest Cover Type — Cheat Sheet

A practical walkthrough of feature interpretation and sanity checks on the Kaggle Forest Cover Type Prediction dataset. Read the data before you fit the model.

Read the full projectUpdated April 2024
1

The task

Kaggle multi-class classification: predict one of 7 forest cover types from cartographic features. ~580k training samples, 54 features.

Cover types:

  1. Spruce/Fir, 2. Lodgepole Pine, 3. Ponderosa Pine, 4. Cottonwood/Willow, 5. Aspen, 6. Douglas-fir, 7. Krummholz.

The dataset is balanced after sampling, but in nature, classes 4 and 6 are minorities. Watch the leaderboard's evaluation metric.

2

The features

GroupFeaturesMeaning
TopographicElevation, Aspect, SlopeWhere the patch sits.
DistanceHorizontal/Vertical_Distance_To_Hydrology, _To_Roadways, _To_Fire_PointsHow far from water, roads, fires.
HillshadeHillshade_9am, _Noon, _3pmLight cast on the slope.
Wilderness4 binary columnsWhich wilderness area.
Soil type40 binary columnsSpecific soil classification.

54 features. Forty are one-hot soil types — most are rare.

3

Sanity checks

Before fitting anything, walk through every feature:

  • Elevation range? Should match Rocky Mountain ranges (~1800–4000 m). Anything outside is suspect.
  • Aspect is 0–360°. It's cyclical — encode sin/cos for linear models.
  • Hillshade values 0–255. They're image-like; sometimes one of the three is highly correlated with the others.
  • Soil type columns sum to 1? Each row should belong to exactly one soil class.
  • Wilderness area columns sum to 1? Same.
  • Are any features constant? Drop them.
  • Class balance? The dataset is sampled to be balanced — don't assume nature is.
4

Feature engineering

Cartographic features have natural combinations:

  • Hydrology_Euclidean = sqrt(Horiz² + Vert²) — true distance to water.
  • Elevation × Distance_To_Hydrology — high+far = different ecosystem.
  • Aspect → sin/cos to handle circularity.
  • Bin elevation into ecological zones (montane, sub-alpine, alpine).
  • Collapse rare soil types into a "soil_other" bucket to reduce dimensionality.

Each one is a hypothesis about ecology, not just a transformation.

5

The model bake-off

Tree-based models dominate this task because:

  • Many binary features (soil, wilderness) → trees handle them natively.
  • No scaling needed.
  • Non-linear interactions between elevation × hydrology distance × soil type.

Tried:

  • Random Forest
  • Gradient Boosting (sklearn)
  • XGBoost
  • LightGBM

LightGBM typically wins on leaderboard with the right num_leaves and min_child_samples. RF is the most stable baseline.

6

What I learned

  • Cartographic features deserve domain reading. "Aspect" isn't abstract — it's the direction the slope faces, and it's cyclical.
  • One-hot columns that sum to 1 are a sanity check, not just a sparse format.
  • Tree models don't need scaling, but they still benefit from engineered features (Euclidean distance to hydrology > horizontal + vertical alone).
  • Class balance in the Kaggle sample ≠ class balance in reality — your CV scores can mislead in deployment.