| Group | Features | Meaning |
|---|---|---|
| Topographic | Elevation, Aspect, Slope | Where the patch sits. |
| Distance | Horizontal/Vertical_Distance_To_Hydrology, _To_Roadways, _To_Fire_Points | How far from water, roads, fires. |
| Hillshade | Hillshade_9am, _Noon, _3pm | Light cast on the slope. |
| Wilderness | 4 binary columns | Which wilderness area. |
| Soil type | 40 binary columns | Specific soil classification. |
The task
Kaggle multi-class classification: predict one of 7 forest cover types from cartographic features. ~580k training samples, 54 features.
Cover types:
- Spruce/Fir, 2. Lodgepole Pine, 3. Ponderosa Pine, 4. Cottonwood/Willow, 5. Aspen, 6. Douglas-fir, 7. Krummholz.
The dataset is balanced after sampling, but in nature, classes 4 and 6 are minorities. Watch the leaderboard's evaluation metric.
The features
54 features. Forty are one-hot soil types — most are rare.
Sanity checks
Before fitting anything, walk through every feature:
- Elevation range? Should match Rocky Mountain ranges (~1800–4000 m). Anything outside is suspect.
Aspectis 0–360°. It's cyclical — encodesin/cosfor linear models.Hillshadevalues 0–255. They're image-like; sometimes one of the three is highly correlated with the others.- Soil type columns sum to 1? Each row should belong to exactly one soil class.
- Wilderness area columns sum to 1? Same.
- Are any features constant? Drop them.
- Class balance? The dataset is sampled to be balanced — don't assume nature is.
Feature engineering
Cartographic features have natural combinations:
Hydrology_Euclidean = sqrt(Horiz² + Vert²)— true distance to water.Elevation × Distance_To_Hydrology— high+far = different ecosystem.- Aspect → sin/cos to handle circularity.
- Bin elevation into ecological zones (montane, sub-alpine, alpine).
- Collapse rare soil types into a "soil_other" bucket to reduce dimensionality.
Each one is a hypothesis about ecology, not just a transformation.
The model bake-off
Tree-based models dominate this task because:
- Many binary features (soil, wilderness) → trees handle them natively.
- No scaling needed.
- Non-linear interactions between elevation × hydrology distance × soil type.
Tried:
- Random Forest
- Gradient Boosting (sklearn)
- XGBoost
- LightGBM
LightGBM typically wins on leaderboard with the right num_leaves and min_child_samples. RF is the most stable baseline.
What I learned
- Cartographic features deserve domain reading. "Aspect" isn't abstract — it's the direction the slope faces, and it's cyclical.
- One-hot columns that sum to 1 are a sanity check, not just a sparse format.
- Tree models don't need scaling, but they still benefit from engineered features (Euclidean distance to hydrology > horizontal + vertical alone).
- Class balance in the Kaggle sample ≠ class balance in reality — your CV scores can mislead in deployment.