Maria Aguilera

Kaggle multi-class classification: predict one of 7 forest cover types from cartographic features. ~580k training samples, 54 features.

Cover types:

Spruce/Fir, 2. Lodgepole Pine, 3. Ponderosa Pine, 4. Cottonwood/Willow, 5. Aspen, 6. Douglas-fir, 7. Krummholz.

The dataset is balanced after sampling, but in nature, classes 4 and 6 are minorities. Watch the leaderboard's evaluation metric.

Group	Features	Meaning
Topographic	`Elevation`, `Aspect`, `Slope`	Where the patch sits.
Distance	`Horizontal/Vertical_Distance_To_Hydrology`, `_To_Roadways`, `_To_Fire_Points`	How far from water, roads, fires.
Hillshade	`Hillshade_9am`, `_Noon`, `_3pm`	Light cast on the slope.
Wilderness	4 binary columns	Which wilderness area.
Soil type	40 binary columns	Specific soil classification.

54 features. Forty are one-hot soil types — most are rare.

Before fitting anything, walk through every feature:

Elevation range? Should match Rocky Mountain ranges (~1800–4000 m). Anything outside is suspect.
Aspect is 0–360°. It's cyclical — encode sin/cos for linear models.
Hillshade values 0–255. They're image-like; sometimes one of the three is highly correlated with the others.
Soil type columns sum to 1? Each row should belong to exactly one soil class.
Wilderness area columns sum to 1? Same.
Are any features constant? Drop them.
Class balance? The dataset is sampled to be balanced — don't assume nature is.

Cartographic features have natural combinations:

Hydrology_Euclidean = sqrt(Horiz² + Vert²) — true distance to water.
Elevation × Distance_To_Hydrology — high+far = different ecosystem.
Aspect → sin/cos to handle circularity.
Bin elevation into ecological zones (montane, sub-alpine, alpine).
Collapse rare soil types into a "soil_other" bucket to reduce dimensionality.

Each one is a hypothesis about ecology, not just a transformation.

Tree-based models dominate this task because:

Tried:

LightGBM typically wins on leaderboard with the right num_leaves and min_child_samples. RF is the most stable baseline.

Cartographic features deserve domain reading. "Aspect" isn't abstract — it's the direction the slope faces, and it's cyclical.
One-hot columns that sum to 1 are a sanity check, not just a sparse format.
Tree models don't need scaling, but they still benefit from engineered features (Euclidean distance to hydrology > horizontal + vertical alone).
Class balance in the Kaggle sample ≠ class balance in reality — your CV scores can mislead in deployment.

Forest Cover Type — Cheat Sheet