| Encoding | Representation | Implies |
|---|---|---|
| Numerical | hour = 0..23 | Linear distance — wrong (23 is not "far" from 0). |
| One-hot | 24 binary columns | No distance between hours. Big sparse columns. |
| Cyclic | sin(2πh/24), cos(2πh/24) | True circular distance. Two columns. |
The problem
UCI / Kaggle bike-share dataset: hourly rentals in Washington, D.C., 2011–2012. Target: total rentals per hour.
The challenge is that hour of day has a circular structure — 23:00 is one hour before 00:00, not 23 hours away. How you tell the model this changes the answer.
Goal: cleanly compare three encoding strategies under time-aware validation.
The three encodings
Same trick works for: day-of-week, month, season, wind direction — any cyclical feature.
The horse race
Three models × three encodings = nine cells. Each model trained, validated, scored.
Linear models strongly preferred cyclic — they can't recover the wrap-around from numerical, and one-hot exploded dimensionality.
Tree-based models (Random Forest, Gradient Boosted): cyclic won slightly even though theory says trees should handle numerical OK. The win came from fewer splits needed to express "evening rush hour".
Bottom line: cyclic > one-hot > numerical, consistently.
Time-aware CV
Never shuffle time-series data for CV. I used:
- Rolling-origin evaluation — train on
[0, t], score on[t, t+h], slide forward. - Strict no-leakage: features that depend on the future of the train window are out.
The naïve k-fold split would have inflated scores by 10+ % and given a model that fails in production.
The October 2012 anomaly
A whole week of October 2012 had zero rentals — Hurricane Sandy shut down the city. A weather event the dataset's weather features didn't fully capture.
If that week falls inside the training window → no problem. If it falls in the validation window → the model looks like garbage.
Lesson: always plot residuals over time. The biggest errors cluster at the events the model couldn't see coming.
What I learned
- Cyclical features deserve cyclical encoding, even for trees.
- Time-aware CV is non-negotiable. Shuffle once, lie about your scores forever.
- Anomalies + small data = unstable validation. Always inspect residuals along time.
- Domain features beat tuning. "Is rush hour" was a stronger feature than any hyperparameter change.