
Table of Contents
- 1. A quick story
- 2. What problem RL actually solves
- 3. How RL fits next to supervised and unsupervised learning
- 4. The loop, explained like you’re watching the puppy
- 5. The key ingredients (jargon translated)
- 6. Why RL is tricky in practice
- 7. When RL shines (and when it doesn’t)
- 8. A tiny concrete example: CartPole reward shaping
- 9. From puppy to production: a lightweight roadmap
- 10. Common pitfalls (and quick fixes)
- 11. If you want to go deeper
- 12. Takeaway
Last update: February 2025. All opinions are my own.
A quick story
Think about teaching a puppy to fetch. You cannot hand it a labeled dataset of “perfect fetch” examples. You throw a stick, the puppy tries something, you cheer or stay silent, and over time the puppy discovers the sequence of moves that earns the reward. That is reinforcement learning (RL) in miniature: learning behaviour through interaction and feedback, not through an answer key.

What problem RL actually solves
- You cannot pre-label the correct action for every situation.
- The world changes, sometimes in ways you cannot model upfront.
- The system must learn while it runs, improving its policy as it gathers experience.
RL is built for these cases: it learns a strategy that maximises long-term reward, not a one-shot label prediction.
How RL fits next to supervised and unsupervised learning
- Supervised: predict a known label from examples (best when you know the answer for many cases).
- Unsupervised: discover structure without labels (clusters, topics, embeddings).
- Reinforcement: pick actions now to improve future reward, even when the right move is unknown and may only pay off later.
The loop, explained like you’re watching the puppy
- Observe the current state (where is the stick?).
- Pick an action (run left, run right, wait).
- See the new state + reward (did you get closer? did you hear “good dog”?).
- Update the policy so actions that led to better rewards become more likely.
- Repeat—many, many times—until the behaviour sticks.
Behaviours that lead to positive outcomes are reinforced. Everything else becomes less likely.
The key ingredients (jargon translated)
- Agent: the decision maker (puppy, robot, trading bot).
- Environment: whatever responds to the agent (room, simulator, market).
- State: what the agent can sense about “now.”
- Action: what the agent can do next.
- Reward: scalar feedback; higher is better.
- Policy: the rule that maps state → action.
- Value: how promising a state (or state–action pair) looks for future reward.
Why RL is tricky in practice
- Delayed rewards: a harmless move now can lose the game ten steps later.
- Exploration vs exploitation: try new actions vs repeat what works; too much of either hurts.
- Non-IID data: your actions change the future data distribution.
- Safety & cost: real-world exploration can be expensive or dangerous.
When RL shines (and when it doesn’t)
- Shines: games with complex strategy, robotics/control, navigation, adaptive recommendations, resource allocation.
- Struggles: when reward is sparse or poorly shaped; when sim-to-real gap is huge; when exploration is unsafe or extremely costly.
A tiny concrete example: CartPole reward shaping
Goal: keep a pole balanced on a cart. A naive reward might be +1 every timestep the pole stays upright. That works, but learning speeds up if you shape the reward to include distance from center and pole angle, e.g. r = 1 - 0.1*|x| - 0.5*|theta|. You’re still aiming for balance, but you’re giving the agent a smoother gradient about what “better” means.
From puppy to production: a lightweight roadmap
- Prototype in simulation: iterate fast where failure is cheap.
- Define reward carefully: align it with the real goal; avoid loopholes.
- Tune exploration: schedule epsilon/temperature so you explore early, exploit later.
- Stabilize training: use replay buffers, target networks, advantage estimators (A2C/PPO) to reduce variance.
- Plan sim-to-real: domain randomization or fine-tuning on real data to close the gap.
- Monitor live: track reward, safety constraints, and drift; be ready to roll back.
Common pitfalls (and quick fixes)
- Reward hacking: agent finds shortcuts. → Add constraints or penalties; audit behaviour regularly.
- Dying at episode start: exploration too timid or reward too sparse. → Add shaping, curiosity, or better initialization.
- Training collapse: unstable updates. → Smaller learning rate, clip gradients, use PPO/A2C style objectives.
- Sim-to-real failure: dynamics mismatch. → Randomize physics/visuals in sim; fine-tune on limited real rollouts.
If you want to go deeper
- Try a minimal notebook with
gymnasium+PPOon CartPole. - Swap the reward and watch how learning speed changes.
- Add a safety constraint (e.g., limit force) and see how the policy adapts.
- Read about policy gradients vs Q-learning; they solve the same loop with different math.
Takeaway
RL is about learning a strategy through interaction. If you can name the goal, the actions, and a reward signal, you can frame the problem as reinforcement learning. Then let the agent practice—cheaply in sim when possible—until the behaviour you want becomes the behaviour it chooses.
