Last update: February 2025. All opinions are my own.

A quick story

Think about teaching a puppy to fetch. You cannot hand it a labeled dataset of “perfect fetch” examples. You throw a stick, the puppy tries something, you cheer or stay silent, and over time the puppy discovers the sequence of moves that earns the reward. That is reinforcement learning (RL) in miniature: learning behaviour through interaction and feedback, not through an answer key.

Dog (agent) learning from a human (environment) via actions, rewards, observations
Dog = agent. Human + world = environment. The agent acts, the environment responds with observations and rewards.

What problem RL actually solves

  • You cannot pre-label the correct action for every situation.
  • The world changes, sometimes in ways you cannot model upfront.
  • The system must learn while it runs, improving its policy as it gathers experience.

RL is built for these cases: it learns a strategy that maximises long-term reward, not a one-shot label prediction.

How RL fits next to supervised and unsupervised learning

  • Supervised: predict a known label from examples (best when you know the answer for many cases).
  • Unsupervised: discover structure without labels (clusters, topics, embeddings).
  • Reinforcement: pick actions now to improve future reward, even when the right move is unknown and may only pay off later.

The loop, explained like you’re watching the puppy

  1. Observe the current state (where is the stick?).
  2. Pick an action (run left, run right, wait).
  3. See the new state + reward (did you get closer? did you hear “good dog”?).
  4. Update the policy so actions that led to better rewards become more likely.
  5. Repeat—many, many times—until the behaviour sticks.
Agent-policy-environment workflow
State → Policy → Action → Environment → Reward + next state → Update.

Behaviours that lead to positive outcomes are reinforced. Everything else becomes less likely.

The key ingredients (jargon translated)

  • Agent: the decision maker (puppy, robot, trading bot).
  • Environment: whatever responds to the agent (room, simulator, market).
  • State: what the agent can sense about “now.”
  • Action: what the agent can do next.
  • Reward: scalar feedback; higher is better.
  • Policy: the rule that maps state → action.
  • Value: how promising a state (or state–action pair) looks for future reward.

Why RL is tricky in practice

  • Delayed rewards: a harmless move now can lose the game ten steps later.
  • Exploration vs exploitation: try new actions vs repeat what works; too much of either hurts.
  • Non-IID data: your actions change the future data distribution.
  • Safety & cost: real-world exploration can be expensive or dangerous.
Exploration versus exploitation reward curves
Balanced exploration can uncover higher long-term reward; pure exploitation often plateaus.

When RL shines (and when it doesn’t)

  • Shines: games with complex strategy, robotics/control, navigation, adaptive recommendations, resource allocation.
  • Struggles: when reward is sparse or poorly shaped; when sim-to-real gap is huge; when exploration is unsafe or extremely costly.

A tiny concrete example: CartPole reward shaping

Goal: keep a pole balanced on a cart. A naive reward might be +1 every timestep the pole stays upright. That works, but learning speeds up if you shape the reward to include distance from center and pole angle, e.g. r = 1 - 0.1*|x| - 0.5*|theta|. You’re still aiming for balance, but you’re giving the agent a smoother gradient about what “better” means.

From puppy to production: a lightweight roadmap

  • Prototype in simulation: iterate fast where failure is cheap.
  • Define reward carefully: align it with the real goal; avoid loopholes.
  • Tune exploration: schedule epsilon/temperature so you explore early, exploit later.
  • Stabilize training: use replay buffers, target networks, advantage estimators (A2C/PPO) to reduce variance.
  • Plan sim-to-real: domain randomization or fine-tuning on real data to close the gap.
  • Monitor live: track reward, safety constraints, and drift; be ready to roll back.

Common pitfalls (and quick fixes)

  • Reward hacking: agent finds shortcuts. → Add constraints or penalties; audit behaviour regularly.
  • Dying at episode start: exploration too timid or reward too sparse. → Add shaping, curiosity, or better initialization.
  • Training collapse: unstable updates. → Smaller learning rate, clip gradients, use PPO/A2C style objectives.
  • Sim-to-real failure: dynamics mismatch. → Randomize physics/visuals in sim; fine-tune on limited real rollouts.

If you want to go deeper

  • Try a minimal notebook with gymnasium + PPO on CartPole.
  • Swap the reward and watch how learning speed changes.
  • Add a safety constraint (e.g., limit force) and see how the policy adapts.
  • Read about policy gradients vs Q-learning; they solve the same loop with different math.

Takeaway

RL is about learning a strategy through interaction. If you can name the goal, the actions, and a reward signal, you can frame the problem as reinforcement learning. Then let the agent practice—cheaply in sim when possible—until the behaviour you want becomes the behaviour it chooses.