
Table of Contents
Last update: February 2026. All opinions are my own.
Overview
Reinforcement learning (RL) is the science of decision-making.
Unlike most machine learning, it is not about predicting labels. It is about learning how to act.
If supervised learning says "here is the right answer," RL says "try something and see what happens."
This post is a short, visual-first intro so you can read papers and code with confidence.
1) From prediction to decision
Most ML problems are about mapping inputs to outputs. RL is about choosing actions over time.
In RL, actions change what you see next. The data depends on the policy, and the policy keeps changing. That loop is the whole point.
A quick comparison
| Topic | Supervised | Reinforcement |
|---|---|---|
| Goal | Predict labels | Maximize long-term reward |
| Feedback | Immediate and direct | Often delayed |
| Data | Fixed dataset | Collected by the agent |
| Output | A prediction | A decision policy |
2) How RL differs from supervised and unsupervised learning
To understand RL clearly, compare it to the other two paradigms.
Supervised learning
You are given:
input -> correct outputThe model's job is to minimize prediction error.
Examples:
- Spam detection
- Image classification
- House price prediction
You measure performance with accuracy, loss, MSE, and related metrics. The dataset is fixed.
Unsupervised learning
You are given:
input onlyThe model discovers structure:
- clusters
- patterns
- low-dimensional representations
There is no "correct answer." But the dataset is still static.
Reinforcement learning
You are given:
state -> choose action -> receive rewardThere are no labels, no predefined dataset, and no immediate error signal.
Instead:
- The agent collects its own data.
- Its actions influence what it sees next.
- The goal is to maximize long-term reward, not minimize prediction error.
Think of how a baby learns to walk:
- Try a step.
- Fall.
- Adjust.
- Try again.
No labels. Just feedback.
| Property | Supervised | Unsupervised | Reinforcement |
|---|---|---|---|
| Data | Labeled data | No labels | Reward signal |
| Dataset | Static dataset | Static dataset | Live interaction |
| Feedback | Immediate error | No clear target | Delayed reward |
| Assumption | I.I.D. data | I.I.D. data | Sequential data |
| Objective | Predict | Discover | Decide |
Reinforcement learning is about decisions, not predictions.
LunarLander is not a prediction problem. It is a decision-making problem. The agent does not predict where the lander should be. It learns which engines to fire and when to land safely.
3) The RL loop
Every RL system is the same loop:
state -> action -> reward -> next state -> ...The agent tries something, gets feedback, and updates its policy.
Visual idea: A simple agent-environment loop diagram with arrows labeled "state," "action," "reward."
4) The core pieces (plain language)
You will see these terms everywhere:
- Agent: the decision-maker.
- Environment: the world the agent interacts with.
- State: what the agent observes.
- Action: what the agent does.
- Reward: the feedback signal.
- Policy: the rule that maps state to action.
- Episode: one full run from start to finish.
If the policy is the behavior, the value function is the prediction of how good a state or action is.
5) A tiny example
Imagine a robot in a maze:
- Each move is an action.
- The goal is to reach the exit.
- The reward is +1 at the exit, 0 elsewhere.
At first, the robot wanders randomly. Over time, it learns which paths lead to the exit and repeats them more often.
Visual idea: A gridworld with a start, a goal, and a few failed paths in light gray.
6) Why RL is hard in practice
RL is powerful but fragile:
- Rewards can be sparse or noisy.
- Exploration can be expensive or unsafe.
- Small reward tweaks can change behavior a lot.
Reward shaping is not cheating. It is how you make learning practical, as long as the shaped reward still matches the real goal.
7) Where to go next
If you want to go deeper, try these steps:
- Build a small RL agent for CartPole.
- Plot reward curves and watch for instability.
- Read Sutton and Barto, Chapters 1-3.
- Study PPO to see how modern agents learn reliably.
RL is a big field, but the core idea is simple: learn to make good decisions through experience.
