Cheat sheet

From Prediction to Decision — RL Cheat Sheet

A gentle introduction to reinforcement learning. How RL differs from supervised learning, the agent-environment loop, value vs policy methods, and key algorithms.

Read the full postUpdated February 2026
1

The setup

┌─────────┐         action a_t          ┌─────────────┐
│  AGENT  │ ─────────────────────────►  │ ENVIRONMENT │
└─────────┘                              └─────────────┘
     ▲                                          │
     │ state s_{t+1}, reward r_{t+1}            │
     └──────────────────────────────────────────┘

At each step:

  1. Agent observes state s.
  2. Picks action a from a policy π(a | s).
  3. Environment returns next state s' and reward r.
  4. Goal: maximise the cumulative discounted reward: G = Σ γ^t · r_t
2

Supervised vs RL

SupervisedReinforcement
SignalCorrect label per sampleScalar reward, often delayed
DataStatic datasetGenerated by interaction
GoalPredictDecide / control
Trade-offBias vs varianceExploration vs exploitation
Failure modeOverfitBad reward → bad policy

In supervised, the dataset is given. In RL, the dataset is created by the policy itself — and bad policies generate bad data. That feedback loop is most of the hardness.

3

Exploration vs exploitation

The defining tension of RL.

  • Exploit: pick the action your current best estimate says is best.
  • Explore: try other actions to learn more.

Pure exploit → stuck in local optimum. Pure explore → never converge.

Standard approaches:

  • ε-greedy — pick best action with probability 1−ε, random otherwise.
  • Boltzmann (softmax) exploration — sample from action values weighted by probability.
  • Optimism in the face of uncertainty (UCB) — explore actions with high upper-confidence bounds.
  • Entropy regularisation in policy methods — penalise overly deterministic policies.
4

Value-based methods

Learn a value function that estimates how good each state (or state-action pair) is.

  • V(s) — expected return from state s.
  • Q(s, a) — expected return from taking action a in state s, then following the policy.

Policy is derived: π(s) = argmax_a Q(s, a).

Key algorithms:

  • Q-Learning — off-policy, updates Q toward r + γ · max_a' Q(s', a').
  • SARSA — on-policy, uses the actual next action.
  • DQN — deep Q-network. Replaces Q-table with a neural net. Atari-era breakthrough.
5

Policy-based methods

Directly parameterise the policy π_θ(a | s) and optimise θ to maximise expected reward.

  • No value function needed (in pure form).
  • Naturally handles continuous action spaces.
  • Stochastic by default — great for partial observability.

Key algorithms:

  • REINFORCE — vanilla policy gradient. High variance.
  • A2C / A3C — actor-critic, adds a value baseline to reduce variance.
  • PPO — clips the policy update step. The modern default. Stable and easy to tune.
  • SAC — off-policy with entropy regularisation. Sample-efficient.
6

Actor-Critic

The dominant modern architecture combines both:

  • Actor — the policy network. Decides actions.
  • Critic — the value network. Estimates how good the state is.

The critic's estimate is used as a baseline to reduce variance in the actor's gradient updates.

PPO, A2C, SAC — all actor-critic under the hood.

7

Reward shaping

The reward function defines the problem. Designing it is most of the work.

  • Sparse rewards (only at the goal) → nothing learns. Long credit-assignment chain.
  • Dense rewards (every step) → fast learning, but you can shape behaviour you didn't intend.
  • Shaped rewards can produce reward hacking — agent finds a loophole.

Rules of thumb:

8

Common environments

Practice grounds:

  • Gymnasium (formerly OpenAI Gym) — CartPole, MountainCar, Lunar Lander, Atari.
  • PettingZoo — multi-agent.
  • MuJoCo / Brax — physics simulators for continuous control.
  • RoboMaker + Gazebo — robotics-grade simulation. Used in AWS DeepRacer.

Libraries:

  • Stable-Baselines3 — production-ready implementations of PPO, A2C, SAC, DQN.
  • RLlib (Ray) — distributed RL.