SageMaker (RL training job)
↑
│ policy update
│
RoboMaker (simulation)
└─ Gazebo (physics)
└─ ROS (state messages)
↓
Redis (state + reward storage)The architecture
- SageMaker holds the RL algorithm (PPO or SAC) + neural-net policy.
- RoboMaker runs the simulated track in Gazebo with the ROS robotics framework.
- Redis mediates state and reward signals between them.
The reward function
The single Python function that decides whether the car learns to drive or learns to spin.
Parameters available at every step:
| Parameter | Meaning |
|---|---|
progress | % of track completed (0–100). |
speed | Current car speed. |
steering_angle | −30° to +30°. |
all_wheels_on_track | Boolean. |
distance_from_center | Distance from track centreline. |
closest_waypoints | Indices of nearest waypoints. |
is_offtrack | Boolean — terminal failure. |
Reward functions are shaping problems — too sparse and nothing learns, too dense and you overfit to the simulator.
PPO vs SAC
| PPO | SAC | |
|---|---|---|
| Type | On-policy | Off-policy |
| Action space | Discrete or continuous | Continuous |
| Exploration | Stochastic policy | Entropy-regularised |
| Sample efficiency | Lower | Higher |
| Stability | More stable | Sometimes brittle |
| DeepRacer default | ✅ | Option |
Use PPO for first runs — it converges reliably with discrete action spaces. Use SAC when you need smoother continuous steering / throttle and have the compute budget.
Hyperparameters that matter
- Learning rate — too high and the policy oscillates, too low and it stalls.
3e-4is a sane default. - Batch size — bigger = smoother gradient, slower per step.
- Discount factor
γ— closer to 1 = longer-term thinking.0.999for racing, since reward is sparse until lap end. - Entropy coefficient — knob for exploration vs exploitation.
- Number of epochs per update — too many → over-optimised against current rollouts → instability.
Sim-to-real transfer
The hardest part. The simulator is clean; the real track is messy. Three strategies:
- Domain randomisation — train across many simulated track conditions (lighting, friction, slight track shifts). The policy learns the invariants.
- Robust reward shaping — penalise behaviours that are fragile in physical space (sharp turns at high speed, hugging track edges).
- Conservative deployment — cap maximum speed in the deployed model below the simulator best, expecting real-world to underperform.
The car always drives worse on the real track. The question is by how much, and whether it finishes the lap.
What I learned
- Reward shaping is 80 % of the project. The same RL algorithm with two different rewards produces wildly different drivers.
- PPO is the right default. Don't reach for SAC unless you've exhausted PPO tuning.
- Sim-to-real underperforms by 10–30 %. Plan for it.
- The simulator's pretty visuals are the trap. Optimising for the sim's leaderboard ≠ optimising for the real track.