Maria Aguilera

SageMaker (RL training job)
        ↑
        │ policy update
        │
RoboMaker (simulation)
   └─ Gazebo (physics)
   └─ ROS (state messages)
        ↓
Redis (state + reward storage)

SageMaker holds the RL algorithm (PPO or SAC) + neural-net policy.
RoboMaker runs the simulated track in Gazebo with the ROS robotics framework.
Redis mediates state and reward signals between them.

The single Python function that decides whether the car learns to drive or learns to spin.

Parameters available at every step:

Parameter	Meaning
`progress`	% of track completed (0–100).
`speed`	Current car speed.
`steering_angle`	−30° to +30°.
`all_wheels_on_track`	Boolean.
`distance_from_center`	Distance from track centreline.
`closest_waypoints`	Indices of nearest waypoints.
`is_offtrack`	Boolean — terminal failure.

Reward functions are shaping problems — too sparse and nothing learns, too dense and you overfit to the simulator.

	PPO	SAC
Type	On-policy	Off-policy
Action space	Discrete or continuous	Continuous
Exploration	Stochastic policy	Entropy-regularised
Sample efficiency	Lower	Higher
Stability	More stable	Sometimes brittle
DeepRacer default	✅	Option

Use PPO for first runs — it converges reliably with discrete action spaces. Use SAC when you need smoother continuous steering / throttle and have the compute budget.

Learning rate — too high and the policy oscillates, too low and it stalls. 3e-4 is a sane default.
Batch size — bigger = smoother gradient, slower per step.
Discount factor γ — closer to 1 = longer-term thinking. 0.999 for racing, since reward is sparse until lap end.
Entropy coefficient — knob for exploration vs exploitation.
Number of epochs per update — too many → over-optimised against current rollouts → instability.

The hardest part. The simulator is clean; the real track is messy. Three strategies:

Domain randomisation — train across many simulated track conditions (lighting, friction, slight track shifts). The policy learns the invariants.
Robust reward shaping — penalise behaviours that are fragile in physical space (sharp turns at high speed, hugging track edges).
Conservative deployment — cap maximum speed in the deployed model below the simulator best, expecting real-world to underperform.

The car always drives worse on the real track. The question is by how much, and whether it finishes the lap.

Reward shaping is 80 % of the project. The same RL algorithm with two different rewards produces wildly different drivers.
PPO is the right default. Don't reach for SAC unless you've exhausted PPO tuning.
Sim-to-real underperforms by 10–30 %. Plan for it.
The simulator's pretty visuals are the trap. Optimising for the sim's leaderboard ≠ optimising for the real track.

Training AWS DeepRacer — Cheat Sheet

The architecture

The reward function

PPO vs SAC

Hyperparameters that matter

Sim-to-real transfer

What I learned