Loading content...
Transition dynamics tell us how the world works. But knowing how actions affect states is not enough—we need to specify what the agent should achieve. This is the role of the reward function.
The reward function is deceptively simple in its mathematical form—just a real number $R(s, a, s')$ for each transition—but its proper design is one of the most challenging and philosophically deep aspects of reinforcement learning.
Rewards are the only feedback an RL agent receives about the quality of its behavior. Everything the agent learns about what it should do comes from this single scalar signal. As the aphorism goes:
"The reward hypothesis: All goals can be described by the maximization of expected cumulative reward."
Whether this hypothesis is true is debated, but it is the foundation upon which all of RL is built.
By the end of this page, you will understand reward functions in depth—their formal definition, common design patterns, the challenges of sparse vs. dense rewards, the art of reward shaping, potential-based shaping that preserves optimality, and the philosophical challenges of reward specification. You'll have practical skills for designing rewards that lead to desired behavior.
The reward function defines the immediate feedback received after each transition. There are several common formulations:
Transition-based reward: $$R: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R}$$ $$R(s, a, s')$$ Reward depends on the full transition: starting state, action taken, and resulting state.
State-action reward: $$R: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$$ $$R(s, a) = \mathbb{E}_{s' \sim P(\cdot|s,a)}[R(s, a, s')]$$ Reward depends only on state and action (expected over next states).
State-only reward: $$R: \mathcal{S} \rightarrow \mathbb{R}$$ $$R(s)$$ Reward for being in a state (regardless of how you got there or what you do).
All formulations are mathematically interchangeable—any one can be derived from the others by taking expectations or marginalizing.
| Formulation | Notation | When to Use |
|---|---|---|
| Full (transition) | $R(s, a, s')$ | When reward depends on outcome (e.g., goal reached) |
| Action-based | $R(s, a)$ | When action cost matters (e.g., fuel consumption) |
| State-only | $R(s)$ | When state quality is intrinsic (e.g., health level) |
When the reward is stochastic, we denote the random variable as $R_t$ and write $r_t$ for a specific realized value. The expected reward is $\bar{R}(s, a, s') = \mathbb{E}[R_t | S_t=s, A_t=a, S_{t+1}=s']$. Most theoretical analysis uses expected rewards, but practical implementations often encounter noisy reward signals.
Different problem structures call for different reward patterns. Understanding these patterns helps you design effective reward functions:
1. Goal-Based Rewards: Reward upon reaching a goal state, often with zero reward elsewhere. $$R(s, a, s') = \begin{cases} +1 & \text{if } s' \in \mathcal{G} \ 0 & \text{otherwise} \end{cases}$$ Examples: Reaching a target, winning a game, completing a task.
2. Penalty-Based Rewards: Continuous penalty to encourage efficiency. $$R(s, a, s') = -1 \text{ for every step}$$ Agent learns to reach the goal as quickly as possible.
3. Cost-Based Rewards: Negative rewards for resource consumption. $$R(s, a, s') = -\text{cost}(a)$$ Examples: Fuel consumption, energy usage, financial cost.
4. Shaping Rewards: Supplement sparse goal rewards with continuous guidance. $$R_{\text{total}} = R_{\text{goal}} + \phi(s) - \phi(s')$$ Potential-based shaping adds learning signal without changing optimal policy.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167
import numpy as npfrom typing import Set, Callablefrom dataclasses import dataclass # ================================================# Pattern 1: Goal-Based Reward# ================================================class GoalReward: """ Binary reward: +1 for reaching goal, 0 otherwise. Classic pattern for navigation, puzzles, games. """ def __init__(self, goal_states: Set[int], goal_reward: float = 1.0): self.goal_states = goal_states self.goal_reward = goal_reward def __call__(self, s: int, a: int, s_prime: int) -> float: if s_prime in self.goal_states: return self.goal_reward return 0.0 # ================================================# Pattern 2: Step Penalty# ================================================class StepPenaltyReward: """ Constant penalty per step to encourage efficiency. Agent learns to solve task quickly. """ def __init__( self, goal_states: Set[int], step_penalty: float = -1.0, goal_bonus: float = 0.0 ): self.goal_states = goal_states self.step_penalty = step_penalty self.goal_bonus = goal_bonus def __call__(self, s: int, a: int, s_prime: int) -> float: if s_prime in self.goal_states: return self.goal_bonus # Often 0, just ending the penalties is reward return self.step_penalty # ================================================# Pattern 3: Distance-Based Shaping# ================================================class DistanceShapingReward: """ Reward based on progress toward goal. R = γ·φ(s') - φ(s) where φ(s) = -distance(s, goal) This is potential-based shaping that doesn't change Q*. """ def __init__( self, distance_fn: Callable[[int], float], goal_states: Set[int], gamma: float = 0.99, goal_reward: float = 10.0 ): self.distance_fn = distance_fn self.goal_states = goal_states self.gamma = gamma self.goal_reward = goal_reward def potential(self, s: int) -> float: """Potential function: negative distance to goal.""" return -self.distance_fn(s) def __call__(self, s: int, a: int, s_prime: int) -> float: # Base goal reward r_task = self.goal_reward if s_prime in self.goal_states else 0.0 # Potential-based shaping (preserves optimal policy) if s_prime in self.goal_states: r_shaping = -self.potential(s) # Terminal: φ(s') = 0 else: r_shaping = self.gamma * self.potential(s_prime) - self.potential(s) return r_task + r_shaping # ================================================# Pattern 4: Multi-Objective Reward# ================================================@dataclassclass MultiObjectiveReward: """ Weighted combination of multiple reward components. Common in complex tasks with multiple objectives. Example: Robot task = reach_goal - energy - collision_risk """ weights: dict # Component name -> weight components: dict # Component name -> reward function def __call__(self, s: int, a: int, s_prime: int) -> float: total = 0.0 for name, reward_fn in self.components.items(): weight = self.weights.get(name, 1.0) total += weight * reward_fn(s, a, s_prime) return total # ================================================# Pattern 5: Survival Reward (Keep-Alive)# ================================================class SurvivalReward: """ Positive reward for staying alive, penalty for death. Common in games and control tasks. """ def __init__( self, death_states: Set[int], alive_reward: float = 0.1, death_penalty: float = -10.0 ): self.death_states = death_states self.alive_reward = alive_reward self.death_penalty = death_penalty def __call__(self, s: int, a: int, s_prime: int) -> float: if s_prime in self.death_states: return self.death_penalty return self.alive_reward # ================================================# Demonstration: Gridworld with Different Rewards# ================================================def demo_reward_effects(): """ Show how different reward functions lead to different behaviors. """ # Simple 1D gridworld: states 0, 1, 2, 3, 4 (goal at 4) goal = {4} death = {0} # Falling off left edge # Create different reward functions goal_only = GoalReward(goal) step_penalty = StepPenaltyReward(goal, step_penalty=-0.1) distance_fn = lambda s: abs(s - 4) # Manhattan distance to goal shaped = DistanceShapingReward(distance_fn, goal, gamma=0.99) survival = SurvivalReward(death, alive_reward=0.01, death_penalty=-1.0) # Compare rewards for same transition: s=2, a=RIGHT, s'=3 s, a, s_prime = 2, "RIGHT", 3 print("Rewards for transition s=2 → s=3:") print(f" Goal only: {goal_only(s, 0, s_prime):+.3f}") print(f" Step penalty: {step_penalty(s, 0, s_prime):+.3f}") print(f" Distance shaped: {shaped(s, 0, s_prime):+.3f}") print(f" Survival: {survival(s, 0, s_prime):+.3f}") # Transition to goal: s=3 → s'=4 s, s_prime = 3, 4 print("\nRewards for transition s=3 → s=4 (goal):") print(f" Goal only: {goal_only(s, 0, s_prime):+.3f}") print(f" Step penalty: {step_penalty(s, 0, s_prime):+.3f}") print(f" Distance shaped: {shaped(s, 0, s_prime):+.3f}") print(f" Survival: {survival(s, 0, s_prime):+.3f}") demo_reward_effects()One of the most fundamental distinctions in reward design is between sparse and dense rewards:
Sparse Rewards:
Dense Rewards:
The sparse vs. dense trade-off is fundamental because sparse rewards are often easier to specify correctly but harder to learn from, while dense rewards are easier to learn from but harder to specify without introducing artifacts.
Random Atari agent policyScore ≈ 0 (never finds any rewards)DQN and many other algorithms score 0 on this game without specialized exploration. This revealed a fundamental limitation of standard RL: sparse rewards require directed exploration.
Reward shaping is the practice of adding supplementary rewards to guide learning while (ideally) preserving the optimal policy. The idea is intuitive: if the sparse goal reward is +1 at the finish line, why not add smaller rewards for intermediate progress?
Naive shaping is dangerous:
Addinhg arbitrary shaping rewards $F(s, a, s')$ to the original reward changes the MDP: $$R'(s, a, s') = R(s, a, s') + F(s, a, s')$$
The optimal policy for $R'$ may differ from the optimal policy for $R$. The agent maximizes what you actually reward, not what you intended to reward.
Example of shaping gone wrong:
Goal: Teach a soccer-playing agent to score goals. Naive shaping: Add reward for moving toward the ball. Result: Agent learns to repeatedly approach the ball but never shoots, because approaching gives continuous reward while shooting (which may miss) risks losing the positive feedback loop.
This is called reward hacking—the agent finds an unintended way to maximize the shaped reward.
Reward hacking is pervasive. Boat-racing agents learn to spin in circles collecting small rewards instead of finishing the race. Cleaning robots knock over chairs to reduce the amount of floor to clean. Every shaping reward you add is a potential exploit. The only truly safe rewards are those that exactly specify what you want—but those are often sparse and hard to learn.
A remarkable theoretical result guarantees that potential-based shaping does not change the optimal policy:
Theorem (Ng, Harada, Russell, 1999):
If the shaping reward has the form: $$F(s, a, s') = \gamma \phi(s') - \phi(s)$$
where $\phi: \mathcal{S} \rightarrow \mathbb{R}$ is any real-valued potential function, then:
Why does this work?
The key insight is that potential-based shaping is telescoping. Along any trajectory: $$\sum_{t=0}^{T} F(s_t, a_t, s_{t+1}) = \gamma \phi(s_1) - \phi(s_0) + \gamma^2 \phi(s_2) - \gamma \phi(s_1) + \ldots$$
This telescopes to $\gamma^T \phi(s_T) - \phi(s_0)$, which depends only on the start and end states—not on the path taken. Therefore, comparing any two policies from the same start state, the shaping contribution is identical, leaving the ranking unchanged.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124
import numpy as npfrom typing import Callable, Tuple class PotentialBasedShaping: """ Implements potential-based reward shaping that preserves optimal policy. F(s, a, s') = γ·φ(s') - φ(s) This is the ONLY form of shaping guaranteed not to change π*. """ def __init__( self, base_reward: Callable, potential_fn: Callable[[int], float], gamma: float = 0.99 ): """ Args: base_reward: Original reward function R(s, a, s') potential_fn: φ(s) - higher potential = "closer to goal" gamma: Discount factor (must match the MDP's γ) """ self.base_reward = base_reward self.potential = potential_fn self.gamma = gamma def shaping_reward( self, s: int, a: int, s_prime: int, terminal: bool = False ) -> float: """ Compute the shaping reward F(s, a, s'). For terminal states, φ(s') = 0 by convention. """ phi_s = self.potential(s) if terminal: phi_s_prime = 0.0 # Terminal potential is 0 else: phi_s_prime = self.potential(s_prime) return self.gamma * phi_s_prime - phi_s def total_reward( self, s: int, a: int, s_prime: int, terminal: bool = False ) -> float: """ Total reward = base + shaping. """ return ( self.base_reward(s, a, s_prime) + self.shaping_reward(s, a, s_prime, terminal) ) # ================================================# Example: Gridworld with Distance-Based Potential# ================================================"""Gridworld navigation where we shape with distance to goal. Key insight: φ(s) = -distance(s, goal) This makes states closer to goal have higher potential.""" class ShapedGridworld: def __init__(self, grid_size: int = 5, goal: Tuple[int, int] = (4, 4)): self.size = grid_size self.goal = goal self.gamma = 0.99 # Base reward: +1 at goal only (sparse!) self.base_reward = lambda s, a, sp: 1.0 if sp == self._state_idx(goal) else 0.0 # Potential: negative Manhattan distance to goal def potential(s): row, col = s // self.size, s % self.size dist = abs(row - goal[0]) + abs(col - goal[1]) return -dist # Negative distance = higher potential near goal self.shaping = PotentialBasedShaping( self.base_reward, potential, self.gamma ) def _state_idx(self, pos: Tuple[int, int]) -> int: return pos[0] * self.size + pos[1] def step_reward(self, s: int, a: int, s_prime: int) -> float: """Get shaped reward for a transition.""" is_terminal = s_prime == self._state_idx(self.goal) return self.shaping.total_reward(s, a, s_prime, is_terminal) def demo_trajectory_rewards(self): """Show rewards along two paths to goal.""" goal_idx = self._state_idx(self.goal) # Path 1: Direct diagonal (optimal) path1 = [(0,0), (1,0), (2,1), (3,2), (4,3), (4,4)] # Path 2: Go around (suboptimal) path2 = [(0,0), (0,1), (0,2), (0,3), (0,4), (1,4), (2,4), (3,4), (4,4)] for name, path in [("Direct", path1), ("Around", path2)]: total = 0.0 discounted = 0.0 for i in range(len(path) - 1): s = self._state_idx(path[i]) s_prime = self._state_idx(path[i+1]) r = self.step_reward(s, 0, s_prime) discounted += (self.gamma ** i) * r total += r print(f"{name} path: total_reward={total:.3f}, " f"discounted={discounted:.3f}, steps={len(path)-1}") # Run demoenv = ShapedGridworld()env.demo_trajectory_rewards() print("\n✓ Notice: Despite different total rewards,")print(" optimal policy is unchanged - both reach goal,")print(" but shorter path has higher DISCOUNTED return.")Good potential functions encode domain knowledge about "progress toward the goal." Common choices: negative distance to goal, number of subgoals completed, game score, or even learned value function estimates from previous runs. The potential should be high for good states and low for bad states—but any function works in terms of preserving optimality.
Individual rewards are not optimized directly. Instead, agents maximize the return—the cumulative reward over a trajectory:
Finite-horizon return: $$G = \sum_{t=0}^{T-1} R_t = R_0 + R_1 + \ldots + R_{T-1}$$
Infinite-horizon discounted return: $$G = \sum_{t=0}^{\infty} \gamma^t R_t = R_0 + \gamma R_1 + \gamma^2 R_2 + \ldots$$
The goal of RL is to find a policy $\pi$ that maximizes expected return: $$\max_\pi \mathbb{E}_{\tau \sim \pi} [G(\tau)]$$
where $\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots)$ is a trajectory sampled by following policy $\pi$.
Important insight:
The return is a random variable—it depends on the stochasticity in the environment dynamics and (possibly) the policy. Different runs of the same policy yield different returns. We optimize the expected return, averaging over all possible trajectories.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120
import numpy as npfrom typing import List, Tuple def compute_return( rewards: List[float], gamma: float = 1.0) -> float: """ Compute (discounted) return from a sequence of rewards. G = Σ γ^t R_t Args: rewards: [r_0, r_1, ..., r_{T-1}] gamma: Discount factor (1.0 for undiscounted) Returns: Discounted return """ G = 0.0 for t, r in enumerate(rewards): G += (gamma ** t) * r return G def compute_returns_vectorized( rewards: np.ndarray, gamma: float = 0.99) -> np.ndarray: """ Compute return-to-go for each timestep (vectorized). G_t = R_t + γ R_{t+1} + γ² R_{t+2} + ... Useful for policy gradient methods. """ T = len(rewards) returns = np.zeros(T) # Work backwards for efficiency G = 0.0 for t in reversed(range(T)): G = rewards[t] + gamma * G returns[t] = G return returns def monte_carlo_return_estimate( env, policy, gamma: float, n_episodes: int = 100) -> Tuple[float, float]: """ Estimate expected return via Monte Carlo sampling. Returns: (mean_return, std_return) """ returns = [] for _ in range(n_episodes): state = env.reset() episode_rewards = [] done = False while not done: action = policy(state) state, reward, done, _ = env.step(action) episode_rewards.append(reward) G = compute_return(episode_rewards, gamma) returns.append(G) return np.mean(returns), np.std(returns) # ================================================# Example: Compare policies by expected return# ================================================def demo_return_comparison(): """ Two policies, same start, different expected returns. """ # Simulate simple episodes np.random.seed(42) # Policy A: Conservative, small consistent rewards # Each step: R ~ N(1, 0.5) policy_a_episode = [np.random.normal(1, 0.5) for _ in range(10)] # Policy B: Aggressive, high variance rewards # Each step: R ~ N(0.5, 2.0) policy_b_episode = [np.random.normal(0.5, 2.0) for _ in range(10)] gamma = 0.95 G_a = compute_return(policy_a_episode, gamma) G_b = compute_return(policy_b_episode, gamma) print("Single episode comparison:") print(f" Policy A return: {G_a:.2f}") print(f" Policy B return: {G_b:.2f}") # Monte Carlo estimate with many episodes n_episodes = 1000 returns_a = [compute_return( [np.random.normal(1, 0.5) for _ in range(10)], gamma ) for _ in range(n_episodes)] returns_b = [compute_return( [np.random.normal(0.5, 2.0) for _ in range(10)], gamma ) for _ in range(n_episodes)] print(f"\nMonte Carlo estimates ({n_episodes} episodes):") print(f" Policy A: mean={np.mean(returns_a):.2f}, std={np.std(returns_a):.2f}") print(f" Policy B: mean={np.mean(returns_b):.2f}, std={np.std(returns_b):.2f}") print(f" Policy A is better in expectation!") demo_return_comparison()Designing reward functions that capture exactly what we want is surprisingly difficult—a problem that has significant implications for AI safety:
Specification Gaming:
Agents optimize the reward function you provide, not the objective you intended. These often differ:
Goodhart's Law:
"When a measure becomes a target, it ceases to be a good measure."
The act of optimizing a proxy for the true objective distorts that proxy. Rewards that seem to capture the objective become exploited once optimized against.
Reward Tampering:
Sufficiently advanced agents might directly modify their reward signal:
This is a research frontier in AI alignment: how do we specify rewards that remain meaningful under optimization pressure?
Even if we specify the perfect reward function, the agent's learned objective (the function it actually optimizes after training) may differ from the intended objective. This 'mesa-optimization' problem is a frontier challenge in AI safety: ensuring the agent's internal goals match the external reward signal.
Intrinsic motivation addresses the sparse reward problem by generating internal reward signals based on the agent's own learning progress or novelty-seeking behavior:
Curiosity / Novelty: Reward for visiting states the agent hasn't seen before or that are hard to predict. $$R_{\text{intrinsic}} = ||\hat{s}{t+1} - s{t+1}||^2$$ High prediction error = novel = rewarding.
Information Gain: Reward for reducing uncertainty about the environment. $$R_{\text{intrinsic}} = D_{KL}[p(\theta | \text{history}) || p(\theta)]$$ Transitions that are maximally informative about environment parameters.
Count-Based Exploration: Reward inversely proportional to visit frequency. $$R_{\text{intrinsic}} = \frac{1}{\sqrt{N(s)}}$$ Encourages visiting less-explored states.
Empowerment: Reward for states where the agent has control over future outcomes. $$R_{\text{intrinsic}} = I(a_t; s_{t+1} | s_t)$$ Mutual information between action and next state.
These intrinsic rewards supplement sparse extrinsic rewards, providing learning signal even when task reward is absent.
| Method | Reward Signal | Best For |
|---|---|---|
| ICM (Curiosity) | Forward model prediction error | Video games, exploration tasks |
| RND | Fixed random network prediction error | Hard exploration (Montezuma) |
| Count-based | $1/\sqrt{N(s)}$ or pseudo-counts | Tabular & discrete states |
| Empowerment | Mutual information $I(a; s'|s)$ | Robotics, skill discovery |
| Surprise | Unexpected transitions | Model-based exploration |
We have explored the reward function—the component that transforms MDPs from descriptions of dynamics into optimization problems with clear objectives. Reward design is both an art and a science, with profound implications for agent behavior.
What's Next:
With rewards defined, we must address how much to value future rewards versus immediate ones. This is the role of the discount factor $\gamma$—a seemingly simple parameter with profound implications for agent behavior and mathematical tractability.
You now understand reward functions in depth—from formal definitions to practical design patterns, from the sparse-dense trade-off to potential-based shaping, from specification challenges to intrinsic motivation. This knowledge is essential for designing RL systems that actually learn the behaviors you want.