Markov Decision Processes - Learning Module

Loading content...

0/245

Reward Signals

The Language of Objectives

Transition dynamics tell us how the world works. But knowing how actions affect states is not enough—we need to specify what the agent should achieve. This is the role of the reward function.

The reward function is deceptively simple in its mathematical form—just a real number $R(s, a, s')$ for each transition—but its proper design is one of the most challenging and philosophically deep aspects of reinforcement learning.

Rewards are the only feedback an RL agent receives about the quality of its behavior. Everything the agent learns about what it should do comes from this single scalar signal. As the aphorism goes:

"The reward hypothesis: All goals can be described by the maximization of expected cumulative reward."

Whether this hypothesis is true is debated, but it is the foundation upon which all of RL is built.

What You Will Learn

By the end of this page, you will understand reward functions in depth—their formal definition, common design patterns, the challenges of sparse vs. dense rewards, the art of reward shaping, potential-based shaping that preserves optimality, and the philosophical challenges of reward specification. You'll have practical skills for designing rewards that lead to desired behavior.

Formal Definition of Reward Functions

The reward function defines the immediate feedback received after each transition. There are several common formulations:

Transition-based reward: $$R: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R}$$ $$R(s, a, s')$$ Reward depends on the full transition: starting state, action taken, and resulting state.

State-action reward: $$R: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$$ $$R(s, a) = \mathbb{E}_{s' \sim P(\cdot|s,a)}[R(s, a, s')]$$ Reward depends only on state and action (expected over next states).

State-only reward: $$R: \mathcal{S} \rightarrow \mathbb{R}$$ $$R(s)$$ Reward for being in a state (regardless of how you got there or what you do).

All formulations are mathematically interchangeable—any one can be derived from the others by taking expectations or marginalizing.

Reward Function Formulations
Formulation	Notation	When to Use
Full (transition)	$R(s, a, s')$	When reward depends on outcome (e.g., goal reached)
Action-based	$R(s, a)$	When action cost matters (e.g., fuel consumption)
State-only	$R(s)$	When state quality is intrinsic (e.g., health level)

The Random Variable Perspective

When the reward is stochastic, we denote the random variable as $R_t$ and write $r_t$ for a specific realized value. The expected reward is $\bar{R}(s, a, s') = \mathbb{E}[R_t | S_t=s, A_t=a, S_{t+1}=s']$. Most theoretical analysis uses expected rewards, but practical implementations often encounter noisy reward signals.

Reward Design Patterns

Different problem structures call for different reward patterns. Understanding these patterns helps you design effective reward functions:

1. Goal-Based Rewards: Reward upon reaching a goal state, often with zero reward elsewhere. $$R(s, a, s') = \begin{cases} +1 & \text{if } s' \in \mathcal{G} \ 0 & \text{otherwise} \end{cases}$$ Examples: Reaching a target, winning a game, completing a task.

2. Penalty-Based Rewards: Continuous penalty to encourage efficiency. $$R(s, a, s') = -1 \text{ for every step}$$ Agent learns to reach the goal as quickly as possible.

3. Cost-Based Rewards: Negative rewards for resource consumption. $$R(s, a, s') = -\text{cost}(a)$$ Examples: Fuel consumption, energy usage, financial cost.

4. Shaping Rewards: Supplement sparse goal rewards with continuous guidance. $$R_{\text{total}} = R_{\text{goal}} + \phi(s) - \phi(s')$$ Potential-based shaping adds learning signal without changing optimal policy.

reward_patterns.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
import numpy as np
from typing import Set, Callable
from dataclasses import dataclass
 
# ================================================
# Pattern 1: Goal-Based Reward
# ================================================
class GoalReward:
    """
    Binary reward: +1 for reaching goal, 0 otherwise.
    Classic pattern for navigation, puzzles, games.
    """
    def __init__(self, goal_states: Set[int], goal_reward: float = 1.0):
        self.goal_states = goal_states
        self.goal_reward = goal_reward
    
    def __call__(self, s: int, a: int, s_prime: int) -> float:
        if s_prime in self.goal_states:
            return self.goal_reward
        return 0.0
 
 
# ================================================
# Pattern 2: Step Penalty
# ================================================
class StepPenaltyReward:
    """
    Constant penalty per step to encourage efficiency.
    Agent learns to solve task quickly.
    """
    def __init__(
        self, 
        goal_states: Set[int], 
        step_penalty: float = -1.0,
        goal_bonus: float = 0.0
    ):
        self.goal_states = goal_states
        self.step_penalty = step_penalty
        self.goal_bonus = goal_bonus
    
    def __call__(self, s: int, a: int, s_prime: int) -> float:
        if s_prime in self.goal_states:
            return self.goal_bonus  # Often 0, just ending the penalties is reward
        return self.step_penalty
 
 
# ================================================
# Pattern 3: Distance-Based Shaping
# ================================================
class DistanceShapingReward:
    """
    Reward based on progress toward goal.
    R = γ·φ(s') - φ(s) where φ(s) = -distance(s, goal)
    
    This is potential-based shaping that doesn't change Q*.
    """
    def __init__(
        self, 
        distance_fn: Callable[[int], float],
        goal_states: Set[int],
        gamma: float = 0.99,
        goal_reward: float = 10.0
    ):
        self.distance_fn = distance_fn
        self.goal_states = goal_states
        self.gamma = gamma
        self.goal_reward = goal_reward
    
    def potential(self, s: int) -> float:
        """Potential function: negative distance to goal."""
        return -self.distance_fn(s)
    
    def __call__(self, s: int, a: int, s_prime: int) -> float:
        # Base goal reward
        r_task = self.goal_reward if s_prime in self.goal_states else 0.0
        
        # Potential-based shaping (preserves optimal policy)
        if s_prime in self.goal_states:
            r_shaping = -self.potential(s)  # Terminal: φ(s') = 0
        else:
            r_shaping = self.gamma * self.potential(s_prime) - self.potential(s)
        
        return r_task + r_shaping
 
 
# ================================================
# Pattern 4: Multi-Objective Reward
# ================================================
@dataclass
class MultiObjectiveReward:
    """
    Weighted combination of multiple reward components.
    Common in complex tasks with multiple objectives.
    
    Example: Robot task = reach_goal - energy - collision_risk
    """
    weights: dict  # Component name -> weight
    components: dict  # Component name -> reward function
    
    def __call__(self, s: int, a: int, s_prime: int) -> float:
        total = 0.0
        for name, reward_fn in self.components.items():
            weight = self.weights.get(name, 1.0)
            total += weight * reward_fn(s, a, s_prime)
        return total
 
 
# ================================================
# Pattern 5: Survival Reward (Keep-Alive)
# ================================================
class SurvivalReward:
    """
    Positive reward for staying alive, penalty for death.
    Common in games and control tasks.
    """
    def __init__(
        self, 
        death_states: Set[int],
        alive_reward: float = 0.1,
        death_penalty: float = -10.0
    ):
        self.death_states = death_states
        self.alive_reward = alive_reward
        self.death_penalty = death_penalty
    
    def __call__(self, s: int, a: int, s_prime: int) -> float:
        if s_prime in self.death_states:
            return self.death_penalty
        return self.alive_reward
 
 
# ================================================
# Demonstration: Gridworld with Different Rewards
# ================================================
def demo_reward_effects():
    """
    Show how different reward functions lead to different behaviors.
    """
    # Simple 1D gridworld: states 0, 1, 2, 3, 4 (goal at 4)
    goal = {4}
    death = {0}  # Falling off left edge
    
    # Create different reward functions
    goal_only = GoalReward(goal)
    step_penalty = StepPenaltyReward(goal, step_penalty=-0.1)
    distance_fn = lambda s: abs(s - 4)  # Manhattan distance to goal
    shaped = DistanceShapingReward(distance_fn, goal, gamma=0.99)
    survival = SurvivalReward(death, alive_reward=0.01, death_penalty=-1.0)
    
    # Compare rewards for same transition: s=2, a=RIGHT, s'=3
    s, a, s_prime = 2, "RIGHT", 3
    
    print("Rewards for transition s=2 → s=3:")
    print(f"  Goal only:       {goal_only(s, 0, s_prime):+.3f}")
    print(f"  Step penalty:    {step_penalty(s, 0, s_prime):+.3f}")
    print(f"  Distance shaped: {shaped(s, 0, s_prime):+.3f}")
    print(f"  Survival:        {survival(s, 0, s_prime):+.3f}")
    
    # Transition to goal: s=3 → s'=4
    s, s_prime = 3, 4
    print("\nRewards for transition s=3 → s=4 (goal):")
    print(f"  Goal only:       {goal_only(s, 0, s_prime):+.3f}")
    print(f"  Step penalty:    {step_penalty(s, 0, s_prime):+.3f}")
    print(f"  Distance shaped: {shaped(s, 0, s_prime):+.3f}")
    print(f"  Survival:        {survival(s, 0, s_prime):+.3f}")
 
demo_reward_effects()

Sparse vs. Dense Rewards: The Exploration Challenge

One of the most fundamental distinctions in reward design is between sparse and dense rewards:

Sparse Rewards:

Non-zero reward only at specific events (goal completion, game end)
Most transitions have $R = 0$
Agent receives no guidance until reaching the rare rewarding event
Challenge: How does the agent learn anything if it never reaches the goal?

Dense Rewards:

Continuous feedback throughout the trajectory
Each transition provides some information about quality
Agent receives guidance at every step
Challenge: May induce unintended behaviors (reward hacking)

The sparse vs. dense trade-off is fundamental because sparse rewards are often easier to specify correctly but harder to learn from, while dense rewards are easier to learn from but harder to specify without introducing artifacts.

Sparse Rewards

•Clear objective: +1 at goal, done
•No reward hacking: Can't game intermediate rewards
•Generalizes well: Objective is unambiguous
•Hard exploration: Random policy rarely finds goal
•Credit assignment: Which actions caused success?

Dense Rewards

•Continuous feedback: Learn from every step
•Faster learning: Signal always available
•Easier exploration: Any improvement is rewarded
•Reward hacking risk: Agent exploits reward loopholes
•Specification burden: Must define nuanced feedback

The Montezuma's Revenge ChallengeThe Atari game Montezuma's Revenge became notorious in deep RL for its sparse rewards. The agent must navigate many rooms, collect keys, avoid enemies, and solve puzzles—but receives reward only when collecting rare treasures or completing the level. Random exploration almost never discovers rewards, so learning from scratch is nearly impossible. This game catalyzed research in exploration, intrinsic motivation, and hierarchical RL—all techniques designed to handle sparse reward settings.

Input

Random Atari agent policy

Output

Score ≈ 0 (never finds any rewards)

Explanation

DQN and many other algorithms score 0 on this game without specialized exploration. This revealed a fundamental limitation of standard RL: sparse rewards require directed exploration.

Reward Shaping: Engineering the Learning Signal

Reward shaping is the practice of adding supplementary rewards to guide learning while (ideally) preserving the optimal policy. The idea is intuitive: if the sparse goal reward is +1 at the finish line, why not add smaller rewards for intermediate progress?

Naive shaping is dangerous:

Addinhg arbitrary shaping rewards $F(s, a, s')$ to the original reward changes the MDP: $$R'(s, a, s') = R(s, a, s') + F(s, a, s')$$

The optimal policy for $R'$ may differ from the optimal policy for $R$. The agent maximizes what you actually reward, not what you intended to reward.

Example of shaping gone wrong:

Goal: Teach a soccer-playing agent to score goals. Naive shaping: Add reward for moving toward the ball. Result: Agent learns to repeatedly approach the ball but never shoots, because approaching gives continuous reward while shooting (which may miss) risks losing the positive feedback loop.

This is called reward hacking—the agent finds an unintended way to maximize the shaped reward.

The Reward Hacking Problem

Reward hacking is pervasive. Boat-racing agents learn to spin in circles collecting small rewards instead of finishing the race. Cleaning robots knock over chairs to reduce the amount of floor to clean. Every shaping reward you add is a potential exploit. The only truly safe rewards are those that exactly specify what you want—but those are often sparse and hard to learn.

Potential-Based Shaping: The Safe Approach

A remarkable theoretical result guarantees that potential-based shaping does not change the optimal policy:

Theorem (Ng, Harada, Russell, 1999):

If the shaping reward has the form: $$F(s, a, s') = \gamma \phi(s') - \phi(s)$$

where $\phi: \mathcal{S} \rightarrow \mathbb{R}$ is any real-valued potential function, then:

The optimal policy under $R' = R + F$ is the same as under $R$
The optimal value function shifts by a constant: $V'(s) = V(s) + \phi(s)$

Why does this work?

The key insight is that potential-based shaping is telescoping. Along any trajectory: $$\sum_{t=0}^{T} F(s_t, a_t, s_{t+1}) = \gamma \phi(s_1) - \phi(s_0) + \gamma^2 \phi(s_2) - \gamma \phi(s_1) + \ldots$$

This telescopes to $\gamma^T \phi(s_T) - \phi(s_0)$, which depends only on the start and end states—not on the path taken. Therefore, comparing any two policies from the same start state, the shaping contribution is identical, leaving the ranking unchanged.

potential_based_shaping.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
import numpy as np
from typing import Callable, Tuple
 
class PotentialBasedShaping:
    """
    Implements potential-based reward shaping that preserves optimal policy.
    
    F(s, a, s') = γ·φ(s') - φ(s)
    
    This is the ONLY form of shaping guaranteed not to change π*.
    """
    def __init__(
        self, 
        base_reward: Callable,
        potential_fn: Callable[[int], float],
        gamma: float = 0.99
    ):
        """
        Args:
            base_reward: Original reward function R(s, a, s')
            potential_fn: φ(s) - higher potential = "closer to goal"
            gamma: Discount factor (must match the MDP's γ)
        """
        self.base_reward = base_reward
        self.potential = potential_fn
        self.gamma = gamma
    
    def shaping_reward(
        self, s: int, a: int, s_prime: int, terminal: bool = False
    ) -> float:
        """
        Compute the shaping reward F(s, a, s').
        
        For terminal states, φ(s') = 0 by convention.
        """
        phi_s = self.potential(s)
        
        if terminal:
            phi_s_prime = 0.0  # Terminal potential is 0
        else:
            phi_s_prime = self.potential(s_prime)
        
        return self.gamma * phi_s_prime - phi_s
    
    def total_reward(
        self, s: int, a: int, s_prime: int, terminal: bool = False
    ) -> float:
        """
        Total reward = base + shaping.
        """
        return (
            self.base_reward(s, a, s_prime) + 
            self.shaping_reward(s, a, s_prime, terminal)
        )
 
 
# ================================================
# Example: Gridworld with Distance-Based Potential
# ================================================
"""
Gridworld navigation where we shape with distance to goal.
 
Key insight: φ(s) = -distance(s, goal) 
This makes states closer to goal have higher potential.
"""
 
class ShapedGridworld:
    def __init__(self, grid_size: int = 5, goal: Tuple[int, int] = (4, 4)):
        self.size = grid_size
        self.goal = goal
        self.gamma = 0.99
        
        # Base reward: +1 at goal only (sparse!)
        self.base_reward = lambda s, a, sp: 1.0 if sp == self._state_idx(goal) else 0.0
        
        # Potential: negative Manhattan distance to goal
        def potential(s):
            row, col = s // self.size, s % self.size
            dist = abs(row - goal[0]) + abs(col - goal[1])
            return -dist  # Negative distance = higher potential near goal
        
        self.shaping = PotentialBasedShaping(
            self.base_reward, potential, self.gamma
        )
    
    def _state_idx(self, pos: Tuple[int, int]) -> int:
        return pos[0] * self.size + pos[1]
    
    def step_reward(self, s: int, a: int, s_prime: int) -> float:
        """Get shaped reward for a transition."""
        is_terminal = s_prime == self._state_idx(self.goal)
        return self.shaping.total_reward(s, a, s_prime, is_terminal)
    
    def demo_trajectory_rewards(self):
        """Show rewards along two paths to goal."""
        goal_idx = self._state_idx(self.goal)
        
        # Path 1: Direct diagonal (optimal)
        path1 = [(0,0), (1,0), (2,1), (3,2), (4,3), (4,4)]
        
        # Path 2: Go around (suboptimal)
        path2 = [(0,0), (0,1), (0,2), (0,3), (0,4), 
                 (1,4), (2,4), (3,4), (4,4)]
        
        for name, path in [("Direct", path1), ("Around", path2)]:
            total = 0.0
            discounted = 0.0
            for i in range(len(path) - 1):
                s = self._state_idx(path[i])
                s_prime = self._state_idx(path[i+1])
                r = self.step_reward(s, 0, s_prime)
                discounted += (self.gamma ** i) * r
                total += r
            print(f"{name} path: total_reward={total:.3f}, "
                  f"discounted={discounted:.3f}, steps={len(path)-1}")
 
 
# Run demo
env = ShapedGridworld()
env.demo_trajectory_rewards()
 
print("\n✓ Notice: Despite different total rewards,")
print("  optimal policy is unchanged - both reach goal,")
print("  but shorter path has higher DISCOUNTED return.")

Choosing Good Potentials

Good potential functions encode domain knowledge about "progress toward the goal." Common choices: negative distance to goal, number of subgoals completed, game score, or even learned value function estimates from previous runs. The potential should be high for good states and low for bad states—but any function works in terms of preserving optimality.

The Return: What Agents Actually Optimize

Individual rewards are not optimized directly. Instead, agents maximize the return—the cumulative reward over a trajectory:

Finite-horizon return: $$G = \sum_{t=0}^{T-1} R_t = R_0 + R_1 + \ldots + R_{T-1}$$

Infinite-horizon discounted return: $$G = \sum_{t=0}^{\infty} \gamma^t R_t = R_0 + \gamma R_1 + \gamma^2 R_2 + \ldots$$

The goal of RL is to find a policy $\pi$ that maximizes expected return: $$\max_\pi \mathbb{E}_{\tau \sim \pi} [G(\tau)]$$

where $\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots)$ is a trajectory sampled by following policy $\pi$.

Important insight:

The return is a random variable—it depends on the stochasticity in the environment dynamics and (possibly) the policy. Different runs of the same policy yield different returns. We optimize the expected return, averaging over all possible trajectories.

return_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import numpy as np
from typing import List, Tuple
 
def compute_return(
    rewards: List[float], 
    gamma: float = 1.0
) -> float:
    """
    Compute (discounted) return from a sequence of rewards.
    
    G = Σ γ^t R_t
    
    Args:
        rewards: [r_0, r_1, ..., r_{T-1}]
        gamma: Discount factor (1.0 for undiscounted)
    
    Returns:
        Discounted return
    """
    G = 0.0
    for t, r in enumerate(rewards):
        G += (gamma ** t) * r
    return G
 
 
def compute_returns_vectorized(
    rewards: np.ndarray, 
    gamma: float = 0.99
) -> np.ndarray:
    """
    Compute return-to-go for each timestep (vectorized).
    
    G_t = R_t + γ R_{t+1} + γ² R_{t+2} + ...
    
    Useful for policy gradient methods.
    """
    T = len(rewards)
    returns = np.zeros(T)
    
    # Work backwards for efficiency
    G = 0.0
    for t in reversed(range(T)):
        G = rewards[t] + gamma * G
        returns[t] = G
    
    return returns
 
 
def monte_carlo_return_estimate(
    env, 
    policy, 
    gamma: float,
    n_episodes: int = 100
) -> Tuple[float, float]:
    """
    Estimate expected return via Monte Carlo sampling.
    
    Returns:
        (mean_return, std_return)
    """
    returns = []
    
    for _ in range(n_episodes):
        state = env.reset()
        episode_rewards = []
        done = False
        
        while not done:
            action = policy(state)
            state, reward, done, _ = env.step(action)
            episode_rewards.append(reward)
        
        G = compute_return(episode_rewards, gamma)
        returns.append(G)
    
    return np.mean(returns), np.std(returns)
 
 
# ================================================
# Example: Compare policies by expected return
# ================================================
def demo_return_comparison():
    """
    Two policies, same start, different expected returns.
    """
    # Simulate simple episodes
    np.random.seed(42)
    
    # Policy A: Conservative, small consistent rewards
    # Each step: R ~ N(1, 0.5)
    policy_a_episode = [np.random.normal(1, 0.5) for _ in range(10)]
    
    # Policy B: Aggressive, high variance rewards
    # Each step: R ~ N(0.5, 2.0)
    policy_b_episode = [np.random.normal(0.5, 2.0) for _ in range(10)]
    
    gamma = 0.95
    G_a = compute_return(policy_a_episode, gamma)
    G_b = compute_return(policy_b_episode, gamma)
    
    print("Single episode comparison:")
    print(f"  Policy A return: {G_a:.2f}")
    print(f"  Policy B return: {G_b:.2f}")
    
    # Monte Carlo estimate with many episodes
    n_episodes = 1000
    returns_a = [compute_return(
        [np.random.normal(1, 0.5) for _ in range(10)], gamma
    ) for _ in range(n_episodes)]
    
    returns_b = [compute_return(
        [np.random.normal(0.5, 2.0) for _ in range(10)], gamma
    ) for _ in range(n_episodes)]
    
    print(f"\nMonte Carlo estimates ({n_episodes} episodes):")
    print(f"  Policy A: mean={np.mean(returns_a):.2f}, std={np.std(returns_a):.2f}")
    print(f"  Policy B: mean={np.mean(returns_b):.2f}, std={np.std(returns_b):.2f}")
    print(f"  Policy A is better in expectation!")
 
demo_return_comparison()

The Challenge of Reward Specification

Designing reward functions that capture exactly what we want is surprisingly difficult—a problem that has significant implications for AI safety:

Specification Gaming:

Agents optimize the reward function you provide, not the objective you intended. These often differ:

Cleaning robot given reward for less dirt: Turns off dirt sensors
Game agent rewarded for score: Exploits game bugs for infinite points
Trading agent rewarded for profit: Engages in illegal market manipulation

Goodhart's Law:

"When a measure becomes a target, it ceases to be a good measure."

The act of optimizing a proxy for the true objective distorts that proxy. Rewards that seem to capture the objective become exploited once optimized against.

Reward Tampering:

Sufficiently advanced agents might directly modify their reward signal:

Hacking the reward function implementation
Manipulating the human reward evaluator
Seeking states where reward is technically high but meaningless

This is a research frontier in AI alignment: how do we specify rewards that remain meaningful under optimization pressure?

Reward Design Pitfalls

•Overly sparse: Agent can't learn before reaching goal by random chance
•Overly dense: Agent exploits intermediate rewards instead of solving task
•Misaligned proxy: Reward measures something correlated with but not identical to true goal
•Scale sensitivity: Relative reward scales affect multi-objective trade-offs unexpectedly
•Horizon mismatch: Short-horizon rewards incentivize myopic behavior
•State-action ambiguity: Reward for state reached vs. action taken can diverge

The Inner Alignment Problem

Even if we specify the perfect reward function, the agent's learned objective (the function it actually optimizes after training) may differ from the intended objective. This 'mesa-optimization' problem is a frontier challenge in AI safety: ensuring the agent's internal goals match the external reward signal.

Intrinsic Motivation: Internal Reward Signals

Intrinsic motivation addresses the sparse reward problem by generating internal reward signals based on the agent's own learning progress or novelty-seeking behavior:

Curiosity / Novelty: Reward for visiting states the agent hasn't seen before or that are hard to predict. $$R_{\text{intrinsic}} = ||\hat{s}{t+1} - s{t+1}||^2$$ High prediction error = novel = rewarding.

Information Gain: Reward for reducing uncertainty about the environment. $$R_{\text{intrinsic}} = D_{KL}[p(\theta | \text{history}) || p(\theta)]$$ Transitions that are maximally informative about environment parameters.

Count-Based Exploration: Reward inversely proportional to visit frequency. $$R_{\text{intrinsic}} = \frac{1}{\sqrt{N(s)}}$$ Encourages visiting less-explored states.

Empowerment: Reward for states where the agent has control over future outcomes. $$R_{\text{intrinsic}} = I(a_t; s_{t+1} | s_t)$$ Mutual information between action and next state.

These intrinsic rewards supplement sparse extrinsic rewards, providing learning signal even when task reward is absent.

Intrinsic Motivation Methods
Method	Reward Signal	Best For
ICM (Curiosity)	Forward model prediction error	Video games, exploration tasks
RND	Fixed random network prediction error	Hard exploration (Montezuma)
Count-based	$1/\sqrt{N(s)}$ or pseudo-counts	Tabular & discrete states
Empowerment	Mutual information $I(a; s'\|s)$	Robotics, skill discovery
Surprise	Unexpected transitions	Model-based exploration

Summary: The Reward Signal

We have explored the reward function—the component that transforms MDPs from descriptions of dynamics into optimization problems with clear objectives. Reward design is both an art and a science, with profound implications for agent behavior.

Key Takeaways

•Reward functions define objectives via $R(s, a, s')$ — the scalar feedback signal agents optimize
•Sparse vs. dense trade-off — sparse rewards are harder to learn but safer; dense rewards risk reward hacking
•Potential-based shaping $F = \gamma\phi(s') - \phi(s)$ adds guidance without changing optimal policy
•Return is the objective — agents maximize expected cumulative (discounted) reward, not single-step rewards
•Reward specification is hard — agents optimize what you specify, not what you intend; alignment is critical
•Intrinsic motivation — curiosity and novelty bonuses enable exploration when extrinsic rewards are sparse

What's Next:

With rewards defined, we must address how much to value future rewards versus immediate ones. This is the role of the discount factor $\gamma$—a seemingly simple parameter with profound implications for agent behavior and mathematical tractability.

Page Complete

You now understand reward functions in depth—from formal definitions to practical design patterns, from the sparse-dense trade-off to potential-based shaping, from specification challenges to intrinsic motivation. This knowledge is essential for designing RL systems that actually learn the behaviors you want.