Rl Fundamentals - Learning Module

Loading content...

0/278

States, Actions, Rewards

The Three Pillars of RL

Every interaction between an agent and its environment is mediated by three fundamental quantities: states, actions, and rewards. These aren't just abstract mathematical objects—they're the concrete reality of reinforcement learning, and their design determines whether learning succeeds or fails.

States describe where you are—the complete summary of your situation relevant to future outcomes.

Actions describe what you can do—the set of interventions available to the agent at each moment.

Rewards describe how well you did—the scalar feedback signal that guides learning toward desirable behavior.

Mastering RL requires deep understanding of each: what makes a good state representation, how action spaces affect algorithm choice, and how reward design shapes learned behavior—sometimes in unexpected ways.

What You Will Learn

This page develops rigorous understanding of states, actions, and rewards. You'll learn how to design state representations that are both sufficient and efficient, how different action space structures affect learning, and the subtle art of reward engineering—including the pitfalls that cause well-intentioned rewards to produce unintended behavior.

Understanding States

A state $s \in \mathcal{S}$ is a complete description of the world from the agent's perspective—complete in the sense that it contains all information relevant to predicting future states and rewards. This completeness property is formalized as the Markov property:

$$P(s_{t+1}, r_{t+1} | s_t, a_t) = P(s_{t+1}, r_{t+1} | s_0, a_0, s_1, a_1, \ldots, s_t, a_t)$$

In words: the future depends on the present state and action, but not on how we got to the present. The history is irrelevant once we know the current state.

States vs. Observations

A critical distinction:

State ($s$): The true, complete world configuration
Observation ($o$): What the agent actually perceives

In a Markov Decision Process (MDP), $o_t = s_t$—the agent sees everything. In a Partially Observable MDP (POMDP), $o_t = O(s_t)$ for some (possibly noisy) observation function—the agent sees only partial information.

State Space Taxonomy
Aspect	Type	Examples	Algorithm Implications
Size	Finite/Discrete	Board games, gridworlds	Tabular methods applicable; exact solutions possible
Size	Infinite/Continuous	Robotics, physics simulations	Function approximation required; no exact solutions
Dimension	Low-dimensional	Pendulum (4D), CartPole (4D)	Simple networks; fast training
Dimension	High-dimensional	Images (thousands of pixels)	Deep networks required; sample inefficient
Structure	Vector	Sensor readings	Standard neural networks
Structure	Image/Grid	Atari games, vision tasks	CNNs for spatial structure
Structure	Graph	Molecules, social networks	GNNs for relational structure
Structure	Sequence	Text, time series	RNNs/Transformers for temporal structure

The State Design Problem

Designing the state representation is one of the most important decisions in RL system design. The state must satisfy competing requirements:

Sufficiency: The state must contain enough information to predict future states and rewards. Missing critical information makes the problem non-Markovian, which complicates learning.
Efficiency: Larger states require more samples to learn from and more computation per step. The state should be as compact as possible while remaining sufficient.
Learnability: The state representation affects how easily the underlying patterns can be learned. Good representations make value functions and policies smooth and easier to approximate.
Accessibility: The state must be based on information the agent can actually observe in deployment. Using privileged information available only during training leads to sim-to-real failures.

state_representations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import numpy as np
from dataclasses import dataclass
from typing import List, Tuple
from abc import ABC, abstractmethod
 
class StateRepresentation(ABC):
    """Base class for state representations."""
    
    @abstractmethod
    def observe(self, raw_state: dict) -> np.ndarray:
        """Convert raw environment state to agent observation."""
        pass
    
    @property
    @abstractmethod
    def shape(self) -> Tuple[int, ...]:
        """Shape of the observation vector/tensor."""
        pass
 
 
class RobotArmState(StateRepresentation):
    """
    State representation for a robotic arm.
    
    Design decision: Include joint positions AND velocities
    to satisfy Markov property (position alone isn't sufficient
    to predict future motion).
    """
    
    def __init__(self, num_joints: int = 6, 
                 include_velocities: bool = True,
                 include_target: bool = True):
        self.num_joints = num_joints
        self.include_velocities = include_velocities
        self.include_target = include_target
    
    @property
    def shape(self) -> Tuple[int]:
        dim = self.num_joints  # Joint positions
        if self.include_velocities:
            dim += self.num_joints  # Joint velocities
        if self.include_target:
            dim += 3  # Target position (x, y, z)
        return (dim,)
    
    def observe(self, raw_state: dict) -> np.ndarray:
        """
        Extract relevant features from raw sensor data.
        
        Args:
            raw_state: Dictionary containing:
                - 'joint_positions': array of joint angles
                - 'joint_velocities': array of angular velocities
                - 'target_position': 3D target coordinates
        """
        obs = [raw_state['joint_positions']]
        
        if self.include_velocities:
            obs.append(raw_state['joint_velocities'])
        
        if self.include_target:
            obs.append(raw_state['target_position'])
        
        return np.concatenate(obs)
 
 
class FrameStackState(StateRepresentation):
    """
    Stack multiple frames to infer velocity/motion.
    
    Used when single frames don't satisfy Markov property
    (e.g., can't determine ball direction from single image).
    """
    
    def __init__(self, frame_shape: Tuple[int, int, int], 
                 num_frames: int = 4):
        self.frame_shape = frame_shape  # (H, W, C)
        self.num_frames = num_frames
        self.frame_buffer: List[np.ndarray] = []
    
    @property
    def shape(self) -> Tuple[int, int, int]:
        H, W, C = self.frame_shape
        return (H, W, C * self.num_frames)
    
    def reset(self, initial_frame: np.ndarray):
        """Initialize buffer with copies of initial frame."""
        self.frame_buffer = [initial_frame.copy() 
                             for _ in range(self.num_frames)]
    
    def observe(self, new_frame: np.ndarray) -> np.ndarray:
        """Add new frame, return stacked representation."""
        self.frame_buffer.pop(0)
        self.frame_buffer.append(new_frame)
        
        # Stack along channel dimension
        return np.concatenate(self.frame_buffer, axis=-1)
 
 
@dataclass
class StateNormalizer:
    """
    Running normalization for continuous states.
    
    Critical for stable learning—neural networks
    expect inputs with zero mean and unit variance.
    """
    mean: np.ndarray
    var: np.ndarray
    count: int
    
    @classmethod
    def create(cls, state_dim: int):
        return cls(
            mean=np.zeros(state_dim),
            var=np.ones(state_dim),
            count=0
        )
    
    def update(self, state: np.ndarray):
        """Update running statistics with new observation."""
        self.count += 1
        delta = state - self.mean
        self.mean += delta / self.count
        self.var += delta * (state - self.mean)
    
    def normalize(self, state: np.ndarray) -> np.ndarray:
        """Normalize state to ~zero mean, unit variance."""
        std = np.sqrt(self.var / max(self.count, 1)) + 1e-8
        return (state - self.mean) / std

When States Aren't Sufficient: Breaking Markov

The Markov property is an assumption, not a guarantee. Many practical problems violate it, and understanding these violations is crucial for successful RL deployment.

Common Sources of Non-Markovian Behavior:

Missing velocities: A ball's position doesn't tell you where it's going. Solution: Include velocities or stack multiple frames.
Hidden modes: A machine may behave differently when hot vs. cold, but temperature isn't observed. Solution: Include temperature sensor or use recurrent policies.
Opponent modeling: In games, the opponent's strategy affects optimal play but isn't directly observed. Solution: Model opponent or use history-based policies.
Partial observability: Through walls, around corners, in fog—any occlusion creates hidden state. Solution: Use belief states or recurrent processing.
Aliased states: Different underlying states produce identical observations. Solution: Include distinguishing features or use memory.

The Aliasing Problem

State aliasing occurs when distinct true states produce identical observations. In a gridworld with local observation, the agent might see 'wall to the north, empty to the south' in two completely different hallways. Without memory, the agent cannot determine which hallway it's in and may take suboptimal actions. This is a fundamental challenge in POMDPs.

Non-Markovian State

•Current position only (no velocity)
•Single image frame
•Raw sensor at one instant
•Global state without local context
•Quantized/rounded values

Markov State

•Position + velocity
•Stack of recent frames
•Sensor + temporal derivatives
•Global + local features
•Full precision measurements

Practical Strategies for Non-Markovian Domains

1. State Augmentation

The simplest approach: add more information to the state until it becomes sufficient. Include velocities, accelerations, sensor histories, or any relevant context.

2. Frame Stacking

For visual domains, stack the last $k$ frames. The DQN paper used 4-frame stacks for Atari—enough to infer velocity from consecutive frames.

3. Recurrent Policies

Use an LSTM or GRU that maintains hidden state across time steps. The hidden state can encode arbitrary history, enabling the policy to learn what to remember.

4. Attention/Transformer Policies

Process the full episode history (or a window) with attention mechanisms. More flexible than RNNs but computationally heavier.

5. Belief State Methods

Explicitly maintain a probability distribution over possible true states. Computationally intensive but theoretically principled for POMDPs.

Understanding Actions

An action $a \in \mathcal{A}$ is the agent's intervention in the world—the way it affects state transitions. The nature of the action space profoundly impacts algorithm design and learning difficulty.

Discrete Action Spaces

Finite set of distinct actions: $\mathcal{A} = {a_1, a_2, \ldots, a_n}$

Examples:

Atari games: 18 possible button combinations
Chess: ~30 legal moves per position (varies)
Recommendation: Thousands of items to choose from

Advantages:

Q-learning family directly applicable
Can enumerate and compare all actions
Natural probability distributions (categorical/softmax)

Challenges:

Large action spaces become intractable
Can't represent fine-grained control
Action discretization loses information

Continuous Action Spaces

Actions are real-valued vectors: $\mathcal{A} \subseteq \mathbb{R}^n$

Examples:

Robot joints: Torques in range [-1, 1]
Autonomous driving: Steering angle and throttle
Resource allocation: Percentages summing to 1

Advantages:

Natural for physical control
Smooth action space enables gradient-based optimization
Can represent arbitrarily precise control

Challenges:

Can't enumerate actions—need different algorithms
Actor-critic methods typically required
Exploration in continuous spaces is non-trivial

Algorithm Suitability by Action Space
Algorithm Family	Discrete Actions	Continuous Actions
Q-Learning / DQN	✓ Native support	✗ Requires discretization or modifications
Policy Gradient / REINFORCE	✓ Via softmax policy	✓ Via Gaussian policy
Actor-Critic / A2C, PPO	✓ Full support	✓ Full support
DDPG / TD3 / SAC	✗ Designed for continuous	✓ Native support
TRPO	✓ Full support	✓ Full support

action_spaces.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
import numpy as np
from abc import ABC, abstractmethod
from typing import Union, Tuple
import torch
import torch.nn as nn
import torch.distributions as D
 
class ActionSpace(ABC):
    """Abstract base class for action spaces."""
    
    @abstractmethod
    def sample(self) -> np.ndarray:
        """Sample a random action."""
        pass
    
    @abstractmethod
    def contains(self, action) -> bool:
        """Check if action is valid."""
        pass
 
 
class DiscreteActionSpace(ActionSpace):
    """
    Finite set of discrete actions.
    
    For Q-learning: output Q(s,a) for all a, take argmax
    For policy gradient: output π(a|s) categorical distribution
    """
    
    def __init__(self, n_actions: int, 
                 action_meanings: list = None):
        self.n_actions = n_actions
        self.action_meanings = action_meanings or list(range(n_actions))
    
    def sample(self) -> int:
        return np.random.randint(self.n_actions)
    
    def contains(self, action: int) -> bool:
        return 0 <= action < self.n_actions
 
 
class ContinuousActionSpace(ActionSpace):
    """
    Continuous action space (bounded box).
    
    Actions are real-valued vectors constrained to [low, high].
    """
    
    def __init__(self, low: np.ndarray, high: np.ndarray):
        self.low = np.asarray(low)
        self.high = np.asarray(high)
        self.shape = self.low.shape
    
    def sample(self) -> np.ndarray:
        return np.random.uniform(self.low, self.high)
    
    def contains(self, action: np.ndarray) -> bool:
        return np.all(action >= self.low) and np.all(action <= self.high)
    
    def clip(self, action: np.ndarray) -> np.ndarray:
        """Clip action to valid range."""
        return np.clip(action, self.low, self.high)
 
 
class DiscretePolicy(nn.Module):
    """
    Policy network for discrete actions.
    Outputs categorical distribution over actions.
    """
    
    def __init__(self, state_dim: int, n_actions: int, 
                 hidden_dim: int = 64):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, n_actions)
        )
    
    def forward(self, state: torch.Tensor) -> D.Categorical:
        logits = self.net(state)
        return D.Categorical(logits=logits)
    
    def act(self, state: torch.Tensor, 
            deterministic: bool = False) -> Tuple[int, float]:
        dist = self.forward(state)
        if deterministic:
            action = dist.probs.argmax()
        else:
            action = dist.sample()
        log_prob = dist.log_prob(action)
        return action.item(), log_prob
 
 
class ContinuousPolicy(nn.Module):
    """
    Policy network for continuous actions.
    Outputs Gaussian distribution with learned mean and std.
    """
    
    def __init__(self, state_dim: int, action_dim: int,
                 hidden_dim: int = 64,
                 log_std_min: float = -20,
                 log_std_max: float = 2):
        super().__init__()
        self.log_std_min = log_std_min
        self.log_std_max = log_std_max
        
        self.shared = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        
        self.mean_head = nn.Linear(hidden_dim, action_dim)
        self.log_std_head = nn.Linear(hidden_dim, action_dim)
    
    def forward(self, state: torch.Tensor) -> D.Normal:
        features = self.shared(state)
        mean = self.mean_head(features)
        log_std = self.log_std_head(features)
        log_std = torch.clamp(log_std, 
                              self.log_std_min, 
                              self.log_std_max)
        std = torch.exp(log_std)
        return D.Normal(mean, std)
    
    def act(self, state: torch.Tensor,
            deterministic: bool = False) -> Tuple[torch.Tensor, torch.Tensor]:
        dist = self.forward(state)
        if deterministic:
            action = dist.mean
        else:
            action = dist.rsample()  # Reparameterized sample
        log_prob = dist.log_prob(action).sum(-1)  # Sum over action dims
        return action, log_prob

Complex Action Spaces

Real-world problems often have action spaces more complex than simple discrete sets or continuous vectors.

Structured Action Spaces

Hierarchical Actions: Actions have multi-level structure. Example: In an RTS game, select (unit, action_type, target)—a combinatorial explosion of primitive actions.

Parameterized Actions: Discrete action types with continuous parameters. Example: In soccer, 'kick' is discrete but kick direction and force are continuous.

Variable-Size Actions: The number of valid actions changes with state. Example: In card games, legal plays depend on hand contents.

Multi-Agent/Multi-Dimensional: Multiple decisions must be made simultaneously. Example: Controlling multiple robots or bid/ask prices.

Action Decomposition

When facing combinatorially large action spaces, decompose the action into sequential or factored sub-decisions. Instead of choosing from 10,000 joint configurations, choose joint 1, then joint 2, etc. Auto-regressive action generation (as in Decision Transformer) handles this elegantly.

Large Discrete Action Spaces

When the discrete action space is huge (thousands to millions), standard methods fail:

Action Embeddings: Represent actions as vectors; use nearest-neighbor search or learn to generate action embeddings.
Action Elimination: Learn to mask obviously bad actions, reducing the effective action space.
Hierarchical RL: Decompose into high-level options and low-level primitives (Option-Critic, Feudal Networks).
Slate/Combinatorial Methods: For recommendation/ranking, use methods designed for selecting item sets.
Wolpertinger Architecture: Embed actions in continuous space, propose continuous action, map to nearest discrete neighbors, evaluate with Q-network.

Approaches for Different Action Space Complexities
Action Space Type	Size	Recommended Approaches
Small Discrete	< 100	DQN, A2C, PPO with categorical policy
Large Discrete	100 - 10K	Dueling DQN, Action embeddings, Hierarchical
Huge Discrete	10K	Wolpertinger, Slate methods, Action elimination
Low-dim Continuous	< 20	DDPG, TD3, SAC, PPO with Gaussian policy
High-dim Continuous	20	SAC, PPO, Normalized Actor-Critic
Parameterized	Discrete + Continuous	P-DQN, Hybrid architectures
Hierarchical	Multi-level	Options, Feudal Networks, HAM

Understanding Rewards

The reward $r \in \mathbb{R}$ is the scalar feedback signal that defines the learning objective. It's the only channel through which the designer communicates goals to the agent—and this simplicity is both powerful and dangerous.

The Reward Hypothesis

"All of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward)." — Sutton & Barto

This hypothesis is the foundational assumption of RL. It asserts that any goal can be expressed as a reward function. Whether this is true in general is debatable, but it's the working assumption that enables RL algorithms.

The Reward Function

Rewards can depend on states, actions, and transitions:

$$r_{t+1} = R(s_t, a_t, s_{t+1})$$

In practice, rewards often simplify to:

$R(s)$: Reward for being in a state
$R(s, a)$: Reward for taking an action in a state
$R(s, a, s')$: Reward for a specific transition

Types of Reward Signals

Dense Rewards: Feedback at every step. Example: Negative reward proportional to distance from goal.

Pro: Strong learning signal; faster learning
Con: May encourage reward hacking; harder to specify correctly

Sparse Rewards: Feedback only at episode end or key events. Example: +1 for winning, 0 otherwise.

Pro: Hard to hack; clearly specifies goal
Con: Credit assignment is difficult; exploration is challenging

Shaped Rewards: Dense rewards engineered to guide learning. Example: Distance to goal plus bonus for facing goal direction.

Pro: Can dramatically accelerate learning
Con: May induce suboptimal policies; requires domain expertise

Intrinsic Rewards: Self-generated rewards for exploration or curiosity. Example: Reward for visiting novel states.

Pro: Enables exploration without extrinsic signal
Con: May conflict with task objectives

Reward Hacking

Agents optimize the reward you give them, not the reward you meant. A cleaning robot rewarded for 'dirt cleaned' might dump dirt to clean it. A game-playing agent rewarded for points might find exploits the designers never imagined. Reward hacking is not a bug in the algorithm—it's a feature. The agent is doing exactly what you asked.

reward_design.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
import numpy as np
from typing import Callable, Tuple, Optional
from dataclasses import dataclass
 
@dataclass
class RewardConfig:
    """Configuration for reward function design."""
    sparse_goal_reward: float = 10.0
    step_penalty: float = -0.01
    shaping_weight: float = 1.0
    
class RewardFunction:
    """
    Composable reward function with multiple components.
    
    Demonstrates reward engineering patterns:
    - Sparse goal rewards
    - Dense shaping rewards
    - Penalty terms
    - Temporal discounting of shaping
    """
    
    def __init__(self, config: RewardConfig = None):
        self.config = config or RewardConfig()
    
    def compute(self, 
                state: np.ndarray,
                action: np.ndarray,
                next_state: np.ndarray,
                goal: np.ndarray,
                done: bool) -> Tuple[float, dict]:
        """
        Compute reward with decomposition for analysis.
        
        Returns:
            reward: Total scalar reward
            info: Dictionary breaking down reward components
        """
        components = {}
        
        # 1. Sparse goal reward
        if done and self._is_success(next_state, goal):
            components['goal'] = self.config.sparse_goal_reward
        else:
            components['goal'] = 0.0
        
        # 2. Step penalty (encourages efficiency)
        components['step_penalty'] = self.config.step_penalty
        
        # 3. Potential-based shaping (provably safe)
        # Using distance-based potential: φ(s) = -||s - goal||
        potential_current = -np.linalg.norm(state - goal)
        potential_next = -np.linalg.norm(next_state - goal)
        # F(s, s') = γ * φ(s') - φ(s)
        gamma = 0.99
        shaping = gamma * potential_next - potential_current
        components['shaping'] = self.config.shaping_weight * shaping
        
        # Total reward
        reward = sum(components.values())
        
        return reward, components
    
    def _is_success(self, state: np.ndarray, 
                    goal: np.ndarray,
                    threshold: float = 0.1) -> bool:
        """Check if goal is reached."""
        return np.linalg.norm(state - goal) < threshold
 
 
class PotentialBasedShaping:
    """
    Potential-based reward shaping (Ng et al., 1999).
    
    THEOREM: If shaping reward is F(s,a,s') = γφ(s') - φ(s)
    for any potential function φ, then optimal policy is
    unchanged from original MDP.
    
    This is the ONLY form of shaping guaranteed not to
    change the optimal policy!
    """
    
    def __init__(self, 
                 potential_fn: Callable[[np.ndarray], float],
                 gamma: float = 0.99):
        self.potential = potential_fn
        self.gamma = gamma
        self.prev_potential = None
    
    def reset(self, initial_state: np.ndarray):
        """Reset at episode start."""
        self.prev_potential = self.potential(initial_state)
    
    def shape(self, next_state: np.ndarray, 
              done: bool) -> float:
        """
        Compute shaping reward.
        
        F = γ * φ(s') - φ(s) for non-terminal
        F = 0 - φ(s) for terminal (φ(terminal) = 0)
        """
        if done:
            next_potential = 0.0
        else:
            next_potential = self.potential(next_state)
        
        shaping_reward = self.gamma * next_potential - self.prev_potential
        self.prev_potential = next_potential
        
        return shaping_reward
 
 
# Example: Distance-based potential
def distance_potential(state: np.ndarray, 
                       goal: np.ndarray) -> float:
    """Potential proportional to negative distance."""
    return -np.linalg.norm(state - goal)
 
 
class NormalizedReward:
    """
    Running normalization of rewards.
    
    Important for algorithms sensitive to reward scale:
    - PPO (clipping depends on advantage scale)
    - A2C (gradient magnitude affected)
    - Soft Actor-Critic (temperature calibration)
    """
    
    def __init__(self, clip: float = 10.0):
        self.clip = clip
        self.mean = 0.0
        self.var = 1.0
        self.count = 0
    
    def __call__(self, reward: float) -> float:
        # Update running statistics
        self.count += 1
        delta = reward - self.mean
        self.mean += delta / self.count
        self.var += delta * (reward - self.mean)
        
        # Normalize
        std = np.sqrt(self.var / max(self.count, 1)) + 1e-8
        normalized = (reward - self.mean) / std
        
        # Clip for stability
        return np.clip(normalized, -self.clip, self.clip)

The Reward Design Problem

Designing reward functions is arguably the hardest part of RL system development. The core challenge: specifying what you want is fundamentally difficult.

Specification Gaming

Agents are creative optimizers. Given any reward function, they will find ways to maximize it that you didn't anticipate—and often wouldn't endorse:

A boat racing agent that drives in circles collecting power-ups instead of finishing the race (more points)
A cleaning robot that covers its camera to 'not see' any dirt
A tetris agent that pauses the game indefinitely to avoid losing
A simulated athlete that exploits physics engine bugs for impossible maneuvers

These behaviors are 'correct' given the reward—the agent is doing exactly what you asked. The problem is that you asked for the wrong thing.

Common Reward Design Pitfalls

•Reward Hacking: Agent finds unintended ways to achieve high reward without desired behavior
•Reward Tampering: Agent modifies its own reward signal (possible in some setups)
•Wireheading: Agent takes actions to maximize its reward sensor directly
•Goodhart's Law: 'When a measure becomes a target, it ceases to be a good measure'
•Proxy Gaming: Rewarding proxies for the true goal leads to optimizing the proxy only
•Deceptive Alignment: Agent appears to follow the reward during training but defects at deployment

Principled Reward Design

1. Potential-Based Shaping

The only form of reward shaping guaranteed to preserve the optimal policy: $$F(s, a, s') = \gamma \phi(s') - \phi(s)$$

Any other shaping risks changing what behavior is optimal.

2. Inverse Reinforcement Learning (IRL)

Learn the reward function from expert demonstrations. Instead of specifying rewards, demonstrate desired behavior and let the system infer what reward would produce it.

3. Preference Learning

Show the agent pairs of behaviors and indicate which is preferred. The reward function is learned to explain these preferences (RLHF - Reinforcement Learning from Human Feedback).

4. Constrained RL

Instead of putting everything into one reward, specify constraints that must be satisfied. Optimize primary reward subject to safety/feasibility constraints.

5. Multi-Objective RL

Maintain multiple reward signals and optimize the Pareto frontier. Allows human-in-the-loop selection of operating point.

The Reward Design Heuristic

Start with the sparsest reward that defines success, then add shaping only if needed for learning. Each shaping term you add is an opportunity for unintended behavior. When in doubt, prefer simplicity. Better to have slow learning than fast learning of the wrong behavior.

Solving Sparse Reward Problems

Sparse rewards are safer (less prone to hacking) but harder to learn from. If the agent only receives feedback upon success, how does it discover successful behavior in the first place?

The Exploration Problem Under Sparse Rewards

Consider a maze where reward is +1 for reaching the goal and 0 otherwise. Random exploration in a large maze will almost never find the goal, so the agent never experiences positive reward and cannot learn. This is the sparse reward exploration problem.

Solution Strategies

1. Curiosity-Driven Exploration

Reward the agent for 'surprising' states—states where its world model prediction is poor. This intrinsic reward drives exploration even without extrinsic signal.

$$r^{intr}t = ||\hat{s}{t+1} - s_{t+1}||$$

Limitations: 'Noisy TV problem'—purely stochastic elements are infinitely surprising but not useful.

2. Count-Based Exploration

Bonus for visiting less-frequent states:

$$r^{bonus}_t = \beta / \sqrt{N(s_t)}$$

In large state spaces, use pseudo-counts or density models.

3. Hindsight Experience Replay (HER)

Retroactively relabel failed episodes with achieved goals. If the agent tried to reach goal A but reached B, store a 'successful' trajectory reaching B.

4. Go-Explore

First, explore the state space by returning to promising states and exploring from there. Second, robustify the discovered paths with RL.

5. Curriculum Learning

Start with easy versions of the task (goal nearby) and gradually increase difficulty. Automated curriculum methods (Asymmetric Self-Play, POET) can generate curricula without human design.

sparse_reward_solutions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
import numpy as np
from collections import defaultdict
from typing import List, Tuple, Optional
 
class HindsightExperienceReplay:
    """
    Hindsight Experience Replay (HER) - Andrychowicz et al., 2017
    
    Key insight: A failed attempt to reach goal A is a successful
    attempt to reach wherever we actually ended up.
    
    Dramatically improves learning in sparse reward goal-conditioned
    tasks.
    """
    
    def __init__(self, 
                 replay_k: int = 4,
                 strategy: str = 'future'):
        """
        Args:
            replay_k: Number of hindsight goals to sample per transition
            strategy: 'final', 'future', or 'episode'
                - 'final': Use the final achieved goal
                - 'future': Sample from states later in episode
                - 'episode': Sample from anywhere in episode
        """
        self.replay_k = replay_k
        self.strategy = strategy
    
    def augment_episode(self,
                        episode: List[dict],
                        goal_dim: int) -> List[dict]:
        """
        Augment episode with hindsight experience.
        
        Args:
            episode: List of transitions with 'state', 'action', 
                    'next_state', 'goal', 'achieved_goal', 'reward', 'done'
            goal_dim: Dimensionality of goal space
            
        Returns:
            Augmented list including original + hindsight transitions
        """
        augmented = list(episode)  # Keep original transitions
        
        for t, transition in enumerate(episode):
            # Sample hindsight goals
            hindsight_goals = self._sample_goals(episode, t)
            
            for new_goal in hindsight_goals:
                # Recompute reward for new goal
                achieved = transition['achieved_goal']
                new_reward = self._compute_reward(achieved, new_goal)
                new_done = np.linalg.norm(achieved - new_goal) < 0.05
                
                # Create new transition with hindsight goal
                hindsight_transition = {
                    'state': np.concatenate([
                        transition['state'][:goal_dim], new_goal
                    ]),
                    'action': transition['action'],
                    'next_state': np.concatenate([
                        transition['next_state'][:goal_dim], new_goal
                    ]),
                    'goal': new_goal,
                    'achieved_goal': achieved,
                    'reward': new_reward,
                    'done': new_done
                }
                augmented.append(hindsight_transition)
        
        return augmented
    
    def _sample_goals(self, episode: List[dict], 
                      t: int) -> List[np.ndarray]:
        """Sample hindsight goals according to strategy."""
        goals = []
        
        if self.strategy == 'final':
            # Always use the final achieved goal
            goals = [episode[-1]['achieved_goal']] * self.replay_k
            
        elif self.strategy == 'future':
            # Sample from later timesteps
            future_indices = range(t + 1, len(episode))
            if len(future_indices) > 0:
                sampled = np.random.choice(
                    list(future_indices), 
                    size=min(self.replay_k, len(future_indices)),
                    replace=False
                )
                goals = [episode[i]['achieved_goal'] for i in sampled]
                
        elif self.strategy == 'episode':
            # Sample from anywhere in episode
            sampled = np.random.choice(
                len(episode),
                size=min(self.replay_k, len(episode)),
                replace=False
            )
            goals = [episode[i]['achieved_goal'] for i in sampled]
        
        return goals
    
    def _compute_reward(self, achieved: np.ndarray,
                        goal: np.ndarray,
                        threshold: float = 0.05) -> float:
        """Sparse reward: +0 for success, -1 otherwise."""
        if np.linalg.norm(achieved - goal) < threshold:
            return 0.0
        return -1.0
 
 
class CuriosityDrivenExploration:
    """
    Intrinsic Curiosity Module (ICM) - Pathak et al., 2017
    
    Provides intrinsic reward based on prediction error
    of a forward dynamics model in learned feature space.
    """
    
    def __init__(self, feature_dim: int = 64,
                 curiosity_weight: float = 0.01):
        self.feature_dim = feature_dim
        self.curiosity_weight = curiosity_weight
        # In practice, these would be neural networks
        self.inverse_model = None  # Predicts action from (s, s')
        self.forward_model = None  # Predicts φ(s') from (φ(s), a)
        self.feature_encoder = None  # Maps s -> φ(s)
    
    def compute_intrinsic_reward(self,
                                  state: np.ndarray,
                                  action: np.ndarray,
                                  next_state: np.ndarray) -> float:
        """
        Compute intrinsic reward as forward model prediction error.
        
        States where the model is surprised (poor prediction)
        get higher intrinsic reward, encouraging exploration.
        """
        # Encode states to feature space
        phi_s = self.feature_encoder(state)
        phi_s_next = self.feature_encoder(next_state)
        
        # Predict features of next state
        phi_s_next_pred = self.forward_model(phi_s, action)
        
        # Intrinsic reward = prediction error
        prediction_error = np.mean((phi_s_next_pred - phi_s_next) ** 2)
        
        return self.curiosity_weight * prediction_error
 
 
class CountBasedExploration:
    """
    Count-based exploration bonus.
    
    For tabular states: exact visitation counts.
    For large state spaces: density estimation or hash-based.
    """
    
    def __init__(self, bonus_coef: float = 0.1):
        self.bonus_coef = bonus_coef
        self.state_counts = defaultdict(int)
    
    def get_bonus(self, state: np.ndarray) -> float:
        """Compute exploration bonus as inverse sqrt of count."""
        # For continuous states, discretize or hash
        state_key = self._hash_state(state)
        self.state_counts[state_key] += 1
        count = self.state_counts[state_key]
        
        # Bonus decays with visits
        return self.bonus_coef / np.sqrt(count)
    
    def _hash_state(self, state: np.ndarray, 
                    precision: int = 100) -> tuple:
        """Simple discretization for hashing."""
        discretized = np.round(state * precision).astype(int)
        return tuple(discretized)

The Interplay: States, Actions, Rewards

States, actions, and rewards don't exist in isolation—their design is deeply interconnected. Understanding these interactions is crucial for effective RL system design.

State ↔ Action Interactions

State affects valid actions: In many domains, the set of valid actions depends on the state. Chess legal moves depend on piece positions. A robot can't move an arm that's at joint limits.

Action granularity affects state requirements: Fine-grained actions (precise motor torques) require detailed state (joint velocities, applied forces). Coarse actions (go left/right) can work with simpler state.

State representation affects action value learning: If similar states map to different optimal actions, value learning struggles. Good representations should make action-value functions smooth.

State ↔ Reward Interactions

State determines reward accessibility: Reward functions often depend on state features. If those features aren't in the state, reward can't be computed correctly.

Reward structure affects state value: Dense rewards make nearby states have similar values (smooth value function). Sparse rewards create sharp value gradients at goal boundaries.

State abstraction can destroy reward signal: Coarse state representations might alias high-reward and low-reward states, making learning impossible.

Action ↔ Reward Interactions

Action space affects exploration for reward: Discrete actions can be explored combinatorially. Continuous actions require gradient-based or noise-based exploration, affecting how rewards are encountered.

Reward structure affects action preferences: If rewards are always positive, long episodes accumulate more reward, potentially encouraging dawdling. Negative step costs encourage efficiency.

Action abstraction can simplify reward learning: Temporally extended actions (options, skills) can bridge sparse rewards, making credit assignment easier.

The System Design Triangle

Designing one of (State, Action, Reward) poorly can make learning impossible even if the others are perfect:

Perfect rewards but insufficient state → Agent can't distinguish good from bad situations
Perfect state but impossible actions → Agent can't execute learned knowledge
Perfect state and actions but pathological reward → Agent learns wrong behavior

Design as a Whole

When debugging RL failures, consider all three components. Sometimes the fix for a 'reward design problem' is actually in the state representation (adding features that make reward learnable) or the action space (adding actions that make the task feasible).

Practical Design Guidelines

State Design Checklist

•Markov Property: Does the state contain all information needed to predict future? Include velocities if positions alone are insufficient.
•Normalization: Are all state features on similar scales? Neural networks expect roughly unit scale inputs.
•Observability: Can you actually observe this state in deployment? Don't include privileged simulation information.
•Dimensionality: Is the state as compact as possible? Larger states need more data to learn from.
•Temporal Consistency: Does the state representation change over time or across environments? Consistency aids transfer.

Action Design Checklist

•Completeness: Can the agent express all meaningful behaviors? Missing actions make optimal behavior impossible.
•Granularity: Is the action space fine enough for precise control but coarse enough for efficient learning?
•Safety: Are dangerous actions excluded or penalized? Some actions should be impossible rather than just discouraged.
•Symmetry: Does the action space have unnecessary asymmetries that complicate learning? (e.g., range [0, 10] vs [-5, 5])
•Physical Feasibility: Can actions actually be executed in the real world? Simulation allows actions that may be physically impossible.

Reward Design Checklist

•Alignment: Does maximizing reward actually correspond to desired behavior? Think adversarially.
•Sparsity vs. Density Trade-off: Can learning succeed with sparse rewards, or is shaping necessary? Start sparse.
•Potential-Based Shaping: If using shaping, is it in potential-based form to preserve optimality?
•Scale: Is the reward magnitude appropriate? Too large destabilizes learning; too small is signal-lost in noise.
•Consistency: Is the reward consistent across states? Arbitrary variations confuse the value function.

The Debugging Protocol

When learning fails: (1) Verify reward is correct—print rewards during episodes. (2) Check state sufficiency—can a human predict the optimal action from the state? (3) Test action space—can the optimal behavior be expressed with available actions? (4) Visualize value estimates—do they make intuitive sense? Systematic debugging prevents wasted experimentation.

Summary: The Building Blocks

States, actions, and rewards are the atomic elements of reinforcement learning. Everything else—value functions, policies, algorithms—is built from these foundations. Let's consolidate the key insights:

Key Takeaways

•States must be Markov-sufficient: Include all information needed to predict future states and rewards. When in doubt, include more; let the agent learn what's irrelevant.
•Action spaces profoundly affect algorithm choice: Discrete actions enable value-based methods; continuous actions typically require actor-critic approaches.
•Complex action spaces require specialized techniques: Hierarchical actions, large discrete spaces, and parameterized actions each demand tailored solutions.
•Reward design is specification: You're programming the goal, not the behavior. Think adversarially about how the reward can be gamed.
•Sparse rewards are safer but harder: They resist hacking but require sophisticated exploration. HER, curiosity, and curriculum learning can help.
•Potential-based shaping preserves optimality: It's the only theoretically safe form of reward shaping. Other shaping risks changing optimal behavior.
•The three components interact: Poor design in one area can invalidate good design in others. Consider the system holistically.

Looking Ahead

With states, actions, and rewards defined, we can now formalize the agent's decision-making procedure: the policy. The next page explores how policies map states to actions, the distinction between deterministic and stochastic policies, how policies are represented and parameterized, and how they can be evaluated and improved—the heart of what RL algorithms do.

Page Complete

You now have deep understanding of states, actions, and rewards—the raw materials of reinforcement learning. These concepts will appear throughout your study of RL, and the design principles here will guide your practical applications. Next, we formalize decision-making through the policy abstraction.

States, Actions, Rewards

The Three Pillars of RL

States describe where you are—the complete summary of your situation relevant to future outcomes.

Actions describe what you can do—the set of interventions available to the agent at each moment.

Rewards describe how well you did—the scalar feedback signal that guides learning toward desirable behavior.

What You Will Learn

Understanding States

$$P(s_{t+1}, r_{t+1} | s_t, a_t) = P(s_{t+1}, r_{t+1} | s_0, a_0, s_1, a_1, \ldots, s_t, a_t)$$

In words: the future depends on the present state and action, but not on how we got to the present. The history is irrelevant once we know the current state.

States vs. Observations

A critical distinction:

State ($s$): The true, complete world configuration
Observation ($o$): What the agent actually perceives

State Space Taxonomy
Aspect	Type	Examples	Algorithm Implications
Size	Finite/Discrete	Board games, gridworlds	Tabular methods applicable; exact solutions possible
Size	Infinite/Continuous	Robotics, physics simulations	Function approximation required; no exact solutions
Dimension	Low-dimensional	Pendulum (4D), CartPole (4D)	Simple networks; fast training
Dimension	High-dimensional	Images (thousands of pixels)	Deep networks required; sample inefficient
Structure	Vector	Sensor readings	Standard neural networks
Structure	Image/Grid	Atari games, vision tasks	CNNs for spatial structure
Structure	Graph	Molecules, social networks	GNNs for relational structure
Structure	Sequence	Text, time series	RNNs/Transformers for temporal structure

The State Design Problem

Designing the state representation is one of the most important decisions in RL system design. The state must satisfy competing requirements:

Sufficiency: The state must contain enough information to predict future states and rewards. Missing critical information makes the problem non-Markovian, which complicates learning.
Efficiency: Larger states require more samples to learn from and more computation per step. The state should be as compact as possible while remaining sufficient.
Learnability: The state representation affects how easily the underlying patterns can be learned. Good representations make value functions and policies smooth and easier to approximate.
Accessibility: The state must be based on information the agent can actually observe in deployment. Using privileged information available only during training leads to sim-to-real failures.

state_representations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import numpy as np
from dataclasses import dataclass
from typing import List, Tuple
from abc import ABC, abstractmethod
 
class StateRepresentation(ABC):
    """Base class for state representations."""
    
    @abstractmethod
    def observe(self, raw_state: dict) -> np.ndarray:
        """Convert raw environment state to agent observation."""
        pass
    
    @property
    @abstractmethod
    def shape(self) -> Tuple[int, ...]:
        """Shape of the observation vector/tensor."""
        pass
 
 
class RobotArmState(StateRepresentation):
    """
    State representation for a robotic arm.
    
    Design decision: Include joint positions AND velocities
    to satisfy Markov property (position alone isn't sufficient
    to predict future motion).
    """
    
    def __init__(self, num_joints: int = 6, 
                 include_velocities: bool = True,
                 include_target: bool = True):
        self.num_joints = num_joints
        self.include_velocities = include_velocities
        self.include_target = include_target
    
    @property
    def shape(self) -> Tuple[int]:
        dim = self.num_joints  # Joint positions
        if self.include_velocities:
            dim += self.num_joints  # Joint velocities
        if self.include_target:
            dim += 3  # Target position (x, y, z)
        return (dim,)
    
    def observe(self, raw_state: dict) -> np.ndarray:
        """
        Extract relevant features from raw sensor data.
        
        Args:
            raw_state: Dictionary containing:
                - 'joint_positions': array of joint angles
                - 'joint_velocities': array of angular velocities
                - 'target_position': 3D target coordinates
        """
        obs = [raw_state['joint_positions']]
        
        if self.include_velocities:
            obs.append(raw_state['joint_velocities'])
        
        if self.include_target:
            obs.append(raw_state['target_position'])
        
        return np.concatenate(obs)
 
 
class FrameStackState(StateRepresentation):
    """
    Stack multiple frames to infer velocity/motion.
    
    Used when single frames don't satisfy Markov property
    (e.g., can't determine ball direction from single image).
    """
    
    def __init__(self, frame_shape: Tuple[int, int, int], 
                 num_frames: int = 4):
        self.frame_shape = frame_shape  # (H, W, C)
        self.num_frames = num_frames
        self.frame_buffer: List[np.ndarray] = []
    
    @property
    def shape(self) -> Tuple[int, int, int]:
        H, W, C = self.frame_shape
        return (H, W, C * self.num_frames)
    
    def reset(self, initial_frame: np.ndarray):
        """Initialize buffer with copies of initial frame."""
        self.frame_buffer = [initial_frame.copy() 
                             for _ in range(self.num_frames)]
    
    def observe(self, new_frame: np.ndarray) -> np.ndarray:
        """Add new frame, return stacked representation."""
        self.frame_buffer.pop(0)
        self.frame_buffer.append(new_frame)
        
        # Stack along channel dimension
        return np.concatenate(self.frame_buffer, axis=-1)
 
 
@dataclass
class StateNormalizer:
    """
    Running normalization for continuous states.
    
    Critical for stable learning—neural networks
    expect inputs with zero mean and unit variance.
    """
    mean: np.ndarray
    var: np.ndarray
    count: int
    
    @classmethod
    def create(cls, state_dim: int):
        return cls(
            mean=np.zeros(state_dim),
            var=np.ones(state_dim),
            count=0
        )
    
    def update(self, state: np.ndarray):
        """Update running statistics with new observation."""
        self.count += 1
        delta = state - self.mean
        self.mean += delta / self.count
        self.var += delta * (state - self.mean)
    
    def normalize(self, state: np.ndarray) -> np.ndarray:
        """Normalize state to ~zero mean, unit variance."""
        std = np.sqrt(self.var / max(self.count, 1)) + 1e-8
        return (state - self.mean) / std

When States Aren't Sufficient: Breaking Markov

The Markov property is an assumption, not a guarantee. Many practical problems violate it, and understanding these violations is crucial for successful RL deployment.

Common Sources of Non-Markovian Behavior:

Missing velocities: A ball's position doesn't tell you where it's going. Solution: Include velocities or stack multiple frames.
Hidden modes: A machine may behave differently when hot vs. cold, but temperature isn't observed. Solution: Include temperature sensor or use recurrent policies.
Opponent modeling: In games, the opponent's strategy affects optimal play but isn't directly observed. Solution: Model opponent or use history-based policies.
Partial observability: Through walls, around corners, in fog—any occlusion creates hidden state. Solution: Use belief states or recurrent processing.
Aliased states: Different underlying states produce identical observations. Solution: Include distinguishing features or use memory.

The Aliasing Problem

Non-Markovian State

•Current position only (no velocity)
•Single image frame
•Raw sensor at one instant
•Global state without local context
•Quantized/rounded values

Markov State

•Position + velocity
•Stack of recent frames
•Sensor + temporal derivatives
•Global + local features
•Full precision measurements

Practical Strategies for Non-Markovian Domains

1. State Augmentation

The simplest approach: add more information to the state until it becomes sufficient. Include velocities, accelerations, sensor histories, or any relevant context.

2. Frame Stacking

For visual domains, stack the last $k$ frames. The DQN paper used 4-frame stacks for Atari—enough to infer velocity from consecutive frames.

3. Recurrent Policies

Use an LSTM or GRU that maintains hidden state across time steps. The hidden state can encode arbitrary history, enabling the policy to learn what to remember.

4. Attention/Transformer Policies

Process the full episode history (or a window) with attention mechanisms. More flexible than RNNs but computationally heavier.

5. Belief State Methods

Explicitly maintain a probability distribution over possible true states. Computationally intensive but theoretically principled for POMDPs.

Understanding Actions

Discrete Action Spaces

Finite set of distinct actions: $\mathcal{A} = {a_1, a_2, \ldots, a_n}$

Examples:

Atari games: 18 possible button combinations
Chess: ~30 legal moves per position (varies)
Recommendation: Thousands of items to choose from

Advantages:

Q-learning family directly applicable
Can enumerate and compare all actions
Natural probability distributions (categorical/softmax)

Challenges:

Large action spaces become intractable
Can't represent fine-grained control
Action discretization loses information

Continuous Action Spaces

Actions are real-valued vectors: $\mathcal{A} \subseteq \mathbb{R}^n$

Examples:

Robot joints: Torques in range [-1, 1]
Autonomous driving: Steering angle and throttle
Resource allocation: Percentages summing to 1

Advantages:

Natural for physical control
Smooth action space enables gradient-based optimization
Can represent arbitrarily precise control

Challenges:

Can't enumerate actions—need different algorithms
Actor-critic methods typically required
Exploration in continuous spaces is non-trivial

Algorithm Suitability by Action Space
Algorithm Family	Discrete Actions	Continuous Actions
Q-Learning / DQN	✓ Native support	✗ Requires discretization or modifications
Policy Gradient / REINFORCE	✓ Via softmax policy	✓ Via Gaussian policy
Actor-Critic / A2C, PPO	✓ Full support	✓ Full support
DDPG / TD3 / SAC	✗ Designed for continuous	✓ Native support
TRPO	✓ Full support	✓ Full support

action_spaces.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
import numpy as np
from abc import ABC, abstractmethod
from typing import Union, Tuple
import torch
import torch.nn as nn
import torch.distributions as D
 
class ActionSpace(ABC):
    """Abstract base class for action spaces."""
    
    @abstractmethod
    def sample(self) -> np.ndarray:
        """Sample a random action."""
        pass
    
    @abstractmethod
    def contains(self, action) -> bool:
        """Check if action is valid."""
        pass
 
 
class DiscreteActionSpace(ActionSpace):
    """
    Finite set of discrete actions.
    
    For Q-learning: output Q(s,a) for all a, take argmax
    For policy gradient: output π(a|s) categorical distribution
    """
    
    def __init__(self, n_actions: int, 
                 action_meanings: list = None):
        self.n_actions = n_actions
        self.action_meanings = action_meanings or list(range(n_actions))
    
    def sample(self) -> int:
        return np.random.randint(self.n_actions)
    
    def contains(self, action: int) -> bool:
        return 0 <= action < self.n_actions
 
 
class ContinuousActionSpace(ActionSpace):
    """
    Continuous action space (bounded box).
    
    Actions are real-valued vectors constrained to [low, high].
    """
    
    def __init__(self, low: np.ndarray, high: np.ndarray):
        self.low = np.asarray(low)
        self.high = np.asarray(high)
        self.shape = self.low.shape
    
    def sample(self) -> np.ndarray:
        return np.random.uniform(self.low, self.high)
    
    def contains(self, action: np.ndarray) -> bool:
        return np.all(action >= self.low) and np.all(action <= self.high)
    
    def clip(self, action: np.ndarray) -> np.ndarray:
        """Clip action to valid range."""
        return np.clip(action, self.low, self.high)
 
 
class DiscretePolicy(nn.Module):
    """
    Policy network for discrete actions.
    Outputs categorical distribution over actions.
    """
    
    def __init__(self, state_dim: int, n_actions: int, 
                 hidden_dim: int = 64):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, n_actions)
        )
    
    def forward(self, state: torch.Tensor) -> D.Categorical:
        logits = self.net(state)
        return D.Categorical(logits=logits)
    
    def act(self, state: torch.Tensor, 
            deterministic: bool = False) -> Tuple[int, float]:
        dist = self.forward(state)
        if deterministic:
            action = dist.probs.argmax()
        else:
            action = dist.sample()
        log_prob = dist.log_prob(action)
        return action.item(), log_prob
 
 
class ContinuousPolicy(nn.Module):
    """
    Policy network for continuous actions.
    Outputs Gaussian distribution with learned mean and std.
    """
    
    def __init__(self, state_dim: int, action_dim: int,
                 hidden_dim: int = 64,
                 log_std_min: float = -20,
                 log_std_max: float = 2):
        super().__init__()
        self.log_std_min = log_std_min
        self.log_std_max = log_std_max
        
        self.shared = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        
        self.mean_head = nn.Linear(hidden_dim, action_dim)
        self.log_std_head = nn.Linear(hidden_dim, action_dim)
    
    def forward(self, state: torch.Tensor) -> D.Normal:
        features = self.shared(state)
        mean = self.mean_head(features)
        log_std = self.log_std_head(features)
        log_std = torch.clamp(log_std, 
                              self.log_std_min, 
                              self.log_std_max)
        std = torch.exp(log_std)
        return D.Normal(mean, std)
    
    def act(self, state: torch.Tensor,
            deterministic: bool = False) -> Tuple[torch.Tensor, torch.Tensor]:
        dist = self.forward(state)
        if deterministic:
            action = dist.mean
        else:
            action = dist.rsample()  # Reparameterized sample
        log_prob = dist.log_prob(action).sum(-1)  # Sum over action dims
        return action, log_prob

Complex Action Spaces

Real-world problems often have action spaces more complex than simple discrete sets or continuous vectors.

Structured Action Spaces

Hierarchical Actions: Actions have multi-level structure. Example: In an RTS game, select (unit, action_type, target)—a combinatorial explosion of primitive actions.

Parameterized Actions: Discrete action types with continuous parameters. Example: In soccer, 'kick' is discrete but kick direction and force are continuous.

Variable-Size Actions: The number of valid actions changes with state. Example: In card games, legal plays depend on hand contents.

Multi-Agent/Multi-Dimensional: Multiple decisions must be made simultaneously. Example: Controlling multiple robots or bid/ask prices.

Action Decomposition

Large Discrete Action Spaces

When the discrete action space is huge (thousands to millions), standard methods fail:

Action Embeddings: Represent actions as vectors; use nearest-neighbor search or learn to generate action embeddings.
Action Elimination: Learn to mask obviously bad actions, reducing the effective action space.
Hierarchical RL: Decompose into high-level options and low-level primitives (Option-Critic, Feudal Networks).
Slate/Combinatorial Methods: For recommendation/ranking, use methods designed for selecting item sets.
Wolpertinger Architecture: Embed actions in continuous space, propose continuous action, map to nearest discrete neighbors, evaluate with Q-network.

Approaches for Different Action Space Complexities
Action Space Type	Size	Recommended Approaches
Small Discrete	< 100	DQN, A2C, PPO with categorical policy
Large Discrete	100 - 10K	Dueling DQN, Action embeddings, Hierarchical
Huge Discrete	10K	Wolpertinger, Slate methods, Action elimination
Low-dim Continuous	< 20	DDPG, TD3, SAC, PPO with Gaussian policy
High-dim Continuous	20	SAC, PPO, Normalized Actor-Critic
Parameterized	Discrete + Continuous	P-DQN, Hybrid architectures
Hierarchical	Multi-level	Options, Feudal Networks, HAM

Understanding Rewards

The Reward Hypothesis

"All of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward)." — Sutton & Barto

The Reward Function

Rewards can depend on states, actions, and transitions:

$$r_{t+1} = R(s_t, a_t, s_{t+1})$$

In practice, rewards often simplify to:

$R(s)$: Reward for being in a state
$R(s, a)$: Reward for taking an action in a state
$R(s, a, s')$: Reward for a specific transition

Types of Reward Signals

Dense Rewards: Feedback at every step. Example: Negative reward proportional to distance from goal.

Pro: Strong learning signal; faster learning
Con: May encourage reward hacking; harder to specify correctly

Sparse Rewards: Feedback only at episode end or key events. Example: +1 for winning, 0 otherwise.

Pro: Hard to hack; clearly specifies goal
Con: Credit assignment is difficult; exploration is challenging

Shaped Rewards: Dense rewards engineered to guide learning. Example: Distance to goal plus bonus for facing goal direction.

Pro: Can dramatically accelerate learning
Con: May induce suboptimal policies; requires domain expertise

Intrinsic Rewards: Self-generated rewards for exploration or curiosity. Example: Reward for visiting novel states.

Pro: Enables exploration without extrinsic signal
Con: May conflict with task objectives

Reward Hacking

reward_design.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
import numpy as np
from typing import Callable, Tuple, Optional
from dataclasses import dataclass
 
@dataclass
class RewardConfig:
    """Configuration for reward function design."""
    sparse_goal_reward: float = 10.0
    step_penalty: float = -0.01
    shaping_weight: float = 1.0
    
class RewardFunction:
    """
    Composable reward function with multiple components.
    
    Demonstrates reward engineering patterns:
    - Sparse goal rewards
    - Dense shaping rewards
    - Penalty terms
    - Temporal discounting of shaping
    """
    
    def __init__(self, config: RewardConfig = None):
        self.config = config or RewardConfig()
    
    def compute(self, 
                state: np.ndarray,
                action: np.ndarray,
                next_state: np.ndarray,
                goal: np.ndarray,
                done: bool) -> Tuple[float, dict]:
        """
        Compute reward with decomposition for analysis.
        
        Returns:
            reward: Total scalar reward
            info: Dictionary breaking down reward components
        """
        components = {}
        
        # 1. Sparse goal reward
        if done and self._is_success(next_state, goal):
            components['goal'] = self.config.sparse_goal_reward
        else:
            components['goal'] = 0.0
        
        # 2. Step penalty (encourages efficiency)
        components['step_penalty'] = self.config.step_penalty
        
        # 3. Potential-based shaping (provably safe)
        # Using distance-based potential: φ(s) = -||s - goal||
        potential_current = -np.linalg.norm(state - goal)
        potential_next = -np.linalg.norm(next_state - goal)
        # F(s, s') = γ * φ(s') - φ(s)
        gamma = 0.99
        shaping = gamma * potential_next - potential_current
        components['shaping'] = self.config.shaping_weight * shaping
        
        # Total reward
        reward = sum(components.values())
        
        return reward, components
    
    def _is_success(self, state: np.ndarray, 
                    goal: np.ndarray,
                    threshold: float = 0.1) -> bool:
        """Check if goal is reached."""
        return np.linalg.norm(state - goal) < threshold
 
 
class PotentialBasedShaping:
    """
    Potential-based reward shaping (Ng et al., 1999).
    
    THEOREM: If shaping reward is F(s,a,s') = γφ(s') - φ(s)
    for any potential function φ, then optimal policy is
    unchanged from original MDP.
    
    This is the ONLY form of shaping guaranteed not to
    change the optimal policy!
    """
    
    def __init__(self, 
                 potential_fn: Callable[[np.ndarray], float],
                 gamma: float = 0.99):
        self.potential = potential_fn
        self.gamma = gamma
        self.prev_potential = None
    
    def reset(self, initial_state: np.ndarray):
        """Reset at episode start."""
        self.prev_potential = self.potential(initial_state)
    
    def shape(self, next_state: np.ndarray, 
              done: bool) -> float:
        """
        Compute shaping reward.
        
        F = γ * φ(s') - φ(s) for non-terminal
        F = 0 - φ(s) for terminal (φ(terminal) = 0)
        """
        if done:
            next_potential = 0.0
        else:
            next_potential = self.potential(next_state)
        
        shaping_reward = self.gamma * next_potential - self.prev_potential
        self.prev_potential = next_potential
        
        return shaping_reward
 
 
# Example: Distance-based potential
def distance_potential(state: np.ndarray, 
                       goal: np.ndarray) -> float:
    """Potential proportional to negative distance."""
    return -np.linalg.norm(state - goal)
 
 
class NormalizedReward:
    """
    Running normalization of rewards.
    
    Important for algorithms sensitive to reward scale:
    - PPO (clipping depends on advantage scale)
    - A2C (gradient magnitude affected)
    - Soft Actor-Critic (temperature calibration)
    """
    
    def __init__(self, clip: float = 10.0):
        self.clip = clip
        self.mean = 0.0
        self.var = 1.0
        self.count = 0
    
    def __call__(self, reward: float) -> float:
        # Update running statistics
        self.count += 1
        delta = reward - self.mean
        self.mean += delta / self.count
        self.var += delta * (reward - self.mean)
        
        # Normalize
        std = np.sqrt(self.var / max(self.count, 1)) + 1e-8
        normalized = (reward - self.mean) / std
        
        # Clip for stability
        return np.clip(normalized, -self.clip, self.clip)

The Reward Design Problem

Designing reward functions is arguably the hardest part of RL system development. The core challenge: specifying what you want is fundamentally difficult.

Specification Gaming

Agents are creative optimizers. Given any reward function, they will find ways to maximize it that you didn't anticipate—and often wouldn't endorse:

A boat racing agent that drives in circles collecting power-ups instead of finishing the race (more points)
A cleaning robot that covers its camera to 'not see' any dirt
A tetris agent that pauses the game indefinitely to avoid losing
A simulated athlete that exploits physics engine bugs for impossible maneuvers

These behaviors are 'correct' given the reward—the agent is doing exactly what you asked. The problem is that you asked for the wrong thing.

Common Reward Design Pitfalls

•Reward Hacking: Agent finds unintended ways to achieve high reward without desired behavior
•Reward Tampering: Agent modifies its own reward signal (possible in some setups)
•Wireheading: Agent takes actions to maximize its reward sensor directly
•Goodhart's Law: 'When a measure becomes a target, it ceases to be a good measure'
•Proxy Gaming: Rewarding proxies for the true goal leads to optimizing the proxy only
•Deceptive Alignment: Agent appears to follow the reward during training but defects at deployment

Principled Reward Design

1. Potential-Based Shaping

The only form of reward shaping guaranteed to preserve the optimal policy: $$F(s, a, s') = \gamma \phi(s') - \phi(s)$$

Any other shaping risks changing what behavior is optimal.

2. Inverse Reinforcement Learning (IRL)

Learn the reward function from expert demonstrations. Instead of specifying rewards, demonstrate desired behavior and let the system infer what reward would produce it.

3. Preference Learning

Show the agent pairs of behaviors and indicate which is preferred. The reward function is learned to explain these preferences (RLHF - Reinforcement Learning from Human Feedback).

4. Constrained RL

Instead of putting everything into one reward, specify constraints that must be satisfied. Optimize primary reward subject to safety/feasibility constraints.

5. Multi-Objective RL

Maintain multiple reward signals and optimize the Pareto frontier. Allows human-in-the-loop selection of operating point.

The Reward Design Heuristic

Solving Sparse Reward Problems

Sparse rewards are safer (less prone to hacking) but harder to learn from. If the agent only receives feedback upon success, how does it discover successful behavior in the first place?

The Exploration Problem Under Sparse Rewards

Solution Strategies

1. Curiosity-Driven Exploration

Reward the agent for 'surprising' states—states where its world model prediction is poor. This intrinsic reward drives exploration even without extrinsic signal.

$$r^{intr}t = ||\hat{s}{t+1} - s_{t+1}||$$

Limitations: 'Noisy TV problem'—purely stochastic elements are infinitely surprising but not useful.

2. Count-Based Exploration

Bonus for visiting less-frequent states:

$$r^{bonus}_t = \beta / \sqrt{N(s_t)}$$

In large state spaces, use pseudo-counts or density models.

3. Hindsight Experience Replay (HER)

Retroactively relabel failed episodes with achieved goals. If the agent tried to reach goal A but reached B, store a 'successful' trajectory reaching B.

4. Go-Explore

First, explore the state space by returning to promising states and exploring from there. Second, robustify the discovered paths with RL.

5. Curriculum Learning

Start with easy versions of the task (goal nearby) and gradually increase difficulty. Automated curriculum methods (Asymmetric Self-Play, POET) can generate curricula without human design.

sparse_reward_solutions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
import numpy as np
from collections import defaultdict
from typing import List, Tuple, Optional
 
class HindsightExperienceReplay:
    """
    Hindsight Experience Replay (HER) - Andrychowicz et al., 2017
    
    Key insight: A failed attempt to reach goal A is a successful
    attempt to reach wherever we actually ended up.
    
    Dramatically improves learning in sparse reward goal-conditioned
    tasks.
    """
    
    def __init__(self, 
                 replay_k: int = 4,
                 strategy: str = 'future'):
        """
        Args:
            replay_k: Number of hindsight goals to sample per transition
            strategy: 'final', 'future', or 'episode'
                - 'final': Use the final achieved goal
                - 'future': Sample from states later in episode
                - 'episode': Sample from anywhere in episode
        """
        self.replay_k = replay_k
        self.strategy = strategy
    
    def augment_episode(self,
                        episode: List[dict],
                        goal_dim: int) -> List[dict]:
        """
        Augment episode with hindsight experience.
        
        Args:
            episode: List of transitions with 'state', 'action', 
                    'next_state', 'goal', 'achieved_goal', 'reward', 'done'
            goal_dim: Dimensionality of goal space
            
        Returns:
            Augmented list including original + hindsight transitions
        """
        augmented = list(episode)  # Keep original transitions
        
        for t, transition in enumerate(episode):
            # Sample hindsight goals
            hindsight_goals = self._sample_goals(episode, t)
            
            for new_goal in hindsight_goals:
                # Recompute reward for new goal
                achieved = transition['achieved_goal']
                new_reward = self._compute_reward(achieved, new_goal)
                new_done = np.linalg.norm(achieved - new_goal) < 0.05
                
                # Create new transition with hindsight goal
                hindsight_transition = {
                    'state': np.concatenate([
                        transition['state'][:goal_dim], new_goal
                    ]),
                    'action': transition['action'],
                    'next_state': np.concatenate([
                        transition['next_state'][:goal_dim], new_goal
                    ]),
                    'goal': new_goal,
                    'achieved_goal': achieved,
                    'reward': new_reward,
                    'done': new_done
                }
                augmented.append(hindsight_transition)
        
        return augmented
    
    def _sample_goals(self, episode: List[dict], 
                      t: int) -> List[np.ndarray]:
        """Sample hindsight goals according to strategy."""
        goals = []
        
        if self.strategy == 'final':
            # Always use the final achieved goal
            goals = [episode[-1]['achieved_goal']] * self.replay_k
            
        elif self.strategy == 'future':
            # Sample from later timesteps
            future_indices = range(t + 1, len(episode))
            if len(future_indices) > 0:
                sampled = np.random.choice(
                    list(future_indices), 
                    size=min(self.replay_k, len(future_indices)),
                    replace=False
                )
                goals = [episode[i]['achieved_goal'] for i in sampled]
                
        elif self.strategy == 'episode':
            # Sample from anywhere in episode
            sampled = np.random.choice(
                len(episode),
                size=min(self.replay_k, len(episode)),
                replace=False
            )
            goals = [episode[i]['achieved_goal'] for i in sampled]
        
        return goals
    
    def _compute_reward(self, achieved: np.ndarray,
                        goal: np.ndarray,
                        threshold: float = 0.05) -> float:
        """Sparse reward: +0 for success, -1 otherwise."""
        if np.linalg.norm(achieved - goal) < threshold:
            return 0.0
        return -1.0
 
 
class CuriosityDrivenExploration:
    """
    Intrinsic Curiosity Module (ICM) - Pathak et al., 2017
    
    Provides intrinsic reward based on prediction error
    of a forward dynamics model in learned feature space.
    """
    
    def __init__(self, feature_dim: int = 64,
                 curiosity_weight: float = 0.01):
        self.feature_dim = feature_dim
        self.curiosity_weight = curiosity_weight
        # In practice, these would be neural networks
        self.inverse_model = None  # Predicts action from (s, s')
        self.forward_model = None  # Predicts φ(s') from (φ(s), a)
        self.feature_encoder = None  # Maps s -> φ(s)
    
    def compute_intrinsic_reward(self,
                                  state: np.ndarray,
                                  action: np.ndarray,
                                  next_state: np.ndarray) -> float:
        """
        Compute intrinsic reward as forward model prediction error.
        
        States where the model is surprised (poor prediction)
        get higher intrinsic reward, encouraging exploration.
        """
        # Encode states to feature space
        phi_s = self.feature_encoder(state)
        phi_s_next = self.feature_encoder(next_state)
        
        # Predict features of next state
        phi_s_next_pred = self.forward_model(phi_s, action)
        
        # Intrinsic reward = prediction error
        prediction_error = np.mean((phi_s_next_pred - phi_s_next) ** 2)
        
        return self.curiosity_weight * prediction_error
 
 
class CountBasedExploration:
    """
    Count-based exploration bonus.
    
    For tabular states: exact visitation counts.
    For large state spaces: density estimation or hash-based.
    """
    
    def __init__(self, bonus_coef: float = 0.1):
        self.bonus_coef = bonus_coef
        self.state_counts = defaultdict(int)
    
    def get_bonus(self, state: np.ndarray) -> float:
        """Compute exploration bonus as inverse sqrt of count."""
        # For continuous states, discretize or hash
        state_key = self._hash_state(state)
        self.state_counts[state_key] += 1
        count = self.state_counts[state_key]
        
        # Bonus decays with visits
        return self.bonus_coef / np.sqrt(count)
    
    def _hash_state(self, state: np.ndarray, 
                    precision: int = 100) -> tuple:
        """Simple discretization for hashing."""
        discretized = np.round(state * precision).astype(int)
        return tuple(discretized)

The Interplay: States, Actions, Rewards

States, actions, and rewards don't exist in isolation—their design is deeply interconnected. Understanding these interactions is crucial for effective RL system design.

State ↔ Action Interactions

State affects valid actions: In many domains, the set of valid actions depends on the state. Chess legal moves depend on piece positions. A robot can't move an arm that's at joint limits.

State representation affects action value learning: If similar states map to different optimal actions, value learning struggles. Good representations should make action-value functions smooth.

State ↔ Reward Interactions

State determines reward accessibility: Reward functions often depend on state features. If those features aren't in the state, reward can't be computed correctly.

Reward structure affects state value: Dense rewards make nearby states have similar values (smooth value function). Sparse rewards create sharp value gradients at goal boundaries.

State abstraction can destroy reward signal: Coarse state representations might alias high-reward and low-reward states, making learning impossible.

Action ↔ Reward Interactions

Reward structure affects action preferences: If rewards are always positive, long episodes accumulate more reward, potentially encouraging dawdling. Negative step costs encourage efficiency.

Action abstraction can simplify reward learning: Temporally extended actions (options, skills) can bridge sparse rewards, making credit assignment easier.

The System Design Triangle

Designing one of (State, Action, Reward) poorly can make learning impossible even if the others are perfect:

Perfect rewards but insufficient state → Agent can't distinguish good from bad situations
Perfect state but impossible actions → Agent can't execute learned knowledge
Perfect state and actions but pathological reward → Agent learns wrong behavior

Design as a Whole

Practical Design Guidelines

State Design Checklist

•Markov Property: Does the state contain all information needed to predict future? Include velocities if positions alone are insufficient.
•Normalization: Are all state features on similar scales? Neural networks expect roughly unit scale inputs.
•Observability: Can you actually observe this state in deployment? Don't include privileged simulation information.
•Dimensionality: Is the state as compact as possible? Larger states need more data to learn from.
•Temporal Consistency: Does the state representation change over time or across environments? Consistency aids transfer.

Action Design Checklist

•Completeness: Can the agent express all meaningful behaviors? Missing actions make optimal behavior impossible.
•Granularity: Is the action space fine enough for precise control but coarse enough for efficient learning?
•Safety: Are dangerous actions excluded or penalized? Some actions should be impossible rather than just discouraged.
•Symmetry: Does the action space have unnecessary asymmetries that complicate learning? (e.g., range [0, 10] vs [-5, 5])
•Physical Feasibility: Can actions actually be executed in the real world? Simulation allows actions that may be physically impossible.

Reward Design Checklist

•Alignment: Does maximizing reward actually correspond to desired behavior? Think adversarially.
•Sparsity vs. Density Trade-off: Can learning succeed with sparse rewards, or is shaping necessary? Start sparse.
•Potential-Based Shaping: If using shaping, is it in potential-based form to preserve optimality?
•Scale: Is the reward magnitude appropriate? Too large destabilizes learning; too small is signal-lost in noise.
•Consistency: Is the reward consistent across states? Arbitrary variations confuse the value function.

The Debugging Protocol

Summary: The Building Blocks

Key Takeaways

•States must be Markov-sufficient: Include all information needed to predict future states and rewards. When in doubt, include more; let the agent learn what's irrelevant.
•Action spaces profoundly affect algorithm choice: Discrete actions enable value-based methods; continuous actions typically require actor-critic approaches.
•Complex action spaces require specialized techniques: Hierarchical actions, large discrete spaces, and parameterized actions each demand tailored solutions.
•Reward design is specification: You're programming the goal, not the behavior. Think adversarially about how the reward can be gamed.
•Sparse rewards are safer but harder: They resist hacking but require sophisticated exploration. HER, curiosity, and curriculum learning can help.
•Potential-based shaping preserves optimality: It's the only theoretically safe form of reward shaping. Other shaping risks changing optimal behavior.
•The three components interact: Poor design in one area can invalidate good design in others. Consider the system holistically.

Looking Ahead

Page Complete