Loading content...
Every interaction between an agent and its environment is mediated by three fundamental quantities: states, actions, and rewards. These aren't just abstract mathematical objects—they're the concrete reality of reinforcement learning, and their design determines whether learning succeeds or fails.
States describe where you are—the complete summary of your situation relevant to future outcomes.
Actions describe what you can do—the set of interventions available to the agent at each moment.
Rewards describe how well you did—the scalar feedback signal that guides learning toward desirable behavior.
Mastering RL requires deep understanding of each: what makes a good state representation, how action spaces affect algorithm choice, and how reward design shapes learned behavior—sometimes in unexpected ways.
This page develops rigorous understanding of states, actions, and rewards. You'll learn how to design state representations that are both sufficient and efficient, how different action space structures affect learning, and the subtle art of reward engineering—including the pitfalls that cause well-intentioned rewards to produce unintended behavior.
A state $s \in \mathcal{S}$ is a complete description of the world from the agent's perspective—complete in the sense that it contains all information relevant to predicting future states and rewards. This completeness property is formalized as the Markov property:
$$P(s_{t+1}, r_{t+1} | s_t, a_t) = P(s_{t+1}, r_{t+1} | s_0, a_0, s_1, a_1, \ldots, s_t, a_t)$$
In words: the future depends on the present state and action, but not on how we got to the present. The history is irrelevant once we know the current state.
States vs. Observations
A critical distinction:
In a Markov Decision Process (MDP), $o_t = s_t$—the agent sees everything. In a Partially Observable MDP (POMDP), $o_t = O(s_t)$ for some (possibly noisy) observation function—the agent sees only partial information.
| Aspect | Type | Examples | Algorithm Implications |
|---|---|---|---|
| Size | Finite/Discrete | Board games, gridworlds | Tabular methods applicable; exact solutions possible |
| Size | Infinite/Continuous | Robotics, physics simulations | Function approximation required; no exact solutions |
| Dimension | Low-dimensional | Pendulum (4D), CartPole (4D) | Simple networks; fast training |
| Dimension | High-dimensional | Images (thousands of pixels) | Deep networks required; sample inefficient |
| Structure | Vector | Sensor readings | Standard neural networks |
| Structure | Image/Grid | Atari games, vision tasks | CNNs for spatial structure |
| Structure | Graph | Molecules, social networks | GNNs for relational structure |
| Structure | Sequence | Text, time series | RNNs/Transformers for temporal structure |
The State Design Problem
Designing the state representation is one of the most important decisions in RL system design. The state must satisfy competing requirements:
Sufficiency: The state must contain enough information to predict future states and rewards. Missing critical information makes the problem non-Markovian, which complicates learning.
Efficiency: Larger states require more samples to learn from and more computation per step. The state should be as compact as possible while remaining sufficient.
Learnability: The state representation affects how easily the underlying patterns can be learned. Good representations make value functions and policies smooth and easier to approximate.
Accessibility: The state must be based on information the agent can actually observe in deployment. Using privileged information available only during training leads to sim-to-real failures.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130
import numpy as npfrom dataclasses import dataclassfrom typing import List, Tuplefrom abc import ABC, abstractmethod class StateRepresentation(ABC): """Base class for state representations.""" @abstractmethod def observe(self, raw_state: dict) -> np.ndarray: """Convert raw environment state to agent observation.""" pass @property @abstractmethod def shape(self) -> Tuple[int, ...]: """Shape of the observation vector/tensor.""" pass class RobotArmState(StateRepresentation): """ State representation for a robotic arm. Design decision: Include joint positions AND velocities to satisfy Markov property (position alone isn't sufficient to predict future motion). """ def __init__(self, num_joints: int = 6, include_velocities: bool = True, include_target: bool = True): self.num_joints = num_joints self.include_velocities = include_velocities self.include_target = include_target @property def shape(self) -> Tuple[int]: dim = self.num_joints # Joint positions if self.include_velocities: dim += self.num_joints # Joint velocities if self.include_target: dim += 3 # Target position (x, y, z) return (dim,) def observe(self, raw_state: dict) -> np.ndarray: """ Extract relevant features from raw sensor data. Args: raw_state: Dictionary containing: - 'joint_positions': array of joint angles - 'joint_velocities': array of angular velocities - 'target_position': 3D target coordinates """ obs = [raw_state['joint_positions']] if self.include_velocities: obs.append(raw_state['joint_velocities']) if self.include_target: obs.append(raw_state['target_position']) return np.concatenate(obs) class FrameStackState(StateRepresentation): """ Stack multiple frames to infer velocity/motion. Used when single frames don't satisfy Markov property (e.g., can't determine ball direction from single image). """ def __init__(self, frame_shape: Tuple[int, int, int], num_frames: int = 4): self.frame_shape = frame_shape # (H, W, C) self.num_frames = num_frames self.frame_buffer: List[np.ndarray] = [] @property def shape(self) -> Tuple[int, int, int]: H, W, C = self.frame_shape return (H, W, C * self.num_frames) def reset(self, initial_frame: np.ndarray): """Initialize buffer with copies of initial frame.""" self.frame_buffer = [initial_frame.copy() for _ in range(self.num_frames)] def observe(self, new_frame: np.ndarray) -> np.ndarray: """Add new frame, return stacked representation.""" self.frame_buffer.pop(0) self.frame_buffer.append(new_frame) # Stack along channel dimension return np.concatenate(self.frame_buffer, axis=-1) @dataclassclass StateNormalizer: """ Running normalization for continuous states. Critical for stable learning—neural networks expect inputs with zero mean and unit variance. """ mean: np.ndarray var: np.ndarray count: int @classmethod def create(cls, state_dim: int): return cls( mean=np.zeros(state_dim), var=np.ones(state_dim), count=0 ) def update(self, state: np.ndarray): """Update running statistics with new observation.""" self.count += 1 delta = state - self.mean self.mean += delta / self.count self.var += delta * (state - self.mean) def normalize(self, state: np.ndarray) -> np.ndarray: """Normalize state to ~zero mean, unit variance.""" std = np.sqrt(self.var / max(self.count, 1)) + 1e-8 return (state - self.mean) / stdThe Markov property is an assumption, not a guarantee. Many practical problems violate it, and understanding these violations is crucial for successful RL deployment.
Common Sources of Non-Markovian Behavior:
Missing velocities: A ball's position doesn't tell you where it's going. Solution: Include velocities or stack multiple frames.
Hidden modes: A machine may behave differently when hot vs. cold, but temperature isn't observed. Solution: Include temperature sensor or use recurrent policies.
Opponent modeling: In games, the opponent's strategy affects optimal play but isn't directly observed. Solution: Model opponent or use history-based policies.
Partial observability: Through walls, around corners, in fog—any occlusion creates hidden state. Solution: Use belief states or recurrent processing.
Aliased states: Different underlying states produce identical observations. Solution: Include distinguishing features or use memory.
State aliasing occurs when distinct true states produce identical observations. In a gridworld with local observation, the agent might see 'wall to the north, empty to the south' in two completely different hallways. Without memory, the agent cannot determine which hallway it's in and may take suboptimal actions. This is a fundamental challenge in POMDPs.
Practical Strategies for Non-Markovian Domains
1. State Augmentation
The simplest approach: add more information to the state until it becomes sufficient. Include velocities, accelerations, sensor histories, or any relevant context.
2. Frame Stacking
For visual domains, stack the last $k$ frames. The DQN paper used 4-frame stacks for Atari—enough to infer velocity from consecutive frames.
3. Recurrent Policies
Use an LSTM or GRU that maintains hidden state across time steps. The hidden state can encode arbitrary history, enabling the policy to learn what to remember.
4. Attention/Transformer Policies
Process the full episode history (or a window) with attention mechanisms. More flexible than RNNs but computationally heavier.
5. Belief State Methods
Explicitly maintain a probability distribution over possible true states. Computationally intensive but theoretically principled for POMDPs.
An action $a \in \mathcal{A}$ is the agent's intervention in the world—the way it affects state transitions. The nature of the action space profoundly impacts algorithm design and learning difficulty.
Discrete Action Spaces
Finite set of distinct actions: $\mathcal{A} = {a_1, a_2, \ldots, a_n}$
Examples:
Advantages:
Challenges:
Continuous Action Spaces
Actions are real-valued vectors: $\mathcal{A} \subseteq \mathbb{R}^n$
Examples:
Advantages:
Challenges:
| Algorithm Family | Discrete Actions | Continuous Actions |
|---|---|---|
| Q-Learning / DQN | ✓ Native support | ✗ Requires discretization or modifications |
| Policy Gradient / REINFORCE | ✓ Via softmax policy | ✓ Via Gaussian policy |
| Actor-Critic / A2C, PPO | ✓ Full support | ✓ Full support |
| DDPG / TD3 / SAC | ✗ Designed for continuous | ✓ Native support |
| TRPO | ✓ Full support | ✓ Full support |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139
import numpy as npfrom abc import ABC, abstractmethodfrom typing import Union, Tupleimport torchimport torch.nn as nnimport torch.distributions as D class ActionSpace(ABC): """Abstract base class for action spaces.""" @abstractmethod def sample(self) -> np.ndarray: """Sample a random action.""" pass @abstractmethod def contains(self, action) -> bool: """Check if action is valid.""" pass class DiscreteActionSpace(ActionSpace): """ Finite set of discrete actions. For Q-learning: output Q(s,a) for all a, take argmax For policy gradient: output π(a|s) categorical distribution """ def __init__(self, n_actions: int, action_meanings: list = None): self.n_actions = n_actions self.action_meanings = action_meanings or list(range(n_actions)) def sample(self) -> int: return np.random.randint(self.n_actions) def contains(self, action: int) -> bool: return 0 <= action < self.n_actions class ContinuousActionSpace(ActionSpace): """ Continuous action space (bounded box). Actions are real-valued vectors constrained to [low, high]. """ def __init__(self, low: np.ndarray, high: np.ndarray): self.low = np.asarray(low) self.high = np.asarray(high) self.shape = self.low.shape def sample(self) -> np.ndarray: return np.random.uniform(self.low, self.high) def contains(self, action: np.ndarray) -> bool: return np.all(action >= self.low) and np.all(action <= self.high) def clip(self, action: np.ndarray) -> np.ndarray: """Clip action to valid range.""" return np.clip(action, self.low, self.high) class DiscretePolicy(nn.Module): """ Policy network for discrete actions. Outputs categorical distribution over actions. """ def __init__(self, state_dim: int, n_actions: int, hidden_dim: int = 64): super().__init__() self.net = nn.Sequential( nn.Linear(state_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, n_actions) ) def forward(self, state: torch.Tensor) -> D.Categorical: logits = self.net(state) return D.Categorical(logits=logits) def act(self, state: torch.Tensor, deterministic: bool = False) -> Tuple[int, float]: dist = self.forward(state) if deterministic: action = dist.probs.argmax() else: action = dist.sample() log_prob = dist.log_prob(action) return action.item(), log_prob class ContinuousPolicy(nn.Module): """ Policy network for continuous actions. Outputs Gaussian distribution with learned mean and std. """ def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 64, log_std_min: float = -20, log_std_max: float = 2): super().__init__() self.log_std_min = log_std_min self.log_std_max = log_std_max self.shared = nn.Sequential( nn.Linear(state_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU() ) self.mean_head = nn.Linear(hidden_dim, action_dim) self.log_std_head = nn.Linear(hidden_dim, action_dim) def forward(self, state: torch.Tensor) -> D.Normal: features = self.shared(state) mean = self.mean_head(features) log_std = self.log_std_head(features) log_std = torch.clamp(log_std, self.log_std_min, self.log_std_max) std = torch.exp(log_std) return D.Normal(mean, std) def act(self, state: torch.Tensor, deterministic: bool = False) -> Tuple[torch.Tensor, torch.Tensor]: dist = self.forward(state) if deterministic: action = dist.mean else: action = dist.rsample() # Reparameterized sample log_prob = dist.log_prob(action).sum(-1) # Sum over action dims return action, log_probReal-world problems often have action spaces more complex than simple discrete sets or continuous vectors.
Structured Action Spaces
Hierarchical Actions: Actions have multi-level structure. Example: In an RTS game, select (unit, action_type, target)—a combinatorial explosion of primitive actions.
Parameterized Actions: Discrete action types with continuous parameters. Example: In soccer, 'kick' is discrete but kick direction and force are continuous.
Variable-Size Actions: The number of valid actions changes with state. Example: In card games, legal plays depend on hand contents.
Multi-Agent/Multi-Dimensional: Multiple decisions must be made simultaneously. Example: Controlling multiple robots or bid/ask prices.
When facing combinatorially large action spaces, decompose the action into sequential or factored sub-decisions. Instead of choosing from 10,000 joint configurations, choose joint 1, then joint 2, etc. Auto-regressive action generation (as in Decision Transformer) handles this elegantly.
Large Discrete Action Spaces
When the discrete action space is huge (thousands to millions), standard methods fail:
Action Embeddings: Represent actions as vectors; use nearest-neighbor search or learn to generate action embeddings.
Action Elimination: Learn to mask obviously bad actions, reducing the effective action space.
Hierarchical RL: Decompose into high-level options and low-level primitives (Option-Critic, Feudal Networks).
Slate/Combinatorial Methods: For recommendation/ranking, use methods designed for selecting item sets.
Wolpertinger Architecture: Embed actions in continuous space, propose continuous action, map to nearest discrete neighbors, evaluate with Q-network.
| Action Space Type | Size | Recommended Approaches |
|---|---|---|
| Small Discrete | < 100 | DQN, A2C, PPO with categorical policy |
| Large Discrete | 100 - 10K | Dueling DQN, Action embeddings, Hierarchical |
| Huge Discrete | 10K | Wolpertinger, Slate methods, Action elimination |
| Low-dim Continuous | < 20 | DDPG, TD3, SAC, PPO with Gaussian policy |
| High-dim Continuous | 20 | SAC, PPO, Normalized Actor-Critic |
| Parameterized | Discrete + Continuous | P-DQN, Hybrid architectures |
| Hierarchical | Multi-level | Options, Feudal Networks, HAM |
The reward $r \in \mathbb{R}$ is the scalar feedback signal that defines the learning objective. It's the only channel through which the designer communicates goals to the agent—and this simplicity is both powerful and dangerous.
The Reward Hypothesis
"All of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward)." — Sutton & Barto
This hypothesis is the foundational assumption of RL. It asserts that any goal can be expressed as a reward function. Whether this is true in general is debatable, but it's the working assumption that enables RL algorithms.
The Reward Function
Rewards can depend on states, actions, and transitions:
$$r_{t+1} = R(s_t, a_t, s_{t+1})$$
In practice, rewards often simplify to:
Types of Reward Signals
Dense Rewards: Feedback at every step. Example: Negative reward proportional to distance from goal.
Sparse Rewards: Feedback only at episode end or key events. Example: +1 for winning, 0 otherwise.
Shaped Rewards: Dense rewards engineered to guide learning. Example: Distance to goal plus bonus for facing goal direction.
Intrinsic Rewards: Self-generated rewards for exploration or curiosity. Example: Reward for visiting novel states.
Agents optimize the reward you give them, not the reward you meant. A cleaning robot rewarded for 'dirt cleaned' might dump dirt to clean it. A game-playing agent rewarded for points might find exploits the designers never imagined. Reward hacking is not a bug in the algorithm—it's a feature. The agent is doing exactly what you asked.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148
import numpy as npfrom typing import Callable, Tuple, Optionalfrom dataclasses import dataclass @dataclassclass RewardConfig: """Configuration for reward function design.""" sparse_goal_reward: float = 10.0 step_penalty: float = -0.01 shaping_weight: float = 1.0 class RewardFunction: """ Composable reward function with multiple components. Demonstrates reward engineering patterns: - Sparse goal rewards - Dense shaping rewards - Penalty terms - Temporal discounting of shaping """ def __init__(self, config: RewardConfig = None): self.config = config or RewardConfig() def compute(self, state: np.ndarray, action: np.ndarray, next_state: np.ndarray, goal: np.ndarray, done: bool) -> Tuple[float, dict]: """ Compute reward with decomposition for analysis. Returns: reward: Total scalar reward info: Dictionary breaking down reward components """ components = {} # 1. Sparse goal reward if done and self._is_success(next_state, goal): components['goal'] = self.config.sparse_goal_reward else: components['goal'] = 0.0 # 2. Step penalty (encourages efficiency) components['step_penalty'] = self.config.step_penalty # 3. Potential-based shaping (provably safe) # Using distance-based potential: φ(s) = -||s - goal|| potential_current = -np.linalg.norm(state - goal) potential_next = -np.linalg.norm(next_state - goal) # F(s, s') = γ * φ(s') - φ(s) gamma = 0.99 shaping = gamma * potential_next - potential_current components['shaping'] = self.config.shaping_weight * shaping # Total reward reward = sum(components.values()) return reward, components def _is_success(self, state: np.ndarray, goal: np.ndarray, threshold: float = 0.1) -> bool: """Check if goal is reached.""" return np.linalg.norm(state - goal) < threshold class PotentialBasedShaping: """ Potential-based reward shaping (Ng et al., 1999). THEOREM: If shaping reward is F(s,a,s') = γφ(s') - φ(s) for any potential function φ, then optimal policy is unchanged from original MDP. This is the ONLY form of shaping guaranteed not to change the optimal policy! """ def __init__(self, potential_fn: Callable[[np.ndarray], float], gamma: float = 0.99): self.potential = potential_fn self.gamma = gamma self.prev_potential = None def reset(self, initial_state: np.ndarray): """Reset at episode start.""" self.prev_potential = self.potential(initial_state) def shape(self, next_state: np.ndarray, done: bool) -> float: """ Compute shaping reward. F = γ * φ(s') - φ(s) for non-terminal F = 0 - φ(s) for terminal (φ(terminal) = 0) """ if done: next_potential = 0.0 else: next_potential = self.potential(next_state) shaping_reward = self.gamma * next_potential - self.prev_potential self.prev_potential = next_potential return shaping_reward # Example: Distance-based potentialdef distance_potential(state: np.ndarray, goal: np.ndarray) -> float: """Potential proportional to negative distance.""" return -np.linalg.norm(state - goal) class NormalizedReward: """ Running normalization of rewards. Important for algorithms sensitive to reward scale: - PPO (clipping depends on advantage scale) - A2C (gradient magnitude affected) - Soft Actor-Critic (temperature calibration) """ def __init__(self, clip: float = 10.0): self.clip = clip self.mean = 0.0 self.var = 1.0 self.count = 0 def __call__(self, reward: float) -> float: # Update running statistics self.count += 1 delta = reward - self.mean self.mean += delta / self.count self.var += delta * (reward - self.mean) # Normalize std = np.sqrt(self.var / max(self.count, 1)) + 1e-8 normalized = (reward - self.mean) / std # Clip for stability return np.clip(normalized, -self.clip, self.clip)Designing reward functions is arguably the hardest part of RL system development. The core challenge: specifying what you want is fundamentally difficult.
Specification Gaming
Agents are creative optimizers. Given any reward function, they will find ways to maximize it that you didn't anticipate—and often wouldn't endorse:
These behaviors are 'correct' given the reward—the agent is doing exactly what you asked. The problem is that you asked for the wrong thing.
Principled Reward Design
1. Potential-Based Shaping
The only form of reward shaping guaranteed to preserve the optimal policy: $$F(s, a, s') = \gamma \phi(s') - \phi(s)$$
Any other shaping risks changing what behavior is optimal.
2. Inverse Reinforcement Learning (IRL)
Learn the reward function from expert demonstrations. Instead of specifying rewards, demonstrate desired behavior and let the system infer what reward would produce it.
3. Preference Learning
Show the agent pairs of behaviors and indicate which is preferred. The reward function is learned to explain these preferences (RLHF - Reinforcement Learning from Human Feedback).
4. Constrained RL
Instead of putting everything into one reward, specify constraints that must be satisfied. Optimize primary reward subject to safety/feasibility constraints.
5. Multi-Objective RL
Maintain multiple reward signals and optimize the Pareto frontier. Allows human-in-the-loop selection of operating point.
Start with the sparsest reward that defines success, then add shaping only if needed for learning. Each shaping term you add is an opportunity for unintended behavior. When in doubt, prefer simplicity. Better to have slow learning than fast learning of the wrong behavior.
Sparse rewards are safer (less prone to hacking) but harder to learn from. If the agent only receives feedback upon success, how does it discover successful behavior in the first place?
The Exploration Problem Under Sparse Rewards
Consider a maze where reward is +1 for reaching the goal and 0 otherwise. Random exploration in a large maze will almost never find the goal, so the agent never experiences positive reward and cannot learn. This is the sparse reward exploration problem.
Solution Strategies
1. Curiosity-Driven Exploration
Reward the agent for 'surprising' states—states where its world model prediction is poor. This intrinsic reward drives exploration even without extrinsic signal.
$$r^{intr}t = ||\hat{s}{t+1} - s_{t+1}||$$
Limitations: 'Noisy TV problem'—purely stochastic elements are infinitely surprising but not useful.
2. Count-Based Exploration
Bonus for visiting less-frequent states:
$$r^{bonus}_t = \beta / \sqrt{N(s_t)}$$
In large state spaces, use pseudo-counts or density models.
3. Hindsight Experience Replay (HER)
Retroactively relabel failed episodes with achieved goals. If the agent tried to reach goal A but reached B, store a 'successful' trajectory reaching B.
4. Go-Explore
First, explore the state space by returning to promising states and exploring from there. Second, robustify the discovered paths with RL.
5. Curriculum Learning
Start with easy versions of the task (goal nearby) and gradually increase difficulty. Automated curriculum methods (Asymmetric Self-Play, POET) can generate curricula without human design.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180
import numpy as npfrom collections import defaultdictfrom typing import List, Tuple, Optional class HindsightExperienceReplay: """ Hindsight Experience Replay (HER) - Andrychowicz et al., 2017 Key insight: A failed attempt to reach goal A is a successful attempt to reach wherever we actually ended up. Dramatically improves learning in sparse reward goal-conditioned tasks. """ def __init__(self, replay_k: int = 4, strategy: str = 'future'): """ Args: replay_k: Number of hindsight goals to sample per transition strategy: 'final', 'future', or 'episode' - 'final': Use the final achieved goal - 'future': Sample from states later in episode - 'episode': Sample from anywhere in episode """ self.replay_k = replay_k self.strategy = strategy def augment_episode(self, episode: List[dict], goal_dim: int) -> List[dict]: """ Augment episode with hindsight experience. Args: episode: List of transitions with 'state', 'action', 'next_state', 'goal', 'achieved_goal', 'reward', 'done' goal_dim: Dimensionality of goal space Returns: Augmented list including original + hindsight transitions """ augmented = list(episode) # Keep original transitions for t, transition in enumerate(episode): # Sample hindsight goals hindsight_goals = self._sample_goals(episode, t) for new_goal in hindsight_goals: # Recompute reward for new goal achieved = transition['achieved_goal'] new_reward = self._compute_reward(achieved, new_goal) new_done = np.linalg.norm(achieved - new_goal) < 0.05 # Create new transition with hindsight goal hindsight_transition = { 'state': np.concatenate([ transition['state'][:goal_dim], new_goal ]), 'action': transition['action'], 'next_state': np.concatenate([ transition['next_state'][:goal_dim], new_goal ]), 'goal': new_goal, 'achieved_goal': achieved, 'reward': new_reward, 'done': new_done } augmented.append(hindsight_transition) return augmented def _sample_goals(self, episode: List[dict], t: int) -> List[np.ndarray]: """Sample hindsight goals according to strategy.""" goals = [] if self.strategy == 'final': # Always use the final achieved goal goals = [episode[-1]['achieved_goal']] * self.replay_k elif self.strategy == 'future': # Sample from later timesteps future_indices = range(t + 1, len(episode)) if len(future_indices) > 0: sampled = np.random.choice( list(future_indices), size=min(self.replay_k, len(future_indices)), replace=False ) goals = [episode[i]['achieved_goal'] for i in sampled] elif self.strategy == 'episode': # Sample from anywhere in episode sampled = np.random.choice( len(episode), size=min(self.replay_k, len(episode)), replace=False ) goals = [episode[i]['achieved_goal'] for i in sampled] return goals def _compute_reward(self, achieved: np.ndarray, goal: np.ndarray, threshold: float = 0.05) -> float: """Sparse reward: +0 for success, -1 otherwise.""" if np.linalg.norm(achieved - goal) < threshold: return 0.0 return -1.0 class CuriosityDrivenExploration: """ Intrinsic Curiosity Module (ICM) - Pathak et al., 2017 Provides intrinsic reward based on prediction error of a forward dynamics model in learned feature space. """ def __init__(self, feature_dim: int = 64, curiosity_weight: float = 0.01): self.feature_dim = feature_dim self.curiosity_weight = curiosity_weight # In practice, these would be neural networks self.inverse_model = None # Predicts action from (s, s') self.forward_model = None # Predicts φ(s') from (φ(s), a) self.feature_encoder = None # Maps s -> φ(s) def compute_intrinsic_reward(self, state: np.ndarray, action: np.ndarray, next_state: np.ndarray) -> float: """ Compute intrinsic reward as forward model prediction error. States where the model is surprised (poor prediction) get higher intrinsic reward, encouraging exploration. """ # Encode states to feature space phi_s = self.feature_encoder(state) phi_s_next = self.feature_encoder(next_state) # Predict features of next state phi_s_next_pred = self.forward_model(phi_s, action) # Intrinsic reward = prediction error prediction_error = np.mean((phi_s_next_pred - phi_s_next) ** 2) return self.curiosity_weight * prediction_error class CountBasedExploration: """ Count-based exploration bonus. For tabular states: exact visitation counts. For large state spaces: density estimation or hash-based. """ def __init__(self, bonus_coef: float = 0.1): self.bonus_coef = bonus_coef self.state_counts = defaultdict(int) def get_bonus(self, state: np.ndarray) -> float: """Compute exploration bonus as inverse sqrt of count.""" # For continuous states, discretize or hash state_key = self._hash_state(state) self.state_counts[state_key] += 1 count = self.state_counts[state_key] # Bonus decays with visits return self.bonus_coef / np.sqrt(count) def _hash_state(self, state: np.ndarray, precision: int = 100) -> tuple: """Simple discretization for hashing.""" discretized = np.round(state * precision).astype(int) return tuple(discretized)States, actions, and rewards don't exist in isolation—their design is deeply interconnected. Understanding these interactions is crucial for effective RL system design.
State ↔ Action Interactions
State affects valid actions: In many domains, the set of valid actions depends on the state. Chess legal moves depend on piece positions. A robot can't move an arm that's at joint limits.
Action granularity affects state requirements: Fine-grained actions (precise motor torques) require detailed state (joint velocities, applied forces). Coarse actions (go left/right) can work with simpler state.
State representation affects action value learning: If similar states map to different optimal actions, value learning struggles. Good representations should make action-value functions smooth.
State ↔ Reward Interactions
State determines reward accessibility: Reward functions often depend on state features. If those features aren't in the state, reward can't be computed correctly.
Reward structure affects state value: Dense rewards make nearby states have similar values (smooth value function). Sparse rewards create sharp value gradients at goal boundaries.
State abstraction can destroy reward signal: Coarse state representations might alias high-reward and low-reward states, making learning impossible.
Action ↔ Reward Interactions
Action space affects exploration for reward: Discrete actions can be explored combinatorially. Continuous actions require gradient-based or noise-based exploration, affecting how rewards are encountered.
Reward structure affects action preferences: If rewards are always positive, long episodes accumulate more reward, potentially encouraging dawdling. Negative step costs encourage efficiency.
Action abstraction can simplify reward learning: Temporally extended actions (options, skills) can bridge sparse rewards, making credit assignment easier.
The System Design Triangle
Designing one of (State, Action, Reward) poorly can make learning impossible even if the others are perfect:
When debugging RL failures, consider all three components. Sometimes the fix for a 'reward design problem' is actually in the state representation (adding features that make reward learnable) or the action space (adding actions that make the task feasible).
When learning fails: (1) Verify reward is correct—print rewards during episodes. (2) Check state sufficiency—can a human predict the optimal action from the state? (3) Test action space—can the optimal behavior be expressed with available actions? (4) Visualize value estimates—do they make intuitive sense? Systematic debugging prevents wasted experimentation.
States, actions, and rewards are the atomic elements of reinforcement learning. Everything else—value functions, policies, algorithms—is built from these foundations. Let's consolidate the key insights:
Looking Ahead
With states, actions, and rewards defined, we can now formalize the agent's decision-making procedure: the policy. The next page explores how policies map states to actions, the distinction between deterministic and stochastic policies, how policies are represented and parameterized, and how they can be evaluated and improved—the heart of what RL algorithms do.
You now have deep understanding of states, actions, and rewards—the raw materials of reinforcement learning. These concepts will appear throughout your study of RL, and the design principles here will guide your practical applications. Next, we formalize decision-making through the policy abstraction.