Rl Applications And Challenges - Learning Module

Loading content...

0/278

Exploration: Discovering Rewarding Behaviors

The Exploration Problem

Imagine you're in a new city without a map. You know your hotel is somewhere nearby, but you don't know which direction. Every step you take could bring you closer to your hotel—or lead you further away. How do you search effectively?

Option 1: Random walk. Take steps in random directions. You'll eventually find the hotel, but it might take a very long time.

Option 2: Systematic coverage. Explore the entire city grid by grid. Thorough, but extremely slow.

Option 3: Informed search. Ask locals, follow signs, use landmarks. Much faster, but requires prior knowledge you may not have.

This is the exploration problem in reinforcement learning—and it's arguably the most fundamental challenge in the field. Before an agent can learn anything, it must first discover rewarding experiences through its own actions. In environments with sparse rewards, random exploration might take astronomical time to stumble upon any positive signal. Yet systematic exploration can be equally impractical in large state spaces.

Exploration is where theory meets practice in RL. Algorithms that work brilliantly on benchmarks with dense rewards often fail completely when rewards are sparse. Understanding exploration is essential for deploying RL in real-world scenarios.

What You Will Learn

By the end of this page, you will understand: (1) The exploration-exploitation tradeoff and its mathematical formalization, (2) Classical exploration strategies (ε-greedy, UCB, Thompson Sampling), (3) Intrinsic motivation and curiosity-driven exploration, (4) Count-based and prediction-error bonuses, and (5) Modern approaches including Random Network Distillation (RND) and Go-Explore.

The Exploration-Exploitation Tradeoff

The exploration-exploitation tradeoff is the central tension in sequential decision-making: should you exploit known good actions to maximize immediate reward, or explore uncertain actions that might lead to better long-term outcomes?

The Dilemma in Action

Consider a restaurant choice problem:

Exploitation: Always go to your favorite restaurant. Guaranteed good meal.
Exploration: Try new restaurants. Risk a bad meal, but might discover an even better one.

Too much exploitation means you might miss superior options. Too much exploration means you waste time on poor choices when better ones are known.

Mathematical Formalization: The Multi-Armed Bandit

The multi-armed bandit problem crystallizes the exploration-exploitation tradeoff in its purest form:

Multi-Armed Bandit Problem
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Multi-Armed Bandit: k slot machines (arms)
# Each arm i has unknown reward distribution with mean μ_i
# Goal: maximize total reward over T pulls
 
class MultiArmedBandit:
    def __init__(self, k_arms):
        # True (unknown) mean rewards for each arm
        self.true_means = np.random.randn(k_arms)
    
    def pull(self, arm):
        # Returns noisy reward from arm's distribution
        return self.true_means[arm] + np.random.randn()
 
# The challenge: Find arm with highest μ_i quickly
# Trade off exploring to find best arm vs. exploiting known good arms
 
# Regret: Difference between optimal strategy and our strategy
# R(T) = T * μ* - Σ_{t=1}^{T} μ_{a_t}
#       ↑ Best possible    ↑ What we got
 
# Good algorithms achieve sublinear regret: R(T) = O(log T)
# This means exploration overhead becomes negligible as T → ∞

Regret: The Cost of Learning

Regret measures the cost of not knowing optimal actions in advance. It accumulates the difference between what we earned and what we could have earned with perfect knowledge:

$$\text{Regret}(T) = \sum_{t=1}^{T} \left( \mu^* - \mu_{a_t} \right)$$

where $\mu^*$ is the mean reward of the best arm, and $\mu_{a_t}$ is the mean reward of the arm we pulled at time $t$.

The goal of exploration strategies is to minimize regret—learning which actions are best with minimal cost.

Regret Bounds for Exploration Strategies
Strategy	Regret Bound	Characteristics
Pure Exploitation	O(T) — Linear	Never finds better arms; stuck on initial best
Pure Random	O(T) — Linear	Never converges; equal pulls on all arms
ε-Greedy	O(T) — Linear	Fixed exploration; never stops exploring
Decaying ε-Greedy	O(log T) — Sublinear	Decreasing exploration; asymptotically optimal
UCB	O(log T) — Sublinear	Optimism-based; provably near-optimal
Thompson Sampling	O(log T) — Sublinear	Bayesian; often best empirically

From Bandits to Full RL

The multi-armed bandit is a simplified RL problem with only one state. In full RL, exploration is dramatically harder because: (1) Actions affect future states, (2) The value of exploration is itself uncertain, (3) The state space may be vast or continuous. Bandit algorithms provide intuition, but don't directly solve the general RL exploration problem.

Classical Exploration Strategies

Let's examine the classical approaches to exploration, building from simple to sophisticated.

ε-Greedy: Simple but Effective

The most common exploration strategy in deep RL is ε-greedy: with probability ε take a random action, otherwise take the greedy action.

ε-Greedy Exploration
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def epsilon_greedy_action(q_values, epsilon):
    """
    With probability epsilon: random action
    With probability (1-epsilon): best action according to Q
    """
    if np.random.random() < epsilon:
        return np.random.randint(len(q_values))  # Explore
    else:
        return np.argmax(q_values)                # Exploit
 
# In practice: decay epsilon over training
def get_epsilon(step, epsilon_start=1.0, epsilon_end=0.01, decay_steps=100000):
    """Linear decay from epsilon_start to epsilon_end."""
    progress = min(step / decay_steps, 1.0)
    return epsilon_start - progress * (epsilon_start - epsilon_end)
 
# DQN uses: epsilon 1.0 → 0.1 over first 1M steps, then stays at 0.1

Limitations of ε-greedy:

Undirected exploration—random actions don't prioritize unexplored states
Fixed exploration probability ignores action uncertainty
All non-greedy actions equally likely, regardless of potential value

Boltzmann (Softmax) Exploration

Boltzmann exploration samples actions proportionally to their Q-values:

Boltzmann Exploration
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def boltzmann_action(q_values, temperature):
    """
    Sample action proportional to exp(Q/τ).
    Temperature τ controls exploration:
    - High τ → uniform (exploration)
    - Low τ → greedy (exploitation)
    """
    # Compute softmax probabilities
    scaled_q = q_values / temperature
    scaled_q -= scaled_q.max()  # For numerical stability
    exp_q = np.exp(scaled_q)
    probs = exp_q / exp_q.sum()
    
    return np.random.choice(len(q_values), p=probs)
 
# Advantage over ε-greedy: 
# Actions with higher Q are more likely even during exploration
# Actions with similar Q get similar probabilities

Upper Confidence Bound (UCB): Optimism Under Uncertainty

UCB implements the principle of optimism in the face of uncertainty: prefer actions that might be good because they haven't been tried enough.

UCB Exploration
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def ucb_action(q_values, action_counts, total_count, c=2.0):
    """
    Select action that maximizes: Q(a) + c * sqrt(log(t) / n(a))
                                   ↑       ↑
                                 exploit  explore
    
    Actions with low counts get exploration bonus.
    c controls exploration-exploitation balance.
    """
    exploration_bonus = c * np.sqrt(np.log(total_count) / (action_counts + 1e-8))
    ucb_values = q_values + exploration_bonus
    return np.argmax(ucb_values)
 
# Why it works:
# - Uncertainty shrinks as sqrt(1/n) for each action
# - Unexplored actions have high uncertainty → high bonus
# - As action is tried more, bonus decreases → exploit
 
# Provably achieves O(log T) regret in multi-armed bandits

Thompson Sampling: Bayesian Exploration

Thompson Sampling maintains a full probability distribution over each action's value and samples from these distributions to choose actions:

Thompson Sampling
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
class ThompsonSampling:
    """
    Bayesian approach: Maintain posterior over reward distributions.
    Sample from posteriors and act greedily on sample.
    """
    
    def __init__(self, n_arms):
        # Beta distribution parameters for each arm
        # Beta(α, β) models probability of success
        self.alphas = np.ones(n_arms)  # Successes + 1
        self.betas = np.ones(n_arms)   # Failures + 1
    
    def select_action(self):
        # Sample from posterior for each arm
        sampled_probs = np.random.beta(self.alphas, self.betas)
        return np.argmax(sampled_probs)
    
    def update(self, arm, reward):
        # Update posterior based on observed reward
        if reward > 0.5:  # Assuming binary reward
            self.alphas[arm] += 1
        else:
            self.betas[arm] += 1
 
# Why Thompson Sampling works:
# - Naturally balances exploration (uncertain arms have wide posteriors)
# - Automatically decreases exploration as uncertainty shrinks
# - Often outperforms UCB empirically
# - Computationally simple for conjugate priors

When Classical Methods Fail

Classical exploration works well when actions are independent (bandits) or the state space is small enough for count-based uncertainty. In large state spaces with deep function approximation, these methods struggle—you never see the exact same state twice, so counts are always 1. This motivates intrinsic motivation and learned exploration strategies.

Intrinsic Motivation: Curiosity as Reward

Human learning isn't driven purely by external rewards. Children explore their environment with apparent curiosity—drawn to novelty, surprised by unexpected outcomes, satisfied by understanding. Can we instill similar drives in RL agents?

Intrinsic motivation provides internal reward signals that encourage exploration, supplementing or replacing sparse external rewards. The agent receives reward for discovering new things, even without explicit task reward.

The Intrinsic + Extrinsic Reward Framework

Intrinsic + Extrinsic Reward
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Total reward = Extrinsic (from environment) + Intrinsic (from curiosity)
 
def combined_reward(state, action, next_state, env_reward, intrinsic_module):
    """
    Combine external task reward with intrinsic exploration reward.
    """
    extrinsic = env_reward                                    # From environment
    intrinsic = intrinsic_module.compute_bonus(state, action, next_state)
    
    # β controls exploration vs. exploitation
    total = extrinsic + beta * intrinsic
    return total
 
# Common pattern:
# - Early training: β high → explore widely
# - Later training: β decays → focus on task reward
 
# Key design choice: What should intrinsic reward measure?
# Options: Novelty, Surprise, Prediction Error, Information Gain

Types of Intrinsic Motivation

Intrinsic Motivation Methods
Method	Intrinsic Reward Based On	Intuition
Count-Based	Visitation count: 1/√N(s)	Visit rarely-seen states
Prediction Error	‖ŝ' - s'‖²	Seek surprising transitions
Information Gain	Reduction in model uncertainty	Learn about unknown dynamics
Empowerment	Mutual information I(a; s'\|s)	Maximize influence on future
Disagreement	Variance across model ensemble	Explore where models disagree

Count-Based Exploration in Deep RL

In tabular RL, count-based exploration is simple: track N(s) and give bonus 1/√N(s). But in deep RL, states are continuous and never repeat exactly. How do we count?

Pseudo-counts approximate visitation by measuring how "familiar" a state is:

Pseudo-Count via Density Model
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
class PseudoCountExploration:
    """
    Use density model ρ(s) to compute pseudo-counts.
    Pseudo-count: N̂(s) = ρ(s) / (1 - ρ(s))
    
    Intuition: If ρ(s) is low, state is novel → high bonus.
    """
    
    def __init__(self, state_dim):
        # Train a density model (e.g., PixelCNN for images)
        self.density_model = DensityModel(state_dim)
    
    def compute_bonus(self, state):
        # Get density before and after seeing this state
        rho_old = self.density_model.predict_density(state)
        self.density_model.update(state)
        rho_new = self.density_model.predict_density(state)
        
        # Compute pseudo-count
        if rho_old == 0:
            return 1.0  # Completely novel state
        
        pseudo_count = rho_old / max(1e-10, rho_new - rho_old)
        
        # Exploration bonus inversely proportional to pseudo-count
        bonus = 1.0 / np.sqrt(pseudo_count + 1e-8)
        return bonus

The Noisy TV Problem

Prediction-error curiosity has a failure mode: a "noisy TV" showing random images generates infinite prediction error (unpredictable by design) and infinite intrinsic reward. The agent becomes mesmerized, watching random noise instead of exploring meaningfully. This motivates more sophisticated curiosity signals that distinguish learnable novelty from inherent stochasticity.

Curiosity-Driven Exploration: ICM and Beyond

Intrinsic Curiosity Module (ICM) is a landmark approach to curiosity-driven exploration. It addresses the noisy TV problem by measuring prediction error in a learned feature space that captures only controllable aspects of the environment.

The ICM Architecture

ICM consists of two neural networks:

ICM Components

•Feature Encoder φ(s): Maps raw states to learned features that capture task-relevant information
•Forward Model: Predicts φ(s') from φ(s) and action a. Prediction error = curiosity signal
•Inverse Model: Predicts action a from φ(s) and φ(s'). Forces features to capture action-relevant information only

Intrinsic Curiosity Module (ICM)
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
class ICM(nn.Module):
    """
    Intrinsic Curiosity Module
    Curiosity = prediction error in learned feature space
    
    Key insight: Inverse model forces features to capture 
    only aspects affected by actions, filtering out noise.
    """
    
    def __init__(self, obs_dim, action_dim, feature_dim=256):
        super().__init__()
        
        # Feature encoder: obs → features
        self.encoder = nn.Sequential(
            nn.Conv2d(3, 32, 8, stride=4),
            nn.ReLU(),
            nn.Conv2d(32, 64, 4, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 64, 3, stride=1),
            nn.ReLU(),
            nn.Flatten(),
            nn.Linear(64 * 7 * 7, feature_dim)
        )
        
        # Forward model: (φ(s), a) → predicted φ(s')
        self.forward_model = nn.Sequential(
            nn.Linear(feature_dim + action_dim, 256),
            nn.ReLU(),
            nn.Linear(256, feature_dim)
        )
        
        # Inverse model: (φ(s), φ(s')) → predicted action
        self.inverse_model = nn.Sequential(
            nn.Linear(feature_dim * 2, 256),
            nn.ReLU(),
            nn.Linear(256, action_dim)
        )
    
    def compute_intrinsic_reward(self, state, action, next_state):
        """
        Intrinsic reward = forward prediction error in feature space.
        """
        phi_s = self.encoder(state)
        phi_next_s = self.encoder(next_state)
        
        # One-hot encode action
        action_onehot = F.one_hot(action, num_classes=self.action_dim)
        
        # Predict next features from current features + action
        predicted_phi_next = self.forward_model(
            torch.cat([phi_s, action_onehot], dim=-1)
        )
        
        # Curiosity = prediction error (squared L2 norm)
        intrinsic_reward = 0.5 * ((predicted_phi_next - phi_next_s) ** 2).sum(dim=-1)
        
        return intrinsic_reward
    
    def compute_losses(self, state, action, next_state):
        """Training losses for ICM."""
        phi_s = self.encoder(state)
        phi_next_s = self.encoder(next_state)
        
        # Forward model loss
        predicted_phi_next = self.forward_model(
            torch.cat([phi_s, F.one_hot(action, self.action_dim)], dim=-1)
        )
        forward_loss = F.mse_loss(predicted_phi_next, phi_next_s.detach())
        
        # Inverse model loss
        predicted_action = self.inverse_model(torch.cat([phi_s, phi_next_s], dim=-1))
        inverse_loss = F.cross_entropy(predicted_action, action)
        
        return forward_loss, inverse_loss

ICM in Practice

ICM enables exploration in challenging sparse-reward environments:

Super Mario Bros: Without any extrinsic reward, an ICM-driven agent explores levels, collects coins, and defeats enemies—purely from curiosity.

VizDoom: Navigates complex 3D mazes to find novel locations.

Key Insight: The inverse model is crucial. By predicting actions from state transitions, it forces the feature space to represent only aspects that the agent can affect—filtering out visual noise, weather effects, and other stochastic elements.

When ICM Struggles

ICM can fail when: (1) Important state aspects aren't action-predictable (e.g., hidden enemy positions), (2) Feature learning is hard (very high-dimensional observations), (3) Tasks require long-horizon planning beyond local curiosity. These limitations motivate more advanced exploration methods.

Random Network Distillation: Simple Yet Powerful

Random Network Distillation (RND) is an elegantly simple method that achieves state-of-the-art exploration in many challenging environments. The idea is almost too simple to believe it works: train a predictor to match the output of a random target network.

The RND Mechanism

How RND Works

•Target Network f(s): A randomly initialized neural network whose weights are NEVER updated. Fixed random mapping from states to features.
•Predictor Network f̂(s): Trained to predict target network's output. Learns on visited states.
•Intrinsic Reward: ‖f̂(s) - f(s)‖². High for novel states (predictor hasn't learned them), low for familiar states.

Random Network Distillation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
class RND(nn.Module):
    """
    Random Network Distillation
    Intrinsic reward = prediction error matching a fixed random network
    
    Brilliantly simple: random targets + prediction error = novelty detection.
    """
    
    def __init__(self, obs_dim, feature_dim=512):
        super().__init__()
        
        # Target network: FIXED random weights (never trained!)
        self.target = nn.Sequential(
            nn.Linear(obs_dim, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, feature_dim)
        )
        # Freeze target network
        for param in self.target.parameters():
            param.requires_grad = False
        
        # Predictor network: trained to match target
        self.predictor = nn.Sequential(
            nn.Linear(obs_dim, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, feature_dim)
        )
    
    def compute_intrinsic_reward(self, obs):
        """
        Novel states: predictor hasn't learned target's output → high error.
        Familiar states: predictor has learned target's output → low error.
        """
        with torch.no_grad():
            target_features = self.target(obs)
        predicted_features = self.predictor(obs)
        
        # MSE as intrinsic reward
        intrinsic_reward = ((predicted_features - target_features) ** 2).sum(dim=-1)
        return intrinsic_reward
    
    def update(self, obs):
        """Train predictor to match target on visited states."""
        target_features = self.target(obs).detach()
        predicted_features = self.predictor(obs)
        loss = F.mse_loss(predicted_features, target_features)
        return loss

Why Does RND Work?

The magic of RND lies in understanding what the predictor learns:

Generalization: The predictor generalizes from visited states to similar states. It can predict the random target's output for states similar to those it's seen.
Novelty Detection: For truly novel states (far from training distribution), the predictor cannot accurately predict the random target, even though the target is deterministic.
No Noisy TV Problem: The target network's output is deterministic. There's no stochasticity to get confused by—the predictor just needs to learn the target's fixed mapping.

RND Achievements

RND Performance on Hard Exploration Games
Game	Previous Best	RND+PPO	Human
Montezuma's Revenge	~2,500	~10,000+	~4,700
Gravitar	~500	~4,000+	~3,350
Private Eye	~100	~15,000+	~69,500
Venture	~0	~1,800+	~1,188

Montezuma's Revenge: The Exploration Benchmark

Montezuma's Revenge is the canonical hard-exploration benchmark. The game requires complex sequences of actions (keys, doors, ladders) with extremely sparse rewards. DQN scores near 0. Prior to RND, only carefully designed algorithms could achieve significant scores. RND achieves superhuman performance with a simple, general approach.

Go-Explore: Decoupling Exploration from Policy

Go-Explore takes a radically different approach to hard exploration problems. Instead of hoping the policy discovers interesting states through intrinsic motivation, Go-Explore first explores the environment through a separate mechanism, then learns to reach discovered goals.

The Go-Explore Philosophy

Many exploration failures occur because policies "forget" how to reach promising states. The agent might briefly discover a distant room, but then the policy drifts and can never return. Go-Explore addresses this by:

Remembering all discovered states in an archive
Returning directly to promising states (using reset or learned goal-conditioned policy)
Exploring from those states to find new discoveries
Robustifying discovered strategies against environment stochasticity

Go-Explore Algorithm (Simplified)
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
class GoExplore:
    """
    Go-Explore: Remember promising states, return to them, explore further.
    
    Two phases:
    1. Exploration Phase: Find goals and trajectories in deterministic setting
    2. Robustification Phase: Learn policy that works under stochasticity
    """
    
    def __init__(self, env, state_encoder):
        self.env = env
        self.state_encoder = state_encoder  # Maps states to discrete cells
        self.archive = {}  # cell → (state, trajectory, score)
    
    def exploration_phase(self, num_iterations):
        """Phase 1: Explore using archive-based search."""
        for _ in range(num_iterations):
            # 1. SELECT: Choose cell to explore from (prefer under-explored)
            cell = self.select_cell()
            state, trajectory, _ = self.archive[cell]
            
            # 2. GO: Return to the selected state
            self.env.reset()
            for action in trajectory:
                self.env.step(action)  # Deterministic playback
            
            # 3. EXPLORE: Take random actions from this state
            for step in range(explore_steps):
                action = self.env.action_space.sample()
                next_state, reward, done, _ = self.env.step(action)
                trajectory.append(action)
                
                # Encode state to cell
                new_cell = self.state_encoder(next_state)
                
                # Update archive if this is a new or better path to cell
                if new_cell not in self.archive or len(trajectory) < len(self.archive[new_cell][1]):
                    self.archive[new_cell] = (next_state, trajectory.copy(), reward)
                
                if done:
                    break
    
    def robustification_phase(self, trajectories):
        """
        Phase 2: Learn robust policy from demonstrated trajectories.
        Uses imitation learning + RL to handle stochasticity.
        """
        # Train goal-conditioned policy to reach discovered states
        policy = train_goal_conditioned_policy(trajectories)
        return policy

Cell Representation: The Key Design Choice

Go-Explore groups similar states into cells for tractable archive management. The cell representation is critical:

For Atari (images):

Downsample frame to 11×8 grayscale
Each pixel becomes a cell dimension
Simple but surprisingly effective

Go-Explore Achievements

•Montezuma's Revenge: First algorithm to achieve maximum possible score (>2M points)
•Pitfall: First algorithm to solve the game, which requires 20+ minutes of perfect play
•Generalization: Same algorithm works across multiple hard exploration games

When To Use Go-Explore

Go-Explore is particularly effective when:

Environment has reset capability: Can return to saved states (possible in simulation, not in physical world)
State space has natural clustering: States can be meaningfully grouped into cells
Exploration is the bottleneck: The task is learnable once states are discovered
Environment is deterministic or nearly so: Initial exploration phase assumes determinism

The Reset Assumption

Go-Explore's exploration phase requires resetting the environment to arbitrary discovered states. This is natural in simulation but impossible in the physical world. The robustification phase addresses this by learning a policy that can reach target states from initial states, but the reliance on reset during exploration limits applicability.

Modern Exploration Research Frontiers

Exploration remains an active research frontier. Let's survey promising modern directions:

Skill Discovery

Instead of exploring individual states, skill discovery methods learn reusable exploration primitives—diverse behaviors that can be composed for efficient exploration:

DIAYN: Skill Discovery
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# DIAYN (Diversity is All You Need)
# Learn diverse skills without extrinsic reward
# Skills that are distinguishable → diverse exploration
 
# Objective: Maximize mutual information I(s; z)
# Intuition: Each skill z should visit distinctive states
 
class DIAYN:
    def __init__(self, policy, discriminator, n_skills):
        self.policy = policy           # π(a|s, z)
        self.discriminator = discriminator  # q(z|s)
        self.n_skills = n_skills
    
    def compute_intrinsic_reward(self, state, skill):
        """
        Reward for visiting states that identify the skill.
        High reward = discriminator correctly identifies skill from state.
        """
        # Probability discriminator assigns to the correct skill
        skill_probs = self.discriminator(state)
        log_q_z_given_s = torch.log(skill_probs[skill] + 1e-8)
        
        # Subtract log prior to encourage diversity
        log_p_z = np.log(1.0 / self.n_skills)
        
        intrinsic_reward = log_q_z_given_s - log_p_z
        return intrinsic_reward

Episodic Curiosity and Memory

Agents can fail to explore if they don't remember what they've already seen. Episodic curiosity methods maintain explicit memory of visited states within an episode:

•Never Give Up (NGU): Combines RND for global novelty with episodic memory for local novelty. Remembers states visited this episode.
•Agent57: Extension of NGU with adaptive exploration that learns to balance exploration-exploitation across a population of policies.
•Episodic Memory Modules: External memory that stores visited states and computes novelty against episodic history.

Goal-Conditioned Exploration

Goal-conditioned RL learns policies that can achieve arbitrary goals, enabling structured exploration:

Hindsight Experience Replay (HER): Relabel failed trajectories with achieved goals, turning failures into successes for goal-conditioned learning
MEGA: Learns to set goals at the frontier of achievable states—progressively expanding reachable region
Go-Explore (Policy Phase): Uses goal-conditioned policies to return to discovered archive states

Language-Guided Exploration

Large language models open new exploration paradigms:

LLM-Based Exploration

•LLM as Explorer: Language model suggests goals or action sequences based on game description
•Semantic Intrinsic Motivation: Reward for discovering semantically meaningful events ("found a key", "opened a door")
•Natural Language Abstraction: Describe discoveries in language to maintain coherent exploration strategy

The Open Problem

Despite progress, exploration remains a fundamental open problem. No method reliably solves all hard-exploration tasks. The field lacks a theoretical framework that predicts when exploration will succeed or fail. State-of-the-art methods still require millions of samples on challenging domains. Progress continues, but general-purpose, sample-efficient exploration is not yet in hand.

Summary: The Art of Discovery

Let's consolidate the key insights from our exploration of... exploration (pun intended):

Key Takeaways

•Exploration-exploitation tradeoff is the central tension: exploit known good actions or explore uncertain alternatives?
•Classical methods (ε-greedy, UCB, Thompson Sampling) provide principled solutions for bandits but don't directly transfer to large state spaces.
•Intrinsic motivation supplements sparse extrinsic rewards with internal curiosity signals that encourage visiting novel states.
•ICM uses prediction error in a learned action-predictable feature space, avoiding the noisy TV problem through inverse model regularization.
•RND achieves state-of-the-art exploration with a simple idea: prediction error against a fixed random target network as novelty.
•Go-Explore decouples exploration from policy learning, maintaining an archive of discovered states and systematically expanding from the frontier.
•Modern frontiers include skill discovery, episodic memory, goal-conditioned exploration, and language-guided strategies.

What's Next:

We've covered how to explore. But what motivates exploration? The answer is reward. The next page tackles reward engineering—the challenge of specifying what we want agents to optimize, the subtle ways this can go wrong, and the emerging research on learning rewards from human feedback.

Page Complete

You now understand the exploration challenge in RL, from the theoretical exploration-exploitation tradeoff to practical methods including intrinsic motivation, ICM, RND, and Go-Explore. Next, we examine reward engineering—specifying what agents should actually optimize.