Loading content...
Imagine you're in a new city without a map. You know your hotel is somewhere nearby, but you don't know which direction. Every step you take could bring you closer to your hotel—or lead you further away. How do you search effectively?
Option 1: Random walk. Take steps in random directions. You'll eventually find the hotel, but it might take a very long time.
Option 2: Systematic coverage. Explore the entire city grid by grid. Thorough, but extremely slow.
Option 3: Informed search. Ask locals, follow signs, use landmarks. Much faster, but requires prior knowledge you may not have.
This is the exploration problem in reinforcement learning—and it's arguably the most fundamental challenge in the field. Before an agent can learn anything, it must first discover rewarding experiences through its own actions. In environments with sparse rewards, random exploration might take astronomical time to stumble upon any positive signal. Yet systematic exploration can be equally impractical in large state spaces.
Exploration is where theory meets practice in RL. Algorithms that work brilliantly on benchmarks with dense rewards often fail completely when rewards are sparse. Understanding exploration is essential for deploying RL in real-world scenarios.
By the end of this page, you will understand: (1) The exploration-exploitation tradeoff and its mathematical formalization, (2) Classical exploration strategies (ε-greedy, UCB, Thompson Sampling), (3) Intrinsic motivation and curiosity-driven exploration, (4) Count-based and prediction-error bonuses, and (5) Modern approaches including Random Network Distillation (RND) and Go-Explore.
The exploration-exploitation tradeoff is the central tension in sequential decision-making: should you exploit known good actions to maximize immediate reward, or explore uncertain actions that might lead to better long-term outcomes?
Consider a restaurant choice problem:
Too much exploitation means you might miss superior options. Too much exploration means you waste time on poor choices when better ones are known.
The multi-armed bandit problem crystallizes the exploration-exploitation tradeoff in its purest form:
12345678910111213141516171819202122
# Multi-Armed Bandit: k slot machines (arms)# Each arm i has unknown reward distribution with mean μ_i# Goal: maximize total reward over T pulls class MultiArmedBandit: def __init__(self, k_arms): # True (unknown) mean rewards for each arm self.true_means = np.random.randn(k_arms) def pull(self, arm): # Returns noisy reward from arm's distribution return self.true_means[arm] + np.random.randn() # The challenge: Find arm with highest μ_i quickly# Trade off exploring to find best arm vs. exploiting known good arms # Regret: Difference between optimal strategy and our strategy# R(T) = T * μ* - Σ_{t=1}^{T} μ_{a_t}# ↑ Best possible ↑ What we got # Good algorithms achieve sublinear regret: R(T) = O(log T)# This means exploration overhead becomes negligible as T → ∞Regret measures the cost of not knowing optimal actions in advance. It accumulates the difference between what we earned and what we could have earned with perfect knowledge:
$$\text{Regret}(T) = \sum_{t=1}^{T} \left( \mu^* - \mu_{a_t} \right)$$
where $\mu^*$ is the mean reward of the best arm, and $\mu_{a_t}$ is the mean reward of the arm we pulled at time $t$.
The goal of exploration strategies is to minimize regret—learning which actions are best with minimal cost.
| Strategy | Regret Bound | Characteristics |
|---|---|---|
| Pure Exploitation | O(T) — Linear | Never finds better arms; stuck on initial best |
| Pure Random | O(T) — Linear | Never converges; equal pulls on all arms |
| ε-Greedy | O(T) — Linear | Fixed exploration; never stops exploring |
| Decaying ε-Greedy | O(log T) — Sublinear | Decreasing exploration; asymptotically optimal |
| UCB | O(log T) — Sublinear | Optimism-based; provably near-optimal |
| Thompson Sampling | O(log T) — Sublinear | Bayesian; often best empirically |
The multi-armed bandit is a simplified RL problem with only one state. In full RL, exploration is dramatically harder because: (1) Actions affect future states, (2) The value of exploration is itself uncertain, (3) The state space may be vast or continuous. Bandit algorithms provide intuition, but don't directly solve the general RL exploration problem.
Let's examine the classical approaches to exploration, building from simple to sophisticated.
The most common exploration strategy in deep RL is ε-greedy: with probability ε take a random action, otherwise take the greedy action.
1234567891011121314151617
def epsilon_greedy_action(q_values, epsilon): """ With probability epsilon: random action With probability (1-epsilon): best action according to Q """ if np.random.random() < epsilon: return np.random.randint(len(q_values)) # Explore else: return np.argmax(q_values) # Exploit # In practice: decay epsilon over trainingdef get_epsilon(step, epsilon_start=1.0, epsilon_end=0.01, decay_steps=100000): """Linear decay from epsilon_start to epsilon_end.""" progress = min(step / decay_steps, 1.0) return epsilon_start - progress * (epsilon_start - epsilon_end) # DQN uses: epsilon 1.0 → 0.1 over first 1M steps, then stays at 0.1Limitations of ε-greedy:
Boltzmann exploration samples actions proportionally to their Q-values:
123456789101112131415161718
def boltzmann_action(q_values, temperature): """ Sample action proportional to exp(Q/τ). Temperature τ controls exploration: - High τ → uniform (exploration) - Low τ → greedy (exploitation) """ # Compute softmax probabilities scaled_q = q_values / temperature scaled_q -= scaled_q.max() # For numerical stability exp_q = np.exp(scaled_q) probs = exp_q / exp_q.sum() return np.random.choice(len(q_values), p=probs) # Advantage over ε-greedy: # Actions with higher Q are more likely even during exploration# Actions with similar Q get similar probabilitiesUCB implements the principle of optimism in the face of uncertainty: prefer actions that might be good because they haven't been tried enough.
12345678910111213141516171819
def ucb_action(q_values, action_counts, total_count, c=2.0): """ Select action that maximizes: Q(a) + c * sqrt(log(t) / n(a)) ↑ ↑ exploit explore Actions with low counts get exploration bonus. c controls exploration-exploitation balance. """ exploration_bonus = c * np.sqrt(np.log(total_count) / (action_counts + 1e-8)) ucb_values = q_values + exploration_bonus return np.argmax(ucb_values) # Why it works:# - Uncertainty shrinks as sqrt(1/n) for each action# - Unexplored actions have high uncertainty → high bonus# - As action is tried more, bonus decreases → exploit # Provably achieves O(log T) regret in multi-armed banditsThompson Sampling maintains a full probability distribution over each action's value and samples from these distributions to choose actions:
1234567891011121314151617181920212223242526272829
class ThompsonSampling: """ Bayesian approach: Maintain posterior over reward distributions. Sample from posteriors and act greedily on sample. """ def __init__(self, n_arms): # Beta distribution parameters for each arm # Beta(α, β) models probability of success self.alphas = np.ones(n_arms) # Successes + 1 self.betas = np.ones(n_arms) # Failures + 1 def select_action(self): # Sample from posterior for each arm sampled_probs = np.random.beta(self.alphas, self.betas) return np.argmax(sampled_probs) def update(self, arm, reward): # Update posterior based on observed reward if reward > 0.5: # Assuming binary reward self.alphas[arm] += 1 else: self.betas[arm] += 1 # Why Thompson Sampling works:# - Naturally balances exploration (uncertain arms have wide posteriors)# - Automatically decreases exploration as uncertainty shrinks# - Often outperforms UCB empirically# - Computationally simple for conjugate priorsClassical exploration works well when actions are independent (bandits) or the state space is small enough for count-based uncertainty. In large state spaces with deep function approximation, these methods struggle—you never see the exact same state twice, so counts are always 1. This motivates intrinsic motivation and learned exploration strategies.
Human learning isn't driven purely by external rewards. Children explore their environment with apparent curiosity—drawn to novelty, surprised by unexpected outcomes, satisfied by understanding. Can we instill similar drives in RL agents?
Intrinsic motivation provides internal reward signals that encourage exploration, supplementing or replacing sparse external rewards. The agent receives reward for discovering new things, even without explicit task reward.
12345678910111213141516171819
# Total reward = Extrinsic (from environment) + Intrinsic (from curiosity) def combined_reward(state, action, next_state, env_reward, intrinsic_module): """ Combine external task reward with intrinsic exploration reward. """ extrinsic = env_reward # From environment intrinsic = intrinsic_module.compute_bonus(state, action, next_state) # β controls exploration vs. exploitation total = extrinsic + beta * intrinsic return total # Common pattern:# - Early training: β high → explore widely# - Later training: β decays → focus on task reward # Key design choice: What should intrinsic reward measure?# Options: Novelty, Surprise, Prediction Error, Information Gain| Method | Intrinsic Reward Based On | Intuition |
|---|---|---|
| Count-Based | Visitation count: 1/√N(s) | Visit rarely-seen states |
| Prediction Error | ‖ŝ' - s'‖² | Seek surprising transitions |
| Information Gain | Reduction in model uncertainty | Learn about unknown dynamics |
| Empowerment | Mutual information I(a; s'|s) | Maximize influence on future |
| Disagreement | Variance across model ensemble | Explore where models disagree |
In tabular RL, count-based exploration is simple: track N(s) and give bonus 1/√N(s). But in deep RL, states are continuous and never repeat exactly. How do we count?
Pseudo-counts approximate visitation by measuring how "familiar" a state is:
123456789101112131415161718192021222324252627
class PseudoCountExploration: """ Use density model ρ(s) to compute pseudo-counts. Pseudo-count: N̂(s) = ρ(s) / (1 - ρ(s)) Intuition: If ρ(s) is low, state is novel → high bonus. """ def __init__(self, state_dim): # Train a density model (e.g., PixelCNN for images) self.density_model = DensityModel(state_dim) def compute_bonus(self, state): # Get density before and after seeing this state rho_old = self.density_model.predict_density(state) self.density_model.update(state) rho_new = self.density_model.predict_density(state) # Compute pseudo-count if rho_old == 0: return 1.0 # Completely novel state pseudo_count = rho_old / max(1e-10, rho_new - rho_old) # Exploration bonus inversely proportional to pseudo-count bonus = 1.0 / np.sqrt(pseudo_count + 1e-8) return bonusPrediction-error curiosity has a failure mode: a "noisy TV" showing random images generates infinite prediction error (unpredictable by design) and infinite intrinsic reward. The agent becomes mesmerized, watching random noise instead of exploring meaningfully. This motivates more sophisticated curiosity signals that distinguish learnable novelty from inherent stochasticity.
Intrinsic Curiosity Module (ICM) is a landmark approach to curiosity-driven exploration. It addresses the noisy TV problem by measuring prediction error in a learned feature space that captures only controllable aspects of the environment.
ICM consists of two neural networks:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
class ICM(nn.Module): """ Intrinsic Curiosity Module Curiosity = prediction error in learned feature space Key insight: Inverse model forces features to capture only aspects affected by actions, filtering out noise. """ def __init__(self, obs_dim, action_dim, feature_dim=256): super().__init__() # Feature encoder: obs → features self.encoder = nn.Sequential( nn.Conv2d(3, 32, 8, stride=4), nn.ReLU(), nn.Conv2d(32, 64, 4, stride=2), nn.ReLU(), nn.Conv2d(64, 64, 3, stride=1), nn.ReLU(), nn.Flatten(), nn.Linear(64 * 7 * 7, feature_dim) ) # Forward model: (φ(s), a) → predicted φ(s') self.forward_model = nn.Sequential( nn.Linear(feature_dim + action_dim, 256), nn.ReLU(), nn.Linear(256, feature_dim) ) # Inverse model: (φ(s), φ(s')) → predicted action self.inverse_model = nn.Sequential( nn.Linear(feature_dim * 2, 256), nn.ReLU(), nn.Linear(256, action_dim) ) def compute_intrinsic_reward(self, state, action, next_state): """ Intrinsic reward = forward prediction error in feature space. """ phi_s = self.encoder(state) phi_next_s = self.encoder(next_state) # One-hot encode action action_onehot = F.one_hot(action, num_classes=self.action_dim) # Predict next features from current features + action predicted_phi_next = self.forward_model( torch.cat([phi_s, action_onehot], dim=-1) ) # Curiosity = prediction error (squared L2 norm) intrinsic_reward = 0.5 * ((predicted_phi_next - phi_next_s) ** 2).sum(dim=-1) return intrinsic_reward def compute_losses(self, state, action, next_state): """Training losses for ICM.""" phi_s = self.encoder(state) phi_next_s = self.encoder(next_state) # Forward model loss predicted_phi_next = self.forward_model( torch.cat([phi_s, F.one_hot(action, self.action_dim)], dim=-1) ) forward_loss = F.mse_loss(predicted_phi_next, phi_next_s.detach()) # Inverse model loss predicted_action = self.inverse_model(torch.cat([phi_s, phi_next_s], dim=-1)) inverse_loss = F.cross_entropy(predicted_action, action) return forward_loss, inverse_lossICM enables exploration in challenging sparse-reward environments:
Super Mario Bros: Without any extrinsic reward, an ICM-driven agent explores levels, collects coins, and defeats enemies—purely from curiosity.
VizDoom: Navigates complex 3D mazes to find novel locations.
Key Insight: The inverse model is crucial. By predicting actions from state transitions, it forces the feature space to represent only aspects that the agent can affect—filtering out visual noise, weather effects, and other stochastic elements.
ICM can fail when: (1) Important state aspects aren't action-predictable (e.g., hidden enemy positions), (2) Feature learning is hard (very high-dimensional observations), (3) Tasks require long-horizon planning beyond local curiosity. These limitations motivate more advanced exploration methods.
Random Network Distillation (RND) is an elegantly simple method that achieves state-of-the-art exploration in many challenging environments. The idea is almost too simple to believe it works: train a predictor to match the output of a random target network.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
class RND(nn.Module): """ Random Network Distillation Intrinsic reward = prediction error matching a fixed random network Brilliantly simple: random targets + prediction error = novelty detection. """ def __init__(self, obs_dim, feature_dim=512): super().__init__() # Target network: FIXED random weights (never trained!) self.target = nn.Sequential( nn.Linear(obs_dim, 512), nn.ReLU(), nn.Linear(512, 512), nn.ReLU(), nn.Linear(512, feature_dim) ) # Freeze target network for param in self.target.parameters(): param.requires_grad = False # Predictor network: trained to match target self.predictor = nn.Sequential( nn.Linear(obs_dim, 512), nn.ReLU(), nn.Linear(512, 512), nn.ReLU(), nn.Linear(512, feature_dim) ) def compute_intrinsic_reward(self, obs): """ Novel states: predictor hasn't learned target's output → high error. Familiar states: predictor has learned target's output → low error. """ with torch.no_grad(): target_features = self.target(obs) predicted_features = self.predictor(obs) # MSE as intrinsic reward intrinsic_reward = ((predicted_features - target_features) ** 2).sum(dim=-1) return intrinsic_reward def update(self, obs): """Train predictor to match target on visited states.""" target_features = self.target(obs).detach() predicted_features = self.predictor(obs) loss = F.mse_loss(predicted_features, target_features) return lossThe magic of RND lies in understanding what the predictor learns:
Generalization: The predictor generalizes from visited states to similar states. It can predict the random target's output for states similar to those it's seen.
Novelty Detection: For truly novel states (far from training distribution), the predictor cannot accurately predict the random target, even though the target is deterministic.
No Noisy TV Problem: The target network's output is deterministic. There's no stochasticity to get confused by—the predictor just needs to learn the target's fixed mapping.
| Game | Previous Best | RND+PPO | Human |
|---|---|---|---|
| Montezuma's Revenge | ~2,500 | ~10,000+ | ~4,700 |
| Gravitar | ~500 | ~4,000+ | ~3,350 |
| Private Eye | ~100 | ~15,000+ | ~69,500 |
| Venture | ~0 | ~1,800+ | ~1,188 |
Montezuma's Revenge is the canonical hard-exploration benchmark. The game requires complex sequences of actions (keys, doors, ladders) with extremely sparse rewards. DQN scores near 0. Prior to RND, only carefully designed algorithms could achieve significant scores. RND achieves superhuman performance with a simple, general approach.
Go-Explore takes a radically different approach to hard exploration problems. Instead of hoping the policy discovers interesting states through intrinsic motivation, Go-Explore first explores the environment through a separate mechanism, then learns to reach discovered goals.
Many exploration failures occur because policies "forget" how to reach promising states. The agent might briefly discover a distant room, but then the policy drifts and can never return. Go-Explore addresses this by:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
class GoExplore: """ Go-Explore: Remember promising states, return to them, explore further. Two phases: 1. Exploration Phase: Find goals and trajectories in deterministic setting 2. Robustification Phase: Learn policy that works under stochasticity """ def __init__(self, env, state_encoder): self.env = env self.state_encoder = state_encoder # Maps states to discrete cells self.archive = {} # cell → (state, trajectory, score) def exploration_phase(self, num_iterations): """Phase 1: Explore using archive-based search.""" for _ in range(num_iterations): # 1. SELECT: Choose cell to explore from (prefer under-explored) cell = self.select_cell() state, trajectory, _ = self.archive[cell] # 2. GO: Return to the selected state self.env.reset() for action in trajectory: self.env.step(action) # Deterministic playback # 3. EXPLORE: Take random actions from this state for step in range(explore_steps): action = self.env.action_space.sample() next_state, reward, done, _ = self.env.step(action) trajectory.append(action) # Encode state to cell new_cell = self.state_encoder(next_state) # Update archive if this is a new or better path to cell if new_cell not in self.archive or len(trajectory) < len(self.archive[new_cell][1]): self.archive[new_cell] = (next_state, trajectory.copy(), reward) if done: break def robustification_phase(self, trajectories): """ Phase 2: Learn robust policy from demonstrated trajectories. Uses imitation learning + RL to handle stochasticity. """ # Train goal-conditioned policy to reach discovered states policy = train_goal_conditioned_policy(trajectories) return policyGo-Explore groups similar states into cells for tractable archive management. The cell representation is critical:
For Atari (images):
Go-Explore is particularly effective when:
Go-Explore's exploration phase requires resetting the environment to arbitrary discovered states. This is natural in simulation but impossible in the physical world. The robustification phase addresses this by learning a policy that can reach target states from initial states, but the reliance on reset during exploration limits applicability.
Exploration remains an active research frontier. Let's survey promising modern directions:
Instead of exploring individual states, skill discovery methods learn reusable exploration primitives—diverse behaviors that can be composed for efficient exploration:
123456789101112131415161718192021222324252627
# DIAYN (Diversity is All You Need)# Learn diverse skills without extrinsic reward# Skills that are distinguishable → diverse exploration # Objective: Maximize mutual information I(s; z)# Intuition: Each skill z should visit distinctive states class DIAYN: def __init__(self, policy, discriminator, n_skills): self.policy = policy # π(a|s, z) self.discriminator = discriminator # q(z|s) self.n_skills = n_skills def compute_intrinsic_reward(self, state, skill): """ Reward for visiting states that identify the skill. High reward = discriminator correctly identifies skill from state. """ # Probability discriminator assigns to the correct skill skill_probs = self.discriminator(state) log_q_z_given_s = torch.log(skill_probs[skill] + 1e-8) # Subtract log prior to encourage diversity log_p_z = np.log(1.0 / self.n_skills) intrinsic_reward = log_q_z_given_s - log_p_z return intrinsic_rewardAgents can fail to explore if they don't remember what they've already seen. Episodic curiosity methods maintain explicit memory of visited states within an episode:
Goal-conditioned RL learns policies that can achieve arbitrary goals, enabling structured exploration:
Large language models open new exploration paradigms:
Despite progress, exploration remains a fundamental open problem. No method reliably solves all hard-exploration tasks. The field lacks a theoretical framework that predicts when exploration will succeed or fail. State-of-the-art methods still require millions of samples on challenging domains. Progress continues, but general-purpose, sample-efficient exploration is not yet in hand.
Let's consolidate the key insights from our exploration of... exploration (pun intended):
What's Next:
We've covered how to explore. But what motivates exploration? The answer is reward. The next page tackles reward engineering—the challenge of specifying what we want agents to optimize, the subtle ways this can go wrong, and the emerging research on learning rewards from human feedback.
You now understand the exploration challenge in RL, from the theoretical exploration-exploitation tradeoff to practical methods including intrinsic motivation, ICM, RND, and Go-Explore. Next, we examine reward engineering—specifying what agents should actually optimize.