Loading content...
Imagine trying to learn a new skill, but you're forced to practice in strict chronological order—you can never revisit past attempts, never focus on challenging sections, never compare your performance across different sessions. Your learning would be inefficient, dominated by whatever you experienced most recently, prone to forgetting earlier lessons.
This is precisely the problem faced by naive reinforcement learning agents. As they interact with their environment, experiences arrive in a temporal stream where consecutive samples are heavily correlated. State s_{t+1} is highly similar to state s_t. The rewards and transitions observed in one phase of a game dominate training until the agent moves to a new phase.
Experience replay solves this problem by storing experiences in a buffer and sampling them randomly during training. This simple idea—reviewed and refined from experiences rather than learning only from the immediate present—transforms an unstable learning process into one that reliably converges.
The impact is dramatic:
By the end of this page, you will understand why temporal correlations destabilize neural network training, how experience replay breaks these correlations, the mechanics of implementing replay buffers, the impact on sample efficiency, and advanced variants like prioritized experience replay that further accelerate learning.
To understand why experience replay is essential, we must first understand what goes wrong without it. In online learning, the agent learns immediately from each experience as it arrives:
for each timestep t:
observe state s_t
take action a_t
receive reward r_t and next state s_{t+1}
immediately update network using (s_t, a_t, r_t, s_{t+1})
This approach seems natural—why wait to learn from an experience? But it fails catastrophically with neural networks for several interconnected reasons.
Problem 1: Temporal Correlation
Consecutive experiences are highly correlated. In a game of Pong, frame t and frame t+1 differ by only a few pixels—the ball moves slightly, the paddle shifts marginally. Training on such similar samples violates a fundamental assumption of stochastic gradient descent: that samples are independent and identically distributed (i.i.d.).
When samples are correlated, gradient updates become biased toward the current region of state space. The network overfits to recent experiences while forgetting earlier ones. This manifests as:
Stochastic gradient descent assumes each minibatch is drawn independently from the data distribution. Correlated samples cause the gradient estimator to have high variance in directions that don't reflect the true loss surface. This isn't just a minor inefficiency—it can cause optimization to fail entirely.
Problem 2: Non-Stationary Distribution
As the agent learns, its policy changes. A better policy visits different states than a poor policy. This means the distribution of collected data shifts continuously:
The network must simultaneously learn Q-values and adapt to a changing input distribution. This creates a feedback loop: the network's predictions affect the policy, which affects the data distribution, which affects the predictions.
Problem 3: Catastrophic Forgetting
Neural networks are susceptible to catastrophic forgetting—when training on new data causes performance on old data to degrade dramatically. In RL, this means:
Without mechanisms to revisit old experiences, the network's capacity is dominated by recent data, and hard-won knowledge is lost.
| Problem | Symptom | Root Cause |
|---|---|---|
| Temporal correlation | Training loss oscillates wildly | Gradient bias from similar consecutive samples |
| Distribution shift | Performance plateaus then degrades | Network adapts to changing data distribution |
| Catastrophic forgetting | Mastered behaviors suddenly fail | New learning overwrites previous knowledge |
| Sample inefficiency | Slow learning despite many interactions | Each experience used only once |
Experience replay introduces a memory buffer that stores past experiences and randomly samples from them during training. The key insight is separating data collection from data consumption:
The Replay Buffer Paradigm
This simple modification achieves multiple goals simultaneously:
Mathematical Perspective
Formally, let D = {(s_i, a_i, r_i, s'_i)} be the replay buffer containing N transitions. Instead of computing gradients on the current transition:
$$ abla_\theta L(\theta) = abla_\theta \left( Q(s_t, a_t; \theta) - y_t \right)^2$$
We compute gradients on a minibatch sampled uniformly from D:
$$ abla_\theta L(\theta) = \frac{1}{|B|} \sum_{i \in B} abla_\theta \left( Q(s_i, a_i; \theta) - y_i \right)^2$$
where B ⊂ {1, ..., N} is a random subset of indices.
This transforms RL into something resembling supervised learning: we have a fixed dataset (the buffer) from which we sample batches. The key difference is that the dataset grows and changes as we collect new experiences.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177
import randomimport numpy as npfrom collections import dequefrom typing import Tuple, Listimport torch class ExperienceReplay: """ Standard experience replay buffer with uniform sampling. This implementation uses a deque (double-ended queue) for O(1) insertion and automatic oldest-element removal when full. """ def __init__(self, capacity: int = 1_000_000): """ Initialize the replay buffer. Args: capacity: Maximum number of transitions to store. Typical values: 100,000 - 10,000,000 Larger buffers = more diverse samples, more memory """ self.capacity = capacity self.buffer = deque(maxlen=capacity) # Statistics tracking self.total_added = 0 self.episode_boundaries = [] def push( self, state: np.ndarray, action: int, reward: float, next_state: np.ndarray, done: bool ) -> None: """ Store a single transition in the buffer. When the buffer is full, the oldest transition is automatically removed (FIFO eviction policy). Args: state: Current observation action: Action taken reward: Reward received next_state: Resulting observation done: Whether episode terminated """ transition = (state, action, reward, next_state, done) self.buffer.append(transition) self.total_added += 1 if done: self.episode_boundaries.append(len(self.buffer) - 1) def sample(self, batch_size: int) -> Tuple[torch.Tensor, ...]: """ Sample a random minibatch of transitions. Uniform random sampling ensures approximately i.i.d. batches, breaking temporal correlations in the data. Args: batch_size: Number of transitions to sample Returns: Tuple of (states, actions, rewards, next_states, dones) as tensors ready for training """ # Random sampling - the key to breaking correlations! batch = random.sample(self.buffer, batch_size) # Unzip the batch states, actions, rewards, next_states, dones = zip(*batch) # Convert to tensors return ( torch.FloatTensor(np.array(states)), torch.LongTensor(actions), torch.FloatTensor(rewards), torch.FloatTensor(np.array(next_states)), torch.BoolTensor(dones), ) def __len__(self) -> int: """Return the current number of stored transitions.""" return len(self.buffer) def is_ready(self, batch_size: int) -> bool: """Check if enough samples exist for training.""" return len(self) >= batch_size # ----- Diagnostic Methods ----- def get_statistics(self) -> dict: """ Compute buffer statistics for monitoring. These metrics help diagnose training issues: - Low reward variance suggests exploration problems - High done ratio means many short episodes """ if len(self.buffer) == 0: return {} rewards = [t[2] for t in self.buffer] dones = [t[4] for t in self.buffer] return { "size": len(self.buffer), "capacity": self.capacity, "utilization": len(self.buffer) / self.capacity, "total_added": self.total_added, "reward_mean": np.mean(rewards), "reward_std": np.std(rewards), "done_ratio": np.mean(dones), "episodes_complete": sum(dones), } class ReplayBufferWithMetadata(ExperienceReplay): """ Extended replay buffer that tracks additional metadata. Useful for debugging and analysis during development. """ def __init__(self, capacity: int = 1_000_000): super().__init__(capacity) self.timesteps = deque(maxlen=capacity) self.episode_ids = deque(maxlen=capacity) self.current_episode = 0 def push( self, state: np.ndarray, action: int, reward: float, next_state: np.ndarray, done: bool, timestep: int = 0 ) -> None: super().push(state, action, reward, next_state, done) self.timesteps.append(timestep) self.episode_ids.append(self.current_episode) if done: self.current_episode += 1 def sample_recent(self, batch_size: int, recency_window: int = 10000): """ Sample from recent experiences with higher probability. This can be useful when the policy has changed significantly and old experiences are less relevant. """ window_start = max(0, len(self.buffer) - recency_window) indices = list(range(window_start, len(self.buffer))) if len(indices) < batch_size: indices = list(range(len(self.buffer))) sampled_indices = random.sample(indices, batch_size) batch = [self.buffer[i] for i in sampled_indices] states, actions, rewards, next_states, dones = zip(*batch) return ( torch.FloatTensor(np.array(states)), torch.LongTensor(actions), torch.FloatTensor(rewards), torch.FloatTensor(np.array(next_states)), torch.BoolTensor(dones), )One of the most significant benefits of experience replay is improved sample efficiency. In reinforcement learning, collecting experience is often expensive—running real robots, simulating complex physics, or playing games for millions of steps all require substantial resources. Using each sample only once is wasteful.
The Replay Ratio
The replay ratio (also called update-to-data ratio) measures how many gradient updates we perform per environment step:
$$\text{Replay Ratio} = \frac{\text{Gradient Updates}}{\text{Environment Steps}}$$
With experience replay, we can increase the replay ratio without collecting more data. A replay ratio of 4 means we extract 4× more learning from the same environmental interactions.
Empirical Impact
Experiments demonstrate the dramatic effect of experience replay:
| Configuration | Atari Breakout Score (10M frames) | Environment Steps to Competence |
|---|---|---|
| No replay | ~2 (barely above random) | Does not converge |
| Replay (1M buffer) | ~300 (superhuman) | ~5M steps |
| Replay + 4× ratio | ~350 | ~1.5M steps |
| Enhanced replay (PER) | ~400 | ~1M steps |
Experience replay doesn't just help—it's frequently the difference between complete failure and superhuman performance.
When Replay Ratio Matters Most
High replay ratios are especially valuable when:
Very high replay ratios risk overfitting to the buffer contents. If you sample and update on the same experiences hundreds of times, the network may memorize specific transitions rather than learn generalizable Q-values. Modern algorithms like REDQ and DroQ specifically address this by using ensembles and dropout to maintain generalization at high replay ratios.
The replay buffer size is a critical hyperparameter that balances several competing concerns. Understanding these trade-offs helps in selecting appropriate sizes for different applications.
Larger Buffers: Pros and Cons
✅ Advantages of large buffers:
❌ Disadvantages of large buffers:
Smaller Buffers: Pros and Cons
✅ Advantages of small buffers:
❌ Disadvantages of small buffers:
| Scenario | Recommended Size | Rationale |
|---|---|---|
| Atari games (DQN) | 1,000,000 | Covers ~10-20 hours of gameplay, diverse strategies |
| Continuous control (SAC) | 100,000 - 1,000,000 | Robotics often has expensive samples |
| Simple environments | 10,000 - 50,000 | Less diversity needed, faster iteration |
| Non-stationary tasks | 10,000 - 100,000 | Old data becomes harmful quickly |
| Multi-task learning | 1,000,000+ | Need experiences from all tasks |
| Demonstration learning | Matches demo size | Keep all demonstrations permanently |
Memory Calculation
For visual observations, buffer memory is substantial:
Memory = Buffer Size × Transition Size
Transition Size (Atari, naive) =
State: 84 × 84 × 4 × 4 bytes (float32) = 112,896 bytes
+ Next State: 112,896 bytes
+ Action: 4 bytes
+ Reward: 4 bytes
+ Done: 1 byte
≈ 226 KB per transition
1M buffer = 226 GB (too large!)
This is why efficient storage is critical:
Transition Size (Atari, efficient) =
State: 84 × 84 × 1 × 1 byte (uint8) = 7,056 bytes
+ Action: 1 byte
+ Reward: 4 bytes
+ Done: 1 bit
≈ 7 KB per transition
1M buffer ≈ 7 GB (manageable)
Note: Next states are computed from stored observations using lazy frame stacking, avoiding duplicate storage.
In off-policy learning, experiences in the buffer were collected by old policies. As the policy improves dramatically, these old experiences may no longer be representative. This 'distribution mismatch' can cause issues. Solutions include: smaller buffers, importance sampling corrections, or simply accepting some inefficiency in exchange for stability.
Uniform random sampling treats all experiences equally, but not all experiences are equally valuable for learning. Some transitions are surprising, contain rare rewards, or reveal errors in the current Q-function. Prioritized Experience Replay (PER) samples transitions proportionally to their learning potential, accelerating training significantly.
The Core Idea: TD Error as Priority
PER uses the temporal difference (TD) error as a measure of how 'surprising' or 'informative' a transition is:
$$\delta_i = | r_i + \gamma \max_{a'} Q(s'_i, a'; \theta^-) - Q(s_i, a_i; \theta) |$$
A high TD error means our value estimate was wrong—we predicted one thing and observed another. These surprising transitions likely contain information that, once learned, will improve our predictions substantially.
Sampling Probability
Transitions are sampled with probability proportional to their priority:
$$P(i) = \frac{p_i^\alpha}{\sum_k p_k^\alpha}$$
where:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178
import numpy as npimport torchfrom typing import Tuple class SumTree: """ Binary tree data structure for O(log n) sampling by priority. Each leaf stores a priority value. Parent nodes store the sum of their children. This enables efficient: - Priority-weighted sampling: O(log n) - Priority updates: O(log n) - Total priority sum: O(1) """ def __init__(self, capacity: int): self.capacity = capacity self.tree = np.zeros(2 * capacity - 1) # Binary tree storage self.data = np.zeros(capacity, dtype=object) # Transition storage self.write_ptr = 0 self.size = 0 def _propagate(self, idx: int, change: float): """Update parent nodes after priority change.""" parent = (idx - 1) // 2 self.tree[parent] += change if parent != 0: self._propagate(parent, change) def _retrieve(self, idx: int, s: float) -> int: """Find leaf index for a given cumulative priority value.""" left = 2 * idx + 1 right = left + 1 if left >= len(self.tree): return idx if s <= self.tree[left]: return self._retrieve(left, s) else: return self._retrieve(right, s - self.tree[left]) def total(self) -> float: """Return sum of all priorities (root node).""" return self.tree[0] def add(self, priority: float, data): """Add or overwrite a transition with given priority.""" tree_idx = self.write_ptr + self.capacity - 1 self.data[self.write_ptr] = data self.update(tree_idx, priority) self.write_ptr = (self.write_ptr + 1) % self.capacity self.size = min(self.size + 1, self.capacity) def update(self, tree_idx: int, priority: float): """Update priority of a transition.""" change = priority - self.tree[tree_idx] self.tree[tree_idx] = priority self._propagate(tree_idx, change) def get(self, s: float) -> Tuple[int, float, object]: """ Sample a transition by cumulative priority. Returns (tree_idx, priority, data) """ tree_idx = self._retrieve(0, s) data_idx = tree_idx - self.capacity + 1 return tree_idx, self.tree[tree_idx], self.data[data_idx] class PrioritizedReplayBuffer: """ Prioritized Experience Replay (PER) implementation. Uses Sum Tree for efficient priority-weighted sampling. Implements proportional prioritization with importance sampling correction. """ def __init__( self, capacity: int = 1_000_000, alpha: float = 0.6, # Prioritization exponent beta_start: float = 0.4, # Initial importance sampling weight beta_frames: int = 100000, # Frames to anneal beta to 1.0 epsilon: float = 1e-6, # Small constant for stability ): self.tree = SumTree(capacity) self.capacity = capacity self.alpha = alpha self.beta_start = beta_start self.beta_frames = beta_frames self.epsilon = epsilon self.frame = 0 # Track max priority for new transitions self.max_priority = 1.0 def _get_beta(self) -> float: """Anneal beta from beta_start to 1.0 over training.""" return min(1.0, self.beta_start + self.frame * (1.0 - self.beta_start) / self.beta_frames) def push(self, state, action, reward, next_state, done): """ Add transition with maximum priority. New transitions get highest priority to ensure they're sampled at least once before being deprioritized. """ # Store as tuple data = (state, action, reward, next_state, done) # New transitions get max priority priority = self.max_priority ** self.alpha self.tree.add(priority, data) def sample(self, batch_size: int) -> Tuple: """ Sample batch with priority-weighted probabilities. Returns: transitions: Batch of (states, actions, rewards, next_states, dones) indices: Tree indices for priority updates weights: Importance sampling weights """ batch = [] indices = [] priorities = [] # Divide priority range into batch_size segments # Sample uniformly within each segment for stratified sampling segment = self.tree.total() / batch_size for i in range(batch_size): s = np.random.uniform(segment * i, segment * (i + 1)) tree_idx, priority, data = self.tree.get(s) batch.append(data) indices.append(tree_idx) priorities.append(priority) # Compute importance sampling weights beta = self._get_beta() sampling_probs = np.array(priorities) / self.tree.total() weights = (self.tree.size * sampling_probs) ** (-beta) weights = weights / weights.max() # Normalize for stability # Unpack batch states, actions, rewards, next_states, dones = zip(*batch) self.frame += 1 return ( torch.FloatTensor(np.array(states)), torch.LongTensor(actions), torch.FloatTensor(rewards), torch.FloatTensor(np.array(next_states)), torch.BoolTensor(dones), indices, torch.FloatTensor(weights), ) def update_priorities(self, indices: list, td_errors: np.ndarray): """ Update priorities based on new TD errors. Called after computing TD errors in training loop. """ for idx, td_error in zip(indices, td_errors): # Priority = |TD error| + epsilon priority = (np.abs(td_error) + self.epsilon) ** self.alpha self.tree.update(idx, priority) self.max_priority = max(self.max_priority, priority) def __len__(self): return self.tree.sizeImportance Sampling Correction
Prioritized sampling introduces bias: high-priority transitions are seen more often than they would be under the true data distribution. This can skew gradient updates.
We correct for this using importance sampling weights:
$$w_i = \left( \frac{1}{N \cdot P(i)} \right)^\beta$$
where β controls the correction amount (β = 1 fully corrects, β = 0 ignores bias). The loss becomes:
$$L = \frac{1}{|B|} \sum_{i \in B} w_i \cdot \delta_i^2$$
Why anneal β?
Initially (β low), we accept bias in exchange for faster learning from important transitions. As training progresses (β → 1), we need unbiased updates for convergence guarantees. Annealing β balances these needs.
α = 0.6 and β starting at 0.4 annealing to 1.0 are common defaults. Higher α means more aggressive prioritization—effective when some transitions are much more informative than others. If training becomes unstable, reduce α or increase β to reduce sampling bias.
Beyond basic uniform replay and PER, researchers have developed numerous sophisticated replay strategies, each addressing specific limitations or optimizing for particular scenarios.
Hindsight Experience Replay (HER)
For goal-conditioned RL with sparse rewards, HER is transformative. The key insight: even failed attempts teach us something. If we tried to reach goal A but ended up at state B, we can retroactively pretend B was our goal—rewriting history to create a successful experience:
This dramatically increases the density of reward signals in sparse-reward environments like robotic manipulation.
N-Step Returns
Instead of using 1-step TD targets, n-step replay uses multi-step returns:
$$y^{(n)} = \sum_{k=0}^{n-1} \gamma^k r_{t+k} + \gamma^n \max_{a'} Q(s_{t+n}, a')$$
Advantages:
| Variant | Key Idea | Best For |
|---|---|---|
| Prioritized ER (PER) | Sample by TD error magnitude | General performance improvement |
| Hindsight ER (HER) | Relabel failed goals retroactively | Sparse reward, goal-conditioned |
| N-step Returns | Multi-step bootstrapping | Faster credit assignment |
| Combined ER (CER) | Always include most recent transition | Ensure freshness |
| Episodic Memory | Store and replay entire episodes | Credit assignment, imitation |
| Demonstration Replay | Mix expert demos with self-play | Learning from demonstrations |
| Reverse Sweep ER | Replay episodes backward | Better credit assignment |
| Attentive ER | Use attention for transition selection | Complex state spaces |
Combined Experience Replay (CER)
A simple but effective modification: always include the most recent transition in each batch. This ensures that fresh experiences immediately influence learning, while random sampling provides diversity.
def sample_cer(buffer, batch_size):
# Sample batch_size - 1 random transitions
random_batch = random.sample(buffer, batch_size - 1)
# Add the most recent transition
return random_batch + [buffer[-1]]
CER can improve learning in rapidly changing environments where the most recent experience is most relevant.
Episodic Replay
Some methods store and replay entire episodes rather than individual transitions. This is particularly useful for:
Start with uniform replay. It's simple, well-understood, and works well in many scenarios. Add PER if learning seems slow despite sufficient samples. Use HER for goal-conditioned tasks with sparse rewards. Consider n-step returns for faster credit assignment. Most advanced variants provide diminishing returns and increase implementation complexity.
Correct implementation of experience replay requires attention to several subtle details. Bugs in replay mechanics can silently degrade learning without obvious errors.
Critical Implementation Details
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121
import numpy as npimport torchfrom copy import deepcopy class RobustReplayBuffer: """ Production-quality replay buffer with common pitfalls addressed. """ def __init__(self, capacity: int, observation_shape: tuple): self.capacity = capacity self.obs_shape = observation_shape # Pre-allocate memory for efficiency and to avoid fragmentation self.observations = np.zeros((capacity, *observation_shape), dtype=np.uint8) self.next_observations = np.zeros((capacity, *observation_shape), dtype=np.uint8) self.actions = np.zeros(capacity, dtype=np.int32) self.rewards = np.zeros(capacity, dtype=np.float32) self.dones = np.zeros(capacity, dtype=np.bool_) self.ptr = 0 self.size = 0 def push( self, obs: np.ndarray, action: int, reward: float, next_obs: np.ndarray, done: bool ): """ Store a transition with explicit copying. Note: We store observations directly, not references. This prevents bugs where environment reuses observation arrays. """ # Explicit copy to prevent aliasing issues np.copyto(self.observations[self.ptr], obs) np.copyto(self.next_observations[self.ptr], next_obs) self.actions[self.ptr] = action self.rewards[self.ptr] = reward self.dones[self.ptr] = done self.ptr = (self.ptr + 1) % self.capacity self.size = min(self.size + 1, self.capacity) def sample( self, batch_size: int, device: torch.device = torch.device("cpu") ) -> dict: """ Sample a random batch with proper randomization. Returns dict for clarity (avoids tuple index confusion). """ # Truly random sampling indices = np.random.randint(0, self.size, size=batch_size) # Convert to tensors with correct dtypes return { "observations": torch.FloatTensor( self.observations[indices] ).to(device) / 255.0, "actions": torch.LongTensor( self.actions[indices] ).to(device), "rewards": torch.FloatTensor( self.rewards[indices] ).to(device), "next_observations": torch.FloatTensor( self.next_observations[indices] ).to(device) / 255.0, "dones": torch.BoolTensor( self.dones[indices] ).to(device), } def sample_with_indices(self, batch_size: int, device: torch.device): """Sample returning indices (for priority updates in PER).""" indices = np.random.randint(0, self.size, size=batch_size) batch = self.sample(batch_size, device) batch["indices"] = indices return batch # ----- Testing utilities ----- def verify_replay_randomness(buffer, n_samples=10000): """ Statistical test that sampling is truly random. If sampling is sequential or biased, this will fail. """ if len(buffer) < 1000: raise ValueError("Need at least 1000 samples for meaningful test") index_counts = np.zeros(len(buffer)) for _ in range(n_samples): # Assume sample returns indices or we can track them batch = buffer.sample(32) # Would need to track which indices were sampled # Chi-square test for uniformity expected = n_samples * 32 / len(buffer) chi_square = np.sum((index_counts - expected) ** 2 / expected) # Should not be significantly different from expected print(f"Sampling uniformity test: χ² = {chi_square:.2f}") def verify_episode_boundaries(buffer): """ Check that frame stacking respects episode boundaries. """ # For buffers with frame stacking, verify that stacked frames # after a 'done' flag are from the new episode, not the old one pass # Implementation depends on buffer structurePerformance Optimization Tips
Pre-allocate memory: Allocating arrays once at initialization prevents memory fragmentation and repeated allocation overhead.
Use numpy for storage: Python lists have significant overhead. Numpy arrays are contiguous and cache-friendly.
Minimize data movement: Keep data on CPU until needed for GPU training. Converting to GPU tensors for every operation is wasteful.
Batch preprocessing: Apply normalization (divide by 255) during batching, not during storage.
Consider compression: For very large buffers, LZ4 compression can reduce memory by 2-3× with minimal CPU overhead.
Profile memory usage: Memory leaks in replay buffers are common. Monitor buffer memory consumption over long training runs.
Experience replay is more than an optimization—it's a fundamental technique that transforms reinforcement learning from unstable to reliable. Let's consolidate the key insights:
What's Next: Target Networks
Experience replay addresses the correlation and sample efficiency problems, but one major instability remains: the moving target problem. In DQN, we train the network to predict Q-values, but our targets are computed using the same network. As the network updates, the targets change, creating a feedback loop that can cause divergence.
Target networks solve this by maintaining a separate, slowly-updated copy of the network for computing targets. This provides stable learning signals despite the fundamentally non-stationary nature of RL.
Together, experience replay and target networks form the twin pillars of stable deep RL. Understanding both is essential for implementing any deep RL algorithm successfully.
You now understand experience replay in depth—why it's necessary, how it works, and how to implement it correctly. This technique appears in virtually every deep RL algorithm, making it foundational knowledge for the field. Next, we'll explore target networks, the second innovation that made DQN work.