Loading content...
The policy gradient theorem gives us a mathematical recipe for optimizing policies. But how do we turn this recipe into a working algorithm? The answer is REINFORCE, introduced by Ronald Williams in 1992—one of the most elegant and historically important algorithms in reinforcement learning.
REINFORCE is deceptively simple: collect trajectories, compute returns, multiply by log-probabilities, update parameters. Yet within this simplicity lies the essence of all policy gradient methods. Understanding REINFORCE deeply prepares you for every advanced algorithm that follows, from Actor-Critic to PPO to TRPO.
In this page, we'll implement REINFORCE from scratch, run experiments to understand its behavior, analyze why it struggles in practice, and identify the key improvements needed for practical applications.
By the end of this page, you will be able to implement REINFORCE completely, understand each component's role, run it on standard environments, interpret training dynamics, and articulate its fundamental limitations. This foundation is essential for understanding all subsequent policy gradient algorithms.
REINFORCE is a Monte Carlo policy gradient algorithm. It uses complete episode trajectories to estimate policy gradients and update the policy parameters. The algorithm directly instantiates the policy gradient theorem.
Algorithm Overview:
1234567891011121314151617181920212223242526
REINFORCE Algorithm═══════════════════════════════════════════════════════════════ Initialize: Policy network π_θ with random parameters θ Learning rate α Discount factor γ Repeat for each episode: 1. COLLECT TRAJECTORY τ = {s₀, a₀, r₀, s₁, a₁, r₁, ..., s_T, a_T, r_T} where a_t ~ π_θ(·|s_t) 2. COMPUTE RETURNS (reward-to-go) For t = T, T-1, ..., 0: G_t = r_t + γ·G_{t+1} (with G_{T+1} = 0) 3. COMPUTE POLICY GRADIENT ∇_θ J(θ) ≈ Σ_{t=0}^{T} ∇_θ log π_θ(a_t|s_t) · G_t 4. UPDATE PARAMETERS θ ← θ + α · ∇_θ J(θ) Until convergence or maximum episodes reached ═══════════════════════════════════════════════════════════════Key Property: No value function needed—pure policy gradient!Why 'REINFORCE'?
The name comes from the psychological concept of reinforcement learning (the algorithm reinforces behaviors that lead to positive outcomes). Williams chose this name to emphasize that the algorithm strengthens actions associated with high returns and weakens actions associated with low returns.
Mathematically, for each action a_t taken in state s_t:
This is trial-and-error learning formalized as gradient ascent.
REINFORCE is a Monte Carlo method because it requires complete episodes to compute returns G_t. You cannot update until an episode terminates. This contrasts with temporal difference methods (like Actor-Critic) that can learn from incomplete episodes. The Monte Carlo property makes REINFORCE simple but limits its applicability to episodic tasks.
Let's implement REINFORCE from scratch, building each component carefully and explaining design decisions along the way.
Step 1: The Policy Network
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980
import torchimport torch.nn as nnimport torch.optim as optimimport torch.nn.functional as Ffrom torch.distributions import Categoricalimport gymnasium as gymimport numpy as npfrom typing import List, Tuplefrom collections import deque class PolicyNetwork(nn.Module): """ Neural network that outputs action probabilities. Architecture: - Input: State observation (flattened if needed) - Hidden layers: ReLU activations for non-linearity - Output: Softmax over actions (categorical distribution) """ def __init__( self, state_dim: int, action_dim: int, hidden_dims: List[int] = [128, 128] ): super().__init__() # Build hidden layers dynamically layers = [] prev_dim = state_dim for hidden_dim in hidden_dims: layers.append(nn.Linear(prev_dim, hidden_dim)) layers.append(nn.ReLU()) prev_dim = hidden_dim self.hidden_layers = nn.Sequential(*layers) self.output_layer = nn.Linear(prev_dim, action_dim) # Initialize weights for better training stability self._initialize_weights() def _initialize_weights(self): """Initialize with small weights for stable starting policy.""" for layer in self.hidden_layers: if isinstance(layer, nn.Linear): nn.init.orthogonal_(layer.weight, gain=np.sqrt(2)) nn.init.constant_(layer.bias, 0) # Smaller initialization for output layer nn.init.orthogonal_(self.output_layer.weight, gain=0.01) nn.init.constant_(self.output_layer.bias, 0) def forward(self, state: torch.Tensor) -> Categorical: """ Forward pass returns a categorical distribution. Using Categorical allows us to: 1. Sample actions: action = dist.sample() 2. Get log probs: log_prob = dist.log_prob(action) 3. Get entropy: entropy = dist.entropy() """ features = self.hidden_layers(state) logits = self.output_layer(features) return Categorical(logits=logits) def get_action(self, state: np.ndarray) -> Tuple[int, torch.Tensor]: """ Sample an action and return its log probability. Returns: action: The sampled action (integer for discrete) log_prob: Log probability of the action (for gradient) """ state_tensor = torch.FloatTensor(state).unsqueeze(0) dist = self.forward(state_tensor) action = dist.sample() log_prob = dist.log_prob(action) return action.item(), log_probStep 2: The REINFORCE Agent
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
class REINFORCEAgent: """ Complete REINFORCE implementation with episode collection and policy gradient updates. """ def __init__( self, state_dim: int, action_dim: int, learning_rate: float = 1e-3, gamma: float = 0.99, hidden_dims: List[int] = [128, 128] ): self.gamma = gamma # Initialize policy network self.policy = PolicyNetwork(state_dim, action_dim, hidden_dims) self.optimizer = optim.Adam(self.policy.parameters(), lr=learning_rate) # Storage for episode data self.log_probs: List[torch.Tensor] = [] self.rewards: List[float] = [] def select_action(self, state: np.ndarray) -> int: """Select action and store log probability for learning.""" action, log_prob = self.policy.get_action(state) self.log_probs.append(log_prob) return action def store_reward(self, reward: float): """Store reward received after taking action.""" self.rewards.append(reward) def compute_returns(self) -> torch.Tensor: """ Compute discounted returns (reward-to-go) for the episode. G_t = r_t + γ·r_{t+1} + γ²·r_{t+2} + ... Computed efficiently backwards: G_T = r_T G_t = r_t + γ·G_{t+1} """ returns = [] G = 0 # Iterate backwards through rewards for reward in reversed(self.rewards): G = reward + self.gamma * G returns.insert(0, G) returns = torch.tensor(returns, dtype=torch.float32) # Normalize returns for training stability # This is a simple but effective variance reduction technique if len(returns) > 1: returns = (returns - returns.mean()) / (returns.std() + 1e-8) return returns def update(self) -> float: """ Perform REINFORCE update using collected episode data. Returns the loss value for logging. """ # Compute discounted returns returns = self.compute_returns() # Stack log probabilities log_probs = torch.cat(self.log_probs) # Policy gradient loss: -E[log π(a|s) · G] # Negative because optimizer minimizes, but we want to maximize policy_loss = -(log_probs * returns).sum() # Perform gradient update self.optimizer.zero_grad() policy_loss.backward() # Optional: Gradient clipping for stability torch.nn.utils.clip_grad_norm_(self.policy.parameters(), max_norm=1.0) self.optimizer.step() # Clear episode data loss_value = policy_loss.item() self.log_probs = [] self.rewards = [] return loss_valueStep 3: The Training Loop
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
def train_reinforce( env_name: str = "CartPole-v1", num_episodes: int = 1000, learning_rate: float = 1e-3, gamma: float = 0.99, render: bool = False, log_interval: int = 100) -> Tuple[List[float], REINFORCEAgent]: """ Train a REINFORCE agent on the specified environment. Args: env_name: Gymnasium environment name num_episodes: Total training episodes learning_rate: Learning rate for policy optimization gamma: Discount factor render: Whether to render environment log_interval: Episodes between logging Returns: episode_rewards: List of total rewards per episode agent: Trained REINFORCE agent """ # Create environment env = gym.make(env_name, render_mode="human" if render else None) # Get dimensions state_dim = env.observation_space.shape[0] action_dim = env.action_space.n # Create agent agent = REINFORCEAgent( state_dim=state_dim, action_dim=action_dim, learning_rate=learning_rate, gamma=gamma ) # Training tracking episode_rewards = [] recent_rewards = deque(maxlen=100) for episode in range(num_episodes): state, _ = env.reset() episode_reward = 0 done = False # Collect trajectory while not done: action = agent.select_action(state) next_state, reward, terminated, truncated, _ = env.step(action) done = terminated or truncated agent.store_reward(reward) episode_reward += reward state = next_state # Update policy after episode completes loss = agent.update() # Track rewards episode_rewards.append(episode_reward) recent_rewards.append(episode_reward) # Logging if (episode + 1) % log_interval == 0: avg_reward = np.mean(recent_rewards) print(f"Episode {episode + 1:4d} | " f"Avg Reward: {avg_reward:7.2f} | " f"Loss: {loss:8.2f}") env.close() return episode_rewards, agent # Run trainingif __name__ == "__main__": rewards, agent = train_reinforce( env_name="CartPole-v1", num_episodes=1000, learning_rate=1e-3 ) # Plot learning curve import matplotlib.pyplot as plt plt.figure(figsize=(10, 5)) plt.plot(rewards, alpha=0.3, label="Episode Reward") # Smooth with moving average window = 50 smoothed = np.convolve(rewards, np.ones(window)/window, mode='valid') plt.plot(range(window-1, len(rewards)), smoothed, label="Smoothed (50 ep)") plt.xlabel("Episode") plt.ylabel("Total Reward") plt.title("REINFORCE on CartPole-v1") plt.legend() plt.grid(True, alpha=0.3) plt.show()Key implementation details: (1) Returns normalization helps stabilize training by keeping gradient magnitudes reasonable. (2) Gradient clipping prevents catastrophic updates. (3) Small output layer initialization ensures the initial policy is close to uniform. (4) Using PyTorch's Categorical distribution handles log probability computation correctly.
To truly understand REINFORCE, we must trace exactly how gradients flow through the computation. This understanding is crucial for debugging and extending the algorithm.
The Computation Graph:
123456789101112131415161718192021222324252627282930
REINFORCE Computation Graph═══════════════════════════════════════════════════════════════ Forward Pass (during episode): state s_t ↓ Policy Network π_θ ↓ Action logits z = [z_1, z_2, ..., z_k] ↓ Softmax: π(a|s) = exp(z_a) / Σ_i exp(z_i) ↓ Sample action a_t ~ Categorical(π) ↓ Log probability: log π_θ(a_t|s_t) = z_{a_t} - log Σ_i exp(z_i) After episode ends: Rewards [r_0, r_1, ..., r_T] ↓ Returns [G_0, G_1, ..., G_T] (computed backwards) ↓ Loss = -Σ_t log π_θ(a_t|s_t) · G_t Backward Pass: ∂Loss/∂logits = -G_t · (e_{a_t} - π_θ(·|s_t)) where e_{a_t} is one-hot vector for action taken and π_θ(·|s_t) is the probability vector ═══════════════════════════════════════════════════════════════Gradient Interpretation:
The gradient of the loss with respect to the logits reveals exactly how REINFORCE learns:
∂Loss/∂z_a = -G_t · (1_{a=a_t} - π_θ(a|s_t))
For the action taken (a = a_t):
For other actions (a ≠ a_t):
This is probability mass redistribution: mass flows toward successful actions and away from unsuccessful ones.
Notice that ∂log π_θ(a|s)/∂z = e_a - π_θ(·|s). This is exactly the gradient of cross-entropy loss! REINFORCE can be interpreted as weighted cross-entropy, where we're trying to match a 'target distribution' that puts all probability on the action taken, weighted by the return.
REINFORCE has a critical weakness: extremely high variance of gradient estimates. This isn't a minor inconvenience—it's the central challenge that spawned decades of research into variance reduction. Let's understand exactly where this variance comes from.
Sources of Variance:
Quantifying the Variance:
123456789101112131415161718
# Variance of policy gradient estimatorVar[∇̂_θ J(θ)] = E[(∇̂_θ J(θ) - ∇_θ J(θ))²] # For a single trajectory:∇̂_θ J(θ) = Σ_t ∇_θ log π_θ(a_t|s_t) · G_t # Variance decomposes as:Var[∇̂_θ J(θ)] = Var[Σ_t ∇_θ log π_θ(a_t|s_t) · G_t] # Key insight: Variance scales with:- |G_t|²: Square of return magnitude- T: Episode length (more terms to sum)- Policy entropy: Low entropy → high log prob variance- Environment stochasticity # Example: CartPole with 500-step episode# Returns range from ~10 to ~500# Variance can be O(10⁵) to O(10⁶) in gradient magnitude!Visualizing Variance:
Let's examine how gradient estimates vary across episodes:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
def analyze_gradient_variance(agent, env, num_episodes=100): """ Collect gradient estimates across multiple episodes to visualize variance. """ gradient_samples = [] return_samples = [] for _ in range(num_episodes): state, _ = env.reset() done = False # Collect episode log_probs = [] rewards = [] while not done: state_tensor = torch.FloatTensor(state).unsqueeze(0) dist = agent.policy(state_tensor) action = dist.sample() log_prob = dist.log_prob(action) next_state, reward, terminated, truncated, _ = env.step(action.item()) done = terminated or truncated log_probs.append(log_prob) rewards.append(reward) state = next_state # Compute returns returns = [] G = 0 for r in reversed(rewards): G = r + 0.99 * G returns.insert(0, G) returns = torch.tensor(returns) return_samples.append(returns[0].item()) # First return # Compute gradient (without updating) agent.optimizer.zero_grad() log_probs = torch.cat(log_probs) loss = -(log_probs * returns).sum() loss.backward() # Extract gradient norm total_grad_norm = 0 for param in agent.policy.parameters(): if param.grad is not None: total_grad_norm += param.grad.norm().item() ** 2 gradient_samples.append(np.sqrt(total_grad_norm)) # Analysis print(f"Return mean: {np.mean(return_samples):.2f}, std: {np.std(return_samples):.2f}") print(f"Gradient norm mean: {np.mean(gradient_samples):.2f}, std: {np.std(gradient_samples):.2f}") print(f"Coefficient of variation: {np.std(gradient_samples)/np.mean(gradient_samples):.2f}") return gradient_samples, return_samplesHigh variance means: (1) Many episodes needed for reliable gradient estimates. (2) Learning progress is noisy and unpredictable. (3) Hyperparameters (especially learning rate) are extremely sensitive. (4) Catastrophic forgetting can occur from bad gradient updates. This is why vanilla REINFORCE often requires millions of environment steps to learn even simple tasks.
While we'll cover advanced variance reduction in the next page, several simple techniques can significantly improve REINFORCE's practical performance.
Improvement 1: Batch Updates
Instead of updating after each episode, collect a batch of episodes and average gradients:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
class BatchREINFORCE: """REINFORCE with batch updates for reduced variance.""" def __init__(self, state_dim, action_dim, lr=1e-3, gamma=0.99, batch_size=10): self.policy = PolicyNetwork(state_dim, action_dim) self.optimizer = optim.Adam(self.policy.parameters(), lr=lr) self.gamma = gamma self.batch_size = batch_size # Batch storage self.batch_log_probs = [] self.batch_returns = [] def collect_episode(self, env): """Collect one episode and store data.""" state, _ = env.reset() done = False log_probs = [] rewards = [] while not done: action, log_prob = self.policy.get_action(state) next_state, reward, terminated, truncated, _ = env.step(action) done = terminated or truncated log_probs.append(log_prob) rewards.append(reward) state = next_state # Compute returns returns = [] G = 0 for r in reversed(rewards): G = r + self.gamma * G returns.insert(0, G) self.batch_log_probs.append(torch.cat(log_probs)) self.batch_returns.append(torch.tensor(returns)) return sum(rewards) def update(self): """Update policy using batched gradients.""" if len(self.batch_log_probs) < self.batch_size: return None # Concatenate all episode data all_log_probs = torch.cat(self.batch_log_probs) all_returns = torch.cat(self.batch_returns) # Normalize returns across the entire batch all_returns = (all_returns - all_returns.mean()) / (all_returns.std() + 1e-8) # Compute loss loss = -(all_log_probs * all_returns).mean() # Update self.optimizer.zero_grad() loss.backward() self.optimizer.step() # Clear batch self.batch_log_probs = [] self.batch_returns = [] return loss.item()Improvement 2: Entropy Regularization
Encourage exploration by adding an entropy bonus to the objective:
12345678910111213141516171819202122232425262728293031323334
def compute_loss_with_entropy( log_probs: torch.Tensor, returns: torch.Tensor, policy: PolicyNetwork, states: torch.Tensor, entropy_coef: float = 0.01) -> torch.Tensor: """ Policy gradient loss with entropy regularization. L = -E[log π(a|s) · G] - β · H(π(·|s)) The entropy term: - Encourages exploration (uniform policy has max entropy) - Prevents premature convergence to deterministic policies - Helps escape local optima """ # Policy gradient term policy_loss = -(log_probs * returns).mean() # Compute entropy for each state distributions = policy(states) entropy = distributions.entropy().mean() # Negative because we want to maximize entropy # (but optimizer minimizes the loss) total_loss = policy_loss - entropy_coef * entropy return total_loss, entropy.item() # Typical entropy coefficient: 0.01 to 0.1# Higher values → more exploration, slower convergence# Lower values → faster convergence, risk of local optimaImprovement 3: Learning Rate Scheduling
Adapt learning rate over training for better convergence:
12345678910111213141516171819202122
# Linear decay: common for policy gradientsdef linear_schedule(initial_lr: float, final_lr: float, total_steps: int): """Linear learning rate decay.""" def get_lr(step: int) -> float: fraction = min(step / total_steps, 1.0) return initial_lr + fraction * (final_lr - initial_lr) return get_lr # Usage with PyTorch optimizerscheduler = torch.optim.lr_scheduler.LambdaLR( optimizer, lr_lambda=lambda episode: max(0.1, 1 - episode / total_episodes)) # Alternative: Reduce on plateau when performance stagnatesscheduler = torch.optim.lr_scheduler.ReduceLROnPlateau( optimizer, mode='max', # We're maximizing reward factor=0.5, patience=50, verbose=True)For best results, combine: (1) Batch updates (batch_size=10-20). (2) Return normalization within batches. (3) Entropy regularization (coef=0.01). (4) Gradient clipping (max_norm=0.5-1.0). (5) Learning rate scheduling. These together can make REINFORCE surprisingly competitive on simple tasks.
Let's analyze REINFORCE's behavior experimentally to build intuition about its learning dynamics.
Experiment 1: Effect of Learning Rate
| Learning Rate | Episodes to Solve | Final Performance | Stability |
|---|---|---|---|
| 1e-2 | Did not solve | ~50 (oscillating) | Very unstable |
| 5e-3 | ~3000 episodes | ~400 (variable) | Some instability |
| 1e-3 | ~800 episodes | ~480 (stable) | Good |
| 5e-4 | ~1500 episodes | ~490 (very stable) | Excellent |
| 1e-4 | ~4000 episodes | ~450 (slow improvement) | Excellent |
Observation: REINFORCE is extremely sensitive to learning rate. Too high causes catastrophic forgetting; too low makes learning impractically slow. The optimal range is narrow.
Experiment 2: Effect of Episode Length
| Max Episode Length | Gradient Variance | Learning Speed | Notes |
|---|---|---|---|
| 100 | Low | Fast | But cannot achieve high returns |
| 200 | Medium | Medium | Reasonable tradeoff |
| 500 | High | Slow | Default for CartPole-v1 |
| 1000 | Very High | Very Slow | Credit assignment breaks down |
Observation: Longer episodes exponentially increase variance. This is why REINFORCE struggles with tasks requiring long-horizon planning.
Experiment 3: Policy Evolution During Training
12345678910111213141516171819202122232425262728293031323334353637383940
def visualize_policy_evolution(agent, env, checkpoints=[0, 100, 500, 1000]): """ Visualize how the policy changes during training. """ import matplotlib.pyplot as plt fig, axes = plt.subplots(2, 2, figsize=(12, 10)) axes = axes.flatten() # Sample states across the observation space # For CartPole: [cart_position, cart_velocity, pole_angle, pole_velocity] angles = np.linspace(-0.2, 0.2, 50) # Pole angle range velocities = np.linspace(-1, 1, 50) # Angular velocity range for idx, checkpoint in enumerate(checkpoints): # Load checkpoint (assume we saved them during training) agent.policy.load_state_dict(torch.load(f"policy_ep{checkpoint}.pt")) # Compute action probabilities for each (angle, velocity) pair probs = np.zeros((50, 50)) for i, angle in enumerate(angles): for j, vel in enumerate(velocities): state = np.array([0, 0, angle, vel]) # Fix cart pos/vel state_tensor = torch.FloatTensor(state).unsqueeze(0) with torch.no_grad(): dist = agent.policy(state_tensor) probs[i, j] = dist.probs[0, 1].item() # P(push right) # Plot ax = axes[idx] im = ax.imshow(probs, extent=[-1, 1, -0.2, 0.2], aspect='auto', origin='lower', cmap='RdBu') ax.set_xlabel('Angular Velocity') ax.set_ylabel('Pole Angle') ax.set_title(f'Episode {checkpoint}') plt.colorbar(im, ax=ax, label='P(push right)') plt.suptitle('Policy Evolution During REINFORCE Training') plt.tight_layout() plt.show()Early in training, the policy is nearly uniform (random). Gradually, it develops regions where it confidently pushes left or right. The boundary between these regions corresponds to the decision surface. A well-trained policy shows a clean diagonal boundary: push right when the pole is falling right, push left when falling left.
How does REINFORCE compare to DQN and other value-based approaches? Let's establish a clear comparison.
Sample Efficiency:
| Algorithm | Episodes to Solve | Environment Steps | Experience Reuse |
|---|---|---|---|
| REINFORCE | 800-2000 | ~200,000 | None (on-policy) |
| DQN | 300-500 | ~60,000 | High (replay buffer) |
| A2C | 400-700 | ~100,000 | None (on-policy) |
| PPO | 200-400 | ~50,000 | Limited (few epochs) |
Strengths of REINFORCE:
Weaknesses of REINFORCE:
In practice, vanilla REINFORCE is rarely the best choice. Use it for: (1) Educational purposes to understand policy gradients. (2) Very simple environments where sample efficiency doesn't matter. (3) As a baseline for comparing more sophisticated algorithms. For real applications, use Actor-Critic or PPO instead.
REINFORCE implementations are prone to subtle bugs that can completely prevent learning. Here's a guide to common issues and how to diagnose them.
Bug 1: Sign Error in Loss
12345678910
# WRONG: Implements gradient descent on positive loss# This MINIMIZES expected return!loss = (log_probs * returns).mean() # CORRECT: Negative sign for gradient ascent via descentloss = -(log_probs * returns).mean() # Alternative: Use optimizer with maximize=True (PyTorch 2.0+)optimizer = optim.Adam(params, lr=1e-3, maximize=True)loss = (log_probs * returns).mean() # Positive is fine nowBug 2: Detached Returns
12345678910
# Returns should NOT require gradients# They are constants for the purpose of gradient computation # WRONG: Returns computed with gradient trackingreturns = torch.tensor(returns_list, requires_grad=True) # BAD # CORRECT: Returns are just constantsreturns = torch.tensor(returns_list, dtype=torch.float32) # No grad# or explicitlyreturns = torch.tensor(returns_list).detach()Bug 3: Incorrect Discount Factor Application
1234567891011121314151617
# WRONG: Forgetting to discountdef wrong_returns(rewards): returns = [] G = 0 for r in reversed(rewards): G = r + G # Missing gamma! returns.insert(0, G) return returns # CORRECT: Proper discountingdef correct_returns(rewards, gamma=0.99): returns = [] G = 0 for r in reversed(rewards): G = r + gamma * G # gamma applied returns.insert(0, G) return returnsDebugging Checklist:
Create a trivial environment where the optimal policy is obvious (e.g., always take action 0 gives +1 reward, action 1 gives -1). If REINFORCE can't learn this, there's a bug. This eliminates environment complexity from debugging.
We've thoroughly examined REINFORCE, the foundational policy gradient algorithm. Let's consolidate what we've learned:
What's next:
The high variance of REINFORCE motivates the core question we'll tackle next: How can we reduce variance while maintaining unbiased gradients? The next page covers variance reduction techniques, including baselines—the key insight that leads to Actor-Critic methods.
You now understand REINFORCE deeply—its implementation, behavior, strengths, and critical weaknesses. This understanding is essential because every advanced policy gradient algorithm (A2C, TRPO, PPO) addresses REINFORCE's limitations in specific ways. Next, we'll learn to tame the variance problem.