Policy Based Methods - Learning Module

Loading content...

0/245

REINFORCE Algorithm

From Theory to Practice

The policy gradient theorem gives us a mathematical recipe for optimizing policies. But how do we turn this recipe into a working algorithm? The answer is REINFORCE, introduced by Ronald Williams in 1992—one of the most elegant and historically important algorithms in reinforcement learning.

REINFORCE is deceptively simple: collect trajectories, compute returns, multiply by log-probabilities, update parameters. Yet within this simplicity lies the essence of all policy gradient methods. Understanding REINFORCE deeply prepares you for every advanced algorithm that follows, from Actor-Critic to PPO to TRPO.

In this page, we'll implement REINFORCE from scratch, run experiments to understand its behavior, analyze why it struggles in practice, and identify the key improvements needed for practical applications.

What You Will Learn

By the end of this page, you will be able to implement REINFORCE completely, understand each component's role, run it on standard environments, interpret training dynamics, and articulate its fundamental limitations. This foundation is essential for understanding all subsequent policy gradient algorithms.

The REINFORCE Algorithm

REINFORCE is a Monte Carlo policy gradient algorithm. It uses complete episode trajectories to estimate policy gradients and update the policy parameters. The algorithm directly instantiates the policy gradient theorem.

Algorithm Overview:

reinforce_algorithm.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
REINFORCE Algorithm
═══════════════════════════════════════════════════════════════
 
Initialize: Policy network π_θ with random parameters θ
            Learning rate α
            Discount factor γ
 
Repeat for each episode:
    1. COLLECT TRAJECTORY
       τ = {s₀, a₀, r₀, s₁, a₁, r₁, ..., s_T, a_T, r_T}
       where a_t ~ π_θ(·|s_t)
    
    2. COMPUTE RETURNS (reward-to-go)
       For t = T, T-1, ..., 0:
           G_t = r_t + γ·G_{t+1}  (with G_{T+1} = 0)
    
    3. COMPUTE POLICY GRADIENT
       ∇_θ J(θ) ≈ Σ_{t=0}^{T} ∇_θ log π_θ(a_t|s_t) · G_t
    
    4. UPDATE PARAMETERS
       θ ← θ + α · ∇_θ J(θ)
 
Until convergence or maximum episodes reached
 
═══════════════════════════════════════════════════════════════
Key Property: No value function needed—pure policy gradient!

Why 'REINFORCE'?

The name comes from the psychological concept of reinforcement learning (the algorithm reinforces behaviors that lead to positive outcomes). Williams chose this name to emphasize that the algorithm strengthens actions associated with high returns and weakens actions associated with low returns.

Mathematically, for each action a_t taken in state s_t:

If G_t > 0: The gradient update increases π_θ(a_t|s_t)
If G_t < 0: The gradient update decreases π_θ(a_t|s_t)

This is trial-and-error learning formalized as gradient ascent.

Monte Carlo Requirement

REINFORCE is a Monte Carlo method because it requires complete episodes to compute returns G_t. You cannot update until an episode terminates. This contrasts with temporal difference methods (like Actor-Critic) that can learn from incomplete episodes. The Monte Carlo property makes REINFORCE simple but limits its applicability to episodic tasks.

Complete Implementation

Let's implement REINFORCE from scratch, building each component carefully and explaining design decisions along the way.

Step 1: The Policy Network

reinforce_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions import Categorical
import gymnasium as gym
import numpy as np
from typing import List, Tuple
from collections import deque
 
 
class PolicyNetwork(nn.Module):
    """
    Neural network that outputs action probabilities.
    
    Architecture:
    - Input: State observation (flattened if needed)
    - Hidden layers: ReLU activations for non-linearity
    - Output: Softmax over actions (categorical distribution)
    """
    
    def __init__(
        self,
        state_dim: int,
        action_dim: int,
        hidden_dims: List[int] = [128, 128]
    ):
        super().__init__()
        
        # Build hidden layers dynamically
        layers = []
        prev_dim = state_dim
        for hidden_dim in hidden_dims:
            layers.append(nn.Linear(prev_dim, hidden_dim))
            layers.append(nn.ReLU())
            prev_dim = hidden_dim
        
        self.hidden_layers = nn.Sequential(*layers)
        self.output_layer = nn.Linear(prev_dim, action_dim)
        
        # Initialize weights for better training stability
        self._initialize_weights()
    
    def _initialize_weights(self):
        """Initialize with small weights for stable starting policy."""
        for layer in self.hidden_layers:
            if isinstance(layer, nn.Linear):
                nn.init.orthogonal_(layer.weight, gain=np.sqrt(2))
                nn.init.constant_(layer.bias, 0)
        
        # Smaller initialization for output layer
        nn.init.orthogonal_(self.output_layer.weight, gain=0.01)
        nn.init.constant_(self.output_layer.bias, 0)
    
    def forward(self, state: torch.Tensor) -> Categorical:
        """
        Forward pass returns a categorical distribution.
        
        Using Categorical allows us to:
        1. Sample actions: action = dist.sample()
        2. Get log probs: log_prob = dist.log_prob(action)
        3. Get entropy: entropy = dist.entropy()
        """
        features = self.hidden_layers(state)
        logits = self.output_layer(features)
        return Categorical(logits=logits)
    
    def get_action(self, state: np.ndarray) -> Tuple[int, torch.Tensor]:
        """
        Sample an action and return its log probability.
        
        Returns:
            action: The sampled action (integer for discrete)
            log_prob: Log probability of the action (for gradient)
        """
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        dist = self.forward(state_tensor)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        return action.item(), log_prob

Step 2: The REINFORCE Agent

reinforce_agent.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
class REINFORCEAgent:
    """
    Complete REINFORCE implementation with episode collection
    and policy gradient updates.
    """
    
    def __init__(
        self,
        state_dim: int,
        action_dim: int,
        learning_rate: float = 1e-3,
        gamma: float = 0.99,
        hidden_dims: List[int] = [128, 128]
    ):
        self.gamma = gamma
        
        # Initialize policy network
        self.policy = PolicyNetwork(state_dim, action_dim, hidden_dims)
        self.optimizer = optim.Adam(self.policy.parameters(), lr=learning_rate)
        
        # Storage for episode data
        self.log_probs: List[torch.Tensor] = []
        self.rewards: List[float] = []
    
    def select_action(self, state: np.ndarray) -> int:
        """Select action and store log probability for learning."""
        action, log_prob = self.policy.get_action(state)
        self.log_probs.append(log_prob)
        return action
    
    def store_reward(self, reward: float):
        """Store reward received after taking action."""
        self.rewards.append(reward)
    
    def compute_returns(self) -> torch.Tensor:
        """
        Compute discounted returns (reward-to-go) for the episode.
        
        G_t = r_t + γ·r_{t+1} + γ²·r_{t+2} + ...
        
        Computed efficiently backwards:
        G_T = r_T
        G_t = r_t + γ·G_{t+1}
        """
        returns = []
        G = 0
        
        # Iterate backwards through rewards
        for reward in reversed(self.rewards):
            G = reward + self.gamma * G
            returns.insert(0, G)
        
        returns = torch.tensor(returns, dtype=torch.float32)
        
        # Normalize returns for training stability
        # This is a simple but effective variance reduction technique
        if len(returns) > 1:
            returns = (returns - returns.mean()) / (returns.std() + 1e-8)
        
        return returns
    
    def update(self) -> float:
        """
        Perform REINFORCE update using collected episode data.
        
        Returns the loss value for logging.
        """
        # Compute discounted returns
        returns = self.compute_returns()
        
        # Stack log probabilities
        log_probs = torch.cat(self.log_probs)
        
        # Policy gradient loss: -E[log π(a|s) · G]
        # Negative because optimizer minimizes, but we want to maximize
        policy_loss = -(log_probs * returns).sum()
        
        # Perform gradient update
        self.optimizer.zero_grad()
        policy_loss.backward()
        
        # Optional: Gradient clipping for stability
        torch.nn.utils.clip_grad_norm_(self.policy.parameters(), max_norm=1.0)
        
        self.optimizer.step()
        
        # Clear episode data
        loss_value = policy_loss.item()
        self.log_probs = []
        self.rewards = []
        
        return loss_value

Step 3: The Training Loop

reinforce_training.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
def train_reinforce(
    env_name: str = "CartPole-v1",
    num_episodes: int = 1000,
    learning_rate: float = 1e-3,
    gamma: float = 0.99,
    render: bool = False,
    log_interval: int = 100
) -> Tuple[List[float], REINFORCEAgent]:
    """
    Train a REINFORCE agent on the specified environment.
    
    Args:
        env_name: Gymnasium environment name
        num_episodes: Total training episodes
        learning_rate: Learning rate for policy optimization
        gamma: Discount factor
        render: Whether to render environment
        log_interval: Episodes between logging
    
    Returns:
        episode_rewards: List of total rewards per episode
        agent: Trained REINFORCE agent
    """
    # Create environment
    env = gym.make(env_name, render_mode="human" if render else None)
    
    # Get dimensions
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    
    # Create agent
    agent = REINFORCEAgent(
        state_dim=state_dim,
        action_dim=action_dim,
        learning_rate=learning_rate,
        gamma=gamma
    )
    
    # Training tracking
    episode_rewards = []
    recent_rewards = deque(maxlen=100)
    
    for episode in range(num_episodes):
        state, _ = env.reset()
        episode_reward = 0
        done = False
        
        # Collect trajectory
        while not done:
            action = agent.select_action(state)
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            
            agent.store_reward(reward)
            episode_reward += reward
            state = next_state
        
        # Update policy after episode completes
        loss = agent.update()
        
        # Track rewards
        episode_rewards.append(episode_reward)
        recent_rewards.append(episode_reward)
        
        # Logging
        if (episode + 1) % log_interval == 0:
            avg_reward = np.mean(recent_rewards)
            print(f"Episode {episode + 1:4d} | "
                  f"Avg Reward: {avg_reward:7.2f} | "
                  f"Loss: {loss:8.2f}")
    
    env.close()
    return episode_rewards, agent
 
 
# Run training
if __name__ == "__main__":
    rewards, agent = train_reinforce(
        env_name="CartPole-v1",
        num_episodes=1000,
        learning_rate=1e-3
    )
    
    # Plot learning curve
    import matplotlib.pyplot as plt
    
    plt.figure(figsize=(10, 5))
    plt.plot(rewards, alpha=0.3, label="Episode Reward")
    
    # Smooth with moving average
    window = 50
    smoothed = np.convolve(rewards, np.ones(window)/window, mode='valid')
    plt.plot(range(window-1, len(rewards)), smoothed, label="Smoothed (50 ep)")
    
    plt.xlabel("Episode")
    plt.ylabel("Total Reward")
    plt.title("REINFORCE on CartPole-v1")
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

Implementation Notes

Key implementation details: (1) Returns normalization helps stabilize training by keeping gradient magnitudes reasonable. (2) Gradient clipping prevents catastrophic updates. (3) Small output layer initialization ensures the initial policy is close to uniform. (4) Using PyTorch's Categorical distribution handles log probability computation correctly.

Understanding the Gradient Flow

To truly understand REINFORCE, we must trace exactly how gradients flow through the computation. This understanding is crucial for debugging and extending the algorithm.

The Computation Graph:

gradient_flow.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
REINFORCE Computation Graph
═══════════════════════════════════════════════════════════════
 
Forward Pass (during episode):
    state s_t
        ↓
    Policy Network π_θ
        ↓
    Action logits z = [z_1, z_2, ..., z_k]
        ↓
    Softmax: π(a|s) = exp(z_a) / Σ_i exp(z_i)
        ↓
    Sample action a_t ~ Categorical(π)
        ↓
    Log probability: log π_θ(a_t|s_t) = z_{a_t} - log Σ_i exp(z_i)
 
After episode ends:
    Rewards [r_0, r_1, ..., r_T]
        ↓
    Returns [G_0, G_1, ..., G_T]  (computed backwards)
        ↓
    Loss = -Σ_t log π_θ(a_t|s_t) · G_t
 
Backward Pass:
    ∂Loss/∂logits = -G_t · (e_{a_t} - π_θ(·|s_t))
    
    where e_{a_t} is one-hot vector for action taken
    and π_θ(·|s_t) is the probability vector
 
═══════════════════════════════════════════════════════════════

Gradient Interpretation:

The gradient of the loss with respect to the logits reveals exactly how REINFORCE learns:

∂Loss/∂z_a = -G_t · (1_{a=a_t} - π_θ(a|s_t))

For the action taken (a = a_t):

If G_t > 0: Gradient is negative → optimizer increases logit → increases probability
If G_t < 0: Gradient is positive → optimizer decreases logit → decreases probability

For other actions (a ≠ a_t):

If G_t > 0: Gradient is positive → decreases these logits → decreases probability
If G_t < 0: Gradient is negative → increases these logits → increases probability

This is probability mass redistribution: mass flows toward successful actions and away from unsuccessful ones.

The Cross-Entropy Connection

Notice that ∂log π_θ(a|s)/∂z = e_a - π_θ(·|s). This is exactly the gradient of cross-entropy loss! REINFORCE can be interpreted as weighted cross-entropy, where we're trying to match a 'target distribution' that puts all probability on the action taken, weighted by the return.

The Variance Problem in Depth

REINFORCE has a critical weakness: extremely high variance of gradient estimates. This isn't a minor inconvenience—it's the central challenge that spawned decades of research into variance reduction. Let's understand exactly where this variance comes from.

Sources of Variance:

Sources of Gradient Variance

•Trajectory Stochasticity: Different sampled trajectories give vastly different returns. One episode might succeed (G=200), the next might fail (G=10), even with the same policy.
•Environment Stochasticity: Even with deterministic actions, stochastic environments produce variable outcomes. The same action sequence can lead to different results.
•Long Episodes: In long episodes, early actions are weighted by sums of many rewards. Small reward variations compound into large return variations.
•Credit Assignment: A single good action might precede many irrelevant actions. Attributing the final reward to all actions creates noisy gradients.
•Reward Scale: Large reward magnitudes directly increase gradient variance. An environment with rewards in [-1000, 1000] has more variance than one with [-1, 1].

Quantifying the Variance:

variance_analysis.math
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Variance of policy gradient estimator
Var[∇̂_θ J(θ)] = E[(∇̂_θ J(θ) - ∇_θ J(θ))²]
 
# For a single trajectory:
∇̂_θ J(θ) = Σ_t ∇_θ log π_θ(a_t|s_t) · G_t
 
# Variance decomposes as:
Var[∇̂_θ J(θ)] = Var[Σ_t ∇_θ log π_θ(a_t|s_t) · G_t]
 
# Key insight: Variance scales with:
- |G_t|²: Square of return magnitude
- T: Episode length (more terms to sum)
- Policy entropy: Low entropy → high log prob variance
- Environment stochasticity
 
# Example: CartPole with 500-step episode
# Returns range from ~10 to ~500
# Variance can be O(10⁵) to O(10⁶) in gradient magnitude!

Visualizing Variance:

Let's examine how gradient estimates vary across episodes:

variance_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
def analyze_gradient_variance(agent, env, num_episodes=100):
    """
    Collect gradient estimates across multiple episodes
    to visualize variance.
    """
    gradient_samples = []
    return_samples = []
    
    for _ in range(num_episodes):
        state, _ = env.reset()
        done = False
        
        # Collect episode
        log_probs = []
        rewards = []
        
        while not done:
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            dist = agent.policy(state_tensor)
            action = dist.sample()
            log_prob = dist.log_prob(action)
            
            next_state, reward, terminated, truncated, _ = env.step(action.item())
            done = terminated or truncated
            
            log_probs.append(log_prob)
            rewards.append(reward)
            state = next_state
        
        # Compute returns
        returns = []
        G = 0
        for r in reversed(rewards):
            G = r + 0.99 * G
            returns.insert(0, G)
        returns = torch.tensor(returns)
        return_samples.append(returns[0].item())  # First return
        
        # Compute gradient (without updating)
        agent.optimizer.zero_grad()
        log_probs = torch.cat(log_probs)
        loss = -(log_probs * returns).sum()
        loss.backward()
        
        # Extract gradient norm
        total_grad_norm = 0
        for param in agent.policy.parameters():
            if param.grad is not None:
                total_grad_norm += param.grad.norm().item() ** 2
        gradient_samples.append(np.sqrt(total_grad_norm))
    
    # Analysis
    print(f"Return mean: {np.mean(return_samples):.2f}, std: {np.std(return_samples):.2f}")
    print(f"Gradient norm mean: {np.mean(gradient_samples):.2f}, std: {np.std(gradient_samples):.2f}")
    print(f"Coefficient of variation: {np.std(gradient_samples)/np.mean(gradient_samples):.2f}")
    
    return gradient_samples, return_samples

The Practical Impact

High variance means: (1) Many episodes needed for reliable gradient estimates. (2) Learning progress is noisy and unpredictable. (3) Hyperparameters (especially learning rate) are extremely sensitive. (4) Catastrophic forgetting can occur from bad gradient updates. This is why vanilla REINFORCE often requires millions of environment steps to learn even simple tasks.

Practical Improvements to REINFORCE

While we'll cover advanced variance reduction in the next page, several simple techniques can significantly improve REINFORCE's practical performance.

Improvement 1: Batch Updates

Instead of updating after each episode, collect a batch of episodes and average gradients:

batch_reinforce.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
class BatchREINFORCE:
    """REINFORCE with batch updates for reduced variance."""
    
    def __init__(self, state_dim, action_dim, lr=1e-3, gamma=0.99, batch_size=10):
        self.policy = PolicyNetwork(state_dim, action_dim)
        self.optimizer = optim.Adam(self.policy.parameters(), lr=lr)
        self.gamma = gamma
        self.batch_size = batch_size
        
        # Batch storage
        self.batch_log_probs = []
        self.batch_returns = []
    
    def collect_episode(self, env):
        """Collect one episode and store data."""
        state, _ = env.reset()
        done = False
        
        log_probs = []
        rewards = []
        
        while not done:
            action, log_prob = self.policy.get_action(state)
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            
            log_probs.append(log_prob)
            rewards.append(reward)
            state = next_state
        
        # Compute returns
        returns = []
        G = 0
        for r in reversed(rewards):
            G = r + self.gamma * G
            returns.insert(0, G)
        
        self.batch_log_probs.append(torch.cat(log_probs))
        self.batch_returns.append(torch.tensor(returns))
        
        return sum(rewards)
    
    def update(self):
        """Update policy using batched gradients."""
        if len(self.batch_log_probs) < self.batch_size:
            return None
        
        # Concatenate all episode data
        all_log_probs = torch.cat(self.batch_log_probs)
        all_returns = torch.cat(self.batch_returns)
        
        # Normalize returns across the entire batch
        all_returns = (all_returns - all_returns.mean()) / (all_returns.std() + 1e-8)
        
        # Compute loss
        loss = -(all_log_probs * all_returns).mean()
        
        # Update
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        # Clear batch
        self.batch_log_probs = []
        self.batch_returns = []
        
        return loss.item()

Improvement 2: Entropy Regularization

Encourage exploration by adding an entropy bonus to the objective:

entropy_reinforce.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def compute_loss_with_entropy(
    log_probs: torch.Tensor,
    returns: torch.Tensor,
    policy: PolicyNetwork,
    states: torch.Tensor,
    entropy_coef: float = 0.01
) -> torch.Tensor:
    """
    Policy gradient loss with entropy regularization.
    
    L = -E[log π(a|s) · G] - β · H(π(·|s))
    
    The entropy term:
    - Encourages exploration (uniform policy has max entropy)
    - Prevents premature convergence to deterministic policies
    - Helps escape local optima
    """
    # Policy gradient term
    policy_loss = -(log_probs * returns).mean()
    
    # Compute entropy for each state
    distributions = policy(states)
    entropy = distributions.entropy().mean()
    
    # Negative because we want to maximize entropy
    # (but optimizer minimizes the loss)
    total_loss = policy_loss - entropy_coef * entropy
    
    return total_loss, entropy.item()
 
 
# Typical entropy coefficient: 0.01 to 0.1
# Higher values → more exploration, slower convergence
# Lower values → faster convergence, risk of local optima

Improvement 3: Learning Rate Scheduling

Adapt learning rate over training for better convergence:

lr_scheduling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Linear decay: common for policy gradients
def linear_schedule(initial_lr: float, final_lr: float, total_steps: int):
    """Linear learning rate decay."""
    def get_lr(step: int) -> float:
        fraction = min(step / total_steps, 1.0)
        return initial_lr + fraction * (final_lr - initial_lr)
    return get_lr
 
# Usage with PyTorch optimizer
scheduler = torch.optim.lr_scheduler.LambdaLR(
    optimizer,
    lr_lambda=lambda episode: max(0.1, 1 - episode / total_episodes)
)
 
# Alternative: Reduce on plateau when performance stagnates
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer,
    mode='max',  # We're maximizing reward
    factor=0.5,
    patience=50,
    verbose=True
)

Combined Best Practices

For best results, combine: (1) Batch updates (batch_size=10-20). (2) Return normalization within batches. (3) Entropy regularization (coef=0.01). (4) Gradient clipping (max_norm=0.5-1.0). (5) Learning rate scheduling. These together can make REINFORCE surprisingly competitive on simple tasks.

Experimental Analysis

Let's analyze REINFORCE's behavior experimentally to build intuition about its learning dynamics.

Experiment 1: Effect of Learning Rate

Learning Rate Sensitivity on CartPole-v1
Learning Rate	Episodes to Solve	Final Performance	Stability
1e-2	Did not solve	~50 (oscillating)	Very unstable
5e-3	~3000 episodes	~400 (variable)	Some instability
1e-3	~800 episodes	~480 (stable)	Good
5e-4	~1500 episodes	~490 (very stable)	Excellent
1e-4	~4000 episodes	~450 (slow improvement)	Excellent

Observation: REINFORCE is extremely sensitive to learning rate. Too high causes catastrophic forgetting; too low makes learning impractically slow. The optimal range is narrow.

Experiment 2: Effect of Episode Length

Performance vs Maximum Episode Length
Max Episode Length	Gradient Variance	Learning Speed	Notes
100	Low	Fast	But cannot achieve high returns
200	Medium	Medium	Reasonable tradeoff
500	High	Slow	Default for CartPole-v1
1000	Very High	Very Slow	Credit assignment breaks down

Observation: Longer episodes exponentially increase variance. This is why REINFORCE struggles with tasks requiring long-horizon planning.

Experiment 3: Policy Evolution During Training

policy_evolution.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
def visualize_policy_evolution(agent, env, checkpoints=[0, 100, 500, 1000]):
    """
    Visualize how the policy changes during training.
    """
    import matplotlib.pyplot as plt
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    axes = axes.flatten()
    
    # Sample states across the observation space
    # For CartPole: [cart_position, cart_velocity, pole_angle, pole_velocity]
    angles = np.linspace(-0.2, 0.2, 50)  # Pole angle range
    velocities = np.linspace(-1, 1, 50)  # Angular velocity range
    
    for idx, checkpoint in enumerate(checkpoints):
        # Load checkpoint (assume we saved them during training)
        agent.policy.load_state_dict(torch.load(f"policy_ep{checkpoint}.pt"))
        
        # Compute action probabilities for each (angle, velocity) pair
        probs = np.zeros((50, 50))
        for i, angle in enumerate(angles):
            for j, vel in enumerate(velocities):
                state = np.array([0, 0, angle, vel])  # Fix cart pos/vel
                state_tensor = torch.FloatTensor(state).unsqueeze(0)
                with torch.no_grad():
                    dist = agent.policy(state_tensor)
                    probs[i, j] = dist.probs[0, 1].item()  # P(push right)
        
        # Plot
        ax = axes[idx]
        im = ax.imshow(probs, extent=[-1, 1, -0.2, 0.2], 
                       aspect='auto', origin='lower', cmap='RdBu')
        ax.set_xlabel('Angular Velocity')
        ax.set_ylabel('Pole Angle')
        ax.set_title(f'Episode {checkpoint}')
        plt.colorbar(im, ax=ax, label='P(push right)')
    
    plt.suptitle('Policy Evolution During REINFORCE Training')
    plt.tight_layout()
    plt.show()

Typical Policy Evolution Pattern

Early in training, the policy is nearly uniform (random). Gradually, it develops regions where it confidently pushes left or right. The boundary between these regions corresponds to the decision surface. A well-trained policy shows a clean diagonal boundary: push right when the pole is falling right, push left when falling left.

REINFORCE vs Value-Based Methods

How does REINFORCE compare to DQN and other value-based approaches? Let's establish a clear comparison.

Sample Efficiency:

Sample Efficiency Comparison on CartPole-v1
Algorithm	Episodes to Solve	Environment Steps	Experience Reuse
REINFORCE	800-2000	~200,000	None (on-policy)
DQN	300-500	~60,000	High (replay buffer)
A2C	400-700	~100,000	None (on-policy)
PPO	200-400	~50,000	Limited (few epochs)

Strengths of REINFORCE:

Advantages of REINFORCE

•Simplicity: Minimal code, easy to understand and debug
•No value function: Avoids complexities of value estimation
•Theoretical clarity: Direct instantiation of policy gradient theorem
•Continuous actions: Works naturally (just change distribution)
•Stochastic policies: Exploration is built into the policy

Weaknesses of REINFORCE:

Limitations of REINFORCE

•High variance: Orders of magnitude more samples needed
•Sample inefficiency: Cannot reuse past experience
•Episodic only: Must wait for episode to complete
•Sensitivity: Hyperparameters require careful tuning
•Poor credit assignment: Cannot distinguish important actions

When to Use REINFORCE

In practice, vanilla REINFORCE is rarely the best choice. Use it for: (1) Educational purposes to understand policy gradients. (2) Very simple environments where sample efficiency doesn't matter. (3) As a baseline for comparing more sophisticated algorithms. For real applications, use Actor-Critic or PPO instead.

Common Bugs and Debugging Strategies

REINFORCE implementations are prone to subtle bugs that can completely prevent learning. Here's a guide to common issues and how to diagnose them.

Bug 1: Sign Error in Loss

bug_sign_error.py
Python
1
2
3
4
5
6
7
8
9
10
# WRONG: Implements gradient descent on positive loss
# This MINIMIZES expected return!
loss = (log_probs * returns).mean()
 
# CORRECT: Negative sign for gradient ascent via descent
loss = -(log_probs * returns).mean()
 
# Alternative: Use optimizer with maximize=True (PyTorch 2.0+)
optimizer = optim.Adam(params, lr=1e-3, maximize=True)
loss = (log_probs * returns).mean()  # Positive is fine now

Bug 2: Detached Returns

bug_detached_returns.py
Python
1
2
3
4
5
6
7
8
9
10
# Returns should NOT require gradients
# They are constants for the purpose of gradient computation
 
# WRONG: Returns computed with gradient tracking
returns = torch.tensor(returns_list, requires_grad=True)  # BAD
 
# CORRECT: Returns are just constants
returns = torch.tensor(returns_list, dtype=torch.float32)  # No grad
# or explicitly
returns = torch.tensor(returns_list).detach()

Bug 3: Incorrect Discount Factor Application

bug_discount.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# WRONG: Forgetting to discount
def wrong_returns(rewards):
    returns = []
    G = 0
    for r in reversed(rewards):
        G = r + G  # Missing gamma!
        returns.insert(0, G)
    return returns
 
# CORRECT: Proper discounting
def correct_returns(rewards, gamma=0.99):
    returns = []
    G = 0
    for r in reversed(rewards):
        G = r + gamma * G  # gamma applied
        returns.insert(0, G)
    return returns

Debugging Checklist:

REINFORCE Debugging Checklist

•Verify random baseline: A random policy should achieve the expected random performance.
•Check gradient flow: Ensure log_probs requires_grad and gradients are non-zero after backward().
•Print gradient norms: They should be reasonable (not 0, not exploding).
•Verify return computation: Print returns and check they make sense for the episode.
•Test without normalization: Return normalization can hide bugs; test without it first.
•Use a trivial environment: Test on a 2-state MDP with known optimal policy.
•Compare with reference: Test against a known-working implementation.

The 'Sanity Check' Environment

Create a trivial environment where the optimal policy is obvious (e.g., always take action 0 gives +1 reward, action 1 gives -1). If REINFORCE can't learn this, there's a bug. This eliminates environment complexity from debugging.

Summary: REINFORCE Algorithm

We've thoroughly examined REINFORCE, the foundational policy gradient algorithm. Let's consolidate what we've learned:

Key Takeaways

•REINFORCE is Monte Carlo policy gradient—it uses complete episode returns to estimate gradients and update the policy.
•The algorithm is elegantly simple: collect trajectory, compute returns, weight log-probs by returns, update parameters.
•Gradient flow is interpretable: positive returns increase action probability, negative returns decrease it.
•High variance is the critical weakness—it comes from trajectory stochasticity, long episodes, and poor credit assignment.
•Practical improvements help: batch updates, return normalization, entropy regularization, gradient clipping.
•Hyperparameter sensitivity makes vanilla REINFORCE challenging to tune; learning rate is especially critical.
•Sample inefficiency limits real-world applicability—REINFORCE needs significantly more data than value-based methods.

What's next:

The high variance of REINFORCE motivates the core question we'll tackle next: How can we reduce variance while maintaining unbiased gradients? The next page covers variance reduction techniques, including baselines—the key insight that leads to Actor-Critic methods.

Page Complete

You now understand REINFORCE deeply—its implementation, behavior, strengths, and critical weaknesses. This understanding is essential because every advanced policy gradient algorithm (A2C, TRPO, PPO) addresses REINFORCE's limitations in specific ways. Next, we'll learn to tame the variance problem.