Policy Based Methods - Learning Module

Loading content...

0/245

Policy Gradient Theorem

The Direct Path to Optimal Behavior

In value-based reinforcement learning, we learned to estimate the value of states and actions, then extract policies from these values. But what if we could directly optimize the policy itself—the mapping from observations to actions—without the intermediary of value functions?

This seemingly simple question leads to one of the most profound results in reinforcement learning: the Policy Gradient Theorem. This theorem provides the mathematical foundation for learning policies through gradient ascent, enabling neural networks to directly output action probabilities that maximize expected cumulative reward.

The policy gradient approach unlocks capabilities that value-based methods struggle with: continuous action spaces, stochastic policies, and better convergence guarantees in certain settings. Every modern deep RL breakthrough—from AlphaGo to robotic manipulation to large language model fine-tuning via RLHF—builds on the theoretical foundation we'll establish in this page.

What You Will Learn

By the end of this page, you will deeply understand the policy gradient theorem—its derivation, assumptions, mathematical structure, and implications. You'll see how we can compute gradients of expected return with respect to policy parameters, even when the dynamics of the environment are unknown. This understanding is essential for everything that follows in policy optimization.

The Policy Optimization Objective

Before deriving the policy gradient theorem, we must precisely define what we're optimizing. In policy-based methods, we parameterize the policy directly with parameters θ (typically neural network weights) and seek to maximize expected cumulative reward.

The Parameterized Policy:

A policy π_θ(a|s) defines the probability of taking action a in state s, given parameters θ. For discrete actions, this might be a softmax over action logits; for continuous actions, it might be a Gaussian distribution with learned mean and variance.

The Objective Function:

We want to find parameters θ* that maximize the expected return:

objective_function.math
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# The Policy Optimization Objective
 
J(θ) = E_τ~π_θ[R(τ)]
 
where:
- τ = (s₀, a₀, r₀, s₁, a₁, r₁, ...) is a trajectory
- R(τ) = Σ_{t=0}^{T} γ^t r_t is the discounted return
- The expectation is over trajectories induced by policy π_θ
 
# Goal: Find θ* = argmax_θ J(θ)
 
# Trajectory probability under π_θ:
P(τ|θ) = p(s₀) Π_{t=0}^{T} π_θ(a_t|s_t) · P(s_{t+1}|s_t, a_t)
 
# Expanding the objective:
J(θ) = ∫ P(τ|θ) R(τ) dτ  (continuous)
J(θ) = Σ_τ P(τ|θ) R(τ)    (discrete)

Understanding the Trajectory Distribution:

The probability of a trajectory τ under policy π_θ depends on:

The initial state distribution p(s₀) — determined by the environment
The policy π_θ(a_t|s_t) — what we control
The transition dynamics P(s_{t+1}|s_t, a_t) — determined by the environment

A crucial observation: the environment dynamics P(s_{t+1}|s_t, a_t) and initial distribution p(s₀) do not depend on θ. Only the policy terms contain our learnable parameters. This separation is what makes the policy gradient theorem possible.

Why Not Just Differentiate Directly?

The objective J(θ) involves an expectation over trajectories. We can't simply backpropagate through it because: (1) we don't know the environment dynamics to differentiate through transitions, and (2) even if we did, the expectation involves a sum/integral over exponentially many trajectories. The policy gradient theorem gives us a way to estimate gradients using only sampled trajectories.

Deriving the Policy Gradient Theorem

Now we arrive at the heart of policy-based methods: deriving a tractable expression for ∇_θ J(θ). This derivation uses the famous log-derivative trick (also called the REINFORCE trick or score function estimator) and reveals a beautiful mathematical structure.

Step 1: Write out the gradient of the objective

derivation_step1.math
1
2
3
4
5
6
7
# Starting point: gradient of expected return
∇_θ J(θ) = ∇_θ E_τ~π_θ[R(τ)]
         = ∇_θ Σ_τ P(τ|θ) R(τ)
         = Σ_τ ∇_θ P(τ|θ) R(τ)   # Leibniz integral rule
 
# Problem: ∇_θ P(τ|θ) requires differentiating through trajectory distribution
# This seems to require knowledge of environment dynamics...

Step 2: Apply the log-derivative trick

The key insight is the identity: ∇_θ log f(θ) = ∇_θ f(θ) / f(θ), which rearranges to:

∇_θ f(θ) = f(θ) · ∇_θ log f(θ)

Applying this to the trajectory probability:

derivation_step2.math
1
2
3
4
5
6
7
8
# The log-derivative trick:
∇_θ P(τ|θ) = P(τ|θ) · ∇_θ log P(τ|θ)
 
# Substituting back:
∇_θ J(θ) = Σ_τ P(τ|θ) · ∇_θ log P(τ|θ) · R(τ)
         = E_τ~π_θ[∇_θ log P(τ|θ) · R(τ)]
 
# This is now an expectation we can estimate via sampling!

Step 3: Simplify the log-probability of trajectories

Now we examine what ∇_θ log P(τ|θ) actually looks like:

derivation_step3.math
1
2
3
4
5
6
7
8
9
10
11
# Trajectory probability:
P(τ|θ) = p(s₀) Π_{t=0}^{T} π_θ(a_t|s_t) · P(s_{t+1}|s_t, a_t)
 
# Taking the log:
log P(τ|θ) = log p(s₀) + Σ_{t=0}^{T} log π_θ(a_t|s_t) + Σ_{t=0}^{T} log P(s_{t+1}|s_t, a_t)
 
# Taking the gradient w.r.t. θ:
∇_θ log P(τ|θ) = 0 + Σ_{t=0}^{T} ∇_θ log π_θ(a_t|s_t) + 0
               = Σ_{t=0}^{T} ∇_θ log π_θ(a_t|s_t)
 
# The environment dynamics terms vanish because they don't depend on θ!

The Policy Gradient Theorem

∇θ J(θ) = E_τ~π_θ[ Σ{t=0}^{T} ∇_θ log π_θ(a_t|s_t) · R(τ) ]

This is the fundamental result. The gradient of expected return equals the expected value of the gradient of log-policy probabilities, weighted by trajectory returns. We can estimate this using only sampled trajectories, without knowing environment dynamics!

Why this works: Intuitive understanding

The policy gradient has a beautiful interpretation:

∇_θ log π_θ(a_t|s_t) points in the direction that increases the probability of action a_t in state s_t
R(τ) tells us whether the trajectory was good (positive) or bad (negative)
By multiplying them together:
- Good trajectories: We push the policy toward actions taken
- Bad trajectories: We push the policy away from actions taken

This is trial and error learning at its mathematical finest—we reinforce behaviors that lead to good outcomes.

The Score Function Perspective

The term ∇_θ log π_θ(a|s) has a special name in statistics: the score function. Understanding the score function perspective provides deeper insight into policy gradients and connects them to broader statistical theory.

Properties of the Score Function:

Score Function Properties

•Zero Mean: E_a~π_θ[∇_θ log π_θ(a|s)] = 0. The expected score under the policy is always zero. This is crucial for variance reduction techniques we'll explore later.
•Fisher Information: E_a~π_θ[(∇_θ log π_θ(a|s))(∇_θ log π_θ(a|s))ᵀ] = F_θ(s), the Fisher information matrix. This measures how sensitive the policy is to parameter changes.
•Direction of Maximum Sensitivity: The score function points in the direction of maximum sensitivity of the log-probability to parameter changes.
•Unbiased Gradient Estimator: Using the score function gives us an unbiased estimate of the policy gradient, though it may have high variance.

score_function_proof.math

# Proof that the expected score is zero:
 
E_{a~π_θ}[∇_θ log π_θ(a|s)] = Σ_a π_θ(a|s) · ∇_θ log π_θ(a|s)
                            = Σ_a π_θ(a|s) · ∇_θ π_θ(a|s) / π_θ(a|s)
                            = Σ_a ∇_θ π_θ(a|s)
                            = ∇_θ Σ_a π_θ(a|s)
                            = ∇_θ 1
                            = 0  ✓
 
# This property allows us to add baselines without introducing bias!

The Score Function in Practice:

For common policy parameterizations, the score function has closed-form expressions:

Score Functions for Common Policy Types
Policy Type	Parameters	Score Function ∇_θ log π_θ(a\|s)
Softmax (Discrete)	θ = logits for each action	e_a - softmax(θ) where e_a is one-hot
Gaussian (Continuous)	θ = (μ, σ) mean and std	∂/∂μ: (a - μ)/σ², ∂/∂σ: ((a-μ)²/σ² - 1)/σ
Categorical	θ = probability vector	e_a/π_θ(a\|s) (simplex constraint)
Beta (Bounded)	θ = (α, β) shape parameters	∂ log Beta(a;α,β) / ∂(α,β)

Neural Network Policies

In deep RL, θ represents neural network weights. The score ∇_θ log π_θ(a|s) is computed via backpropagation through the network. For a softmax output layer with discrete actions, this is simply the gradient of cross-entropy loss between the one-hot action and the network's output distribution.

The Temporal Structure of Policy Gradients

The basic policy gradient theorem sums over all timesteps and weights by total trajectory return. But there's finer temporal structure we can exploit. Actions at time t cannot influence rewards that have already occurred at times t' < t.

Causality in Policy Gradients:

Consider the contribution from a single timestep t in our gradient:

temporal_structure.math
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Original form (all rewards):
∇_θ J(θ) = E_τ[ Σ_{t=0}^{T} ∇_θ log π_θ(a_t|s_t) · R(τ) ]
 
where R(τ) = Σ_{t'=0}^{T} γ^{t'} r_{t'}
 
# Exploiting causality:
Action a_t can only affect rewards r_t, r_{t+1}, r_{t+2}, ...
It cannot affect r_0, r_1, ..., r_{t-1}
 
# Therefore we can rewrite:
∇_θ J(θ) = E_τ[ Σ_{t=0}^{T} ∇_θ log π_θ(a_t|s_t) · G_t ]
 
where G_t = Σ_{t'=t}^{T} γ^{t'-t} r_{t'}  (reward-to-go from time t)
 
# This is a more efficient and lower-variance estimator!

Why Reward-to-Go Reduces Variance:

Using the reward-to-go G_t instead of total return R(τ) doesn't change the expected gradient (it's still unbiased), but it dramatically reduces variance:

Smaller magnitude: G_t is typically smaller than R(τ) because it excludes past rewards
Relevant signal: G_t only includes rewards that the action at time t could possibly influence
Less noise: Past rewards are pure noise for the current action's gradient

Reward-to-Go vs Full Return

While mathematically equivalent in expectation, reward-to-go can reduce variance by orders of magnitude. In a 1000-step episode, using full return means each gradient is weighted by ~1000 reward terms. Using reward-to-go, the last action is weighted by only 1 reward term. This makes learning much more efficient.

The State-Action Value Connection:

We can push this further by recognizing a connection to value functions. The expected reward-to-go starting from state s, taking action a, and following policy π thereafter is exactly the Q-function:

q_function_connection.math
1
2
3
4
5
6
7
8
9
10
11
# Definition of Q-function:
Q^π(s, a) = E_π[G_t | s_t = s, a_t = a]
          = E_π[Σ_{k=0}^{∞} γ^k r_{t+k} | s_t = s, a_t = a]
 
# Alternative form of policy gradient:
∇_θ J(θ) = E_{s~ρ^π, a~π_θ}[∇_θ log π_θ(a|s) · Q^{π_θ}(s, a)]
 
where ρ^π(s) = Σ_{t=0}^{∞} γ^t P(s_t = s | π)
is the discounted state visitation distribution
 
# This connects policy gradients to value-based concepts!

Alternative Formulations of the Policy Gradient

The policy gradient theorem admits several equivalent formulations, each offering different insights and computational properties. Understanding these alternatives is essential for implementing and improving policy gradient methods.

Formulation 1: Trajectory-Based

formulation_trajectory.math
1
2
3
4
5
6
7
8
9
10
11
12
# Trajectory-based policy gradient:
∇_θ J(θ) = E_τ~π_θ[(Σ_{t=0}^{T} ∇_θ log π_θ(a_t|s_t)) · R(τ)]
 
# Monte Carlo estimate (sample N trajectories):
∇̂_θ J(θ) ≈ (1/N) Σ_{i=1}^{N} (Σ_{t=0}^{T_i} ∇_θ log π_θ(a_t^i|s_t^i)) · R(τ^i)
 
# Advantages:
# - Simple to implement
# - Works with any trajectory length
# Disadvantages:
# - High variance
# - Inefficient use of temporal structure

Formulation 2: Per-Step with Reward-to-Go

formulation_per_step.math
1
2
3
4
5
6
7
8
9
10
11
12
13
# Per-step policy gradient with reward-to-go:
∇_θ J(θ) = E_τ~π_θ[Σ_{t=0}^{T} ∇_θ log π_θ(a_t|s_t) · G_t]
 
# Monte Carlo estimate:
∇̂_θ J(θ) ≈ (1/N) Σ_{i=1}^{N} Σ_{t=0}^{T_i} ∇_θ log π_θ(a_t^i|s_t^i) · G_t^i
 
where G_t^i = Σ_{k=t}^{T_i} γ^{k-t} r_k^i
 
# Advantages:
# - Lower variance than full trajectory
# - Respects causal structure
# Disadvantages:
# - Still high variance in long episodes

Formulation 3: State-Value Form

formulation_state_value.math

# Using Q-function:
∇_θ J(θ) = E_{s~ρ^π, a~π_θ}[∇_θ log π_θ(a|s) · Q^π(s, a)]
 
# Using Advantage function:
∇_θ J(θ) = E_{s~ρ^π, a~π_θ}[∇_θ log π_θ(a|s) · A^π(s, a)]
 
where A^π(s, a) = Q^π(s, a) - V^π(s)
 
# Advantages:
# - Advantage form has lower variance
# - Enables actor-critic methods
# Disadvantages:
# - Requires learning value functions

Which Formulation to Use?

In practice, the advantage formulation is most commonly used in modern algorithms. It combines the unbiased nature of policy gradients with variance reduction from learned value functions. We'll explore this extensively in the Actor-Critic section.

Mathematical Properties and Guarantees

The policy gradient theorem gives rise to optimization algorithms with specific properties. Understanding these properties helps us appreciate both the power and limitations of policy-based methods.

Property 1: Unbiased Gradient Estimation

unbiased_property.math
1
2
3
4
5
6
7
8
# The Monte Carlo policy gradient estimator is unbiased:
 
E[∇̂_θ J(θ)] = E[(1/N) Σ_{i=1}^{N} (Σ_{t} ∇_θ log π_θ(a_t^i|s_t^i)) · R(τ^i)]
            = (1/N) Σ_{i=1}^{N} E[(Σ_{t} ∇_θ log π_θ(a_t|s_t)) · R(τ)]
            = E[(Σ_{t} ∇_θ log π_θ(a_t|s_t)) · R(τ)]
            = ∇_θ J(θ)  ✓
 
# No bias, but variance can be very high!

Property 2: Guaranteed Policy Improvement (with small steps)

Unlike Q-learning which can oscillate or diverge, policy gradient methods with small enough step sizes guarantee improvement:

improvement_guarantee.math
1
2
3
4
5
6
7
8
9
10
11
12
# With sufficiently small learning rate α:
J(θ + α∇_θJ(θ)) ≥ J(θ)
 
# This follows from the definition of the gradient:
# In a local neighborhood, the gradient points uphill
 
# However, this guarantee only holds:
# 1. With exact gradients (not estimates)
# 2. With small enough step sizes
# 3. Locally (not globally optimal)
 
# In practice with noisy estimates, we hope for improvement on average

Key Properties of Policy Gradients

•Model-Free: No knowledge of environment dynamics P(s'|s,a) required. Only needs ability to sample trajectories.
•On-Policy: The gradient estimate uses samples from the current policy π_θ. Off-policy corrections require importance sampling.
•High Variance: The estimator is unbiased but can have very high variance, making learning unstable without variance reduction.
•Works with Any Differentiable Policy: As long as ∇_θ log π_θ(a|s) exists, we can apply policy gradients.
•Natural for Stochastic Policies: Policy gradients optimize probability distributions directly, making stochastic policies natural.
•Handles Continuous Actions: Unlike value-based methods that require max_a Q(s,a), policy gradients work seamlessly with continuous action spaces.

The Variance Problem

The high variance of policy gradients is the central challenge. A single trajectory provides a noisy signal—good actions might accidentally precede bad rewards due to environment stochasticity. We need many samples or sophisticated variance reduction to learn reliably. The next few pages address this critical issue.

Implementation Considerations

Translating the policy gradient theorem into working code requires careful attention to numerical stability, gradient computation, and policy representation. Let's examine the practical implementation details.

Computing Log-Probabilities for Common Policies:

log_prob_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Categorical, Normal
 
class DiscretePolicy(nn.Module):
    """Policy for discrete action spaces using softmax."""
    
    def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 128):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim)
        )
    
    def forward(self, state: torch.Tensor) -> Categorical:
        """Returns a categorical distribution over actions."""
        logits = self.network(state)
        return Categorical(logits=logits)
    
    def get_action_and_log_prob(self, state: torch.Tensor):
        """Sample action and compute log probability."""
        dist = self.forward(state)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        return action, log_prob
 
 
class ContinuousPolicy(nn.Module):
    """Policy for continuous action spaces using Gaussian."""
    
    def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 128):
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        self.mean_head = nn.Linear(hidden_dim, action_dim)
        # Log std as learnable parameter for stability
        self.log_std = nn.Parameter(torch.zeros(action_dim))
    
    def forward(self, state: torch.Tensor) -> Normal:
        """Returns a Gaussian distribution over actions."""
        features = self.shared(state)
        mean = self.mean_head(features)
        std = torch.exp(self.log_std)  # Ensure positive std
        return Normal(mean, std)
    
    def get_action_and_log_prob(self, state: torch.Tensor):
        """Sample action and compute log probability."""
        dist = self.forward(state)
        action = dist.rsample()  # Reparameterized sample for gradients
        # Sum log probs for independent dimensions
        log_prob = dist.log_prob(action).sum(dim=-1)
        return action, log_prob

Computing the Policy Gradient:

With policies defined, we can implement the basic policy gradient computation:

policy_gradient_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
def compute_policy_gradient_loss(
    log_probs: torch.Tensor,  # Log probs of actions taken
    returns: torch.Tensor      # Returns (or advantages) for weighting
) -> torch.Tensor:
    """
    Compute the policy gradient loss for a batch of transitions.
    
    The loss is negative because we use gradient DESCENT on the loss,
    which is equivalent to gradient ASCENT on the objective J(θ).
    
    Loss = -E[log π_θ(a|s) · G]
    ∇Loss = -∇_θ J(θ)
    
    Minimizing this loss maximizes expected return.
    """
    # Element-wise product: each log prob weighted by its return
    weighted_log_probs = log_probs * returns
    
    # Negative because we do gradient descent
    # Mean over batch for stable gradients
    loss = -weighted_log_probs.mean()
    
    return loss
 
 
def compute_returns(rewards: list, gamma: float = 0.99) -> torch.Tensor:
    """
    Compute discounted returns (reward-to-go) for each timestep.
    
    G_t = r_t + γ·r_{t+1} + γ²·r_{t+2} + ...
    
    Computed efficiently in reverse order:
    G_{T} = r_T
    G_t = r_t + γ·G_{t+1}
    """
    returns = []
    G = 0
    
    # Iterate backwards through rewards
    for reward in reversed(rewards):
        G = reward + gamma * G
        returns.insert(0, G)
    
    returns = torch.tensor(returns, dtype=torch.float32)
    
    # Optional: normalize returns for training stability
    # returns = (returns - returns.mean()) / (returns.std() + 1e-8)
    
    return returns

Numerical Stability

Several numerical issues can arise: (1) Log probabilities can be very negative for unlikely actions—use log-sum-exp tricks. (2) Returns can have widely varying magnitudes—normalization helps. (3) Gradients can explode with long episodes—gradient clipping is essential.

Comparison with Value-Based Methods

Policy gradient methods represent a fundamentally different approach to reinforcement learning than value-based methods like Q-learning. Understanding when to use each is essential for effective RL practice.

The Two Paradigms:

Value-Based Methods

•Learn Q(s, a) or V(s)
•Derive policy from values: π(s) = argmax_a Q(s, a)
•Off-policy learning possible
•Sample efficient (can reuse data)
•Deterministic policies typically
•Struggles with continuous actions
•Can be unstable with function approximation

Policy Gradient Methods

•Learn π(a|s) directly
•Policy is the learned object
•Primarily on-policy (importance sampling for off-policy)
•Less sample efficient (needs fresh data)
•Naturally stochastic policies
•Continuous actions seamlessly
•Smoother optimization landscape

When to Use Each Approach
Scenario	Recommended Approach	Reasoning
Discrete actions, need sample efficiency	Value-based (DQN)	Can use experience replay for data reuse
Continuous action space	Policy gradient	No need for maximization over actions
Stochastic policy required	Policy gradient	Naturally represents action distributions
High-dimensional action space	Policy gradient	Avoids curse of dimensionality in max
Partial observability (POMDP)	Policy gradient	Stochastic policies can handle uncertainty
Competitive/game settings	Policy gradient	Mixed strategies require stochasticity

The Best of Both Worlds

Actor-Critic methods (covered later in this module) combine policy gradients with value function learning. The 'critic' provides lower-variance gradient estimates while the 'actor' learns the policy. This hybrid approach powers most modern deep RL successes.

Summary: The Policy Gradient Theorem

We've established the theoretical foundation for policy gradient methods. Let's consolidate the key insights:

Key Takeaways

•The Policy Gradient Theorem provides a way to compute gradients of expected return: ∇_θ J(θ) = E[∇_θ log π_θ(a|s) · G], bypassing the need for environment dynamics.
•The log-derivative trick is the mathematical key—it allows us to express gradients of expectations as expectations of gradients weighted by log-probabilities.
•Causal structure matters—using reward-to-go G_t instead of full return R(τ) reduces variance without introducing bias.
•The score function ∇_θ log π_θ(a|s) has useful properties: zero mean (enabling baselines) and connection to Fisher information.
•Policy gradients are model-free, on-policy, and high-variance—these properties shape how algorithms are designed.
•Continuous actions and stochastic policies are natural for policy gradients, while value-based methods struggle with these.

What's next:

The policy gradient theorem gives us a theoretically sound approach, but the high variance makes naive implementation impractical. In the next page, we'll implement REINFORCE—the simplest policy gradient algorithm—and see firsthand both its elegance and its challenges.

Page Complete

You now understand the policy gradient theorem—the mathematical foundation of policy-based reinforcement learning. This theorem enables a fundamentally different approach to RL: directly optimizing behavior through gradient ascent on expected return. Next, we'll put theory into practice with the REINFORCE algorithm.