Loading content...
In value-based reinforcement learning, we learned to estimate the value of states and actions, then extract policies from these values. But what if we could directly optimize the policy itself—the mapping from observations to actions—without the intermediary of value functions?
This seemingly simple question leads to one of the most profound results in reinforcement learning: the Policy Gradient Theorem. This theorem provides the mathematical foundation for learning policies through gradient ascent, enabling neural networks to directly output action probabilities that maximize expected cumulative reward.
The policy gradient approach unlocks capabilities that value-based methods struggle with: continuous action spaces, stochastic policies, and better convergence guarantees in certain settings. Every modern deep RL breakthrough—from AlphaGo to robotic manipulation to large language model fine-tuning via RLHF—builds on the theoretical foundation we'll establish in this page.
By the end of this page, you will deeply understand the policy gradient theorem—its derivation, assumptions, mathematical structure, and implications. You'll see how we can compute gradients of expected return with respect to policy parameters, even when the dynamics of the environment are unknown. This understanding is essential for everything that follows in policy optimization.
Before deriving the policy gradient theorem, we must precisely define what we're optimizing. In policy-based methods, we parameterize the policy directly with parameters θ (typically neural network weights) and seek to maximize expected cumulative reward.
The Parameterized Policy:
A policy π_θ(a|s) defines the probability of taking action a in state s, given parameters θ. For discrete actions, this might be a softmax over action logits; for continuous actions, it might be a Gaussian distribution with learned mean and variance.
The Objective Function:
We want to find parameters θ* that maximize the expected return:
1234567891011121314151617
# The Policy Optimization Objective J(θ) = E_τ~π_θ[R(τ)] where:- τ = (s₀, a₀, r₀, s₁, a₁, r₁, ...) is a trajectory- R(τ) = Σ_{t=0}^{T} γ^t r_t is the discounted return- The expectation is over trajectories induced by policy π_θ # Goal: Find θ* = argmax_θ J(θ) # Trajectory probability under π_θ:P(τ|θ) = p(s₀) Π_{t=0}^{T} π_θ(a_t|s_t) · P(s_{t+1}|s_t, a_t) # Expanding the objective:J(θ) = ∫ P(τ|θ) R(τ) dτ (continuous)J(θ) = Σ_τ P(τ|θ) R(τ) (discrete)Understanding the Trajectory Distribution:
The probability of a trajectory τ under policy π_θ depends on:
A crucial observation: the environment dynamics P(s_{t+1}|s_t, a_t) and initial distribution p(s₀) do not depend on θ. Only the policy terms contain our learnable parameters. This separation is what makes the policy gradient theorem possible.
The objective J(θ) involves an expectation over trajectories. We can't simply backpropagate through it because: (1) we don't know the environment dynamics to differentiate through transitions, and (2) even if we did, the expectation involves a sum/integral over exponentially many trajectories. The policy gradient theorem gives us a way to estimate gradients using only sampled trajectories.
Now we arrive at the heart of policy-based methods: deriving a tractable expression for ∇_θ J(θ). This derivation uses the famous log-derivative trick (also called the REINFORCE trick or score function estimator) and reveals a beautiful mathematical structure.
Step 1: Write out the gradient of the objective
1234567
# Starting point: gradient of expected return∇_θ J(θ) = ∇_θ E_τ~π_θ[R(τ)] = ∇_θ Σ_τ P(τ|θ) R(τ) = Σ_τ ∇_θ P(τ|θ) R(τ) # Leibniz integral rule # Problem: ∇_θ P(τ|θ) requires differentiating through trajectory distribution# This seems to require knowledge of environment dynamics...Step 2: Apply the log-derivative trick
The key insight is the identity: ∇_θ log f(θ) = ∇_θ f(θ) / f(θ), which rearranges to:
∇_θ f(θ) = f(θ) · ∇_θ log f(θ)
Applying this to the trajectory probability:
12345678
# The log-derivative trick:∇_θ P(τ|θ) = P(τ|θ) · ∇_θ log P(τ|θ) # Substituting back:∇_θ J(θ) = Σ_τ P(τ|θ) · ∇_θ log P(τ|θ) · R(τ) = E_τ~π_θ[∇_θ log P(τ|θ) · R(τ)] # This is now an expectation we can estimate via sampling!Step 3: Simplify the log-probability of trajectories
Now we examine what ∇_θ log P(τ|θ) actually looks like:
1234567891011
# Trajectory probability:P(τ|θ) = p(s₀) Π_{t=0}^{T} π_θ(a_t|s_t) · P(s_{t+1}|s_t, a_t) # Taking the log:log P(τ|θ) = log p(s₀) + Σ_{t=0}^{T} log π_θ(a_t|s_t) + Σ_{t=0}^{T} log P(s_{t+1}|s_t, a_t) # Taking the gradient w.r.t. θ:∇_θ log P(τ|θ) = 0 + Σ_{t=0}^{T} ∇_θ log π_θ(a_t|s_t) + 0 = Σ_{t=0}^{T} ∇_θ log π_θ(a_t|s_t) # The environment dynamics terms vanish because they don't depend on θ!∇θ J(θ) = E_τ~π_θ[ Σ{t=0}^{T} ∇_θ log π_θ(a_t|s_t) · R(τ) ]
This is the fundamental result. The gradient of expected return equals the expected value of the gradient of log-policy probabilities, weighted by trajectory returns. We can estimate this using only sampled trajectories, without knowing environment dynamics!
Why this works: Intuitive understanding
The policy gradient has a beautiful interpretation:
This is trial and error learning at its mathematical finest—we reinforce behaviors that lead to good outcomes.
The term ∇_θ log π_θ(a|s) has a special name in statistics: the score function. Understanding the score function perspective provides deeper insight into policy gradients and connects them to broader statistical theory.
Properties of the Score Function:
12345678910
# Proof that the expected score is zero: E_{a~π_θ}[∇_θ log π_θ(a|s)] = Σ_a π_θ(a|s) · ∇_θ log π_θ(a|s) = Σ_a π_θ(a|s) · ∇_θ π_θ(a|s) / π_θ(a|s) = Σ_a ∇_θ π_θ(a|s) = ∇_θ Σ_a π_θ(a|s) = ∇_θ 1 = 0 ✓ # This property allows us to add baselines without introducing bias!The Score Function in Practice:
For common policy parameterizations, the score function has closed-form expressions:
| Policy Type | Parameters | Score Function ∇_θ log π_θ(a|s) |
|---|---|---|
| Softmax (Discrete) | θ = logits for each action | e_a - softmax(θ) where e_a is one-hot |
| Gaussian (Continuous) | θ = (μ, σ) mean and std | ∂/∂μ: (a - μ)/σ², ∂/∂σ: ((a-μ)²/σ² - 1)/σ |
| Categorical | θ = probability vector | e_a/π_θ(a|s) (simplex constraint) |
| Beta (Bounded) | θ = (α, β) shape parameters | ∂ log Beta(a;α,β) / ∂(α,β) |
In deep RL, θ represents neural network weights. The score ∇_θ log π_θ(a|s) is computed via backpropagation through the network. For a softmax output layer with discrete actions, this is simply the gradient of cross-entropy loss between the one-hot action and the network's output distribution.
The basic policy gradient theorem sums over all timesteps and weights by total trajectory return. But there's finer temporal structure we can exploit. Actions at time t cannot influence rewards that have already occurred at times t' < t.
Causality in Policy Gradients:
Consider the contribution from a single timestep t in our gradient:
123456789101112131415
# Original form (all rewards):∇_θ J(θ) = E_τ[ Σ_{t=0}^{T} ∇_θ log π_θ(a_t|s_t) · R(τ) ] where R(τ) = Σ_{t'=0}^{T} γ^{t'} r_{t'} # Exploiting causality:Action a_t can only affect rewards r_t, r_{t+1}, r_{t+2}, ...It cannot affect r_0, r_1, ..., r_{t-1} # Therefore we can rewrite:∇_θ J(θ) = E_τ[ Σ_{t=0}^{T} ∇_θ log π_θ(a_t|s_t) · G_t ] where G_t = Σ_{t'=t}^{T} γ^{t'-t} r_{t'} (reward-to-go from time t) # This is a more efficient and lower-variance estimator!Why Reward-to-Go Reduces Variance:
Using the reward-to-go G_t instead of total return R(τ) doesn't change the expected gradient (it's still unbiased), but it dramatically reduces variance:
While mathematically equivalent in expectation, reward-to-go can reduce variance by orders of magnitude. In a 1000-step episode, using full return means each gradient is weighted by ~1000 reward terms. Using reward-to-go, the last action is weighted by only 1 reward term. This makes learning much more efficient.
The State-Action Value Connection:
We can push this further by recognizing a connection to value functions. The expected reward-to-go starting from state s, taking action a, and following policy π thereafter is exactly the Q-function:
1234567891011
# Definition of Q-function:Q^π(s, a) = E_π[G_t | s_t = s, a_t = a] = E_π[Σ_{k=0}^{∞} γ^k r_{t+k} | s_t = s, a_t = a] # Alternative form of policy gradient:∇_θ J(θ) = E_{s~ρ^π, a~π_θ}[∇_θ log π_θ(a|s) · Q^{π_θ}(s, a)] where ρ^π(s) = Σ_{t=0}^{∞} γ^t P(s_t = s | π)is the discounted state visitation distribution # This connects policy gradients to value-based concepts!The policy gradient theorem admits several equivalent formulations, each offering different insights and computational properties. Understanding these alternatives is essential for implementing and improving policy gradient methods.
Formulation 1: Trajectory-Based
123456789101112
# Trajectory-based policy gradient:∇_θ J(θ) = E_τ~π_θ[(Σ_{t=0}^{T} ∇_θ log π_θ(a_t|s_t)) · R(τ)] # Monte Carlo estimate (sample N trajectories):∇̂_θ J(θ) ≈ (1/N) Σ_{i=1}^{N} (Σ_{t=0}^{T_i} ∇_θ log π_θ(a_t^i|s_t^i)) · R(τ^i) # Advantages:# - Simple to implement# - Works with any trajectory length# Disadvantages:# - High variance# - Inefficient use of temporal structureFormulation 2: Per-Step with Reward-to-Go
12345678910111213
# Per-step policy gradient with reward-to-go:∇_θ J(θ) = E_τ~π_θ[Σ_{t=0}^{T} ∇_θ log π_θ(a_t|s_t) · G_t] # Monte Carlo estimate:∇̂_θ J(θ) ≈ (1/N) Σ_{i=1}^{N} Σ_{t=0}^{T_i} ∇_θ log π_θ(a_t^i|s_t^i) · G_t^i where G_t^i = Σ_{k=t}^{T_i} γ^{k-t} r_k^i # Advantages:# - Lower variance than full trajectory# - Respects causal structure# Disadvantages:# - Still high variance in long episodesFormulation 3: State-Value Form
12345678910111213
# Using Q-function:∇_θ J(θ) = E_{s~ρ^π, a~π_θ}[∇_θ log π_θ(a|s) · Q^π(s, a)] # Using Advantage function:∇_θ J(θ) = E_{s~ρ^π, a~π_θ}[∇_θ log π_θ(a|s) · A^π(s, a)] where A^π(s, a) = Q^π(s, a) - V^π(s) # Advantages:# - Advantage form has lower variance# - Enables actor-critic methods# Disadvantages:# - Requires learning value functionsIn practice, the advantage formulation is most commonly used in modern algorithms. It combines the unbiased nature of policy gradients with variance reduction from learned value functions. We'll explore this extensively in the Actor-Critic section.
The policy gradient theorem gives rise to optimization algorithms with specific properties. Understanding these properties helps us appreciate both the power and limitations of policy-based methods.
Property 1: Unbiased Gradient Estimation
12345678
# The Monte Carlo policy gradient estimator is unbiased: E[∇̂_θ J(θ)] = E[(1/N) Σ_{i=1}^{N} (Σ_{t} ∇_θ log π_θ(a_t^i|s_t^i)) · R(τ^i)] = (1/N) Σ_{i=1}^{N} E[(Σ_{t} ∇_θ log π_θ(a_t|s_t)) · R(τ)] = E[(Σ_{t} ∇_θ log π_θ(a_t|s_t)) · R(τ)] = ∇_θ J(θ) ✓ # No bias, but variance can be very high!Property 2: Guaranteed Policy Improvement (with small steps)
Unlike Q-learning which can oscillate or diverge, policy gradient methods with small enough step sizes guarantee improvement:
123456789101112
# With sufficiently small learning rate α:J(θ + α∇_θJ(θ)) ≥ J(θ) # This follows from the definition of the gradient:# In a local neighborhood, the gradient points uphill # However, this guarantee only holds:# 1. With exact gradients (not estimates)# 2. With small enough step sizes# 3. Locally (not globally optimal) # In practice with noisy estimates, we hope for improvement on averageThe high variance of policy gradients is the central challenge. A single trajectory provides a noisy signal—good actions might accidentally precede bad rewards due to environment stochasticity. We need many samples or sophisticated variance reduction to learn reliably. The next few pages address this critical issue.
Translating the policy gradient theorem into working code requires careful attention to numerical stability, gradient computation, and policy representation. Let's examine the practical implementation details.
Computing Log-Probabilities for Common Policies:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
import torchimport torch.nn as nnimport torch.nn.functional as Ffrom torch.distributions import Categorical, Normal class DiscretePolicy(nn.Module): """Policy for discrete action spaces using softmax.""" def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 128): super().__init__() self.network = nn.Sequential( nn.Linear(state_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, action_dim) ) def forward(self, state: torch.Tensor) -> Categorical: """Returns a categorical distribution over actions.""" logits = self.network(state) return Categorical(logits=logits) def get_action_and_log_prob(self, state: torch.Tensor): """Sample action and compute log probability.""" dist = self.forward(state) action = dist.sample() log_prob = dist.log_prob(action) return action, log_prob class ContinuousPolicy(nn.Module): """Policy for continuous action spaces using Gaussian.""" def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 128): super().__init__() self.shared = nn.Sequential( nn.Linear(state_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU() ) self.mean_head = nn.Linear(hidden_dim, action_dim) # Log std as learnable parameter for stability self.log_std = nn.Parameter(torch.zeros(action_dim)) def forward(self, state: torch.Tensor) -> Normal: """Returns a Gaussian distribution over actions.""" features = self.shared(state) mean = self.mean_head(features) std = torch.exp(self.log_std) # Ensure positive std return Normal(mean, std) def get_action_and_log_prob(self, state: torch.Tensor): """Sample action and compute log probability.""" dist = self.forward(state) action = dist.rsample() # Reparameterized sample for gradients # Sum log probs for independent dimensions log_prob = dist.log_prob(action).sum(dim=-1) return action, log_probComputing the Policy Gradient:
With policies defined, we can implement the basic policy gradient computation:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
def compute_policy_gradient_loss( log_probs: torch.Tensor, # Log probs of actions taken returns: torch.Tensor # Returns (or advantages) for weighting) -> torch.Tensor: """ Compute the policy gradient loss for a batch of transitions. The loss is negative because we use gradient DESCENT on the loss, which is equivalent to gradient ASCENT on the objective J(θ). Loss = -E[log π_θ(a|s) · G] ∇Loss = -∇_θ J(θ) Minimizing this loss maximizes expected return. """ # Element-wise product: each log prob weighted by its return weighted_log_probs = log_probs * returns # Negative because we do gradient descent # Mean over batch for stable gradients loss = -weighted_log_probs.mean() return loss def compute_returns(rewards: list, gamma: float = 0.99) -> torch.Tensor: """ Compute discounted returns (reward-to-go) for each timestep. G_t = r_t + γ·r_{t+1} + γ²·r_{t+2} + ... Computed efficiently in reverse order: G_{T} = r_T G_t = r_t + γ·G_{t+1} """ returns = [] G = 0 # Iterate backwards through rewards for reward in reversed(rewards): G = reward + gamma * G returns.insert(0, G) returns = torch.tensor(returns, dtype=torch.float32) # Optional: normalize returns for training stability # returns = (returns - returns.mean()) / (returns.std() + 1e-8) return returnsSeveral numerical issues can arise: (1) Log probabilities can be very negative for unlikely actions—use log-sum-exp tricks. (2) Returns can have widely varying magnitudes—normalization helps. (3) Gradients can explode with long episodes—gradient clipping is essential.
Policy gradient methods represent a fundamentally different approach to reinforcement learning than value-based methods like Q-learning. Understanding when to use each is essential for effective RL practice.
The Two Paradigms:
| Scenario | Recommended Approach | Reasoning |
|---|---|---|
| Discrete actions, need sample efficiency | Value-based (DQN) | Can use experience replay for data reuse |
| Continuous action space | Policy gradient | No need for maximization over actions |
| Stochastic policy required | Policy gradient | Naturally represents action distributions |
| High-dimensional action space | Policy gradient | Avoids curse of dimensionality in max |
| Partial observability (POMDP) | Policy gradient | Stochastic policies can handle uncertainty |
| Competitive/game settings | Policy gradient | Mixed strategies require stochasticity |
Actor-Critic methods (covered later in this module) combine policy gradients with value function learning. The 'critic' provides lower-variance gradient estimates while the 'actor' learns the policy. This hybrid approach powers most modern deep RL successes.
We've established the theoretical foundation for policy gradient methods. Let's consolidate the key insights:
What's next:
The policy gradient theorem gives us a theoretically sound approach, but the high variance makes naive implementation impractical. In the next page, we'll implement REINFORCE—the simplest policy gradient algorithm—and see firsthand both its elegance and its challenges.
You now understand the policy gradient theorem—the mathematical foundation of policy-based reinforcement learning. This theorem enables a fundamentally different approach to RL: directly optimizing behavior through gradient ascent on expected return. Next, we'll put theory into practice with the REINFORCE algorithm.