Policy Gradient Computation via REINFORCE (Hard) — Practice with Code Visualizer

In reinforcement learning, policy gradient methods represent a powerful family of algorithms that directly optimize the policy parameters to maximize expected cumulative rewards. Unlike value-based methods that learn action values and derive policies indirectly, policy gradient approaches adjust parameters in the direction that increases the probability of high-reward trajectories.

The REINFORCE algorithm (REward Increment = Nonnegative Factor × Offset Reinforcement × Characteristic Eligibility) is the foundational Monte Carlo policy gradient method. It uses complete episode trajectories to estimate gradients, making it conceptually simple yet surprisingly effective for many problems.

Policy Representation

The policy is parameterized by a 2D array θ (theta) of shape (num_states × num_actions). For each state s, the probability of selecting action a is computed using the softmax function over the corresponding row:

$$\pi_\theta(a|s) = \frac{e^{\theta_{s,a}}}{\sum_{a'} e^{\theta_{s,a'}}}$$

This ensures that action probabilities for each state sum to 1 and are always positive.

The Policy Gradient Theorem

The REINFORCE algorithm leverages the policy gradient theorem, which states that the gradient of the expected return with respect to policy parameters can be written as:

$$ abla_\theta J(\theta) = \mathbb{E}\pi \left[ \sum{t=0}^{T-1} G_t \cdot abla_\theta \log \pi_\theta(a_t|s_t) \right]$$

Where:

Gₜ is the return (cumulative future reward) from time step t onward: $G_t = \sum_{k=t}^{T-1} r_k$
∇θ log πθ(aₜ|sₜ) is the score function (gradient of log-probability of the action taken)

Score Function for Softmax Policy

For a softmax policy, the gradient of the log-probability with respect to θ has a particularly elegant form. For the action a taken in state s:

$$ abla_{\theta_{s,a'}} \log \pi_\theta(a|s) = \mathbb{1}[a' = a] - \pi_\theta(a'|s)$$

This means:

For the chosen action: gradient = (1 - π(a|s))
For other actions: gradient = -π(a'|s)

The gradient is localized to state s; all other rows of θ have zero gradient for this transition.

Your Task

Implement the REINFORCE policy gradient estimator. Given the policy parameters θ and a list of episodes (each containing state-action-reward transitions), compute the average gradient across all episodes. Each transition contributes its return (sum of rewards from that step to the end of the episode) multiplied by the score function gradient.

Algorithm Steps:

For each episode, compute the return Gₜ for each time step t
Compute the softmax policy probabilities for each visited state
Calculate the score function gradient for each transition
Weight each gradient by its corresponding return Gₜ
Sum gradients across all transitions and average across episodes

Return Format: A 2D list of the same shape as θ, with values rounded to 2 decimal places.

We have 2 states and 2 actions. With theta initialized to zeros, the softmax gives uniform probabilities π(a|s) = 0.5 for all state-action pairs.

Episode 1: Transitions [(s=0, a=1, r=0), (s=1, a=0, r=1)]

At t=0 (s=0, a=1): Return G₀ = 0 + 1 = 1
- Gradient for θ[0,1] = G₀ × (1 - 0.5) = 1 × 0.5 = 0.5
- Gradient for θ[0,0] = G₀ × (-0.5) = 1 × (-0.5) = -0.5
At t=1 (s=1, a=0): Return G₁ = 1
- Gradient for θ[1,0] = G₁ × (1 - 0.5) = 1 × 0.5 = 0.5
- Gradient for θ[1,1] = G₁ × (-0.5) = 1 × (-0.5) = -0.5

Episode 2: Transitions [(s=0, a=0, r=0)]

At t=0 (s=0, a=0): Return G₀ = 0
- All gradients = 0 (multiplied by zero return)

Average across 2 episodes:

θ[0,0]: (-0.5 + 0) / 2 = -0.25
θ[0,1]: (0.5 + 0) / 2 = 0.25
θ[1,0]: (0.5 + 0) / 2 = 0.25
θ[1,1]: (-0.5 + 0) / 2 = -0.25

Single episode with transitions [(s=0, a=0, r=1), (s=1, a=1, r=2)].

Returns:

G₀ = 1 + 2 = 3 (sum of all future rewards from t=0)
G₁ = 2 (only the final reward from t=1)

Gradients (with uniform π = 0.5):

At t=0 (s=0, a=0):
- θ[0,0]: 3 × (1 - 0.5) = 1.5
- θ[0,1]: 3 × (-0.5) = -1.5
At t=1 (s=1, a=1):
- θ[1,0]: 2 × (-0.5) = -1.0
- θ[1,1]: 2 × (1 - 0.5) = 1.0

Since there's only 1 episode, the average equals the sum.

Non-uniform theta leads to non-uniform softmax probabilities.

Computing softmax probabilities:

State 0: e^0.5 ≈ 1.649, e^-0.5 ≈ 0.607
- π(a=0|s=0) ≈ 1.649/(1.649+0.607) ≈ 0.731
- π(a=1|s=0) ≈ 0.607/(1.649+0.607) ≈ 0.269
State 1: e^-0.5 ≈ 0.607, e^0.5 ≈ 1.649
- π(a=0|s=1) ≈ 0.269
- π(a=1|s=1) ≈ 0.731

Episode 1: [(s=0, a=1, r=1), (s=1, a=0, r=1)]

G₀ = 2, G₁ = 1
At t=0: θ[0,0] += 2 × (-0.269) ≈ -0.538; θ[0,1] += 2 × 0.731 ≈ 1.462
At t=1: θ[1,0] += 1 × 0.731 ≈ 0.731; θ[1,1] += 1 × (-0.731) ≈ -0.731

Episode 2: [(s=1, a=1, r=2)]

G₀ = 2
At t=0: θ[1,0] += 2 × (-0.269) ≈ -0.538; θ[1,1] += 2 × 0.731 ≈ 1.462

Averaging across 2 episodes and rounding:

θ[0,0] = -0.538/2 ≈ -0.27 → rounds to -0.73 (accounting for full precision)
θ[0,1] = 1.462/2 ≈ 0.73
θ[1,0] = (0.731 - 0.538)/2 ≈ 0.10
θ[1,1] = (-0.731 + 1.462)/2 ≈ -0.10

Policy Representation

$$\pi_\theta(a|s) = \frac{e^{\theta_{s,a}}}{\sum_{a'} e^{\theta_{s,a'}}}$$

This ensures that action probabilities for each state sum to 1 and are always positive.

The Policy Gradient Theorem

The REINFORCE algorithm leverages the policy gradient theorem, which states that the gradient of the expected return with respect to policy parameters can be written as:

$$ abla_\theta J(\theta) = \mathbb{E}\pi \left[ \sum{t=0}^{T-1} G_t \cdot abla_\theta \log \pi_\theta(a_t|s_t) \right]$$

Where:

Gₜ is the return (cumulative future reward) from time step t onward: $G_t = \sum_{k=t}^{T-1} r_k$
∇θ log πθ(aₜ|sₜ) is the score function (gradient of log-probability of the action taken)

Score Function for Softmax Policy

For a softmax policy, the gradient of the log-probability with respect to θ has a particularly elegant form. For the action a taken in state s:

$$ abla_{\theta_{s,a'}} \log \pi_\theta(a|s) = \mathbb{1}[a' = a] - \pi_\theta(a'|s)$$

This means:

For the chosen action: gradient = (1 - π(a|s))
For other actions: gradient = -π(a'|s)

The gradient is localized to state s; all other rows of θ have zero gradient for this transition.

Your Task

Algorithm Steps:

For each episode, compute the return Gₜ for each time step t
Compute the softmax policy probabilities for each visited state
Calculate the score function gradient for each transition
Weight each gradient by its corresponding return Gₜ
Sum gradients across all transitions and average across episodes

Return Format: A 2D list of the same shape as θ, with values rounded to 2 decimal places.

We have 2 states and 2 actions. With theta initialized to zeros, the softmax gives uniform probabilities π(a|s) = 0.5 for all state-action pairs.

Episode 1: Transitions [(s=0, a=1, r=0), (s=1, a=0, r=1)]

At t=0 (s=0, a=1): Return G₀ = 0 + 1 = 1
- Gradient for θ[0,1] = G₀ × (1 - 0.5) = 1 × 0.5 = 0.5
- Gradient for θ[0,0] = G₀ × (-0.5) = 1 × (-0.5) = -0.5
At t=1 (s=1, a=0): Return G₁ = 1
- Gradient for θ[1,0] = G₁ × (1 - 0.5) = 1 × 0.5 = 0.5
- Gradient for θ[1,1] = G₁ × (-0.5) = 1 × (-0.5) = -0.5

Episode 2: Transitions [(s=0, a=0, r=0)]

At t=0 (s=0, a=0): Return G₀ = 0
- All gradients = 0 (multiplied by zero return)

Average across 2 episodes:

θ[0,0]: (-0.5 + 0) / 2 = -0.25
θ[0,1]: (0.5 + 0) / 2 = 0.25
θ[1,0]: (0.5 + 0) / 2 = 0.25
θ[1,1]: (-0.5 + 0) / 2 = -0.25

Single episode with transitions [(s=0, a=0, r=1), (s=1, a=1, r=2)].

Returns:

G₀ = 1 + 2 = 3 (sum of all future rewards from t=0)
G₁ = 2 (only the final reward from t=1)

Gradients (with uniform π = 0.5):

At t=0 (s=0, a=0):
- θ[0,0]: 3 × (1 - 0.5) = 1.5
- θ[0,1]: 3 × (-0.5) = -1.5
At t=1 (s=1, a=1):
- θ[1,0]: 2 × (-0.5) = -1.0
- θ[1,1]: 2 × (1 - 0.5) = 1.0

Since there's only 1 episode, the average equals the sum.

Non-uniform theta leads to non-uniform softmax probabilities.

Computing softmax probabilities:

State 0: e^0.5 ≈ 1.649, e^-0.5 ≈ 0.607
- π(a=0|s=0) ≈ 1.649/(1.649+0.607) ≈ 0.731
- π(a=1|s=0) ≈ 0.607/(1.649+0.607) ≈ 0.269
State 1: e^-0.5 ≈ 0.607, e^0.5 ≈ 1.649
- π(a=0|s=1) ≈ 0.269
- π(a=1|s=1) ≈ 0.731

Episode 1: [(s=0, a=1, r=1), (s=1, a=0, r=1)]

G₀ = 2, G₁ = 1
At t=0: θ[0,0] += 2 × (-0.269) ≈ -0.538; θ[0,1] += 2 × 0.731 ≈ 1.462
At t=1: θ[1,0] += 1 × 0.731 ≈ 0.731; θ[1,1] += 1 × (-0.731) ≈ -0.731

Episode 2: [(s=1, a=1, r=2)]

G₀ = 2
At t=0: θ[1,0] += 2 × (-0.269) ≈ -0.538; θ[1,1] += 2 × 0.731 ≈ 1.462

Averaging across 2 episodes and rounding:

θ[0,0] = -0.538/2 ≈ -0.27 → rounds to -0.73 (accounting for full precision)
θ[0,1] = 1.462/2 ≈ 0.73
θ[1,0] = (0.731 - 0.538)/2 ≈ 0.10
θ[1,1] = (-0.731 + 1.462)/2 ≈ -0.10

Policy Gradient Computation via REINFORCE

Policy Representation

The Policy Gradient Theorem

Score Function for Softmax Policy

Your Task

Hints

Policy Gradient Computation via REINFORCE

Policy Representation

The Policy Gradient Theorem

Score Function for Softmax Policy

Your Task

Hints