Loading problem...
In reinforcement learning, policy gradient methods represent a powerful family of algorithms that directly optimize the policy parameters to maximize expected cumulative rewards. Unlike value-based methods that learn action values and derive policies indirectly, policy gradient approaches adjust parameters in the direction that increases the probability of high-reward trajectories.
The REINFORCE algorithm (REward Increment = Nonnegative Factor × Offset Reinforcement × Characteristic Eligibility) is the foundational Monte Carlo policy gradient method. It uses complete episode trajectories to estimate gradients, making it conceptually simple yet surprisingly effective for many problems.
The policy is parameterized by a 2D array θ (theta) of shape (num_states × num_actions). For each state s, the probability of selecting action a is computed using the softmax function over the corresponding row:
$$\pi_\theta(a|s) = \frac{e^{\theta_{s,a}}}{\sum_{a'} e^{\theta_{s,a'}}}$$
This ensures that action probabilities for each state sum to 1 and are always positive.
The REINFORCE algorithm leverages the policy gradient theorem, which states that the gradient of the expected return with respect to policy parameters can be written as:
$$ abla_\theta J(\theta) = \mathbb{E}\pi \left[ \sum{t=0}^{T-1} G_t \cdot abla_\theta \log \pi_\theta(a_t|s_t) \right]$$
Where:
For a softmax policy, the gradient of the log-probability with respect to θ has a particularly elegant form. For the action a taken in state s:
$$ abla_{\theta_{s,a'}} \log \pi_\theta(a|s) = \mathbb{1}[a' = a] - \pi_\theta(a'|s)$$
This means:
The gradient is localized to state s; all other rows of θ have zero gradient for this transition.
Implement the REINFORCE policy gradient estimator. Given the policy parameters θ and a list of episodes (each containing state-action-reward transitions), compute the average gradient across all episodes. Each transition contributes its return (sum of rewards from that step to the end of the episode) multiplied by the score function gradient.
Algorithm Steps:
Return Format: A 2D list of the same shape as θ, with values rounded to 2 decimal places.
theta = [[0.0, 0.0], [0.0, 0.0]]
episodes = [[[0, 1, 0.0], [1, 0, 1.0]], [[0, 0, 0.0]]][[-0.25, 0.25], [0.25, -0.25]]We have 2 states and 2 actions. With theta initialized to zeros, the softmax gives uniform probabilities π(a|s) = 0.5 for all state-action pairs.
Episode 1: Transitions [(s=0, a=1, r=0), (s=1, a=0, r=1)]
Episode 2: Transitions [(s=0, a=0, r=0)]
Average across 2 episodes:
theta = [[0.0, 0.0], [0.0, 0.0]]
episodes = [[[0, 0, 1.0], [1, 1, 2.0]]][[1.5, -1.5], [-1.0, 1.0]]Single episode with transitions [(s=0, a=0, r=1), (s=1, a=1, r=2)].
Returns:
Gradients (with uniform π = 0.5):
Since there's only 1 episode, the average equals the sum.
theta = [[0.5, -0.5], [-0.5, 0.5]]
episodes = [[[0, 1, 1.0], [1, 0, 1.0]], [[1, 1, 2.0]]][[-0.73, 0.73], [0.1, -0.1]]Non-uniform theta leads to non-uniform softmax probabilities.
Computing softmax probabilities:
Episode 1: [(s=0, a=1, r=1), (s=1, a=0, r=1)]
Episode 2: [(s=1, a=1, r=2)]
Averaging across 2 episodes and rounding:
Constraints