Multi-Step Value Estimation (Medium) — Practice with Code Visualizer

In reinforcement learning, accurately estimating the value of states is crucial for learning optimal policies. Multi-step value estimation represents a powerful approach that interpolates between two fundamental methods: single-step bootstrapping (which updates after observing just one transition) and Monte Carlo methods (which wait until the episode terminates to update).

The core insight is that instead of immediately bootstrapping from the next state's value estimate, we can look ahead n steps into the future, accumulating actual rewards before bootstrapping from a more distant state. This n-step lookahead often accelerates learning by incorporating more observed trajectory information while still avoiding the high variance of waiting for complete episodes.

The n-Step Return:

For a state visited at time t, the n-step return is computed as:

$$G_t^{(n)} = r_{t+1} + \gamma \cdot r_{t+2} + \gamma^2 \cdot r_{t+3} + \ldots + \gamma^{n-1} \cdot r_{t+n} + \gamma^n \cdot V(S_{t+n})$$

Where:

rₜ₊ᵢ represents the reward received i steps after the current state
γ (gamma) is the discount factor that reduces the weight of future rewards
V(Sₜ₊ₙ) is the current value estimate of the state reached after n steps

Value Update Rule:

Once the n-step return is computed, the value estimate is updated using:

$$V(S_t) \leftarrow V(S_t) + \alpha \cdot (G_t^{(n)} - V(S_t))$$

Where α (alpha) is the learning rate controlling how much we adjust our estimate toward the observed return.

Edge Cases:

When the trajectory is shorter than n steps (near episode termination), we use all available steps and bootstrap from the terminal state's value (or simply use accumulated rewards if the trajectory ends).

Your Task:

Implement a function that takes a sequence of states and rewards from a trajectory, along with current value estimates, and computes updated value estimates using n-step temporal difference learning. Process each state in the sequence, computing its n-step return and applying the update rule.

For each state, we compute the 2-step return:

• State 0: Look 2 steps ahead. G₀⁽²⁾ = r₁ + γ·r₂ + γ²·V[2] = 1.0 + 0.9×1.0 + 0.81×0.0 = 1.9 V[0] += α × (G - V[0]) = 0.0 + 0.1 × (1.9 - 0.0) = 0.19

• State 1: Look 2 steps ahead. G₁⁽²⁾ = r₂ + γ·r₃ + γ²·V[3] = 1.0 + 0.9×1.0 + 0.81×0.0 = 1.9 V[1] += α × (G - V[1]) = 0.0 + 0.1 × (1.9 - 0.0) = 0.19

• State 2: Only 1 step available. G₂⁽¹⁾ = r₃ + γ·V[3] = 1.0 + 0.9×0.0 = 1.0 V[2] += α × (G - V[2]) = 0.0 + 0.1 × (1.0 - 0.0) = 0.1

With n=1, this reduces to standard TD(0) learning:

• State 0: G₀⁽¹⁾ = r₁ + γ·V[1] = 5.0 + 0.9×10.0 = 14.0 V[0] += α × (G - V[0]) = 0.0 + 0.5 × (14.0 - 0.0) = 7.0

• State 1: G₁⁽¹⁾ = r₂ + γ·V[2] = 3.0 + 0.9×0.0 = 3.0 V[1] += α × (G - V[1]) = 10.0 + 0.5 × (3.0 - 10.0) = 10.0 - 3.5 = 6.5

Note how V[1] decreases because the bootstrapped return (3.0) is less than the current estimate (10.0).

With n=4, we use all available future rewards before bootstrapping:

• State 0: Full 4-step return (but only 4 rewards, bootstraps from terminal): G = 1.0 + 0.99×2.0 + 0.99²×3.0 + 0.99³×4.0 = 1.0 + 1.98 + 2.9403 + 3.8812 = 9.8015 V[0] = 0.1 × 9.8015 = 0.9801 (rounded)

• State 1: 3 remaining steps: G = 2.0 + 0.99×3.0 + 0.99²×4.0 = 2.0 + 2.97 + 3.9204 = 8.8904 V[1] = 0.1 × 8.8904 = 0.889 (rounded)

• State 2: 2 remaining steps: G = 3.0 + 0.99×4.0 = 3.0 + 3.96 = 6.96 V[2] = 0.1 × 6.96 = 0.696

• State 3: 1 remaining step: G = 4.0 V[3] = 0.1 × 4.0 = 0.4