Loading content...
In reinforcement learning, TD(λ) represents one of the most elegant algorithms that elegantly unifies two seemingly different approaches to value estimation: single-step Temporal Difference (TD(0)) learning and full-episode Monte Carlo methods. The key innovation lies in eligibility traces—memory structures that maintain a decaying record of recently visited states, enabling more efficient credit assignment across multi-step transitions.
Consider an agent navigating through an environment, collecting rewards along the way. When a reward is finally observed, a fundamental question arises: which past states should receive credit for this outcome?
When λ = 0, TD(λ) collapses exactly to TD(0). When λ = 1, it approximates Monte Carlo estimation. Intermediate values of λ provide a smooth interpolation between these extremes.
An eligibility trace e(s) for each state s tracks how "eligible" that state is to receive credit for the current temporal difference error. The trace operates with two key dynamics:
This creates a fading memory where recently visited states have higher eligibility, and frequently visited states accumulate even higher traces.
The backward view of TD(λ) processes an episode step-by-step:
$$\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$$
$$e(s) \leftarrow \gamma \lambda \cdot e(s) + \mathbf{1}[s = S_t]$$
$$V(s) \leftarrow V(s) + \alpha \cdot \delta_t \cdot e(s) \quad \forall s$$
Where:
For the terminal transition, the next-state value V(S_{t+1}) is 0.
Implement TD(λ) value estimation using the backward view with accumulating eligibility traces. Given an episode trajectory, the discount factor γ, the trace decay parameter λ, the learning rate α, and the number of states, compute the estimated state values after processing the entire episode.
Implementation Details:
episode = [[0, 1], [1, 1], [2, 0]]
gamma = 0.9
lambda = 0.8
alpha = 0.1
num_states = 3[0.172, 0.1, 0.0]This episode visits states 0 → 1 → 2 with rewards 1, 1, 0 respectively.
Step 1: Transition from state 0 (reward = 1) • Before: V = [0, 0, 0], e = [0, 0, 0] • Update e[0] = 0 × 0.9 × 0.8 + 1 = 1.0 • TD error δ = 1 + 0.9 × V[1] - V[0] = 1 + 0 - 0 = 1.0 • Update V[0] += 0.1 × 1.0 × 1.0 = 0.1 • After: V = [0.1, 0, 0], e = [1.0, 0, 0]
Step 2: Transition from state 1 (reward = 1) • Decay and update e: e[0] = 1.0 × 0.72 = 0.72, e[1] = 0 + 1 = 1.0 • TD error δ = 1 + 0.9 × 0 - 0 = 1.0 • Update V[0] += 0.1 × 1.0 × 0.72 = 0.072, V[1] += 0.1 × 1.0 × 1.0 = 0.1 • After: V = [0.172, 0.1, 0], e = [0.72, 1.0, 0]
Step 3: Transition from state 2 (reward = 0, terminal) • TD error δ = 0 + 0 - 0 = 0 (terminal state) • No value updates since δ = 0
Final state values: [0.172, 0.1, 0.0]
Notice how state 0 received credit from both its own reward AND the subsequent reward from state 1 due to the eligibility trace mechanism.
episode = [[0, 0], [1, 0], [2, 1]]
gamma = 0.9
lambda = 0.0
alpha = 0.1
num_states = 3[0.0, 0.0, 0.1]With λ = 0, this reduces to TD(0) where only the immediately preceding state receives credit.
Step 1: Transition from state 0 (reward = 0) • TD error δ = 0 + 0.9 × 0 - 0 = 0 • No updates needed
Step 2: Transition from state 1 (reward = 0) • Since λ = 0, previous traces immediately decay to 0 • TD error δ = 0 + 0.9 × 0 - 0 = 0 • No updates needed
Step 3: Transition from state 2 (reward = 1, terminal) • e[2] = 1.0 (only current state has non-zero trace with λ=0) • TD error δ = 1 + 0 - 0 = 1.0 • Update V[2] += 0.1 × 1.0 × 1.0 = 0.1
Final state values: [0.0, 0.0, 0.1]
Only the final state receives any value update because the reward arrives there and λ=0 prevents earlier states from maintaining eligibility.
episode = [[0, 0], [1, 0], [2, 1]]
gamma = 0.9
lambda = 1.0
alpha = 0.1
num_states = 3[0.081, 0.09, 0.1]With λ = 1, this approaches Monte Carlo behavior where all states in the trajectory receive credit.
Steps 1-2: Transitions with reward = 0 • TD errors are 0, so no value updates • But eligibility traces accumulate and decay by γ only (since λ=1, decay = γλ = 0.9)
Step 3: Transition from state 2 (reward = 1, terminal) • Eligibility traces: e[0] = 0.9², e[1] = 0.9, e[2] = 1.0 • TD error δ = 1.0 • Updates:
Final state values: [0.081, 0.09, 0.1]
All states receive credit proportional to their discounted distance from the rewarding terminal state, mimicking Monte Carlo's approach of backing up the full return.
Constraints