0/318

00:00:00

Description

Editorial

TD(λ) Value Estimation with Eligibility Traces

HARD10 pts

In reinforcement learning, TD(λ) represents one of the most elegant algorithms that elegantly unifies two seemingly different approaches to value estimation: single-step Temporal Difference (TD(0)) learning and full-episode Monte Carlo methods. The key innovation lies in eligibility traces—memory structures that maintain a decaying record of recently visited states, enabling more efficient credit assignment across multi-step transitions.

The Spectrum of Temporal Difference Learning

Consider an agent navigating through an environment, collecting rewards along the way. When a reward is finally observed, a fundamental question arises: which past states should receive credit for this outcome?

TD(0) takes a myopic view—only the immediately preceding state receives credit
Monte Carlo takes the opposite extreme—all states in the episode share credit equally
TD(λ) parameterizes this spectrum with a decay factor λ ∈ [0, 1]

When λ = 0, TD(λ) collapses exactly to TD(0). When λ = 1, it approximates Monte Carlo estimation. Intermediate values of λ provide a smooth interpolation between these extremes.

Eligibility Traces: The Memory of Credit

An eligibility trace e(s) for each state s tracks how "eligible" that state is to receive credit for the current temporal difference error. The trace operates with two key dynamics:

Decay: At each timestep, all traces decay by factor γλ (discount × trace parameter)
Accumulation: The trace for the currently visited state is incremented by 1

This creates a fading memory where recently visited states have higher eligibility, and frequently visited states accumulate even higher traces.

The Backward View Algorithm

The backward view of TD(λ) processes an episode step-by-step:

$$\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$$

$$e(s) \leftarrow \gamma \lambda \cdot e(s) + \mathbf{1}[s = S_t]$$

$$V(s) \leftarrow V(s) + \alpha \cdot \delta_t \cdot e(s) \quad \forall s$$

Where:

δₜ is the TD error at step t (reward plus discounted next-state value minus current value)
e(s) is the eligibility trace for state s
α is the learning rate
𝟙[s = Sₜ] is 1 if s is the current state, 0 otherwise

For the terminal transition, the next-state value V(S_{t+1}) is 0.

Your Task

Implement TD(λ) value estimation using the backward view with accumulating eligibility traces. Given an episode trajectory, the discount factor γ, the trace decay parameter λ, the learning rate α, and the number of states, compute the estimated state values after processing the entire episode.

Implementation Details:

Initialize all state values V(s) to 0
Initialize all eligibility traces e(s) to 0
Process transitions in order, updating traces and values at each step
For the final transition, treat the terminal state as having value 0
Round final values to 3 decimal places

Example

Input

episode = [[0, 1], [1, 1], [2, 0]]
gamma = 0.9
lambda = 0.8
alpha = 0.1
num_states = 3

Output

[0.172, 0.1, 0.0]

Explanation

This episode visits states 0 → 1 → 2 with rewards 1, 1, 0 respectively.

Step 1: Transition from state 0 (reward = 1) • Before: V = [0, 0, 0], e = [0, 0, 0] • Update e[0] = 0 × 0.9 × 0.8 + 1 = 1.0 • TD error δ = 1 + 0.9 × V[1] - V[0] = 1 + 0 - 0 = 1.0 • Update V[0] += 0.1 × 1.0 × 1.0 = 0.1 • After: V = [0.1, 0, 0], e = [1.0, 0, 0]

Step 2: Transition from state 1 (reward = 1) • Decay and update e: e[0] = 1.0 × 0.72 = 0.72, e[1] = 0 + 1 = 1.0 • TD error δ = 1 + 0.9 × 0 - 0 = 1.0 • Update V[0] += 0.1 × 1.0 × 0.72 = 0.072, V[1] += 0.1 × 1.0 × 1.0 = 0.1 • After: V = [0.172, 0.1, 0], e = [0.72, 1.0, 0]

Step 3: Transition from state 2 (reward = 0, terminal) • TD error δ = 0 + 0 - 0 = 0 (terminal state) • No value updates since δ = 0

Final state values: [0.172, 0.1, 0.0]

Notice how state 0 received credit from both its own reward AND the subsequent reward from state 1 due to the eligibility trace mechanism.

Example

Input

episode = [[0, 0], [1, 0], [2, 1]]
gamma = 0.9
lambda = 0.0
alpha = 0.1
num_states = 3

Output

[0.0, 0.0, 0.1]

Explanation

With λ = 0, this reduces to TD(0) where only the immediately preceding state receives credit.

Step 1: Transition from state 0 (reward = 0) • TD error δ = 0 + 0.9 × 0 - 0 = 0 • No updates needed

Step 2: Transition from state 1 (reward = 0) • Since λ = 0, previous traces immediately decay to 0 • TD error δ = 0 + 0.9 × 0 - 0 = 0 • No updates needed

Step 3: Transition from state 2 (reward = 1, terminal) • e[2] = 1.0 (only current state has non-zero trace with λ=0) • TD error δ = 1 + 0 - 0 = 1.0 • Update V[2] += 0.1 × 1.0 × 1.0 = 0.1

Final state values: [0.0, 0.0, 0.1]

Only the final state receives any value update because the reward arrives there and λ=0 prevents earlier states from maintaining eligibility.

Example

Input

episode = [[0, 0], [1, 0], [2, 1]]
gamma = 0.9
lambda = 1.0
alpha = 0.1
num_states = 3

Output

[0.081, 0.09, 0.1]

Explanation

With λ = 1, this approaches Monte Carlo behavior where all states in the trajectory receive credit.

Steps 1-2: Transitions with reward = 0 • TD errors are 0, so no value updates • But eligibility traces accumulate and decay by γ only (since λ=1, decay = γλ = 0.9)

Step 3: Transition from state 2 (reward = 1, terminal) • Eligibility traces: e[0] = 0.9², e[1] = 0.9, e[2] = 1.0 • TD error δ = 1.0 • Updates:

V[0] += 0.1 × 1.0 × 0.81 = 0.081
V[1] += 0.1 × 1.0 × 0.9 = 0.09
V[2] += 0.1 × 1.0 × 1.0 = 0.1

Final state values: [0.081, 0.09, 0.1]

All states receive credit proportional to their discounted distance from the rewarding terminal state, mimicking Monte Carlo's approach of backing up the full return.

Accepted0/0·0% Acceptance

Constraints

1 ≤ length of episode ≤ 1000
0 ≤ state indices < num_states
1 ≤ num_states ≤ 100
-100 ≤ reward ≤ 100
0 ≤ gamma ≤ 1
0 ≤ lambda ≤ 1
0 < alpha ≤ 1
Episode transitions are valid (each [state, reward] pair represents visiting a state and receiving a reward)
The last transition represents entering a terminal state

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

alpha =

0.1

gamma =

0.9

lambda =

0.8

episode =

[[0,1],[1,1],[2,0]]

num_states =

TD(λ) Value Estimation with Eligibility Traces

The Spectrum of Temporal Difference Learning

Eligibility Traces: The Memory of Credit

The Backward View Algorithm

Your Task

Hints

TD(λ) Value Estimation with Eligibility Traces

The Spectrum of Temporal Difference Learning

Eligibility Traces: The Memory of Credit

The Backward View Algorithm

Your Task

Hints