Machine LearningReinforcement Learning

Value-Based Methods

LevelAdvanced

Duration180 mins

TopicReinforcement Learning

5 / 5

Eligibility Traces

Assigning Credit Across Time

Consider a chess game where you make a brilliant move on turn 10, but the decisive advantage only manifests 20 moves later. When you finally win, how should the learning algorithm know that turn 10 was special? This is the temporal credit assignment problem—one of the deepest challenges in reinforcement learning.

Eligibility traces are the elegant solution. They maintain a memory of recently visited states, allowing rewards to propagate backward to credit earlier actions. Rather than updating only the current state (TD(0)) or waiting until episode end (Monte Carlo), eligibility traces update all recently visited states, with updates decaying exponentially into the past.

Eligibility traces implement TD(λ) efficiently, bridging the full spectrum from one-step TD to Monte Carlo. They are conceptually beautiful, computationally efficient, and often the key to learning in domains with delayed rewards. Understanding traces is essential for mastering reinforcement learning.

What You Will Learn

This page provides comprehensive coverage of eligibility traces: the intuition behind trace mechanisms, accumulating vs replacing traces, the backward view of TD(λ), the mathematical equivalence to the forward λ-return view, SARSA(λ) and Watkins's Q(λ), practical implementation, and modern developments like true online TD(λ).

The Credit Assignment Problem

Before diving into the mechanics of eligibility traces, let's understand the problem they solve.

Temporal Credit Assignment

In supervised learning, credit assignment is immediate: the label tells us exactly what the output should have been. In RL, rewards are delayed and must be attributed to earlier states and actions.

Example: In a maze:

Agent takes sequence of moves: left, left, down, right, ...
After 50 moves, reaches goal and gets +100 reward
Question: Which of those 50 moves deserves credit?

Clearly, moves leading toward the goal deserve credit, and wrong turns deserve blame. But how do we identify which is which when only the final reward is observed?

TD(0)'s Limitation

TD(0) updates only the immediately preceding state: $$V(S_t) \leftarrow V(S_t) + \alpha \delta_t$$

where $\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$.

Problem: If reward arrives at step 50, only $V(S_{49})$ is updated toward that reward. $V(S_{48})$ only gets updated when $V(S_{49})$ has already increased (on a future visit). Credit propagates one step per episode—requiring many episodes to propagate rewards to early states.

Monte Carlo's Approach

MC updates all visited states toward the final return: $$V(S_t) \leftarrow V(S_t) + \alpha (G_t - V(S_t))$$

Problem: All states get equal credit for the return, regardless of when in the trajectory they occurred. This is often too much—early states had less influence on the final outcome than late states.

The Ideal Solution

We want something in between: update all recently visited states, but give more credit to recent states than distant ones. Eligibility traces achieve exactly this through an exponentially decaying memory of visited states.

Eligibility Traces: Intuition

An eligibility trace $e_t(s)$ is a scalar associated with each state that marks how 'eligible' that state is for learning—how responsible it might be for the current TD error.

The Decay Mechanism

Traces accumulate when states are visited and decay over time:

$$e_t(s) = \begin{cases} \gamma \lambda \cdot e_{t-1}(s) + 1 & \text{if } s = S_t \text{ (state visited)} \gamma \lambda \cdot e_{t-1}(s) & \text{otherwise (no visit, decay)} \end{cases}$$

Where:

$\gamma$: Discount factor (same as in return)
$\lambda$: Trace decay parameter (new, $0 \leq \lambda \leq 1$)
$\gamma\lambda$: Combined decay rate (typically 0.9 × 0.9 = 0.81)

Interpretation: Each step, all traces decay by factor $\gamma\lambda$. When a state is visited, its trace increases by 1. States visited recently have high traces; states visited long ago have near-zero traces.

Trace Decay Example (γλ = 0.8, state visited at t=0)
Time t	Steps Since Visit	Trace Value e(s)
0	0	1.0 (just visited)
1	1	0.8
2	2	0.64
5	5	0.33
10	10	0.11
20	20	0.01

The TD(λ) Update with Traces

Instead of updating only $V(S_t)$, TD(λ) updates all states proportionally to their eligibility:

$$V(s) \leftarrow V(s) + \alpha \delta_t e_t(s) \quad \text{for all } s$$

The TD error $\delta_t$ affects:

$S_t$ strongly (trace = 1, just visited)
Recent states moderately (trace between 0 and 1)
Distant states weakly (trace ≈ 0)

This propagates credit backward—a reward now benefits all recently visited states, with benefit decaying exponentially into the past.

Why γλ, Not Just λ?

The decay factor is γλ (not just λ) because: (1) γ accounts for discounting—distant states contribute less to current value anyway; (2) λ is the additional 'eligibility' parameter controlling how far back credit flows. At λ=0, only the current state is eligible (TD(0)). At λ=1, credit flows to all visited states proportional to γ^k (MC-like).

Types of Eligibility Traces

There are two main types of eligibility traces, differing in how they handle repeated visits to the same state.

Accumulating Traces

The update rule we've seen: $$e_t(s) = \gamma \lambda \cdot e_{t-1}(s) + \mathbf{1}_{s = S_t}$$

If a state is visited multiple times in quick succession, its trace accumulates—can exceed 1.

Example: State $s$ visited at t=0, t=2, t=3 (γλ = 0.8):

t=0: e(s) = 0 + 1 = 1.0
t=1: e(s) = 0.8 × 1.0 = 0.8
t=2: e(s) = 0.8 × 0.8 + 1 = 1.64 (accumulates!)
t=3: e(s) = 0.8 × 1.64 + 1 = 2.31

Interpretation: States visited more frequently get larger updates. This makes intuitive sense—if you keep returning to a state, it's probably important.

Replacing Traces

An alternative that caps the trace at 1: $$e_t(s) = \begin{cases} 1 & \text{if } s = S_t \gamma \lambda \cdot e_{t-1}(s) & \text{otherwise} \end{cases}$$

Visiting a state replaces the trace with 1, rather than adding to it.

Same example (s visited at t=0, t=2, t=3; γλ = 0.8):

t=0: e(s) = 1.0
t=1: e(s) = 0.8
t=2: e(s) = 1.0 (replaced, not 1.64)
t=3: e(s) = 1.0 (replaced again)

Interpretation: What matters is how recently the state was visited, not how often. This can be more stable when states are revisited frequently (e.g., loops in the state space).

Accumulating vs Replacing Traces
Aspect	Accumulating	Replacing
Trace on revisit	Adds to existing trace	Resets to 1
Maximum trace value	Unbounded	1
Reflects	Frequency × recency	Recency only
Loop behavior	Can explode in tight loops	Bounded, stable
Theoretical match	Forward view (exactly)	Forward view (approximately)
Common usage	Standard choice	When loops are common

Dutch Traces (Modern Alternative)

A third option, 'Dutch traces', modifies the update: e(s) = γλ·e(s) + 1 - α·γλ·e(s) = (1-α)·γλ·e(s) + 1. This exactly matches the forward view for linear function approximation and is used in true online TD(λ). It's theoretically cleaner but more complex to implement.

The TD(λ) Algorithm

Let's formalize the complete TD(λ) algorithm for policy evaluation (prediction).

TD(λ) for Prediction (Backward View)

Algorithm:

Initialize $V(s)$ arbitrarily, $e(s) = 0$ for all states
For each episode:
- Observe initial state $S_0$
- For each step $t = 0, 1, 2, \ldots$:
  - Take action from policy, observe $R_{t+1}, S_{t+1}$
  - Compute TD error: $\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$
  - Update trace for $S_t$: $e(S_t) \leftarrow e(S_t) + 1$ (accumulating) or $e(S_t) \leftarrow 1$ (replacing)
  - For all states $s$:
    - $V(s) \leftarrow V(s) + \alpha \delta_t e(s)$
    - $e(s) \leftarrow \gamma \lambda e(s)$ (decay)
- Until terminal

td_lambda.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
from typing import Callable, Tuple, List
 
def td_lambda_prediction(
    env,
    policy: Callable[[int], int],
    num_episodes: int = 1000,
    alpha: float = 0.1,
    gamma: float = 0.99,
    lmbda: float = 0.9,  # 'lambda' is a Python keyword
    num_states: int = None,
    trace_type: str = "accumulating"  # or "replacing"
) -> Tuple[np.ndarray, List[float]]:
    """
    TD(λ) Prediction using eligibility traces with backward view.
    
    Args:
        env: Environment with reset() and step() methods
        policy: Policy to evaluate (state -> action)
        num_episodes: Training episodes
        alpha: Learning rate
        gamma: Discount factor
        lmbda: Trace decay parameter (λ)
        num_states: Size of state space
        trace_type: "accumulating" or "replacing"
    
    Returns:
        V: Estimated value function
        episode_returns: Return per episode for monitoring
    """
    V = np.zeros(num_states)
    episode_returns = []
    
    for episode in range(num_episodes):
        # Reset eligibility traces at episode start
        e = np.zeros(num_states)
        
        state = env.reset()
        done = False
        episode_return = 0
        
        while not done:
            # Follow policy
            action = policy(state)
            next_state, reward, done, _ = env.step(action)
            episode_return += reward
            
            # Compute TD error
            if done:
                td_error = reward - V[state]  # V(terminal) = 0
            else:
                td_error = reward + gamma * V[next_state] - V[state]
            
            # Update eligibility for current state
            if trace_type == "accumulating":
                e[state] += 1  # Accumulate
            else:  # replacing
                e[state] = 1   # Replace
            
            # Update all values proportional to their eligibility
            V += alpha * td_error * e
            
            # Decay all traces
            e *= gamma * lmbda
            
            state = next_state
        
        episode_returns.append(episode_return)
    
    return V, episode_returns
 
 
def compare_td_lambda_values():
    """Compare different λ values on a random walk."""
    # Lower λ: faster initial learning, may have higher asymptotic error
    # Higher λ: slower start, often lower asymptotic error
    # λ = 0: Equivalent to TD(0)
    # λ = 1: Equivalent to every-visit MC (in expectation)
    
    lambda_values = [0.0, 0.5, 0.8, 0.95, 1.0]
    # ... comparison code would go here

Episode Boundary Handling

Eligibility traces must be reset to 0 at the start of each episode! Continuing traces across episodes would propagate credit from one episode to another, which is incorrect for episodic tasks. This is easy to forget and causes subtle bugs.

Forward and Backward Views

One of the most beautiful results in RL theory: the backward view (eligibility traces) is equivalent to the forward view (λ-returns). Let's unpack this.

The Forward View (λ-Return)

The λ-return at time $t$: $$G_t^\lambda = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_{t:t+n}$$

This is a weighted average of all n-step returns, with geometrically decaying weights.

Forward view update (offline, at episode end): $$V(S_t) \leftarrow V(S_t) + \alpha (G_t^\lambda - V(S_t))$$

Problem: Computing $G_t^\lambda$ requires knowing all future rewards—can't update until episode ends.

The Backward View (Eligibility Traces)

The update we've described: $$V(s) \leftarrow V(s) + \alpha \delta_t e(s) \quad \text{for all } s$$

Advantage: Updates happen at every step using only past information.

Equivalence Theorem

Theorem: For linear function approximation (including tabular), the sum of all backward-view updates over an episode equals the forward-view update for each state visited.

Specifically, at episode end: $$\sum_{t=0}^{T-1} \alpha \delta_t e_t(s) = \alpha \sum_{t: S_t = s} (G_t^\lambda - V(S_t))$$

Proof sketch: Show that the accumulation of TD errors weighted by traces equals the λ-return minus the initial estimate. This uses the telescoping property of TD errors.

Implication: The backward view is not an approximation—it computes exactly the same updates as the forward view, just incrementally.

Forward vs Backward View Comparison
Aspect	Forward View	Backward View
Computes	λ-return G_t^λ	TD errors δ_t with traces
Update timing	Episode end (offline)	Every step (online)
Information needed	All future rewards	Only past and present
Memory cost	Store full trajectory	O(\|S\|) for traces
Result	Same updates!	Same updates!

The Magic of Eligibility Traces

Eligibility traces perform a seemingly magical trick: they produce updates equivalent to looking into the future, while only using past information. This is possible because the λ-return's structure—a sum of n-step returns—can be decomposed into a sum of TD errors, which can be computed incrementally.

TD(λ) for Control: SARSA(λ) and Q(λ)

Eligibility traces extend naturally to control algorithms. The idea: maintain traces for state-action pairs $(s, a)$ instead of just states.

SARSA(λ)

Extend SARSA with eligibility traces:

$$\delta_t = R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)$$ $$e(S_t, A_t) \leftarrow e(S_t, A_t) + 1 \text{ (or } = 1 \text{ for replacing)}$$ $$Q(s, a) \leftarrow Q(s, a) + \alpha \delta_t e(s, a) \quad \forall (s, a)$$ $$e(s, a) \leftarrow \gamma \lambda e(s, a) \quad \forall (s, a)$$

SARSA(λ) maintains credit assignment for state-action pairs, enabling faster learning when rewards are delayed from the actions that caused them.

sarsa_lambda.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import numpy as np
 
class SARSALambdaAgent:
    """
    SARSA(λ): On-policy TD control with eligibility traces.
    
    Extends SARSA to update all recently visited state-action pairs,
    weighted by their eligibility traces.
    """
    
    def __init__(
        self,
        num_states: int,
        num_actions: int,
        alpha: float = 0.1,
        gamma: float = 0.99,
        lmbda: float = 0.9,
        epsilon: float = 0.1,
        trace_type: str = "replacing"
    ):
        self.num_states = num_states
        self.num_actions = num_actions
        self.alpha = alpha
        self.gamma = gamma
        self.lmbda = lmbda
        self.epsilon = epsilon
        self.trace_type = trace_type
        
        self.Q = np.zeros((num_states, num_actions))
        self.e = np.zeros((num_states, num_actions))  # Eligibility traces
    
    def select_action(self, state: int) -> int:
        """ε-greedy action selection."""
        if np.random.random() < self.epsilon:
            return np.random.randint(self.num_actions)
        return np.argmax(self.Q[state])
    
    def reset_traces(self):
        """Reset traces at start of episode."""
        self.e[:, :] = 0
    
    def update(
        self,
        state: int,
        action: int,
        reward: float,
        next_state: int,
        next_action: int,
        done: bool
    ):
        """
        SARSA(λ) update with eligibility traces.
        """
        # Compute TD error using SARSA target
        if done:
            td_error = reward - self.Q[state, action]
        else:
            td_error = (reward + self.gamma * self.Q[next_state, next_action] 
                       - self.Q[state, action])
        
        # Update eligibility for current state-action
        if self.trace_type == "accumulating":
            self.e[state, action] += 1
        elif self.trace_type == "replacing":
            self.e[state, action] = 1
        
        # Update all Q-values proportional to eligibility
        self.Q += self.alpha * td_error * self.e
        
        # Decay all traces
        self.e *= self.gamma * self.lmbda
 
 
def train_sarsa_lambda(
    env,
    agent: SARSALambdaAgent,
    num_episodes: int = 1000
):
    """Training loop for SARSA(λ)."""
    episode_returns = []
    
    for episode in range(num_episodes):
        agent.reset_traces()  # Critical! Reset at episode start
        
        state = env.reset()
        action = agent.select_action(state)
        total_return = 0
        done = False
        
        while not done:
            next_state, reward, done, _ = env.step(action)
            next_action = agent.select_action(next_state)
            
            agent.update(state, action, reward, next_state, next_action, done)
            
            total_return += reward
            state = next_state
            action = next_action
        
        episode_returns.append(total_return)
    
    return episode_returns

Watkins's Q(λ)

Extending Q-learning with traces is trickier because Q-learning is off-policy. The issue: traces are computed under the behavior policy, but Q-learning targets the greedy policy.

Watkins's Q(λ): Cut traces to zero when an exploratory (non-greedy) action is taken.

$$e(s, a) = \begin{cases} \gamma \lambda e(s, a) + 1 & \text{if } (s, a) = (S_t, A_t) \text{ and } A_t = \arg\max Q(S_t, \cdot) 0 & \text{if } A_t eq \arg\max Q(S_t, \cdot) \text{ (cut traces)} \gamma \lambda e(s, a) & \text{otherwise} \end{cases}$$

Rationale: When an exploratory action is taken, the trajectory diverges from the greedy policy that Q-learning evaluates. Continuing traces would propagate credit incorrectly.

Q(λ) Limitations

Cutting traces at every exploratory action means most traces are zero most of the time (especially with high ε). This significantly reduces the benefits of eligibility traces. For this reason, SARSA(λ) is often preferred over Q(λ) when traces are desired. Alternative formulations like Tree-backup and Retrace address this issue.

Practical Implementation Considerations

Implementing eligibility traces efficiently requires attention to computational details.

Computational Cost

Naive implementation updates all state(-action) pairs every step:

TD(0): O(1) update per step
TD(λ) naive: O(|S||A|) update per step (must touch all traces)

Sparse update optimization: Most traces decay to near-zero quickly. Track only states with non-negligible traces:

active_pairs = [(s, a) for (s, a) in traces if traces[s, a] > threshold]
for s, a in active_pairs:
    Q[s, a] += alpha * td_error * traces[s, a]
    traces[s, a] *= gamma * lmbda

With threshold = 0.01 and γλ = 0.8, traces stay active for about 20 steps before dropping below threshold. This bounds active pairs at ~20 per episode.

Implementation Checklist

•Reset traces at episode start: Forgetting this causes traces to span episodes, leading to incorrect credit assignment.
•Update order matters: First update eligibility for current state, THEN apply TD error to all states, THEN decay traces.
•Terminal state handling: When terminal is reached, the TD error uses Q(terminal) = 0. Traces should still be applied.
•Numeric precision: Very small traces can cause underflow. Use a threshold to zero them out.
•Memory: For large state spaces, use sparse representations (dict of non-zero traces, not full array).
•Hyperparameter interaction: λ and α interact—higher λ often benefits from lower α to prevent oscillation.

Hyperparameter Selection

Parameter	Typical Range	Effect of Increasing
λ	0.7–0.95	More MC-like, lower bias, higher variance
α (given λ)	0.01–0.1 (lower than TD(0))	Faster learning but potential instability
γλ product	0.6–0.9	Longer credit assignment horizon

λ selection heuristic:

Sparse rewards → higher λ (credit must flow far)
Dense rewards → lower λ (nearby rewards suffice)
Function approximation → λ = 0.9 often works well

When Traces Help Most

Eligibility traces provide the biggest benefit when: (1) Rewards are sparse or delayed, (2) State transitions are not too stochastic (determistic paths benefit most), (3) Episodes are long (more room for credit propagation). In environments with dense rewards and short episodes, TD(0) may be equally effective and simpler.

True Online TD(λ)

The classical backward view is almost equivalent to the forward view, but not exactly for non-linear function approximation. True online TD(λ) achieves exact equivalence.

The Problem with Classical TD(λ)

With linear function approximation (or tabular), classical TD(λ) exactly matches the forward view. But with non-linear approximators (neural networks), there's a subtle discrepancy: the forward view uses a fixed $V$ to compute all $G_t^\lambda$ values, while the backward view updates $V$ incrementally.

The Solution: Dutch Traces

True online TD(λ) uses modified Dutch traces:

$$e_t = \gamma \lambda e_{t-1} + \alpha(1 - \gamma \lambda e_{t-1}^\top \phi_t) \phi_t$$

where $\phi_t$ is the feature vector for state $S_t$.

The key difference: the trace update includes a correction term that accounts for the immediate effect of the update on the current state's value.

true_online_td_lambda.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy as np
 
class TrueOnlineTDLambda:
    """
    True Online TD(λ) - exactly equivalent to the forward view
    for linear function approximation.
    
    Uses Dutch traces with a correction term.
    """
    
    def __init__(
        self,
        num_features: int,
        alpha: float = 0.1,
        gamma: float = 0.99,
        lmbda: float = 0.9
    ):
        self.w = np.zeros(num_features)  # Weight vector
        self.alpha = alpha
        self.gamma = gamma
        self.lmbda = lmbda
    
    def value(self, phi: np.ndarray) -> float:
        """Compute value as w^T * phi."""
        return np.dot(self.w, phi)
    
    def reset_traces(self):
        """Reset traces at episode start."""
        self.e = np.zeros_like(self.w)
        self.V_old = 0.0
    
    def update(
        self,
        phi: np.ndarray,       # Current state features
        reward: float,
        phi_next: np.ndarray,  # Next state features (or zeros if terminal)
        done: bool
    ):
        """
        True Online TD(λ) update.
        
        Uses Dutch traces for exact equivalence with forward view.
        """
        V = self.value(phi)
        V_next = 0.0 if done else self.value(phi_next)
        
        # TD error (using current weights)
        delta = reward + self.gamma * V_next - V
        
        # Dutch trace update
        # e = γλe + α(1 - γλ * e^T * φ) * φ
        ed_dot_phi = np.dot(self.e, phi)
        self.e = (self.gamma * self.lmbda * self.e + 
                  self.alpha * (1 - self.gamma * self.lmbda * ed_dot_phi) * phi)
        
        # Weight update with correction term
        # Δw = δ·e + (V - V_old)(e - α·φ) 
        self.w += delta * self.e + (V - self.V_old) * (self.e - self.alpha * phi)
        
        # Store old value for next update
        self.V_old = V_next

When to Use True Online TD(λ)

Use true online TD(λ) when:

Theoretical exactness matters
Using linear function approximation and want forward-view equivalence guarantee
Learning rate is relatively high (larger α magnifies the difference)

Classical TD(λ) is often sufficient when:

Learning rate is small (difference diminishes)
Using non-linear function approximation anyway (no exact equivalence possible)
Simplicity is preferred (classical is easier to implement/understand)

Deep Learning and Traces

With neural networks, eligibility traces are less common because: (1) No exact forward-view equivalence exists, (2) Experience replay (common in deep RL) doesn't naturally support traces, (3) Computational cost of updating traces for all parameters. However, researches have explored 'expected eligibility traces' and other adaptations for deep RL.

Summary: Eligibility Traces

Eligibility traces are the elegant mechanism for assigning credit across time, unifying TD learning and Monte Carlo methods. Let's consolidate the key insights:

Key Takeaways

•Credit assignment: Traces solve the temporal credit assignment problem by maintaining a decaying memory of recently visited states.
•Trace decay: e(s) decays by γλ each step, is boosted when s is visited. Recently visited states have high traces.
•TD(λ) update: All states are updated proportionally to their eligibility: V(s) += α·δ·e(s).
•Accumulating vs replacing: Accumulating traces can exceed 1 (frequency + recency), replacing caps at 1 (recency only).
•Forward-backward equivalence: The backward view (traces) exactly computes the forward view (λ-returns) for tabular/linear cases.
•λ interpolation: λ=0 gives TD(0), λ=1 gives MC-like behavior, intermediate λ balances bias and variance.
•SARSA(λ): Extends SARSA with traces for faster on-policy learning in domains with delayed rewards.
•True online TD(λ): Uses Dutch traces for exact forward-view equivalence with linear function approximation.

Module Complete: You've now mastered the core value-based methods of reinforcement learning:

Value iteration: Computing optimal values with a known model
Q-learning: Model-free, off-policy learning of optimal Q-values
SARSA: Model-free, on-policy learning of policy values
TD learning: The theoretical foundation of temporal-difference methods
Eligibility traces: Efficient credit assignment across time via TD(λ)

These form the foundation for deep RL methods covered in Module 5, where function approximation extends these ideas to large-scale problems.

Module Complete

Congratulations! You've completed Module 3: Value-Based Methods. You now understand how agents can learn optimal behavior from experience using temporal-difference learning, from the foundational algorithms (Q-learning, SARSA) through the theoretical underpinnings (TD errors, convergence) to advanced techniques (eligibility traces, TD(λ)). This knowledge forms the bedrock for all value-based reinforcement learning.

5 / 5

Loading learning content...

Machine LearningReinforcement Learning

Value-Based Methods

LevelAdvanced

Duration180 mins

TopicReinforcement Learning

5 / 5

Eligibility Traces

Assigning Credit Across Time

What You Will Learn

The Credit Assignment Problem

Before diving into the mechanics of eligibility traces, let's understand the problem they solve.

Temporal Credit Assignment

In supervised learning, credit assignment is immediate: the label tells us exactly what the output should have been. In RL, rewards are delayed and must be attributed to earlier states and actions.

Example: In a maze:

Agent takes sequence of moves: left, left, down, right, ...
After 50 moves, reaches goal and gets +100 reward
Question: Which of those 50 moves deserves credit?

Clearly, moves leading toward the goal deserve credit, and wrong turns deserve blame. But how do we identify which is which when only the final reward is observed?

TD(0)'s Limitation

TD(0) updates only the immediately preceding state: $$V(S_t) \leftarrow V(S_t) + \alpha \delta_t$$

where $\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$.

Monte Carlo's Approach

MC updates all visited states toward the final return: $$V(S_t) \leftarrow V(S_t) + \alpha (G_t - V(S_t))$$

The Ideal Solution

Eligibility Traces: Intuition

An eligibility trace $e_t(s)$ is a scalar associated with each state that marks how 'eligible' that state is for learning—how responsible it might be for the current TD error.

The Decay Mechanism

Traces accumulate when states are visited and decay over time:

$$e_t(s) = \begin{cases} \gamma \lambda \cdot e_{t-1}(s) + 1 & \text{if } s = S_t \text{ (state visited)} \gamma \lambda \cdot e_{t-1}(s) & \text{otherwise (no visit, decay)} \end{cases}$$

Where:

$\gamma$: Discount factor (same as in return)
$\lambda$: Trace decay parameter (new, $0 \leq \lambda \leq 1$)
$\gamma\lambda$: Combined decay rate (typically 0.9 × 0.9 = 0.81)

Trace Decay Example (γλ = 0.8, state visited at t=0)
Time t	Steps Since Visit	Trace Value e(s)
0	0	1.0 (just visited)
1	1	0.8
2	2	0.64
5	5	0.33
10	10	0.11
20	20	0.01

The TD(λ) Update with Traces

Instead of updating only $V(S_t)$, TD(λ) updates all states proportionally to their eligibility:

$$V(s) \leftarrow V(s) + \alpha \delta_t e_t(s) \quad \text{for all } s$$

The TD error $\delta_t$ affects:

$S_t$ strongly (trace = 1, just visited)
Recent states moderately (trace between 0 and 1)
Distant states weakly (trace ≈ 0)

This propagates credit backward—a reward now benefits all recently visited states, with benefit decaying exponentially into the past.

Why γλ, Not Just λ?

Types of Eligibility Traces

There are two main types of eligibility traces, differing in how they handle repeated visits to the same state.

Accumulating Traces

The update rule we've seen: $$e_t(s) = \gamma \lambda \cdot e_{t-1}(s) + \mathbf{1}_{s = S_t}$$

If a state is visited multiple times in quick succession, its trace accumulates—can exceed 1.

Example: State $s$ visited at t=0, t=2, t=3 (γλ = 0.8):

t=0: e(s) = 0 + 1 = 1.0
t=1: e(s) = 0.8 × 1.0 = 0.8
t=2: e(s) = 0.8 × 0.8 + 1 = 1.64 (accumulates!)
t=3: e(s) = 0.8 × 1.64 + 1 = 2.31

Interpretation: States visited more frequently get larger updates. This makes intuitive sense—if you keep returning to a state, it's probably important.

Replacing Traces

An alternative that caps the trace at 1: $$e_t(s) = \begin{cases} 1 & \text{if } s = S_t \gamma \lambda \cdot e_{t-1}(s) & \text{otherwise} \end{cases}$$

Visiting a state replaces the trace with 1, rather than adding to it.

Same example (s visited at t=0, t=2, t=3; γλ = 0.8):

t=0: e(s) = 1.0
t=1: e(s) = 0.8
t=2: e(s) = 1.0 (replaced, not 1.64)
t=3: e(s) = 1.0 (replaced again)

Interpretation: What matters is how recently the state was visited, not how often. This can be more stable when states are revisited frequently (e.g., loops in the state space).

Accumulating vs Replacing Traces
Aspect	Accumulating	Replacing
Trace on revisit	Adds to existing trace	Resets to 1
Maximum trace value	Unbounded	1
Reflects	Frequency × recency	Recency only
Loop behavior	Can explode in tight loops	Bounded, stable
Theoretical match	Forward view (exactly)	Forward view (approximately)
Common usage	Standard choice	When loops are common

Dutch Traces (Modern Alternative)

The TD(λ) Algorithm

Let's formalize the complete TD(λ) algorithm for policy evaluation (prediction).

TD(λ) for Prediction (Backward View)

Algorithm:

Initialize $V(s)$ arbitrarily, $e(s) = 0$ for all states
For each episode:
- Observe initial state $S_0$
- For each step $t = 0, 1, 2, \ldots$:
  - Take action from policy, observe $R_{t+1}, S_{t+1}$
  - Compute TD error: $\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$
  - Update trace for $S_t$: $e(S_t) \leftarrow e(S_t) + 1$ (accumulating) or $e(S_t) \leftarrow 1$ (replacing)
  - For all states $s$:
    - $V(s) \leftarrow V(s) + \alpha \delta_t e(s)$
    - $e(s) \leftarrow \gamma \lambda e(s)$ (decay)
- Until terminal

td_lambda.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
from typing import Callable, Tuple, List
 
def td_lambda_prediction(
    env,
    policy: Callable[[int], int],
    num_episodes: int = 1000,
    alpha: float = 0.1,
    gamma: float = 0.99,
    lmbda: float = 0.9,  # 'lambda' is a Python keyword
    num_states: int = None,
    trace_type: str = "accumulating"  # or "replacing"
) -> Tuple[np.ndarray, List[float]]:
    """
    TD(λ) Prediction using eligibility traces with backward view.
    
    Args:
        env: Environment with reset() and step() methods
        policy: Policy to evaluate (state -> action)
        num_episodes: Training episodes
        alpha: Learning rate
        gamma: Discount factor
        lmbda: Trace decay parameter (λ)
        num_states: Size of state space
        trace_type: "accumulating" or "replacing"
    
    Returns:
        V: Estimated value function
        episode_returns: Return per episode for monitoring
    """
    V = np.zeros(num_states)
    episode_returns = []
    
    for episode in range(num_episodes):
        # Reset eligibility traces at episode start
        e = np.zeros(num_states)
        
        state = env.reset()
        done = False
        episode_return = 0
        
        while not done:
            # Follow policy
            action = policy(state)
            next_state, reward, done, _ = env.step(action)
            episode_return += reward
            
            # Compute TD error
            if done:
                td_error = reward - V[state]  # V(terminal) = 0
            else:
                td_error = reward + gamma * V[next_state] - V[state]
            
            # Update eligibility for current state
            if trace_type == "accumulating":
                e[state] += 1  # Accumulate
            else:  # replacing
                e[state] = 1   # Replace
            
            # Update all values proportional to their eligibility
            V += alpha * td_error * e
            
            # Decay all traces
            e *= gamma * lmbda
            
            state = next_state
        
        episode_returns.append(episode_return)
    
    return V, episode_returns
 
 
def compare_td_lambda_values():
    """Compare different λ values on a random walk."""
    # Lower λ: faster initial learning, may have higher asymptotic error
    # Higher λ: slower start, often lower asymptotic error
    # λ = 0: Equivalent to TD(0)
    # λ = 1: Equivalent to every-visit MC (in expectation)
    
    lambda_values = [0.0, 0.5, 0.8, 0.95, 1.0]
    # ... comparison code would go here

Episode Boundary Handling

Forward and Backward Views

One of the most beautiful results in RL theory: the backward view (eligibility traces) is equivalent to the forward view (λ-returns). Let's unpack this.

The Forward View (λ-Return)

The λ-return at time $t$: $$G_t^\lambda = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_{t:t+n}$$

This is a weighted average of all n-step returns, with geometrically decaying weights.

Forward view update (offline, at episode end): $$V(S_t) \leftarrow V(S_t) + \alpha (G_t^\lambda - V(S_t))$$

Problem: Computing $G_t^\lambda$ requires knowing all future rewards—can't update until episode ends.

The Backward View (Eligibility Traces)

The update we've described: $$V(s) \leftarrow V(s) + \alpha \delta_t e(s) \quad \text{for all } s$$

Advantage: Updates happen at every step using only past information.

Equivalence Theorem

Theorem: For linear function approximation (including tabular), the sum of all backward-view updates over an episode equals the forward-view update for each state visited.

Specifically, at episode end: $$\sum_{t=0}^{T-1} \alpha \delta_t e_t(s) = \alpha \sum_{t: S_t = s} (G_t^\lambda - V(S_t))$$

Proof sketch: Show that the accumulation of TD errors weighted by traces equals the λ-return minus the initial estimate. This uses the telescoping property of TD errors.

Implication: The backward view is not an approximation—it computes exactly the same updates as the forward view, just incrementally.

Forward vs Backward View Comparison
Aspect	Forward View	Backward View
Computes	λ-return G_t^λ	TD errors δ_t with traces
Update timing	Episode end (offline)	Every step (online)
Information needed	All future rewards	Only past and present
Memory cost	Store full trajectory	O(\|S\|) for traces
Result	Same updates!	Same updates!

The Magic of Eligibility Traces

TD(λ) for Control: SARSA(λ) and Q(λ)

Eligibility traces extend naturally to control algorithms. The idea: maintain traces for state-action pairs $(s, a)$ instead of just states.

SARSA(λ)

Extend SARSA with eligibility traces:

SARSA(λ) maintains credit assignment for state-action pairs, enabling faster learning when rewards are delayed from the actions that caused them.

sarsa_lambda.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import numpy as np
 
class SARSALambdaAgent:
    """
    SARSA(λ): On-policy TD control with eligibility traces.
    
    Extends SARSA to update all recently visited state-action pairs,
    weighted by their eligibility traces.
    """
    
    def __init__(
        self,
        num_states: int,
        num_actions: int,
        alpha: float = 0.1,
        gamma: float = 0.99,
        lmbda: float = 0.9,
        epsilon: float = 0.1,
        trace_type: str = "replacing"
    ):
        self.num_states = num_states
        self.num_actions = num_actions
        self.alpha = alpha
        self.gamma = gamma
        self.lmbda = lmbda
        self.epsilon = epsilon
        self.trace_type = trace_type
        
        self.Q = np.zeros((num_states, num_actions))
        self.e = np.zeros((num_states, num_actions))  # Eligibility traces
    
    def select_action(self, state: int) -> int:
        """ε-greedy action selection."""
        if np.random.random() < self.epsilon:
            return np.random.randint(self.num_actions)
        return np.argmax(self.Q[state])
    
    def reset_traces(self):
        """Reset traces at start of episode."""
        self.e[:, :] = 0
    
    def update(
        self,
        state: int,
        action: int,
        reward: float,
        next_state: int,
        next_action: int,
        done: bool
    ):
        """
        SARSA(λ) update with eligibility traces.
        """
        # Compute TD error using SARSA target
        if done:
            td_error = reward - self.Q[state, action]
        else:
            td_error = (reward + self.gamma * self.Q[next_state, next_action] 
                       - self.Q[state, action])
        
        # Update eligibility for current state-action
        if self.trace_type == "accumulating":
            self.e[state, action] += 1
        elif self.trace_type == "replacing":
            self.e[state, action] = 1
        
        # Update all Q-values proportional to eligibility
        self.Q += self.alpha * td_error * self.e
        
        # Decay all traces
        self.e *= self.gamma * self.lmbda
 
 
def train_sarsa_lambda(
    env,
    agent: SARSALambdaAgent,
    num_episodes: int = 1000
):
    """Training loop for SARSA(λ)."""
    episode_returns = []
    
    for episode in range(num_episodes):
        agent.reset_traces()  # Critical! Reset at episode start
        
        state = env.reset()
        action = agent.select_action(state)
        total_return = 0
        done = False
        
        while not done:
            next_state, reward, done, _ = env.step(action)
            next_action = agent.select_action(next_state)
            
            agent.update(state, action, reward, next_state, next_action, done)
            
            total_return += reward
            state = next_state
            action = next_action
        
        episode_returns.append(total_return)
    
    return episode_returns

Watkins's Q(λ)

Extending Q-learning with traces is trickier because Q-learning is off-policy. The issue: traces are computed under the behavior policy, but Q-learning targets the greedy policy.

Watkins's Q(λ): Cut traces to zero when an exploratory (non-greedy) action is taken.

Rationale: When an exploratory action is taken, the trajectory diverges from the greedy policy that Q-learning evaluates. Continuing traces would propagate credit incorrectly.

Q(λ) Limitations

Practical Implementation Considerations

Implementing eligibility traces efficiently requires attention to computational details.

Computational Cost

Naive implementation updates all state(-action) pairs every step:

TD(0): O(1) update per step
TD(λ) naive: O(|S||A|) update per step (must touch all traces)

Sparse update optimization: Most traces decay to near-zero quickly. Track only states with non-negligible traces:

active_pairs = [(s, a) for (s, a) in traces if traces[s, a] > threshold]
for s, a in active_pairs:
    Q[s, a] += alpha * td_error * traces[s, a]
    traces[s, a] *= gamma * lmbda

With threshold = 0.01 and γλ = 0.8, traces stay active for about 20 steps before dropping below threshold. This bounds active pairs at ~20 per episode.

Implementation Checklist

•Reset traces at episode start: Forgetting this causes traces to span episodes, leading to incorrect credit assignment.
•Update order matters: First update eligibility for current state, THEN apply TD error to all states, THEN decay traces.
•Terminal state handling: When terminal is reached, the TD error uses Q(terminal) = 0. Traces should still be applied.
•Numeric precision: Very small traces can cause underflow. Use a threshold to zero them out.
•Memory: For large state spaces, use sparse representations (dict of non-zero traces, not full array).
•Hyperparameter interaction: λ and α interact—higher λ often benefits from lower α to prevent oscillation.

Hyperparameter Selection

Parameter	Typical Range	Effect of Increasing
λ	0.7–0.95	More MC-like, lower bias, higher variance
α (given λ)	0.01–0.1 (lower than TD(0))	Faster learning but potential instability
γλ product	0.6–0.9	Longer credit assignment horizon

λ selection heuristic:

Sparse rewards → higher λ (credit must flow far)
Dense rewards → lower λ (nearby rewards suffice)
Function approximation → λ = 0.9 often works well

When Traces Help Most

True Online TD(λ)

The classical backward view is almost equivalent to the forward view, but not exactly for non-linear function approximation. True online TD(λ) achieves exact equivalence.

The Problem with Classical TD(λ)

The Solution: Dutch Traces

True online TD(λ) uses modified Dutch traces:

$$e_t = \gamma \lambda e_{t-1} + \alpha(1 - \gamma \lambda e_{t-1}^\top \phi_t) \phi_t$$

where $\phi_t$ is the feature vector for state $S_t$.

The key difference: the trace update includes a correction term that accounts for the immediate effect of the update on the current state's value.

true_online_td_lambda.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy as np
 
class TrueOnlineTDLambda:
    """
    True Online TD(λ) - exactly equivalent to the forward view
    for linear function approximation.
    
    Uses Dutch traces with a correction term.
    """
    
    def __init__(
        self,
        num_features: int,
        alpha: float = 0.1,
        gamma: float = 0.99,
        lmbda: float = 0.9
    ):
        self.w = np.zeros(num_features)  # Weight vector
        self.alpha = alpha
        self.gamma = gamma
        self.lmbda = lmbda
    
    def value(self, phi: np.ndarray) -> float:
        """Compute value as w^T * phi."""
        return np.dot(self.w, phi)
    
    def reset_traces(self):
        """Reset traces at episode start."""
        self.e = np.zeros_like(self.w)
        self.V_old = 0.0
    
    def update(
        self,
        phi: np.ndarray,       # Current state features
        reward: float,
        phi_next: np.ndarray,  # Next state features (or zeros if terminal)
        done: bool
    ):
        """
        True Online TD(λ) update.
        
        Uses Dutch traces for exact equivalence with forward view.
        """
        V = self.value(phi)
        V_next = 0.0 if done else self.value(phi_next)
        
        # TD error (using current weights)
        delta = reward + self.gamma * V_next - V
        
        # Dutch trace update
        # e = γλe + α(1 - γλ * e^T * φ) * φ
        ed_dot_phi = np.dot(self.e, phi)
        self.e = (self.gamma * self.lmbda * self.e + 
                  self.alpha * (1 - self.gamma * self.lmbda * ed_dot_phi) * phi)
        
        # Weight update with correction term
        # Δw = δ·e + (V - V_old)(e - α·φ) 
        self.w += delta * self.e + (V - self.V_old) * (self.e - self.alpha * phi)
        
        # Store old value for next update
        self.V_old = V_next

When to Use True Online TD(λ)

Use true online TD(λ) when:

Theoretical exactness matters
Using linear function approximation and want forward-view equivalence guarantee
Learning rate is relatively high (larger α magnifies the difference)

Classical TD(λ) is often sufficient when:

Learning rate is small (difference diminishes)
Using non-linear function approximation anyway (no exact equivalence possible)
Simplicity is preferred (classical is easier to implement/understand)

Deep Learning and Traces

Summary: Eligibility Traces

Eligibility traces are the elegant mechanism for assigning credit across time, unifying TD learning and Monte Carlo methods. Let's consolidate the key insights:

Key Takeaways

•Credit assignment: Traces solve the temporal credit assignment problem by maintaining a decaying memory of recently visited states.
•Trace decay: e(s) decays by γλ each step, is boosted when s is visited. Recently visited states have high traces.
•TD(λ) update: All states are updated proportionally to their eligibility: V(s) += α·δ·e(s).
•Accumulating vs replacing: Accumulating traces can exceed 1 (frequency + recency), replacing caps at 1 (recency only).
•Forward-backward equivalence: The backward view (traces) exactly computes the forward view (λ-returns) for tabular/linear cases.
•λ interpolation: λ=0 gives TD(0), λ=1 gives MC-like behavior, intermediate λ balances bias and variance.
•SARSA(λ): Extends SARSA with traces for faster on-policy learning in domains with delayed rewards.
•True online TD(λ): Uses Dutch traces for exact forward-view equivalence with linear function approximation.

Module Complete: You've now mastered the core value-based methods of reinforcement learning:

Value iteration: Computing optimal values with a known model
Q-learning: Model-free, off-policy learning of optimal Q-values
SARSA: Model-free, on-policy learning of policy values
TD learning: The theoretical foundation of temporal-difference methods
Eligibility traces: Efficient credit assignment across time via TD(λ)

These form the foundation for deep RL methods covered in Module 5, where function approximation extends these ideas to large-scale problems.

Module Complete

5 / 5