Loading learning content...
Consider a chess game where you make a brilliant move on turn 10, but the decisive advantage only manifests 20 moves later. When you finally win, how should the learning algorithm know that turn 10 was special? This is the temporal credit assignment problem—one of the deepest challenges in reinforcement learning.
Eligibility traces are the elegant solution. They maintain a memory of recently visited states, allowing rewards to propagate backward to credit earlier actions. Rather than updating only the current state (TD(0)) or waiting until episode end (Monte Carlo), eligibility traces update all recently visited states, with updates decaying exponentially into the past.
Eligibility traces implement TD(λ) efficiently, bridging the full spectrum from one-step TD to Monte Carlo. They are conceptually beautiful, computationally efficient, and often the key to learning in domains with delayed rewards. Understanding traces is essential for mastering reinforcement learning.
This page provides comprehensive coverage of eligibility traces: the intuition behind trace mechanisms, accumulating vs replacing traces, the backward view of TD(λ), the mathematical equivalence to the forward λ-return view, SARSA(λ) and Watkins's Q(λ), practical implementation, and modern developments like true online TD(λ).
Before diving into the mechanics of eligibility traces, let's understand the problem they solve.
In supervised learning, credit assignment is immediate: the label tells us exactly what the output should have been. In RL, rewards are delayed and must be attributed to earlier states and actions.
Example: In a maze:
Clearly, moves leading toward the goal deserve credit, and wrong turns deserve blame. But how do we identify which is which when only the final reward is observed?
TD(0) updates only the immediately preceding state: $$V(S_t) \leftarrow V(S_t) + \alpha \delta_t$$
where $\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$.
Problem: If reward arrives at step 50, only $V(S_{49})$ is updated toward that reward. $V(S_{48})$ only gets updated when $V(S_{49})$ has already increased (on a future visit). Credit propagates one step per episode—requiring many episodes to propagate rewards to early states.
MC updates all visited states toward the final return: $$V(S_t) \leftarrow V(S_t) + \alpha (G_t - V(S_t))$$
Problem: All states get equal credit for the return, regardless of when in the trajectory they occurred. This is often too much—early states had less influence on the final outcome than late states.
We want something in between: update all recently visited states, but give more credit to recent states than distant ones. Eligibility traces achieve exactly this through an exponentially decaying memory of visited states.
An eligibility trace $e_t(s)$ is a scalar associated with each state that marks how 'eligible' that state is for learning—how responsible it might be for the current TD error.
Traces accumulate when states are visited and decay over time:
$$e_t(s) = \begin{cases} \gamma \lambda \cdot e_{t-1}(s) + 1 & \text{if } s = S_t \text{ (state visited)} \ \gamma \lambda \cdot e_{t-1}(s) & \text{otherwise (no visit, decay)} \end{cases}$$
Where:
Interpretation: Each step, all traces decay by factor $\gamma\lambda$. When a state is visited, its trace increases by 1. States visited recently have high traces; states visited long ago have near-zero traces.
| Time t | Steps Since Visit | Trace Value e(s) |
|---|---|---|
| 0 | 0 | 1.0 (just visited) |
| 1 | 1 | 0.8 |
| 2 | 2 | 0.64 |
| 5 | 5 | 0.33 |
| 10 | 10 | 0.11 |
| 20 | 20 | 0.01 |
Instead of updating only $V(S_t)$, TD(λ) updates all states proportionally to their eligibility:
$$V(s) \leftarrow V(s) + \alpha \delta_t e_t(s) \quad \text{for all } s$$
The TD error $\delta_t$ affects:
This propagates credit backward—a reward now benefits all recently visited states, with benefit decaying exponentially into the past.
The decay factor is γλ (not just λ) because: (1) γ accounts for discounting—distant states contribute less to current value anyway; (2) λ is the additional 'eligibility' parameter controlling how far back credit flows. At λ=0, only the current state is eligible (TD(0)). At λ=1, credit flows to all visited states proportional to γ^k (MC-like).
There are two main types of eligibility traces, differing in how they handle repeated visits to the same state.
The update rule we've seen: $$e_t(s) = \gamma \lambda \cdot e_{t-1}(s) + \mathbf{1}_{s = S_t}$$
If a state is visited multiple times in quick succession, its trace accumulates—can exceed 1.
Example: State $s$ visited at t=0, t=2, t=3 (γλ = 0.8):
Interpretation: States visited more frequently get larger updates. This makes intuitive sense—if you keep returning to a state, it's probably important.
An alternative that caps the trace at 1: $$e_t(s) = \begin{cases} 1 & \text{if } s = S_t \ \gamma \lambda \cdot e_{t-1}(s) & \text{otherwise} \end{cases}$$
Visiting a state replaces the trace with 1, rather than adding to it.
Same example (s visited at t=0, t=2, t=3; γλ = 0.8):
Interpretation: What matters is how recently the state was visited, not how often. This can be more stable when states are revisited frequently (e.g., loops in the state space).
| Aspect | Accumulating | Replacing |
|---|---|---|
| Trace on revisit | Adds to existing trace | Resets to 1 |
| Maximum trace value | Unbounded | 1 |
| Reflects | Frequency × recency | Recency only |
| Loop behavior | Can explode in tight loops | Bounded, stable |
| Theoretical match | Forward view (exactly) | Forward view (approximately) |
| Common usage | Standard choice | When loops are common |
A third option, 'Dutch traces', modifies the update: e(s) = γλ·e(s) + 1 - α·γλ·e(s) = (1-α)·γλ·e(s) + 1. This exactly matches the forward view for linear function approximation and is used in true online TD(λ). It's theoretically cleaner but more complex to implement.
Let's formalize the complete TD(λ) algorithm for policy evaluation (prediction).
Algorithm:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
import numpy as npfrom typing import Callable, Tuple, List def td_lambda_prediction( env, policy: Callable[[int], int], num_episodes: int = 1000, alpha: float = 0.1, gamma: float = 0.99, lmbda: float = 0.9, # 'lambda' is a Python keyword num_states: int = None, trace_type: str = "accumulating" # or "replacing") -> Tuple[np.ndarray, List[float]]: """ TD(λ) Prediction using eligibility traces with backward view. Args: env: Environment with reset() and step() methods policy: Policy to evaluate (state -> action) num_episodes: Training episodes alpha: Learning rate gamma: Discount factor lmbda: Trace decay parameter (λ) num_states: Size of state space trace_type: "accumulating" or "replacing" Returns: V: Estimated value function episode_returns: Return per episode for monitoring """ V = np.zeros(num_states) episode_returns = [] for episode in range(num_episodes): # Reset eligibility traces at episode start e = np.zeros(num_states) state = env.reset() done = False episode_return = 0 while not done: # Follow policy action = policy(state) next_state, reward, done, _ = env.step(action) episode_return += reward # Compute TD error if done: td_error = reward - V[state] # V(terminal) = 0 else: td_error = reward + gamma * V[next_state] - V[state] # Update eligibility for current state if trace_type == "accumulating": e[state] += 1 # Accumulate else: # replacing e[state] = 1 # Replace # Update all values proportional to their eligibility V += alpha * td_error * e # Decay all traces e *= gamma * lmbda state = next_state episode_returns.append(episode_return) return V, episode_returns def compare_td_lambda_values(): """Compare different λ values on a random walk.""" # Lower λ: faster initial learning, may have higher asymptotic error # Higher λ: slower start, often lower asymptotic error # λ = 0: Equivalent to TD(0) # λ = 1: Equivalent to every-visit MC (in expectation) lambda_values = [0.0, 0.5, 0.8, 0.95, 1.0] # ... comparison code would go hereEligibility traces must be reset to 0 at the start of each episode! Continuing traces across episodes would propagate credit from one episode to another, which is incorrect for episodic tasks. This is easy to forget and causes subtle bugs.
One of the most beautiful results in RL theory: the backward view (eligibility traces) is equivalent to the forward view (λ-returns). Let's unpack this.
The λ-return at time $t$: $$G_t^\lambda = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_{t:t+n}$$
This is a weighted average of all n-step returns, with geometrically decaying weights.
Forward view update (offline, at episode end): $$V(S_t) \leftarrow V(S_t) + \alpha (G_t^\lambda - V(S_t))$$
Problem: Computing $G_t^\lambda$ requires knowing all future rewards—can't update until episode ends.
The update we've described: $$V(s) \leftarrow V(s) + \alpha \delta_t e(s) \quad \text{for all } s$$
Advantage: Updates happen at every step using only past information.
Theorem: For linear function approximation (including tabular), the sum of all backward-view updates over an episode equals the forward-view update for each state visited.
Specifically, at episode end: $$\sum_{t=0}^{T-1} \alpha \delta_t e_t(s) = \alpha \sum_{t: S_t = s} (G_t^\lambda - V(S_t))$$
Proof sketch: Show that the accumulation of TD errors weighted by traces equals the λ-return minus the initial estimate. This uses the telescoping property of TD errors.
Implication: The backward view is not an approximation—it computes exactly the same updates as the forward view, just incrementally.
| Aspect | Forward View | Backward View |
|---|---|---|
| Computes | λ-return G_t^λ | TD errors δ_t with traces |
| Update timing | Episode end (offline) | Every step (online) |
| Information needed | All future rewards | Only past and present |
| Memory cost | Store full trajectory | O(|S|) for traces |
| Result | Same updates! | Same updates! |
Eligibility traces perform a seemingly magical trick: they produce updates equivalent to looking into the future, while only using past information. This is possible because the λ-return's structure—a sum of n-step returns—can be decomposed into a sum of TD errors, which can be computed incrementally.
Eligibility traces extend naturally to control algorithms. The idea: maintain traces for state-action pairs $(s, a)$ instead of just states.
Extend SARSA with eligibility traces:
$$\delta_t = R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)$$ $$e(S_t, A_t) \leftarrow e(S_t, A_t) + 1 \text{ (or } = 1 \text{ for replacing)}$$ $$Q(s, a) \leftarrow Q(s, a) + \alpha \delta_t e(s, a) \quad \forall (s, a)$$ $$e(s, a) \leftarrow \gamma \lambda e(s, a) \quad \forall (s, a)$$
SARSA(λ) maintains credit assignment for state-action pairs, enabling faster learning when rewards are delayed from the actions that caused them.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102
import numpy as np class SARSALambdaAgent: """ SARSA(λ): On-policy TD control with eligibility traces. Extends SARSA to update all recently visited state-action pairs, weighted by their eligibility traces. """ def __init__( self, num_states: int, num_actions: int, alpha: float = 0.1, gamma: float = 0.99, lmbda: float = 0.9, epsilon: float = 0.1, trace_type: str = "replacing" ): self.num_states = num_states self.num_actions = num_actions self.alpha = alpha self.gamma = gamma self.lmbda = lmbda self.epsilon = epsilon self.trace_type = trace_type self.Q = np.zeros((num_states, num_actions)) self.e = np.zeros((num_states, num_actions)) # Eligibility traces def select_action(self, state: int) -> int: """ε-greedy action selection.""" if np.random.random() < self.epsilon: return np.random.randint(self.num_actions) return np.argmax(self.Q[state]) def reset_traces(self): """Reset traces at start of episode.""" self.e[:, :] = 0 def update( self, state: int, action: int, reward: float, next_state: int, next_action: int, done: bool ): """ SARSA(λ) update with eligibility traces. """ # Compute TD error using SARSA target if done: td_error = reward - self.Q[state, action] else: td_error = (reward + self.gamma * self.Q[next_state, next_action] - self.Q[state, action]) # Update eligibility for current state-action if self.trace_type == "accumulating": self.e[state, action] += 1 elif self.trace_type == "replacing": self.e[state, action] = 1 # Update all Q-values proportional to eligibility self.Q += self.alpha * td_error * self.e # Decay all traces self.e *= self.gamma * self.lmbda def train_sarsa_lambda( env, agent: SARSALambdaAgent, num_episodes: int = 1000): """Training loop for SARSA(λ).""" episode_returns = [] for episode in range(num_episodes): agent.reset_traces() # Critical! Reset at episode start state = env.reset() action = agent.select_action(state) total_return = 0 done = False while not done: next_state, reward, done, _ = env.step(action) next_action = agent.select_action(next_state) agent.update(state, action, reward, next_state, next_action, done) total_return += reward state = next_state action = next_action episode_returns.append(total_return) return episode_returnsExtending Q-learning with traces is trickier because Q-learning is off-policy. The issue: traces are computed under the behavior policy, but Q-learning targets the greedy policy.
Watkins's Q(λ): Cut traces to zero when an exploratory (non-greedy) action is taken.
$$e(s, a) = \begin{cases} \gamma \lambda e(s, a) + 1 & \text{if } (s, a) = (S_t, A_t) \text{ and } A_t = \arg\max Q(S_t, \cdot) \ 0 & \text{if } A_t \neq \arg\max Q(S_t, \cdot) \text{ (cut traces)} \ \gamma \lambda e(s, a) & \text{otherwise} \end{cases}$$
Rationale: When an exploratory action is taken, the trajectory diverges from the greedy policy that Q-learning evaluates. Continuing traces would propagate credit incorrectly.
Cutting traces at every exploratory action means most traces are zero most of the time (especially with high ε). This significantly reduces the benefits of eligibility traces. For this reason, SARSA(λ) is often preferred over Q(λ) when traces are desired. Alternative formulations like Tree-backup and Retrace address this issue.
Implementing eligibility traces efficiently requires attention to computational details.
Naive implementation updates all state(-action) pairs every step:
Sparse update optimization: Most traces decay to near-zero quickly. Track only states with non-negligible traces:
active_pairs = [(s, a) for (s, a) in traces if traces[s, a] > threshold]
for s, a in active_pairs:
Q[s, a] += alpha * td_error * traces[s, a]
traces[s, a] *= gamma * lmbda
With threshold = 0.01 and γλ = 0.8, traces stay active for about 20 steps before dropping below threshold. This bounds active pairs at ~20 per episode.
| Parameter | Typical Range | Effect of Increasing |
|---|---|---|
| λ | 0.7–0.95 | More MC-like, lower bias, higher variance |
| α (given λ) | 0.01–0.1 (lower than TD(0)) | Faster learning but potential instability |
| γλ product | 0.6–0.9 | Longer credit assignment horizon |
λ selection heuristic:
Eligibility traces provide the biggest benefit when: (1) Rewards are sparse or delayed, (2) State transitions are not too stochastic (determistic paths benefit most), (3) Episodes are long (more room for credit propagation). In environments with dense rewards and short episodes, TD(0) may be equally effective and simpler.
The classical backward view is almost equivalent to the forward view, but not exactly for non-linear function approximation. True online TD(λ) achieves exact equivalence.
With linear function approximation (or tabular), classical TD(λ) exactly matches the forward view. But with non-linear approximators (neural networks), there's a subtle discrepancy: the forward view uses a fixed $V$ to compute all $G_t^\lambda$ values, while the backward view updates $V$ incrementally.
True online TD(λ) uses modified Dutch traces:
$$e_t = \gamma \lambda e_{t-1} + \alpha(1 - \gamma \lambda e_{t-1}^\top \phi_t) \phi_t$$
where $\phi_t$ is the feature vector for state $S_t$.
The key difference: the trace update includes a correction term that accounts for the immediate effect of the update on the current state's value.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
import numpy as np class TrueOnlineTDLambda: """ True Online TD(λ) - exactly equivalent to the forward view for linear function approximation. Uses Dutch traces with a correction term. """ def __init__( self, num_features: int, alpha: float = 0.1, gamma: float = 0.99, lmbda: float = 0.9 ): self.w = np.zeros(num_features) # Weight vector self.alpha = alpha self.gamma = gamma self.lmbda = lmbda def value(self, phi: np.ndarray) -> float: """Compute value as w^T * phi.""" return np.dot(self.w, phi) def reset_traces(self): """Reset traces at episode start.""" self.e = np.zeros_like(self.w) self.V_old = 0.0 def update( self, phi: np.ndarray, # Current state features reward: float, phi_next: np.ndarray, # Next state features (or zeros if terminal) done: bool ): """ True Online TD(λ) update. Uses Dutch traces for exact equivalence with forward view. """ V = self.value(phi) V_next = 0.0 if done else self.value(phi_next) # TD error (using current weights) delta = reward + self.gamma * V_next - V # Dutch trace update # e = γλe + α(1 - γλ * e^T * φ) * φ ed_dot_phi = np.dot(self.e, phi) self.e = (self.gamma * self.lmbda * self.e + self.alpha * (1 - self.gamma * self.lmbda * ed_dot_phi) * phi) # Weight update with correction term # Δw = δ·e + (V - V_old)(e - α·φ) self.w += delta * self.e + (V - self.V_old) * (self.e - self.alpha * phi) # Store old value for next update self.V_old = V_nextUse true online TD(λ) when:
Classical TD(λ) is often sufficient when:
With neural networks, eligibility traces are less common because: (1) No exact forward-view equivalence exists, (2) Experience replay (common in deep RL) doesn't naturally support traces, (3) Computational cost of updating traces for all parameters. However, researches have explored 'expected eligibility traces' and other adaptations for deep RL.
Eligibility traces are the elegant mechanism for assigning credit across time, unifying TD learning and Monte Carlo methods. Let's consolidate the key insights:
Module Complete: You've now mastered the core value-based methods of reinforcement learning:
These form the foundation for deep RL methods covered in Module 5, where function approximation extends these ideas to large-scale problems.
Congratulations! You've completed Module 3: Value-Based Methods. You now understand how agents can learn optimal behavior from experience using temporal-difference learning, from the foundational algorithms (Q-learning, SARSA) through the theoretical underpinnings (TD errors, convergence) to advanced techniques (eligibility traces, TD(λ)). This knowledge forms the bedrock for all value-based reinforcement learning.