Machine LearningReinforcement Learning

Value-Based Methods

LevelAdvanced

Duration180 mins

TopicReinforcement Learning

4 / 5

Temporal Difference Learning

The Elegance of Learning from Predictions

Imagine predicting tomorrow's weather. One approach: wait until tomorrow, see what happens, then adjust your model. Another approach: make a prediction for tomorrow, then immediately after making another prediction for the moment you step outside, compare the two predictions. If they disagree, learn from the discrepancy now, without waiting for the final outcome.

This is the essence of Temporal Difference (TD) Learning: learning from the difference between successive predictions, rather than waiting for final outcomes. It's a profound idea that bridges the immediate feedback of supervised learning with the delayed rewards of reinforcement learning.

TD learning combines the best of two worlds:

From Monte Carlo: Learning directly from experience without a model
From Dynamic Programming: Updating estimates based on other estimates (bootstrapping)

The result is an algorithm family that learns faster than Monte Carlo (updates every step, not episode end) and requires no model like DP. TD methods are the foundation of modern RL success stories, from game playing to robotics.

What You Will Learn

This page provides deep theoretical foundations of TD learning: TD prediction for policy evaluation, the TD error as a learning signal, the bias-variance trade-off, convergence theory, backward vs forward views, and the connections to neuroscience that inspired the development of TD methods.

The Core TD Idea

At its heart, TD learning is about improving predictions using other predictions. Let's build intuition before formalism.

The Prediction Problem

Suppose you predict that state $s$ has value $V(s) = 10$ (expected future reward). From $s$, you take action $a$, receive reward $r = 2$, and transition to $s'$ where your current estimate is $V(s') = 9$.

Question: Was your prediction $V(s) = 10$ accurate?

TD's answer: Compare it to what you now believe: $$\text{New estimate for } V(s) = r + \gamma V(s') = 2 + 0.99 \times 9 = 10.91$$

You predicted 10, but your more informed estimate says 10.91. This discrepancy—the TD error—tells you to increase $V(s)$.

Why This Works

The key insight: your estimate of V(s') is based on ALL future experience from s' onward that you've seen so far. It encodes more information than just your direct estimates of V(s). By comparing V(s) to r + γV(s'), you're essentially asking: 'Does my prediction for s match what I'd predict if I knew the immediate reward and my prediction for s'?'

Contrast with Alternatives

Monte Carlo: Wait until episode ends, compute actual return $G_t = r_1 + \gamma r_2 + \gamma^2 r_3 + \ldots$, then update $V(s_t) \to G_t$.

Pro: Unbiased—uses actual returns
Con: High variance—accumulates all randomness; must wait for episode end

Dynamic Programming: Compute $V(s) = \sum_{a} \pi(a|s) \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V(s') \right]$

Pro: Uses full knowledge of transitions, no sampling variance
Con: Requires model $P(s'|s,a)$—usually unavailable

Temporal Difference: Update $V(s_t) \to r_{t+1} + \gamma V(s_{t+1})$ after each step.

Pro: Low variance (one random reward), no model needed
Con: Biased—uses estimate $V(s_{t+1})$ instead of true value

Comparison of Value Estimation Methods
Property	Monte Carlo	Dynamic Programming	TD Learning
Model required?	No	Yes	No
Bootstraps?	No	Yes	Yes
Works online?	No (episode end)	N/A	Yes (each step)
Bias	None (unbiased)	None	Yes (from V estimates)
Variance	High	None	Low
Converges to	V^π exactly	V^π exactly	V^π (under conditions)

TD Prediction: Evaluating a Policy

Before learning optimal behavior (control), we study prediction: estimating $V^\pi$ for a fixed policy $\pi$. This is the purest form of TD learning.

The TD(0) Update

The simplest TD method, TD(0), updates after each step:

$$V(S_t) \leftarrow V(S_t) + \alpha \left[ R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \right]$$

Components:

$V(S_t)$: Current estimate for the visited state
$R_{t+1} + \gamma V(S_{t+1})$: TD target (better estimate)
$R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$: TD error $\delta_t$
$\alpha$: Learning rate (step size)

The update moves $V(S_t)$ toward the TD target by a fraction $\alpha$.

td_prediction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
import numpy as np
from typing import List, Tuple, Callable
 
def td_prediction(
    env,
    policy: Callable[[int], int],  # Deterministic policy: state -> action
    num_episodes: int = 1000,
    alpha: float = 0.1,
    gamma: float = 0.99,
    num_states: int = None
) -> Tuple[np.ndarray, List[List[float]]]:
    """
    TD(0) Prediction: Estimate V^π for a given policy π.
    
    This is pure prediction - no policy improvement, just evaluation.
    
    Args:
        env: Environment with reset() and step(action) methods
        policy: The fixed policy to evaluate
        num_episodes: Number of episodes of experience
        alpha: Learning rate
        gamma: Discount factor
        num_states: Size of state space (for initialization)
    
    Returns:
        V: Estimated value function
        td_errors_per_episode: TD errors for analysis
    """
    V = np.zeros(num_states)
    td_errors_per_episode = []
    
    for episode in range(num_episodes):
        state = env.reset()
        done = False
        episode_td_errors = []
        
        while not done:
            # Follow the fixed policy
            action = policy(state)
            next_state, reward, done, _ = env.step(action)
            
            # Compute TD error
            if done:
                td_target = reward  # Terminal: no future value
            else:
                td_target = reward + gamma * V[next_state]
            
            td_error = td_target - V[state]
            
            # TD(0) update: move V(state) toward td_target
            V[state] = V[state] + alpha * td_error
            
            episode_td_errors.append(td_error)
            state = next_state
        
        td_errors_per_episode.append(episode_td_errors)
    
    return V, td_errors_per_episode
 
 
def monte_carlo_prediction(
    env,
    policy: Callable[[int], int],
    num_episodes: int = 1000,
    alpha: float = 0.1,
    gamma: float = 0.99,
    num_states: int = None
) -> np.ndarray:
    """
    Monte Carlo Prediction for comparison with TD.
    
    Every-visit MC: average returns from all visits to each state.
    """
    V = np.zeros(num_states)
    returns_count = np.zeros(num_states)
    
    for episode in range(num_episodes):
        # Generate complete episode
        states = []
        rewards = []
        
        state = env.reset()
        done = False
        
        while not done:
            states.append(state)
            action = policy(state)
            next_state, reward, done, _ = env.step(action)
            rewards.append(reward)
            state = next_state
        
        # Compute returns and update (backward through episode)
        G = 0
        for t in range(len(states) - 1, -1, -1):
            G = rewards[t] + gamma * G
            s = states[t]
            returns_count[s] += 1
            # Incremental mean update
            V[s] = V[s] + (G - V[s]) / returns_count[s]
    
    return V
 
 
# Example: Compare TD and MC on a random walk
def compare_td_mc():
    """
    Classic comparison: 19-state random walk.
    Shows TD typically learns faster than MC.
    """
    import matplotlib.pyplot as plt
    
    class RandomWalk:
        """19-state random walk: start at center, goal at either end."""
        def __init__(self, n_states=19):
            self.n_states = n_states
            self.center = n_states // 2
        
        def reset(self):
            self.state = self.center
            return self.state
        
        def step(self, action=None):
            # Random walk: 50% left, 50% right
            if np.random.random() < 0.5:
                self.state -= 1  # Left
            else:
                self.state += 1  # Right
            
            done = self.state == 0 or self.state == self.n_states - 1
            reward = 1 if self.state == self.n_states - 1 else (0 if not done else -1)
            return self.state, reward, done, {}
    
    env = RandomWalk(19)
    true_v = np.linspace(-1, 1, 19)  # True values for random walk
    
    # ... training and comparison code would follow

The Certainty Equivalence Principle

TD(0) can be viewed through the lens of certainty equivalence: it behaves as if the estimated model were the true model.

Given experience $(s_t, r_{t+1}, s_{t+1})$, TD(0) updates as if:

$P(s_{t+1} | s_t, a_t) = 1$ (the observed transition is certain)
$R(s_t, a_t) = r_{t+1}$ (the observed reward is the true reward)

Over many updates, these "certain" estimates average out to the true expectations. This is why TD converges to $V^\pi$ despite treating single samples as if they were certain.

The TD Error: Properties and Significance

The TD error $\delta_t$ is the heart of TD learning:

$$\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$$

This quantity has remarkable properties that explain why TD works and connect to neuroscience.

The TD Error as Prediction Error

$\delta_t$ measures the discrepancy between:

What you predicted: $V(S_t)$
What you now predict based on new information: $R_{t+1} + \gamma V(S_{t+1})$

If $\delta_t > 0$: Things were better than expected (increase $V(S_t)$) If $\delta_t < 0$: Things were worse than expected (decrease $V(S_t)$) If $\delta_t = 0$: Prediction was perfect (no change needed)

Connection to Dopamine

Neuroscience research by Wolfram Schultz (1997) discovered that dopamine neurons in the brain fire in a pattern strikingly similar to TD errors. They fire when rewards are unexpectedly good (δ > 0), are inhibited when rewards are unexpectedly bad (δ < 0), and remain at baseline for expected outcomes (δ ≈ 0). This suggests the brain may implement something like TD learning!

Key Property: TD Errors Sum to MC Returns

A beautiful result: the sum of TD errors along a trajectory equals the Monte Carlo return minus the initial value estimate:

$$\sum_{k=t}^{T-1} \gamma^{k-t} \delta_k = G_t - V(S_t)$$

Proof: $$ \begin{aligned} \sum_{k=t}^{T-1} \gamma^{k-t} \delta_k &= \sum_{k=t}^{T-1} \gamma^{k-t} [R_{k+1} + \gamma V(S_{k+1}) - V(S_k)] \ &= -V(S_t) + \sum_{k=t}^{T-1} \gamma^{k-t} R_{k+1} + \sum_{k=t}^{T-1} \gamma^{k-t+1} V(S_{k+1}) - \sum_{k=t+1}^{T} \gamma^{k-t} V(S_k) \ &= -V(S_t) + G_t + \gamma^{T-t} V(S_T) \quad \text{(telescoping sum)} \ &= G_t - V(S_t) \quad \text{(since } V(S_T) = 0 \text{ for terminal)}. \end{aligned} $$

This reveals that TD learning with all TD errors is equivalent to Monte Carlo learning—they target the same quantity, just decomposed differently.

TD Error Variance Analysis

The TD error has variance from:

Transition randomness: Where you end up ($s'$) is stochastic
Reward randomness: What reward you receive is stochastic
Value estimate noise: $V(s')$ is itself an estimate with error

Compare to MC error $G_t - V(S_t)$:

MC accumulates variance from ALL future transitions and rewards
TD only has variance from ONE step

This variance reduction is TD's key advantage, despite the bias from using $V(s')$ instead of the true value.

Variance Sources in Value Estimation
Method	Variance Sources	Total Variance
TD(0)	1 reward, 1 transition, V(s') estimate	Low
n-step TD	n rewards, n transitions, V(s_{t+n}) estimate	Medium
Monte Carlo	T-t rewards, T-t transitions	High

Convergence Theory

TD learning provably converges under appropriate conditions. The theory is more subtle than it might appear because TD uses biased estimates (bootstrapping).

Main Convergence Theorem for TD(0)

Theorem: For policy evaluation (estimating $V^\pi$), TD(0) converges to $V^\pi$ with probability 1 if:

The policy $\pi$ generates an ergodic Markov chain (every state visited infinitely often)
Learning rates satisfy: $\sum_t \alpha_t = \infty$ and $\sum_t \alpha_t^2 < \infty$

Intuition: The first condition ensures all states are updated infinitely. The second ensures updates eventually become small enough to dampen oscillations while still reaching any value.

Constant Learning Rates

With constant α (common in practice), TD doesn't converge in the formal sense—it oscillates around V^π indefinitely. However, the oscillations are bounded, and average performance is close to V^π. For practical purposes, this is often acceptable, especially in non-stationary environments where tracking changes is desirable.

Why TD Converges Despite Bias

The bias in TD comes from using $V(S_{t+1})$ instead of true $V^\pi(S_{t+1})$. Why doesn't this cause systematic error?

Key insight: The bias decreases as learning progresses. As $V$ approaches $V^\pi$, the bootstrap estimate $V(S_{t+1})$ approaches $V^\pi(S_{t+1})$, and the bias vanishes.

Formally, define the operator: $$T^\pi V(s) = \mathbb{E}\pi[R{t+1} + \gamma V(S_{t+1}) | S_t = s]$$

This is a $\gamma$-contraction: $|T^\pi V - T^\pi U|\infty \leq \gamma |V - U|\infty$.

TD(0) approximates applying $T^\pi$ using samples. The contraction property ensures that even with noisy samples, the iteration converges.

Convergence Rate: TD vs MC

Which converges faster? It depends on the problem!

TD is faster when:

Episodes are long (MC must wait until end)
Initial value estimates are reasonable (bootstrapping helps)
Transitions are deterministic or low-variance

MC is faster when:

Episodes are short
Rewards are the main source of information (not transitions)
The value function is complex/rapidly changing

Empirical finding (Sutton, 1988): For many problems, especially those with multi-step dependencies and long episodes, TD learns correct values significantly faster than MC.

n-Step TD and the Bias-Variance Trade-off

One-step TD (TD(0)) and full-return MC are endpoints of a spectrum. n-step TD fills the continuum.

n-Step Returns

The n-step return from time $t$: $$G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \ldots + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n})$$

This uses $n$ actual rewards and bootstraps from step $t+n$.

n-Step TD Update: $$V(S_t) \leftarrow V(S_t) + \alpha \left[ G_{t:t+n} - V(S_t) \right]$$

The Bias-Variance Trade-off

As $n$ increases:

Bias decreases:

More real rewards, less dependence on estimated values
At $n = \infty$: No bias (pure MC)

Variance increases:

More random rewards accumulated
Each reward adds variance from that step's transition

The optimal $n$ balances these effects and is problem-dependent.

n-Step TD Characteristics
n	Name	Bias	Variance	Update Delay
1	TD(0)	High (V estimate)	Low	1 step
2-5	Short n-step	Medium	Medium	n steps
10-20	Long n-step	Low	High	n steps
∞	Monte Carlo	None	Highest	Episode end

n_step_td_prediction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
from collections import deque
 
def n_step_td_prediction(
    env,
    policy,
    n: int = 4,
    num_episodes: int = 1000,
    alpha: float = 0.1,
    gamma: float = 0.99,
    num_states: int = None
) -> np.ndarray:
    """
    n-step TD Prediction for policy evaluation.
    
    Uses n actual rewards + bootstrap from V(S_{t+n}).
    Larger n = less bias, more variance.
    """
    V = np.zeros(num_states)
    
    for episode in range(num_episodes):
        # Buffers for the n-step window
        state_buffer = deque(maxlen=n+1)
        reward_buffer = deque(maxlen=n)
        
        state = env.reset()
        state_buffer.append(state)
        done = False
        t = 0
        T = float('inf')  # Episode length (unknown until we hit terminal)
        
        # Process the episode
        while True:
            if t < T:
                action = policy(state)
                next_state, reward, done, _ = env.step(action)
                
                state_buffer.append(next_state)
                reward_buffer.append(reward)
                
                if done:
                    T = t + 1
                state = next_state
            
            # Update time is n steps behind current time
            tau = t - n + 1
            
            if tau >= 0:
                # Compute n-step return G_{tau:tau+n}
                G = 0.0
                # Sum rewards from tau+1 to min(tau+n, T)
                for i, r in enumerate(reward_buffer):
                    G += (gamma ** i) * r
                
                # Bootstrap from V(S_{tau+n}) if not past terminal
                if tau + n < T:
                    G += (gamma ** n) * V[state_buffer[-1]]
                
                # Update V(S_tau)
                update_state = state_buffer[0]
                V[update_state] += alpha * (G - V[update_state])
            
            t += 1
            
            # Check termination: updated all states through T-1
            if tau >= T - 1:
                break
    
    return V
 
 
def compare_n_step_values():
    """
    Compare different n values on a simple problem to
    visualize the bias-variance trade-off.
    """
    # Results typically show:
    # - n=1 (TD(0)): Fast initial learning, may plateau at biased value
    # - n=medium: Often best final performance
    # - n=large/MC: Slow early learning, lower final error
    pass

Practical Choice of n

Empirically, n ∈ [4, 8] often works well across a variety of domains. For very sparse rewards (one reward at episode end), larger n speeds learning dramatically. For dense rewards, smaller n suffices. When in doubt, try TD(λ) instead—it automatically averages over all n values.

TD(λ): Unifying All n-Step Returns

Rather than choosing a single $n$, why not use a weighted average of all n-step returns? This is TD(λ), one of the most elegant ideas in RL.

The λ-Return

The λ-return $G_t^\lambda$ is a weighted average of all n-step returns:

$$G_t^\lambda = (1-\lambda) \sum_{n=1}^{T-t-1} \lambda^{n-1} G_{t:t+n} + \lambda^{T-t-1} G_t$$

The weights $(1-\lambda)\lambda^{n-1}$ are exponentially decaying, summing to 1.

Special cases:

$\lambda = 0$: Only TD(0) return $G_{t:t+1}$ has weight 1 → TD(0)
$\lambda = 1$: Only full return $G_t$ has weight 1 → Monte Carlo

Geometric Weighting Intuition

Why exponential decay? Consider the weight on the n-step return:

1-step: $(1-\lambda) \cdot 1 = (1-\lambda)$
2-step: $(1-\lambda) \cdot \lambda$
3-step: $(1-\lambda) \cdot \lambda^2$
...

For $\lambda = 0.9$:

1-step gets 10% weight
2-step gets 9% weight
...
10-step gets ~4% weight
50-step gets ~0.5% weight

This gives preference to shorter (lower variance) returns while still incorporating longer-horizon information.

TD(λ) Weight Distribution (λ = 0.9, infinite horizon)
n-step	Weight	Cumulative
1	0.10	0.10
2	0.09	0.19
5	0.066	0.41
10	0.039	0.65
20	0.014	0.86
50	0.002	0.995

Forward vs Backward Views

Computing the λ-return requires knowing future rewards (forward view). This seems impractical online! But there's an equivalent backward view using eligibility traces that updates at every step. The next page derives this mathematically and shows they're equivalent.

Why λ Works

TD(λ) interpolates between TD(0) (low variance, high bias) and MC (high variance, no bias):

Small λ (≈ 0): Predominantly short-horizon, low variance
Large λ (≈ 1): Predominantly long-horizon, low bias
Medium λ (≈ 0.8–0.95): Often best of both worlds

Unlike fixed n-step methods, λ-returns are robust: even if the optimal n varies across states, λ-averaging tends to work reasonably everywhere.

Batch TD and Certainty Equivalence

So far, we've discussed online TD: updating after each step of experience. Batch TD processes a fixed dataset repeatedly until convergence, revealing theoretical properties more clearly.

Batch TD Algorithm

Given a batch of transitions ${(s_i, r_i, s'i)}{i=1}^{N}$:

Initialize $V$ arbitrarily
Repeat until convergence:
- For each transition $(s, r, s')$ in batch:
  - $V(s) \leftarrow V(s) + \alpha[r + \gamma V(s') - V(s)]$
Return converged $V$

Batch TD converges to a well-defined fixed point, unlike online TD which depends on visit order.

What Does Batch TD Converge To?

Batch TD converges to the value function that would be optimal for the maximum likelihood MDP estimated from the data.

From the batch, estimate:

$\hat{P}(s' | s, a) = \frac{#(s, a, s')}{#(s, a)}$
$\hat{R}(s, a) = \text{mean reward when } (s, a) \text{ observed}$

Then batch TD converges to $V$ satisfying: $$V(s) = \hat{R}(s) + \gamma \sum_{s'} \hat{P}(s' | s) V(s')$$

This is certainty equivalence: TD behaves as if the estimated model is truth.

Batch TD vs Batch MC

Batch MC simply averages observed returns for each state: V(s) = mean(G_t for visits to s). This ignores the MDP structure—each return is treated independently. Batch TD exploits transition structure, often producing better estimates from the same data, especially when data is limited.

Example: TD vs MC with Sparse Data

Consider a 3-state MDP where we observe:

8 episodes of A → B → C (reward 0 everywhere)
1 episode of B → C (reward 1 at C)

MC estimate for V(A):

Only direct episodes starting at A observed
All ended with reward 0
$V_{MC}(A) = 0$

TD estimate for V(A):

Learns $V(B) \approx 0.11$ from the B→C data (1/9 episodes from B got reward)
Propagates: $V(A) = 0 + \gamma V(B) \approx 0.11$
TD leverages the shared B→C transition!

TD's use of the Markov structure extracts more information from limited data.

TD Learning and Neuroscience

One of the most remarkable connections in computational neuroscience is between TD learning and the brain's reward system.

The Dopamine Connection

In the mid-1990s, Wolfram Schultz and colleagues made a striking discovery: dopamine neurons in the midbrain fire in patterns that closely match TD errors.

Experimental Setup: Monkeys learn that a light predicts juice reward after a delay.

Dopamine Response Evolution:

Before learning: Dopamine neurons fire when reward (juice) arrives (unexpected reward → positive δ)
After learning: Neurons fire at the light, not the juice (value prediction updates → light predicts reward)
Omitted reward: Neurons are inhibited when expected juice doesn't come (negative δ)

This matches TD error perfectly: positive firing for better-than-expected, negative (inhibition) for worse-than-expected, baseline for as-expected.

Dopamine vs TD Error Correspondence
Situation	Dopamine Activity	TD Error δ
Unexpected reward	Strong firing	Positive (r > expected)
Expected reward received	Baseline	Zero (r = expected)
Expected reward omitted	Inhibition (below baseline)	Negative (r < expected)
Predictive cue (learned)	Firing at cue, not reward	δ shifts to earliest predictor

Implications

This correspondence suggests the brain may implement something like TD learning for reward-based learning. This has influenced understanding of addiction (hijacking the reward system), learning disorders, and therapeutic approaches. It also validates RL theory: nature converged on a similar solution!

The Actor-Critic Architecture in the Brain

Neuroscience also finds evidence for an actor-critic decomposition:

Critic (value estimation): Ventral striatum and orbitofrontal cortex appear to encode state values
Actor (policy): Dorsal striatum appears to select actions based on learned associations
TD error (dopamine): Modulates both, updating value estimates and action preferences

This mirrors the actor-critic RL architecture, where a critic learns $V$ (or $Q$) and TD errors train both the value function and the policy.

The parallel is not exact—brains are far more complex—but the high-level similarity is striking and has driven research in both fields.

Summary: Temporal Difference Learning

Temporal difference learning represents a breakthrough in understanding how to learn from sequential experience. Let's consolidate the key insights:

Key Takeaways

•The TD insight: Learn from prediction differences, not just final outcomes. Update V(s) toward r + γV(s') without waiting for episode end.
•TD error δ = r + γV(s') - V(s): The learning signal measuring surprise. Positive = better than expected, negative = worse.
•Bootstrapping: TD uses current estimates to improve themselves, unlike MC which uses only actual returns.
•Bias-variance trade-off: TD(0) has low variance but biased estimates; MC has no bias but high variance. n-step TD and TD(λ) interpolate.
•Convergence: TD converges to V^π under appropriate conditions despite the bootstrapping bias.
•TD(λ): Elegantly averages all n-step returns with exponentially decaying weights. λ controls the bias-variance trade-off.
•Certainty equivalence: TD learns the value function for the maximum-likelihood MDP estimated from experience.
•Neuroscience connection: Dopamine neurons fire like TD errors, suggesting biological TD learning.

What's next: We've established the theory of TD learning and seen it in action through Q-learning and SARSA. The next page introduces eligibility traces, the mechanism that makes TD(λ) computationally tractable and provides a beautiful backward view of the λ-return forward view.

Page Complete

You've mastered the theoretical foundations of temporal difference learning—the engine driving Q-learning, SARSA, and their variants. This understanding of bootstrapping, TD errors, and the bias-variance trade-off provides the conceptual foundation for all modern value-based RL.

4 / 5

Loading learning content...

Machine LearningReinforcement Learning

Value-Based Methods

LevelAdvanced

Duration180 mins

TopicReinforcement Learning

4 / 5

Temporal Difference Learning

The Elegance of Learning from Predictions

TD learning combines the best of two worlds:

From Monte Carlo: Learning directly from experience without a model
From Dynamic Programming: Updating estimates based on other estimates (bootstrapping)

What You Will Learn

The Core TD Idea

At its heart, TD learning is about improving predictions using other predictions. Let's build intuition before formalism.

The Prediction Problem

Question: Was your prediction $V(s) = 10$ accurate?

TD's answer: Compare it to what you now believe: $$\text{New estimate for } V(s) = r + \gamma V(s') = 2 + 0.99 \times 9 = 10.91$$

You predicted 10, but your more informed estimate says 10.91. This discrepancy—the TD error—tells you to increase $V(s)$.

Why This Works

Contrast with Alternatives

Monte Carlo: Wait until episode ends, compute actual return $G_t = r_1 + \gamma r_2 + \gamma^2 r_3 + \ldots$, then update $V(s_t) \to G_t$.

Pro: Unbiased—uses actual returns
Con: High variance—accumulates all randomness; must wait for episode end

Dynamic Programming: Compute $V(s) = \sum_{a} \pi(a|s) \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V(s') \right]$

Pro: Uses full knowledge of transitions, no sampling variance
Con: Requires model $P(s'|s,a)$—usually unavailable

Temporal Difference: Update $V(s_t) \to r_{t+1} + \gamma V(s_{t+1})$ after each step.

Pro: Low variance (one random reward), no model needed
Con: Biased—uses estimate $V(s_{t+1})$ instead of true value

Comparison of Value Estimation Methods
Property	Monte Carlo	Dynamic Programming	TD Learning
Model required?	No	Yes	No
Bootstraps?	No	Yes	Yes
Works online?	No (episode end)	N/A	Yes (each step)
Bias	None (unbiased)	None	Yes (from V estimates)
Variance	High	None	Low
Converges to	V^π exactly	V^π exactly	V^π (under conditions)

TD Prediction: Evaluating a Policy

Before learning optimal behavior (control), we study prediction: estimating $V^\pi$ for a fixed policy $\pi$. This is the purest form of TD learning.

The TD(0) Update

The simplest TD method, TD(0), updates after each step:

$$V(S_t) \leftarrow V(S_t) + \alpha \left[ R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \right]$$

Components:

$V(S_t)$: Current estimate for the visited state
$R_{t+1} + \gamma V(S_{t+1})$: TD target (better estimate)
$R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$: TD error $\delta_t$
$\alpha$: Learning rate (step size)

The update moves $V(S_t)$ toward the TD target by a fraction $\alpha$.

td_prediction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
import numpy as np
from typing import List, Tuple, Callable
 
def td_prediction(
    env,
    policy: Callable[[int], int],  # Deterministic policy: state -> action
    num_episodes: int = 1000,
    alpha: float = 0.1,
    gamma: float = 0.99,
    num_states: int = None
) -> Tuple[np.ndarray, List[List[float]]]:
    """
    TD(0) Prediction: Estimate V^π for a given policy π.
    
    This is pure prediction - no policy improvement, just evaluation.
    
    Args:
        env: Environment with reset() and step(action) methods
        policy: The fixed policy to evaluate
        num_episodes: Number of episodes of experience
        alpha: Learning rate
        gamma: Discount factor
        num_states: Size of state space (for initialization)
    
    Returns:
        V: Estimated value function
        td_errors_per_episode: TD errors for analysis
    """
    V = np.zeros(num_states)
    td_errors_per_episode = []
    
    for episode in range(num_episodes):
        state = env.reset()
        done = False
        episode_td_errors = []
        
        while not done:
            # Follow the fixed policy
            action = policy(state)
            next_state, reward, done, _ = env.step(action)
            
            # Compute TD error
            if done:
                td_target = reward  # Terminal: no future value
            else:
                td_target = reward + gamma * V[next_state]
            
            td_error = td_target - V[state]
            
            # TD(0) update: move V(state) toward td_target
            V[state] = V[state] + alpha * td_error
            
            episode_td_errors.append(td_error)
            state = next_state
        
        td_errors_per_episode.append(episode_td_errors)
    
    return V, td_errors_per_episode
 
 
def monte_carlo_prediction(
    env,
    policy: Callable[[int], int],
    num_episodes: int = 1000,
    alpha: float = 0.1,
    gamma: float = 0.99,
    num_states: int = None
) -> np.ndarray:
    """
    Monte Carlo Prediction for comparison with TD.
    
    Every-visit MC: average returns from all visits to each state.
    """
    V = np.zeros(num_states)
    returns_count = np.zeros(num_states)
    
    for episode in range(num_episodes):
        # Generate complete episode
        states = []
        rewards = []
        
        state = env.reset()
        done = False
        
        while not done:
            states.append(state)
            action = policy(state)
            next_state, reward, done, _ = env.step(action)
            rewards.append(reward)
            state = next_state
        
        # Compute returns and update (backward through episode)
        G = 0
        for t in range(len(states) - 1, -1, -1):
            G = rewards[t] + gamma * G
            s = states[t]
            returns_count[s] += 1
            # Incremental mean update
            V[s] = V[s] + (G - V[s]) / returns_count[s]
    
    return V
 
 
# Example: Compare TD and MC on a random walk
def compare_td_mc():
    """
    Classic comparison: 19-state random walk.
    Shows TD typically learns faster than MC.
    """
    import matplotlib.pyplot as plt
    
    class RandomWalk:
        """19-state random walk: start at center, goal at either end."""
        def __init__(self, n_states=19):
            self.n_states = n_states
            self.center = n_states // 2
        
        def reset(self):
            self.state = self.center
            return self.state
        
        def step(self, action=None):
            # Random walk: 50% left, 50% right
            if np.random.random() < 0.5:
                self.state -= 1  # Left
            else:
                self.state += 1  # Right
            
            done = self.state == 0 or self.state == self.n_states - 1
            reward = 1 if self.state == self.n_states - 1 else (0 if not done else -1)
            return self.state, reward, done, {}
    
    env = RandomWalk(19)
    true_v = np.linspace(-1, 1, 19)  # True values for random walk
    
    # ... training and comparison code would follow

The Certainty Equivalence Principle

TD(0) can be viewed through the lens of certainty equivalence: it behaves as if the estimated model were the true model.

Given experience $(s_t, r_{t+1}, s_{t+1})$, TD(0) updates as if:

$P(s_{t+1} | s_t, a_t) = 1$ (the observed transition is certain)
$R(s_t, a_t) = r_{t+1}$ (the observed reward is the true reward)

Over many updates, these "certain" estimates average out to the true expectations. This is why TD converges to $V^\pi$ despite treating single samples as if they were certain.

The TD Error: Properties and Significance

The TD error $\delta_t$ is the heart of TD learning:

$$\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$$

This quantity has remarkable properties that explain why TD works and connect to neuroscience.

The TD Error as Prediction Error

$\delta_t$ measures the discrepancy between:

What you predicted: $V(S_t)$
What you now predict based on new information: $R_{t+1} + \gamma V(S_{t+1})$

Connection to Dopamine

Key Property: TD Errors Sum to MC Returns

A beautiful result: the sum of TD errors along a trajectory equals the Monte Carlo return minus the initial value estimate:

$$\sum_{k=t}^{T-1} \gamma^{k-t} \delta_k = G_t - V(S_t)$$

This reveals that TD learning with all TD errors is equivalent to Monte Carlo learning—they target the same quantity, just decomposed differently.

TD Error Variance Analysis

The TD error has variance from:

Transition randomness: Where you end up ($s'$) is stochastic
Reward randomness: What reward you receive is stochastic
Value estimate noise: $V(s')$ is itself an estimate with error

Compare to MC error $G_t - V(S_t)$:

MC accumulates variance from ALL future transitions and rewards
TD only has variance from ONE step

This variance reduction is TD's key advantage, despite the bias from using $V(s')$ instead of the true value.

Variance Sources in Value Estimation
Method	Variance Sources	Total Variance
TD(0)	1 reward, 1 transition, V(s') estimate	Low
n-step TD	n rewards, n transitions, V(s_{t+n}) estimate	Medium
Monte Carlo	T-t rewards, T-t transitions	High

Convergence Theory

TD learning provably converges under appropriate conditions. The theory is more subtle than it might appear because TD uses biased estimates (bootstrapping).

Main Convergence Theorem for TD(0)

Theorem: For policy evaluation (estimating $V^\pi$), TD(0) converges to $V^\pi$ with probability 1 if:

The policy $\pi$ generates an ergodic Markov chain (every state visited infinitely often)
Learning rates satisfy: $\sum_t \alpha_t = \infty$ and $\sum_t \alpha_t^2 < \infty$

Intuition: The first condition ensures all states are updated infinitely. The second ensures updates eventually become small enough to dampen oscillations while still reaching any value.

Constant Learning Rates

Why TD Converges Despite Bias

The bias in TD comes from using $V(S_{t+1})$ instead of true $V^\pi(S_{t+1})$. Why doesn't this cause systematic error?

Key insight: The bias decreases as learning progresses. As $V$ approaches $V^\pi$, the bootstrap estimate $V(S_{t+1})$ approaches $V^\pi(S_{t+1})$, and the bias vanishes.

Formally, define the operator: $$T^\pi V(s) = \mathbb{E}\pi[R{t+1} + \gamma V(S_{t+1}) | S_t = s]$$

This is a $\gamma$-contraction: $|T^\pi V - T^\pi U|\infty \leq \gamma |V - U|\infty$.

TD(0) approximates applying $T^\pi$ using samples. The contraction property ensures that even with noisy samples, the iteration converges.

Convergence Rate: TD vs MC

Which converges faster? It depends on the problem!

TD is faster when:

Episodes are long (MC must wait until end)
Initial value estimates are reasonable (bootstrapping helps)
Transitions are deterministic or low-variance

MC is faster when:

Episodes are short
Rewards are the main source of information (not transitions)
The value function is complex/rapidly changing

Empirical finding (Sutton, 1988): For many problems, especially those with multi-step dependencies and long episodes, TD learns correct values significantly faster than MC.

n-Step TD and the Bias-Variance Trade-off

One-step TD (TD(0)) and full-return MC are endpoints of a spectrum. n-step TD fills the continuum.

n-Step Returns

The n-step return from time $t$: $$G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \ldots + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n})$$

This uses $n$ actual rewards and bootstraps from step $t+n$.

n-Step TD Update: $$V(S_t) \leftarrow V(S_t) + \alpha \left[ G_{t:t+n} - V(S_t) \right]$$

The Bias-Variance Trade-off

As $n$ increases:

Bias decreases:

More real rewards, less dependence on estimated values
At $n = \infty$: No bias (pure MC)

Variance increases:

More random rewards accumulated
Each reward adds variance from that step's transition

The optimal $n$ balances these effects and is problem-dependent.

n-Step TD Characteristics
n	Name	Bias	Variance	Update Delay
1	TD(0)	High (V estimate)	Low	1 step
2-5	Short n-step	Medium	Medium	n steps
10-20	Long n-step	Low	High	n steps
∞	Monte Carlo	None	Highest	Episode end

n_step_td_prediction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
from collections import deque
 
def n_step_td_prediction(
    env,
    policy,
    n: int = 4,
    num_episodes: int = 1000,
    alpha: float = 0.1,
    gamma: float = 0.99,
    num_states: int = None
) -> np.ndarray:
    """
    n-step TD Prediction for policy evaluation.
    
    Uses n actual rewards + bootstrap from V(S_{t+n}).
    Larger n = less bias, more variance.
    """
    V = np.zeros(num_states)
    
    for episode in range(num_episodes):
        # Buffers for the n-step window
        state_buffer = deque(maxlen=n+1)
        reward_buffer = deque(maxlen=n)
        
        state = env.reset()
        state_buffer.append(state)
        done = False
        t = 0
        T = float('inf')  # Episode length (unknown until we hit terminal)
        
        # Process the episode
        while True:
            if t < T:
                action = policy(state)
                next_state, reward, done, _ = env.step(action)
                
                state_buffer.append(next_state)
                reward_buffer.append(reward)
                
                if done:
                    T = t + 1
                state = next_state
            
            # Update time is n steps behind current time
            tau = t - n + 1
            
            if tau >= 0:
                # Compute n-step return G_{tau:tau+n}
                G = 0.0
                # Sum rewards from tau+1 to min(tau+n, T)
                for i, r in enumerate(reward_buffer):
                    G += (gamma ** i) * r
                
                # Bootstrap from V(S_{tau+n}) if not past terminal
                if tau + n < T:
                    G += (gamma ** n) * V[state_buffer[-1]]
                
                # Update V(S_tau)
                update_state = state_buffer[0]
                V[update_state] += alpha * (G - V[update_state])
            
            t += 1
            
            # Check termination: updated all states through T-1
            if tau >= T - 1:
                break
    
    return V
 
 
def compare_n_step_values():
    """
    Compare different n values on a simple problem to
    visualize the bias-variance trade-off.
    """
    # Results typically show:
    # - n=1 (TD(0)): Fast initial learning, may plateau at biased value
    # - n=medium: Often best final performance
    # - n=large/MC: Slow early learning, lower final error
    pass

Practical Choice of n

TD(λ): Unifying All n-Step Returns

Rather than choosing a single $n$, why not use a weighted average of all n-step returns? This is TD(λ), one of the most elegant ideas in RL.

The λ-Return

The λ-return $G_t^\lambda$ is a weighted average of all n-step returns:

$$G_t^\lambda = (1-\lambda) \sum_{n=1}^{T-t-1} \lambda^{n-1} G_{t:t+n} + \lambda^{T-t-1} G_t$$

The weights $(1-\lambda)\lambda^{n-1}$ are exponentially decaying, summing to 1.

Special cases:

$\lambda = 0$: Only TD(0) return $G_{t:t+1}$ has weight 1 → TD(0)
$\lambda = 1$: Only full return $G_t$ has weight 1 → Monte Carlo

Geometric Weighting Intuition

Why exponential decay? Consider the weight on the n-step return:

1-step: $(1-\lambda) \cdot 1 = (1-\lambda)$
2-step: $(1-\lambda) \cdot \lambda$
3-step: $(1-\lambda) \cdot \lambda^2$
...

For $\lambda = 0.9$:

1-step gets 10% weight
2-step gets 9% weight
...
10-step gets ~4% weight
50-step gets ~0.5% weight

This gives preference to shorter (lower variance) returns while still incorporating longer-horizon information.

TD(λ) Weight Distribution (λ = 0.9, infinite horizon)
n-step	Weight	Cumulative
1	0.10	0.10
2	0.09	0.19
5	0.066	0.41
10	0.039	0.65
20	0.014	0.86
50	0.002	0.995

Forward vs Backward Views

Why λ Works

TD(λ) interpolates between TD(0) (low variance, high bias) and MC (high variance, no bias):

Small λ (≈ 0): Predominantly short-horizon, low variance
Large λ (≈ 1): Predominantly long-horizon, low bias
Medium λ (≈ 0.8–0.95): Often best of both worlds

Unlike fixed n-step methods, λ-returns are robust: even if the optimal n varies across states, λ-averaging tends to work reasonably everywhere.

Batch TD and Certainty Equivalence

So far, we've discussed online TD: updating after each step of experience. Batch TD processes a fixed dataset repeatedly until convergence, revealing theoretical properties more clearly.

Batch TD Algorithm

Given a batch of transitions ${(s_i, r_i, s'i)}{i=1}^{N}$:

Initialize $V$ arbitrarily
Repeat until convergence:
- For each transition $(s, r, s')$ in batch:
  - $V(s) \leftarrow V(s) + \alpha[r + \gamma V(s') - V(s)]$
Return converged $V$

Batch TD converges to a well-defined fixed point, unlike online TD which depends on visit order.

What Does Batch TD Converge To?

Batch TD converges to the value function that would be optimal for the maximum likelihood MDP estimated from the data.

From the batch, estimate:

$\hat{P}(s' | s, a) = \frac{#(s, a, s')}{#(s, a)}$
$\hat{R}(s, a) = \text{mean reward when } (s, a) \text{ observed}$

Then batch TD converges to $V$ satisfying: $$V(s) = \hat{R}(s) + \gamma \sum_{s'} \hat{P}(s' | s) V(s')$$

This is certainty equivalence: TD behaves as if the estimated model is truth.

Batch TD vs Batch MC

Example: TD vs MC with Sparse Data

Consider a 3-state MDP where we observe:

8 episodes of A → B → C (reward 0 everywhere)
1 episode of B → C (reward 1 at C)

MC estimate for V(A):

Only direct episodes starting at A observed
All ended with reward 0
$V_{MC}(A) = 0$

TD estimate for V(A):

Learns $V(B) \approx 0.11$ from the B→C data (1/9 episodes from B got reward)
Propagates: $V(A) = 0 + \gamma V(B) \approx 0.11$
TD leverages the shared B→C transition!

TD's use of the Markov structure extracts more information from limited data.

TD Learning and Neuroscience

One of the most remarkable connections in computational neuroscience is between TD learning and the brain's reward system.

The Dopamine Connection

In the mid-1990s, Wolfram Schultz and colleagues made a striking discovery: dopamine neurons in the midbrain fire in patterns that closely match TD errors.

Experimental Setup: Monkeys learn that a light predicts juice reward after a delay.

Dopamine Response Evolution:

Before learning: Dopamine neurons fire when reward (juice) arrives (unexpected reward → positive δ)
After learning: Neurons fire at the light, not the juice (value prediction updates → light predicts reward)
Omitted reward: Neurons are inhibited when expected juice doesn't come (negative δ)

This matches TD error perfectly: positive firing for better-than-expected, negative (inhibition) for worse-than-expected, baseline for as-expected.

Dopamine vs TD Error Correspondence
Situation	Dopamine Activity	TD Error δ
Unexpected reward	Strong firing	Positive (r > expected)
Expected reward received	Baseline	Zero (r = expected)
Expected reward omitted	Inhibition (below baseline)	Negative (r < expected)
Predictive cue (learned)	Firing at cue, not reward	δ shifts to earliest predictor

Implications

The Actor-Critic Architecture in the Brain

Neuroscience also finds evidence for an actor-critic decomposition:

Critic (value estimation): Ventral striatum and orbitofrontal cortex appear to encode state values
Actor (policy): Dorsal striatum appears to select actions based on learned associations
TD error (dopamine): Modulates both, updating value estimates and action preferences

This mirrors the actor-critic RL architecture, where a critic learns $V$ (or $Q$) and TD errors train both the value function and the policy.

The parallel is not exact—brains are far more complex—but the high-level similarity is striking and has driven research in both fields.

Summary: Temporal Difference Learning

Temporal difference learning represents a breakthrough in understanding how to learn from sequential experience. Let's consolidate the key insights:

Key Takeaways

•The TD insight: Learn from prediction differences, not just final outcomes. Update V(s) toward r + γV(s') without waiting for episode end.
•TD error δ = r + γV(s') - V(s): The learning signal measuring surprise. Positive = better than expected, negative = worse.
•Bootstrapping: TD uses current estimates to improve themselves, unlike MC which uses only actual returns.
•Bias-variance trade-off: TD(0) has low variance but biased estimates; MC has no bias but high variance. n-step TD and TD(λ) interpolate.
•Convergence: TD converges to V^π under appropriate conditions despite the bootstrapping bias.
•TD(λ): Elegantly averages all n-step returns with exponentially decaying weights. λ controls the bias-variance trade-off.
•Certainty equivalence: TD learns the value function for the maximum-likelihood MDP estimated from experience.
•Neuroscience connection: Dopamine neurons fire like TD errors, suggesting biological TD learning.

Page Complete

4 / 5