Loading learning content...
Imagine predicting tomorrow's weather. One approach: wait until tomorrow, see what happens, then adjust your model. Another approach: make a prediction for tomorrow, then immediately after making another prediction for the moment you step outside, compare the two predictions. If they disagree, learn from the discrepancy now, without waiting for the final outcome.
This is the essence of Temporal Difference (TD) Learning: learning from the difference between successive predictions, rather than waiting for final outcomes. It's a profound idea that bridges the immediate feedback of supervised learning with the delayed rewards of reinforcement learning.
TD learning combines the best of two worlds:
The result is an algorithm family that learns faster than Monte Carlo (updates every step, not episode end) and requires no model like DP. TD methods are the foundation of modern RL success stories, from game playing to robotics.
This page provides deep theoretical foundations of TD learning: TD prediction for policy evaluation, the TD error as a learning signal, the bias-variance trade-off, convergence theory, backward vs forward views, and the connections to neuroscience that inspired the development of TD methods.
At its heart, TD learning is about improving predictions using other predictions. Let's build intuition before formalism.
Suppose you predict that state $s$ has value $V(s) = 10$ (expected future reward). From $s$, you take action $a$, receive reward $r = 2$, and transition to $s'$ where your current estimate is $V(s') = 9$.
Question: Was your prediction $V(s) = 10$ accurate?
TD's answer: Compare it to what you now believe: $$\text{New estimate for } V(s) = r + \gamma V(s') = 2 + 0.99 \times 9 = 10.91$$
You predicted 10, but your more informed estimate says 10.91. This discrepancy—the TD error—tells you to increase $V(s)$.
The key insight: your estimate of V(s') is based on ALL future experience from s' onward that you've seen so far. It encodes more information than just your direct estimates of V(s). By comparing V(s) to r + γV(s'), you're essentially asking: 'Does my prediction for s match what I'd predict if I knew the immediate reward and my prediction for s'?'
Monte Carlo: Wait until episode ends, compute actual return $G_t = r_1 + \gamma r_2 + \gamma^2 r_3 + \ldots$, then update $V(s_t) \to G_t$.
Dynamic Programming: Compute $V(s) = \sum_{a} \pi(a|s) \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V(s') \right]$
Temporal Difference: Update $V(s_t) \to r_{t+1} + \gamma V(s_{t+1})$ after each step.
| Property | Monte Carlo | Dynamic Programming | TD Learning |
|---|---|---|---|
| Model required? | No | Yes | No |
| Bootstraps? | No | Yes | Yes |
| Works online? | No (episode end) | N/A | Yes (each step) |
| Bias | None (unbiased) | None | Yes (from V estimates) |
| Variance | High | None | Low |
| Converges to | V^π exactly | V^π exactly | V^π (under conditions) |
Before learning optimal behavior (control), we study prediction: estimating $V^\pi$ for a fixed policy $\pi$. This is the purest form of TD learning.
The simplest TD method, TD(0), updates after each step:
$$V(S_t) \leftarrow V(S_t) + \alpha \left[ R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \right]$$
Components:
The update moves $V(S_t)$ toward the TD target by a fraction $\alpha$.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136
import numpy as npfrom typing import List, Tuple, Callable def td_prediction( env, policy: Callable[[int], int], # Deterministic policy: state -> action num_episodes: int = 1000, alpha: float = 0.1, gamma: float = 0.99, num_states: int = None) -> Tuple[np.ndarray, List[List[float]]]: """ TD(0) Prediction: Estimate V^π for a given policy π. This is pure prediction - no policy improvement, just evaluation. Args: env: Environment with reset() and step(action) methods policy: The fixed policy to evaluate num_episodes: Number of episodes of experience alpha: Learning rate gamma: Discount factor num_states: Size of state space (for initialization) Returns: V: Estimated value function td_errors_per_episode: TD errors for analysis """ V = np.zeros(num_states) td_errors_per_episode = [] for episode in range(num_episodes): state = env.reset() done = False episode_td_errors = [] while not done: # Follow the fixed policy action = policy(state) next_state, reward, done, _ = env.step(action) # Compute TD error if done: td_target = reward # Terminal: no future value else: td_target = reward + gamma * V[next_state] td_error = td_target - V[state] # TD(0) update: move V(state) toward td_target V[state] = V[state] + alpha * td_error episode_td_errors.append(td_error) state = next_state td_errors_per_episode.append(episode_td_errors) return V, td_errors_per_episode def monte_carlo_prediction( env, policy: Callable[[int], int], num_episodes: int = 1000, alpha: float = 0.1, gamma: float = 0.99, num_states: int = None) -> np.ndarray: """ Monte Carlo Prediction for comparison with TD. Every-visit MC: average returns from all visits to each state. """ V = np.zeros(num_states) returns_count = np.zeros(num_states) for episode in range(num_episodes): # Generate complete episode states = [] rewards = [] state = env.reset() done = False while not done: states.append(state) action = policy(state) next_state, reward, done, _ = env.step(action) rewards.append(reward) state = next_state # Compute returns and update (backward through episode) G = 0 for t in range(len(states) - 1, -1, -1): G = rewards[t] + gamma * G s = states[t] returns_count[s] += 1 # Incremental mean update V[s] = V[s] + (G - V[s]) / returns_count[s] return V # Example: Compare TD and MC on a random walkdef compare_td_mc(): """ Classic comparison: 19-state random walk. Shows TD typically learns faster than MC. """ import matplotlib.pyplot as plt class RandomWalk: """19-state random walk: start at center, goal at either end.""" def __init__(self, n_states=19): self.n_states = n_states self.center = n_states // 2 def reset(self): self.state = self.center return self.state def step(self, action=None): # Random walk: 50% left, 50% right if np.random.random() < 0.5: self.state -= 1 # Left else: self.state += 1 # Right done = self.state == 0 or self.state == self.n_states - 1 reward = 1 if self.state == self.n_states - 1 else (0 if not done else -1) return self.state, reward, done, {} env = RandomWalk(19) true_v = np.linspace(-1, 1, 19) # True values for random walk # ... training and comparison code would followTD(0) can be viewed through the lens of certainty equivalence: it behaves as if the estimated model were the true model.
Given experience $(s_t, r_{t+1}, s_{t+1})$, TD(0) updates as if:
Over many updates, these "certain" estimates average out to the true expectations. This is why TD converges to $V^\pi$ despite treating single samples as if they were certain.
The TD error $\delta_t$ is the heart of TD learning:
$$\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$$
This quantity has remarkable properties that explain why TD works and connect to neuroscience.
$\delta_t$ measures the discrepancy between:
If $\delta_t > 0$: Things were better than expected (increase $V(S_t)$) If $\delta_t < 0$: Things were worse than expected (decrease $V(S_t)$) If $\delta_t = 0$: Prediction was perfect (no change needed)
Neuroscience research by Wolfram Schultz (1997) discovered that dopamine neurons in the brain fire in a pattern strikingly similar to TD errors. They fire when rewards are unexpectedly good (δ > 0), are inhibited when rewards are unexpectedly bad (δ < 0), and remain at baseline for expected outcomes (δ ≈ 0). This suggests the brain may implement something like TD learning!
A beautiful result: the sum of TD errors along a trajectory equals the Monte Carlo return minus the initial value estimate:
$$\sum_{k=t}^{T-1} \gamma^{k-t} \delta_k = G_t - V(S_t)$$
Proof: $$ \begin{aligned} \sum_{k=t}^{T-1} \gamma^{k-t} \delta_k &= \sum_{k=t}^{T-1} \gamma^{k-t} [R_{k+1} + \gamma V(S_{k+1}) - V(S_k)] \ &= -V(S_t) + \sum_{k=t}^{T-1} \gamma^{k-t} R_{k+1} + \sum_{k=t}^{T-1} \gamma^{k-t+1} V(S_{k+1}) - \sum_{k=t+1}^{T} \gamma^{k-t} V(S_k) \ &= -V(S_t) + G_t + \gamma^{T-t} V(S_T) \quad \text{(telescoping sum)} \ &= G_t - V(S_t) \quad \text{(since } V(S_T) = 0 \text{ for terminal)}. \end{aligned} $$
This reveals that TD learning with all TD errors is equivalent to Monte Carlo learning—they target the same quantity, just decomposed differently.
The TD error has variance from:
Compare to MC error $G_t - V(S_t)$:
This variance reduction is TD's key advantage, despite the bias from using $V(s')$ instead of the true value.
| Method | Variance Sources | Total Variance |
|---|---|---|
| TD(0) | 1 reward, 1 transition, V(s') estimate | Low |
| n-step TD | n rewards, n transitions, V(s_{t+n}) estimate | Medium |
| Monte Carlo | T-t rewards, T-t transitions | High |
TD learning provably converges under appropriate conditions. The theory is more subtle than it might appear because TD uses biased estimates (bootstrapping).
Theorem: For policy evaluation (estimating $V^\pi$), TD(0) converges to $V^\pi$ with probability 1 if:
Intuition: The first condition ensures all states are updated infinitely. The second ensures updates eventually become small enough to dampen oscillations while still reaching any value.
With constant α (common in practice), TD doesn't converge in the formal sense—it oscillates around V^π indefinitely. However, the oscillations are bounded, and average performance is close to V^π. For practical purposes, this is often acceptable, especially in non-stationary environments where tracking changes is desirable.
The bias in TD comes from using $V(S_{t+1})$ instead of true $V^\pi(S_{t+1})$. Why doesn't this cause systematic error?
Key insight: The bias decreases as learning progresses. As $V$ approaches $V^\pi$, the bootstrap estimate $V(S_{t+1})$ approaches $V^\pi(S_{t+1})$, and the bias vanishes.
Formally, define the operator: $$T^\pi V(s) = \mathbb{E}\pi[R{t+1} + \gamma V(S_{t+1}) | S_t = s]$$
This is a $\gamma$-contraction: $|T^\pi V - T^\pi U|\infty \leq \gamma |V - U|\infty$.
TD(0) approximates applying $T^\pi$ using samples. The contraction property ensures that even with noisy samples, the iteration converges.
Which converges faster? It depends on the problem!
TD is faster when:
MC is faster when:
Empirical finding (Sutton, 1988): For many problems, especially those with multi-step dependencies and long episodes, TD learns correct values significantly faster than MC.
One-step TD (TD(0)) and full-return MC are endpoints of a spectrum. n-step TD fills the continuum.
The n-step return from time $t$: $$G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \ldots + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n})$$
This uses $n$ actual rewards and bootstraps from step $t+n$.
n-Step TD Update: $$V(S_t) \leftarrow V(S_t) + \alpha \left[ G_{t:t+n} - V(S_t) \right]$$
As $n$ increases:
Bias decreases:
Variance increases:
The optimal $n$ balances these effects and is problem-dependent.
| n | Name | Bias | Variance | Update Delay |
|---|---|---|---|---|
| 1 | TD(0) | High (V estimate) | Low | 1 step |
| 2-5 | Short n-step | Medium | Medium | n steps |
| 10-20 | Long n-step | Low | High | n steps |
| ∞ | Monte Carlo | None | Highest | Episode end |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
import numpy as npfrom collections import deque def n_step_td_prediction( env, policy, n: int = 4, num_episodes: int = 1000, alpha: float = 0.1, gamma: float = 0.99, num_states: int = None) -> np.ndarray: """ n-step TD Prediction for policy evaluation. Uses n actual rewards + bootstrap from V(S_{t+n}). Larger n = less bias, more variance. """ V = np.zeros(num_states) for episode in range(num_episodes): # Buffers for the n-step window state_buffer = deque(maxlen=n+1) reward_buffer = deque(maxlen=n) state = env.reset() state_buffer.append(state) done = False t = 0 T = float('inf') # Episode length (unknown until we hit terminal) # Process the episode while True: if t < T: action = policy(state) next_state, reward, done, _ = env.step(action) state_buffer.append(next_state) reward_buffer.append(reward) if done: T = t + 1 state = next_state # Update time is n steps behind current time tau = t - n + 1 if tau >= 0: # Compute n-step return G_{tau:tau+n} G = 0.0 # Sum rewards from tau+1 to min(tau+n, T) for i, r in enumerate(reward_buffer): G += (gamma ** i) * r # Bootstrap from V(S_{tau+n}) if not past terminal if tau + n < T: G += (gamma ** n) * V[state_buffer[-1]] # Update V(S_tau) update_state = state_buffer[0] V[update_state] += alpha * (G - V[update_state]) t += 1 # Check termination: updated all states through T-1 if tau >= T - 1: break return V def compare_n_step_values(): """ Compare different n values on a simple problem to visualize the bias-variance trade-off. """ # Results typically show: # - n=1 (TD(0)): Fast initial learning, may plateau at biased value # - n=medium: Often best final performance # - n=large/MC: Slow early learning, lower final error passEmpirically, n ∈ [4, 8] often works well across a variety of domains. For very sparse rewards (one reward at episode end), larger n speeds learning dramatically. For dense rewards, smaller n suffices. When in doubt, try TD(λ) instead—it automatically averages over all n values.
Rather than choosing a single $n$, why not use a weighted average of all n-step returns? This is TD(λ), one of the most elegant ideas in RL.
The λ-return $G_t^\lambda$ is a weighted average of all n-step returns:
$$G_t^\lambda = (1-\lambda) \sum_{n=1}^{T-t-1} \lambda^{n-1} G_{t:t+n} + \lambda^{T-t-1} G_t$$
The weights $(1-\lambda)\lambda^{n-1}$ are exponentially decaying, summing to 1.
Special cases:
Why exponential decay? Consider the weight on the n-step return:
For $\lambda = 0.9$:
This gives preference to shorter (lower variance) returns while still incorporating longer-horizon information.
| n-step | Weight | Cumulative |
|---|---|---|
| 1 | 0.10 | 0.10 |
| 2 | 0.09 | 0.19 |
| 5 | 0.066 | 0.41 |
| 10 | 0.039 | 0.65 |
| 20 | 0.014 | 0.86 |
| 50 | 0.002 | 0.995 |
Computing the λ-return requires knowing future rewards (forward view). This seems impractical online! But there's an equivalent backward view using eligibility traces that updates at every step. The next page derives this mathematically and shows they're equivalent.
TD(λ) interpolates between TD(0) (low variance, high bias) and MC (high variance, no bias):
Unlike fixed n-step methods, λ-returns are robust: even if the optimal n varies across states, λ-averaging tends to work reasonably everywhere.
So far, we've discussed online TD: updating after each step of experience. Batch TD processes a fixed dataset repeatedly until convergence, revealing theoretical properties more clearly.
Given a batch of transitions ${(s_i, r_i, s'i)}{i=1}^{N}$:
Batch TD converges to a well-defined fixed point, unlike online TD which depends on visit order.
Batch TD converges to the value function that would be optimal for the maximum likelihood MDP estimated from the data.
From the batch, estimate:
Then batch TD converges to $V$ satisfying: $$V(s) = \hat{R}(s) + \gamma \sum_{s'} \hat{P}(s' | s) V(s')$$
This is certainty equivalence: TD behaves as if the estimated model is truth.
Batch MC simply averages observed returns for each state: V(s) = mean(G_t for visits to s). This ignores the MDP structure—each return is treated independently. Batch TD exploits transition structure, often producing better estimates from the same data, especially when data is limited.
Consider a 3-state MDP where we observe:
MC estimate for V(A):
TD estimate for V(A):
TD's use of the Markov structure extracts more information from limited data.
One of the most remarkable connections in computational neuroscience is between TD learning and the brain's reward system.
In the mid-1990s, Wolfram Schultz and colleagues made a striking discovery: dopamine neurons in the midbrain fire in patterns that closely match TD errors.
Experimental Setup: Monkeys learn that a light predicts juice reward after a delay.
Dopamine Response Evolution:
This matches TD error perfectly: positive firing for better-than-expected, negative (inhibition) for worse-than-expected, baseline for as-expected.
| Situation | Dopamine Activity | TD Error δ |
|---|---|---|
| Unexpected reward | Strong firing | Positive (r > expected) |
| Expected reward received | Baseline | Zero (r = expected) |
| Expected reward omitted | Inhibition (below baseline) | Negative (r < expected) |
| Predictive cue (learned) | Firing at cue, not reward | δ shifts to earliest predictor |
This correspondence suggests the brain may implement something like TD learning for reward-based learning. This has influenced understanding of addiction (hijacking the reward system), learning disorders, and therapeutic approaches. It also validates RL theory: nature converged on a similar solution!
Neuroscience also finds evidence for an actor-critic decomposition:
This mirrors the actor-critic RL architecture, where a critic learns $V$ (or $Q$) and TD errors train both the value function and the policy.
The parallel is not exact—brains are far more complex—but the high-level similarity is striking and has driven research in both fields.
Temporal difference learning represents a breakthrough in understanding how to learn from sequential experience. Let's consolidate the key insights:
What's next: We've established the theory of TD learning and seen it in action through Q-learning and SARSA. The next page introduces eligibility traces, the mechanism that makes TD(λ) computationally tractable and provides a beautiful backward view of the λ-return forward view.
You've mastered the theoretical foundations of temporal difference learning—the engine driving Q-learning, SARSA, and their variants. This understanding of bootstrapping, TD errors, and the bias-variance trade-off provides the conceptual foundation for all modern value-based RL.