Loading learning content...
Richard Bellman's insight, developed in the 1950s, transformed how we think about sequential decision-making: the value of a state can be decomposed into immediate reward plus discounted value of successor states.
This simple observation—that long-term value has recursive structure—underlies virtually every reinforcement learning algorithm. The Bellman equations express this recursion mathematically:
$$V(s) = R(s) + \gamma \sum_{s'} P(s'|s) V(s')$$
At first glance, this appears circular: we define $V(s)$ in terms of $V(s')$. But this is precisely the structure that enables dynamic programming. We can solve for $V$ simultaneously for all states, or iteratively refine estimates until convergence.
The Bellman equations come in two forms:
Together, they form the mathematical backbone of RL theory.
By the end of this page, you will understand both forms of the Bellman equations in depth—their derivation, interpretation, matrix form, connection to algorithms, and why they guarantee convergence. You'll be able to set up and solve Bellman equations for concrete MDPs.
Before diving into Bellman equations, let's precisely define the value functions they describe.
State Value Function $V^\pi(s)$:
$$V^\pi(s) = \mathbb{E}\pi\left[G_t | S_t = s\right] = \mathbb{E}\pi\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k} | S_t = s\right]$$
The expected return starting from state $s$ and following policy $\pi$ thereafter.
Action Value Function $Q^\pi(s, a)$:
$$Q^\pi(s, a) = \mathbb{E}_\pi\left[G_t | S_t = s, A_t = a\right]$$
The expected return starting from state $s$, taking action $a$, then following policy $\pi$.
Relationship between V and Q:
$$V^\pi(s) = \sum_{a \in \mathcal{A}} \pi(a|s) Q^\pi(s, a)$$
$$Q^\pi(s, a) = R(s, a) + \gamma \sum_{s' \in \mathcal{S}} P(s'|s, a) V^\pi(s')$$
These relationships are crucial—$V$ and $Q$ encode the same information differently, and we can convert between them.
| Property | V(s) | Q(s, a) |
|---|---|---|
| Domain | States only | State-action pairs |
| Size (tabular) | $|\mathcal{S}|$ | $|\mathcal{S}| \times |\mathcal{A}|$ |
| Policy inside? | Yes, baked in | Explicit per action |
| Model needed? | No for evaluation | Yes to extract policy |
| Common use | Planning with model | Model-free learning |
The Bellman Expectation Equations describe the value of states (or state-actions) under a fixed policy $\pi$. They express that value is immediate reward plus discounted expected future value.
For the State Value Function:
$$V^\pi(s) = \sum_{a \in \mathcal{A}} \pi(a|s) \left[ R(s, a) + \gamma \sum_{s' \in \mathcal{S}} P(s'|s, a) V^\pi(s') \right]$$
For the Action Value Function:
$$Q^\pi(s, a) = R(s, a) + \gamma \sum_{s' \in \mathcal{S}} P(s'|s, a) \sum_{a' \in \mathcal{A}} \pi(a'|s') Q^\pi(s', a')$$
Derivation (for V):
Starting from the definition: $$V^\pi(s) = \mathbb{E}\pi[G_t | S_t = s]$$ $$= \mathbb{E}\pi[R_t + \gamma G_{t+1} | S_t = s]$$ $$= \mathbb{E}\pi[R_t | S_t = s] + \gamma \mathbb{E}\pi[G_{t+1} | S_t = s]$$ $$= \mathbb{E}\pi[R_t | S_t = s] + \gamma \mathbb{E}\pi[V^\pi(S_{t+1}) | S_t = s]$$
Expanding the expectations over $\pi$ and $P$ gives the full equation.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148
import numpy as npfrom typing import Dict, Tuple def bellman_expectation_v( s: int, policy: np.ndarray, # π(a|s): shape [n_states, n_actions] P: np.ndarray, # P(s'|s,a): shape [n_states, n_actions, n_states] R: np.ndarray, # R(s,a): shape [n_states, n_actions] V: np.ndarray, # Current V estimate: shape [n_states] gamma: float) -> float: """ Compute Bellman expectation backup for state s. V^π(s) = Σ_a π(a|s) [R(s,a) + γ Σ_s' P(s'|s,a) V^π(s')] """ n_actions = P.shape[1] value = 0.0 for a in range(n_actions): prob_action = policy[s, a] if prob_action == 0: continue # Immediate reward r = R[s, a] # Expected future value future_value = gamma * np.sum(P[s, a, :] * V) # Add to total, weighted by action probability value += prob_action * (r + future_value) return value def bellman_expectation_q( s: int, a: int, policy: np.ndarray, P: np.ndarray, R: np.ndarray, Q: np.ndarray, # Current Q estimate: shape [n_states, n_actions] gamma: float) -> float: """ Compute Bellman expectation backup for state-action (s, a). Q^π(s,a) = R(s,a) + γ Σ_s' P(s'|s,a) Σ_a' π(a'|s') Q^π(s',a') """ n_states, n_actions = Q.shape # Immediate reward q_value = R[s, a] # Expected future value for s_prime in range(n_states): prob_transition = P[s, a, s_prime] if prob_transition == 0: continue # Expected Q at next state under policy expected_next_q = np.sum(policy[s_prime, :] * Q[s_prime, :]) q_value += gamma * prob_transition * expected_next_q return q_value # ================================================# Policy Evaluation: Solve Bellman Expectation Equations# ================================================def iterative_policy_evaluation( policy: np.ndarray, P: np.ndarray, R: np.ndarray, gamma: float, epsilon: float = 1e-6, max_iters: int = 1000) -> Tuple[np.ndarray, int]: """ Solve for V^π by iteratively applying Bellman backups. This finds the unique fixed point of the Bellman expectation equation. """ n_states = P.shape[0] V = np.zeros(n_states) for iteration in range(max_iters): V_new = np.zeros(n_states) for s in range(n_states): V_new[s] = bellman_expectation_v(s, policy, P, R, V, gamma) # Check convergence delta = np.max(np.abs(V_new - V)) V = V_new if delta < epsilon: return V, iteration + 1 return V, max_iters # ================================================# Example: Evaluate a Random Policy# ================================================def demo_policy_evaluation(): """ Evaluate a random policy on a simple MDP. """ # 4-state chain: 0 -- 1 -- 2 -- 3 (goal) n_states = 4 n_actions = 2 # Left (0), Right (1) # Transition probabilities (deterministic) P = np.zeros((n_states, n_actions, n_states)) for s in range(n_states): P[s, 0, max(0, s-1)] = 1.0 # Left P[s, 1, min(n_states-1, s+1)] = 1.0 # Right # Rewards: +1 for reaching goal (state 3) R = np.zeros((n_states, n_actions)) R[2, 1] = 1.0 # Reward for entering goal from state 2 # Random policy: 50% left, 50% right policy_random = np.ones((n_states, n_actions)) / n_actions # Optimal policy: always go right policy_optimal = np.zeros((n_states, n_actions)) policy_optimal[:, 1] = 1.0 # Always action 1 (right) gamma = 0.9 print("=== Policy Evaluation ===") print(f"MDP: 4-state chain, goal at state 3, γ={gamma}") V_random, iters = iterative_policy_evaluation(policy_random, P, R, gamma) print(f"\nRandom policy V^π: {V_random.round(3)}") print(f" Converged in {iters} iterations") V_optimal, iters = iterative_policy_evaluation(policy_optimal, P, R, gamma) print(f"\nOptimal policy V^π*: {V_optimal.round(3)}") print(f" Converged in {iters} iterations") print("\nNote: Optimal policy has higher values everywhere!") demo_policy_evaluation()For finite MDPs, the Bellman expectation equation can be written in matrix-vector form, enabling direct solution.
Let $\mathbf{P}^\pi$ be the state transition matrix under policy $\pi$:
$$(\mathbf{P}^\pi){ss'} = \sum{a \in \mathcal{A}} \pi(a|s) P(s'|s, a)$$
Let $\mathbf{R}^\pi$ be the expected immediate reward vector:
$$(\mathbf{R}^\pi)s = \sum{a \in \mathcal{A}} \pi(a|s) R(s, a)$$
The Bellman expectation equation becomes:
$$\mathbf{V}^\pi = \mathbf{R}^\pi + \gamma \mathbf{P}^\pi \mathbf{V}^\pi$$
This is a system of linear equations! Rearranging:
$$(\mathbf{I} - \gamma \mathbf{P}^\pi) \mathbf{V}^\pi = \mathbf{R}^\pi$$
$$\mathbf{V}^\pi = (\mathbf{I} - \gamma \mathbf{P}^\pi)^{-1} \mathbf{R}^\pi$$
When $\gamma < 1$, the matrix $(\mathbf{I} - \gamma \mathbf{P}^\pi)$ is always invertible (the Neumann series converges), giving a unique solution.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131
import numpy as np def compute_policy_matrices( policy: np.ndarray, # [n_states, n_actions] P: np.ndarray, # [n_states, n_actions, n_states] R: np.ndarray # [n_states, n_actions]) -> tuple: """ Compute the policy-specific transition matrix P^π and reward vector R^π. """ n_states, n_actions = policy.shape # P^π[s, s'] = Σ_a π(a|s) P(s'|s,a) P_pi = np.zeros((n_states, n_states)) for s in range(n_states): for a in range(n_actions): P_pi[s, :] += policy[s, a] * P[s, a, :] # R^π[s] = Σ_a π(a|s) R(s,a) R_pi = np.sum(policy * R, axis=1) return P_pi, R_pi def solve_bellman_direct( policy: np.ndarray, P: np.ndarray, R: np.ndarray, gamma: float) -> np.ndarray: """ Directly solve the Bellman equation using matrix inversion. V^π = (I - γP^π)^{-1} R^π Complexity: O(n³) for matrix inversion Best for: Small state spaces, exact solution needed """ n_states = policy.shape[0] P_pi, R_pi = compute_policy_matrices(policy, P, R) # Solve: (I - γP^π) V = R^π A = np.eye(n_states) - gamma * P_pi V = np.linalg.solve(A, R_pi) return V def neumann_series_solution( policy: np.ndarray, P: np.ndarray, R: np.ndarray, gamma: float, n_terms: int = 100) -> np.ndarray: """ Approximate solution using Neumann series expansion. (I - γP^π)^{-1} = I + γP^π + γ²(P^π)² + γ³(P^π)³ + ... V^π = R^π + γP^π R^π + γ²(P^π)² R^π + ... This shows the interpretation: value = sum of discounted future rewards. """ P_pi, R_pi = compute_policy_matrices(policy, P, R) n_states = len(R_pi) V = np.zeros(n_states) P_power = np.eye(n_states) # (P^π)^0 = I for k in range(n_terms): contribution = (gamma ** k) * (P_power @ R_pi) V += contribution P_power = P_power @ P_pi # Update to (P^π)^{k+1} return V # ================================================# Compare Methods# ================================================def compare_solution_methods(): """ Compare iterative, direct, and Neumann series solutions. All should give the same answer. """ # Same MDP as before n_states = 4 n_actions = 2 P = np.zeros((n_states, n_actions, n_states)) for s in range(n_states): P[s, 0, max(0, s-1)] = 1.0 P[s, 1, min(n_states-1, s+1)] = 1.0 R = np.zeros((n_states, n_actions)) R[2, 1] = 1.0 policy = np.ones((n_states, n_actions)) / n_actions gamma = 0.9 print("=== Comparing Bellman Solution Methods ===") # Method 1: Iterative from bellman_expectation import iterative_policy_evaluation # (defined above) # V_iter, _ = iterative_policy_evaluation(policy, P, R, gamma) # Using inline version for demonstration: V_iter = np.zeros(n_states) for _ in range(1000): V_new = np.zeros(n_states) for s in range(n_states): for a in range(n_actions): V_new[s] += policy[s, a] * (R[s, a] + gamma * np.sum(P[s, a, :] * V_iter)) if np.max(np.abs(V_new - V_iter)) < 1e-10: break V_iter = V_new # Method 2: Direct (matrix inversion) V_direct = solve_bellman_direct(policy, P, R, gamma) # Method 3: Neumann series V_neumann = neumann_series_solution(policy, P, R, gamma, n_terms=50) print(f"Iterative: {V_iter.round(6)}") print(f"Direct solve: {V_direct.round(6)}") print(f"Neumann series: {V_neumann.round(6)}") print(f"\nMax difference: {np.max(np.abs(V_iter - V_direct)):.2e}") print("All methods agree! ✓") compare_solution_methods()Iterative: Best for large state spaces (memory efficient, avoids matrix storage). Direct solve: Best for small state spaces when exact solution needed immediately. Neumann series: Illustrative—shows that value is sum of future discounted rewards—but rarely used in practice.
The Bellman Optimality Equations describe the value functions for the optimal policy $\pi^*$—the policy that maximizes expected return.
Optimal State Value Function:
$$V^(s) = \max_{a \in \mathcal{A}} \left[ R(s, a) + \gamma \sum_{s' \in \mathcal{S}} P(s'|s, a) V^(s') \right]$$
Optimal Action Value Function:
$$Q^(s, a) = R(s, a) + \gamma \sum_{s' \in \mathcal{S}} P(s'|s, a) \max_{a' \in \mathcal{A}} Q^(s', a')$$
The Crucial Difference:
Compared to expectation equations, the optimality equations replace the weighted sum $\sum_a \pi(a|s) [\ldots]$ with a max operation. Instead of averaging over actions according to a policy, we select the best action.
Relationships:
$$V^(s) = \max_{a} Q^(s, a)$$
$$Q^(s, a) = R(s, a) + \gamma \sum_{s'} P(s'|s, a) V^(s')$$
Knowing $Q^$ immediately gives the optimal policy: $\pi^(s) = \arg\max_a Q^*(s, a)$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191
import numpy as npfrom typing import Tuple def bellman_optimality_v( s: int, P: np.ndarray, # P(s'|s,a): [n_states, n_actions, n_states] R: np.ndarray, # R(s,a): [n_states, n_actions] V: np.ndarray, # Current V estimate: [n_states] gamma: float) -> Tuple[float, int]: """ Compute Bellman optimality backup for V at state s. V*(s) = max_a [R(s,a) + γ Σ_s' P(s'|s,a) V*(s')] Returns: (new_value, best_action) """ n_actions = P.shape[1] q_values = [] for a in range(n_actions): q = R[s, a] + gamma * np.sum(P[s, a, :] * V) q_values.append(q) best_action = np.argmax(q_values) best_value = q_values[best_action] return best_value, best_action def bellman_optimality_q( s: int, a: int, P: np.ndarray, R: np.ndarray, Q: np.ndarray, # Current Q estimate: [n_states, n_actions] gamma: float) -> float: """ Compute Bellman optimality backup for Q at (s, a). Q*(s,a) = R(s,a) + γ Σ_s' P(s'|s,a) max_a' Q*(s',a') """ n_states = Q.shape[0] q_value = R[s, a] for s_prime in range(n_states): prob = P[s, a, s_prime] if prob > 0: max_next_q = np.max(Q[s_prime, :]) q_value += gamma * prob * max_next_q return q_value # ================================================# Value Iteration: Solve Bellman Optimality Equations# ================================================def value_iteration( P: np.ndarray, R: np.ndarray, gamma: float, epsilon: float = 1e-6, max_iters: int = 1000) -> Tuple[np.ndarray, np.ndarray, int]: """ Find V* and π* using value iteration. Repeatedly apply Bellman optimality backups until convergence. Returns: (V_star, policy, iterations) """ n_states, n_actions, _ = P.shape V = np.zeros(n_states) policy = np.zeros(n_states, dtype=int) for iteration in range(max_iters): V_new = np.zeros(n_states) for s in range(n_states): V_new[s], policy[s] = bellman_optimality_v(s, P, R, V, gamma) # Check convergence delta = np.max(np.abs(V_new - V)) V = V_new if delta < epsilon: return V, policy, iteration + 1 return V, policy, max_iters def q_value_iteration( P: np.ndarray, R: np.ndarray, gamma: float, epsilon: float = 1e-6, max_iters: int = 1000) -> Tuple[np.ndarray, int]: """ Find Q* using Q-value iteration. Returns: (Q_star, iterations) """ n_states, n_actions, _ = P.shape Q = np.zeros((n_states, n_actions)) for iteration in range(max_iters): Q_new = np.zeros((n_states, n_actions)) for s in range(n_states): for a in range(n_actions): Q_new[s, a] = bellman_optimality_q(s, a, P, R, Q, gamma) delta = np.max(np.abs(Q_new - Q)) Q = Q_new if delta < epsilon: return Q, iteration + 1 return Q, max_iters # ================================================# Demonstration# ================================================def demo_value_iteration(): """ Solve a simple MDP using value iteration. """ # Gridworld: 3x3 grid, goal at (2,2), hole at (1,1) # 0 1 2 # 3 H 5 # 6 7 G n_states = 9 n_actions = 4 # Up, Down, Left, Right goal = 8 hole = 4 # Build transition matrix (deterministic moves, stay at walls) P = np.zeros((n_states, n_actions, n_states)) def next_state(s, a): if s == goal or s == hole: return s # Terminal states row, col = s // 3, s % 3 if a == 0: row = max(0, row - 1) # Up if a == 1: row = min(2, row + 1) # Down if a == 2: col = max(0, col - 1) # Left if a == 3: col = min(2, col + 1) # Right return row * 3 + col for s in range(n_states): for a in range(n_actions): P[s, a, next_state(s, a)] = 1.0 # Rewards R = np.zeros((n_states, n_actions)) for a in range(n_actions): # Reward for entering goal for s in range(n_states): if next_state(s, a) == goal: R[s, a] = 1.0 elif next_state(s, a) == hole: R[s, a] = -1.0 gamma = 0.9 print("=== Value Iteration on 3x3 Gridworld ===") print("Grid: G=goal(+1), H=hole(-1)") print(" 0 1 2") print(" 3 H 5") print(" 6 7 G") V_star, policy, iters = value_iteration(P, R, gamma) print(f"\nConverged in {iters} iterations") print(f"\nOptimal Values V*:") print(V_star.reshape(3, 3).round(2)) action_names = ['↑', '↓', '←', '→'] print(f"\nOptimal Policy π*:") policy_grid = [action_names[a] for a in policy] policy_grid[hole] = 'H' policy_grid[goal] = 'G' for i in range(3): print(' '.join(policy_grid[i*3:(i+1)*3])) demo_value_iteration()Why do value iteration and policy evaluation converge? The answer lies in the contraction mapping theorem.
Definition: Contraction Mapping
A function $T$ is a $\gamma$-contraction in the max-norm if for all $V_1, V_2$:
$$||TV_1 - TV_2||\infty \leq \gamma ||V_1 - V_2||\infty$$
The function "shrinks" distances between inputs.
Bellman Operator:
Define the Bellman optimality operator $\mathcal{T}$:
$$(\mathcal{T}V)(s) = \max_a \left[ R(s, a) + \gamma \sum_{s'} P(s'|s, a) V(s') \right]$$
Theorem: The Bellman operator $\mathcal{T}$ is a $\gamma$-contraction.
Banach Fixed Point Theorem: Every contraction mapping has a unique fixed point, and repeatedly applying the mapping converges to that fixed point.
Implication: Since $\mathcal{T}$ is a $\gamma$-contraction:
For any two value functions V₁, V₂ and state s: |𝒯V₁(s) - 𝒯V₂(s)| = |maxₐ[R + γΣP·V₁] - maxₐ[R + γΣP·V₂]| ≤ maxₐ γ|Σ P(s'|s,a)(V₁(s') - V₂(s'))| ≤ γ ||V₁ - V₂||∞ The max-over-actions difference is bounded by γ times the max difference in next-state values.
| Property | Value Iteration | Policy Iteration |
|---|---|---|
| Fixed point | $V^*$ (optimal value) | $V^$ and $\pi^$ |
| Update type | Value updates only | Alternates policy eval & improvement |
| Convergence | Geometric $\gamma^k$ | Finite (at most $|\mathcal{A}|^{|\mathcal{S}|}$ switches) |
| Per iteration cost | $O(|\mathcal{S}|^2|\mathcal{A}|)$ | $O(|\mathcal{S}|^3)$ per evaluation |
| Better for | Large action spaces | Large state spaces |
Backup diagrams provide intuitive visualization of Bellman updates. They show the flow of value information from future states back to the current state.
Components:
Bellman Expectation Backup for V^π:
◯ s
/|\
● ● ● (actions, weighted by π(a|s))
/|\|/|\
◯ ◯ ◯ ◯ (next states, weighted by P(s'|s,a))
Value flows from next states up through actions (averaging) to current state.
Bellman Optimality Backup for V:*
◯ s
|
max
/|\
● ● ● (actions, take max)
/|\|/|\
◯ ◯ ◯ ◯ (next states, weighted by P(s'|s,a))
Value flows from next states, but we take the max over actions instead of averaging.
The Bellman equations directly inspire the major families of RL algorithms:
Dynamic Programming (known model):
| Algorithm | Updates | Based On |
|---|---|---|
| Policy Evaluation | $V^\pi$ via expectation backups | Bellman expectation for V |
| Policy Iteration | Alternates eval & improve | Both expectation & optimality |
| Value Iteration | $V^*$ via optimality backups | Bellman optimality for V |
Model-Free Learning (unknown model):
| Algorithm | Updates | Bellman Equation |
|---|---|---|
| Monte Carlo | $V(s) \leftarrow \text{avg}(G)$ | Expectation (sample-based) |
| TD(0) | $V(s) \leftarrow V(s) + \alpha[r + \gamma V(s') - V(s)]$ | Expectation (bootstrapping) |
| SARSA | $Q(s,a) \leftarrow Q + \alpha[r + \gamma Q(s',a') - Q]$ | Expectation for Q |
| Q-Learning | $Q(s,a) \leftarrow Q + \alpha[r + \gamma \max_a Q(s',a) - Q]$ | Optimality for Q |
Every algorithm is either solving the Bellman equations exactly (DP) or approximating the solution through samples (RL).
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
import numpy as np def td0_update(V, s, r, s_prime, alpha, gamma): """ TD(0) update - sample-based Bellman expectation backup. Target: R + γV(s') Update: V(s) += α [target - V(s)] """ target = r + gamma * V[s_prime] V[s] = V[s] + alpha * (target - V[s]) return V def q_learning_update(Q, s, a, r, s_prime, alpha, gamma): """ Q-Learning update - sample-based Bellman optimality backup. Target: R + γ max_a' Q(s', a') Update: Q(s,a) += α [target - Q(s,a)] """ target = r + gamma * np.max(Q[s_prime, :]) Q[s, a] = Q[s, a] + alpha * (target - Q[s, a]) return Q def sarsa_update(Q, s, a, r, s_prime, a_prime, alpha, gamma): """ SARSA update - sample-based Bellman expectation backup for Q. Target: R + γ Q(s', a') [using actual next action] Update: Q(s,a) += α [target - Q(s,a)] """ target = r + gamma * Q[s_prime, a_prime] Q[s, a] = Q[s, a] + alpha * (target - Q[s, a]) return Q # ================================================# Key Insight: The Bellman Error# ================================================def bellman_error_q(Q, s, a, r, s_prime, gamma): """ The Bellman error (or TD error) for Q-learning. δ = R + γ max_a' Q(s', a') - Q(s, a) At convergence (Q = Q*), the expected Bellman error is zero. Learning minimizes (squared) Bellman error. """ target = r + gamma * np.max(Q[s_prime, :]) current = Q[s, a] return target - current print("=== Algorithm Relationship to Bellman ===")print()print("All RL algorithms minimize Bellman error:")print(" δ = [R + γ · (next value)] - (current value)")print()print("DP methods: Sweep over all states, apply exact backups")print("MC methods: Estimate backup target with full returns")print("TD methods: Estimate backup target with bootstrapping")print()print("The Bellman equation IS reinforcement learning!")Understanding Bellman equations leads to practical insights:
Synchronous vs. Asynchronous Updates:
Value iteration typically uses synchronous updates—compute all new values, then replace. But asynchronous updates (update states in any order, using latest values) also converge and can be faster if updates are prioritized.
Prioritized Sweeping:
Focus updates on states whose values are likely to change the most. If a predecessor state's value changed significantly, its successors' values are now stale.
Real-Time Dynamic Programming (RTDP):
Update only states encountered during simulation. Focuses computation on relevant parts of the state space.
Function Approximation:
For large/continuous state spaces, we can't store tabular V or Q. Instead, we approximate: $$V(s) \approx V_\theta(s) = \theta^T \phi(s)$$
Training minimizes Bellman error: $L(\theta) = \mathbb{E}[(r + \gamma V_\theta(s') - V_\theta(s))^2]$
Combining (1) function approximation, (2) bootstrapping (using estimates in targets), and (3) off-policy learning can cause divergence—value estimates can grow without bound. This is called the 'deadly triad' and motivates careful algorithm design in deep RL (target networks, clipping, etc.).
We have reached the mathematical heart of reinforcement learning. The Bellman equations express the recursive structure of value, enabling both exact solutions (DP) and sample-based learning (RL).
Module Complete!
With the Bellman equations mastered, you now possess the complete mathematical foundation of MDPs. The next modules will build upon this foundation, exploring value-based methods (Q-learning), policy-based methods (policy gradients), and their modern deep learning incarnations.
Congratulations! You have completed Module 2: Markov Decision Processes. You now understand the complete MDP framework—the 5-tuple formulation, state transitions, reward signals, discount factors, and the Bellman equations that tie everything together. This is the theoretical foundation that every RL algorithm builds upon.