Machine LearningReinforcement Learning

Markov Decision Processes

LevelAdvanced

Duration90 mins

TopicReinforcement Learning

5 / 5

Bellman Equations

The Recursive Heart of RL

Richard Bellman's insight, developed in the 1950s, transformed how we think about sequential decision-making: the value of a state can be decomposed into immediate reward plus discounted value of successor states.

This simple observation—that long-term value has recursive structure—underlies virtually every reinforcement learning algorithm. The Bellman equations express this recursion mathematically:

$$V(s) = R(s) + \gamma \sum_{s'} P(s'|s) V(s')$$

At first glance, this appears circular: we define $V(s)$ in terms of $V(s')$. But this is precisely the structure that enables dynamic programming. We can solve for $V$ simultaneously for all states, or iteratively refine estimates until convergence.

The Bellman equations come in two forms:

Bellman Expectation Equations — describe value for a fixed policy $\pi$
Bellman Optimality Equations — describe value for the optimal policy $\pi^*$

Together, they form the mathematical backbone of RL theory.

What You Will Learn

By the end of this page, you will understand both forms of the Bellman equations in depth—their derivation, interpretation, matrix form, connection to algorithms, and why they guarantee convergence. You'll be able to set up and solve Bellman equations for concrete MDPs.

Value Functions: Quantifying Long-Term Worth

Before diving into Bellman equations, let's precisely define the value functions they describe.

State Value Function $V^\pi(s)$:

$$V^\pi(s) = \mathbb{E}\pi\left[G_t | S_t = s\right] = \mathbb{E}\pi\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k} | S_t = s\right]$$

The expected return starting from state $s$ and following policy $\pi$ thereafter.

Action Value Function $Q^\pi(s, a)$:

$$Q^\pi(s, a) = \mathbb{E}_\pi\left[G_t | S_t = s, A_t = a\right]$$

The expected return starting from state $s$, taking action $a$, then following policy $\pi$.

Relationship between V and Q:

$$V^\pi(s) = \sum_{a \in \mathcal{A}} \pi(a|s) Q^\pi(s, a)$$

$$Q^\pi(s, a) = R(s, a) + \gamma \sum_{s' \in \mathcal{S}} P(s'|s, a) V^\pi(s')$$

These relationships are crucial—$V$ and $Q$ encode the same information differently, and we can convert between them.

Comparing V and Q Functions
Property	V(s)	Q(s, a)
Domain	States only	State-action pairs
Size (tabular)	$\|\mathcal{S}\|$	$\|\mathcal{S}\| \times \|\mathcal{A}\|$
Policy inside?	Yes, baked in	Explicit per action
Model needed?	No for evaluation	Yes to extract policy
Common use	Planning with model	Model-free learning

Bellman Expectation Equations

The Bellman Expectation Equations describe the value of states (or state-actions) under a fixed policy $\pi$. They express that value is immediate reward plus discounted expected future value.

For the State Value Function:

$$V^\pi(s) = \sum_{a \in \mathcal{A}} \pi(a|s) \left[ R(s, a) + \gamma \sum_{s' \in \mathcal{S}} P(s'|s, a) V^\pi(s') \right]$$

For the Action Value Function:

$$Q^\pi(s, a) = R(s, a) + \gamma \sum_{s' \in \mathcal{S}} P(s'|s, a) \sum_{a' \in \mathcal{A}} \pi(a'|s') Q^\pi(s', a')$$

Derivation (for V):

Starting from the definition: $$V^\pi(s) = \mathbb{E}\pi[G_t | S_t = s]$$ $$= \mathbb{E}\pi[R_t + \gamma G_{t+1} | S_t = s]$$ $$= \mathbb{E}\pi[R_t | S_t = s] + \gamma \mathbb{E}\pi[G_{t+1} | S_t = s]$$ $$= \mathbb{E}\pi[R_t | S_t = s] + \gamma \mathbb{E}\pi[V^\pi(S_{t+1}) | S_t = s]$$

Expanding the expectations over $\pi$ and $P$ gives the full equation.

bellman_expectation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
import numpy as np
from typing import Dict, Tuple
 
def bellman_expectation_v(
    s: int,
    policy: np.ndarray,      # π(a|s): shape [n_states, n_actions]
    P: np.ndarray,           # P(s'|s,a): shape [n_states, n_actions, n_states]
    R: np.ndarray,           # R(s,a): shape [n_states, n_actions]
    V: np.ndarray,           # Current V estimate: shape [n_states]
    gamma: float
) -> float:
    """
    Compute Bellman expectation backup for state s.
    
    V^π(s) = Σ_a π(a|s) [R(s,a) + γ Σ_s' P(s'|s,a) V^π(s')]
    """
    n_actions = P.shape[1]
    value = 0.0
    
    for a in range(n_actions):
        prob_action = policy[s, a]
        if prob_action == 0:
            continue
            
        # Immediate reward
        r = R[s, a]
        
        # Expected future value
        future_value = gamma * np.sum(P[s, a, :] * V)
        
        # Add to total, weighted by action probability
        value += prob_action * (r + future_value)
    
    return value
 
 
def bellman_expectation_q(
    s: int,
    a: int,
    policy: np.ndarray,
    P: np.ndarray,
    R: np.ndarray,
    Q: np.ndarray,          # Current Q estimate: shape [n_states, n_actions]
    gamma: float
) -> float:
    """
    Compute Bellman expectation backup for state-action (s, a).
    
    Q^π(s,a) = R(s,a) + γ Σ_s' P(s'|s,a) Σ_a' π(a'|s') Q^π(s',a')
    """
    n_states, n_actions = Q.shape
    
    # Immediate reward
    q_value = R[s, a]
    
    # Expected future value
    for s_prime in range(n_states):
        prob_transition = P[s, a, s_prime]
        if prob_transition == 0:
            continue
            
        # Expected Q at next state under policy
        expected_next_q = np.sum(policy[s_prime, :] * Q[s_prime, :])
        
        q_value += gamma * prob_transition * expected_next_q
    
    return q_value
 
 
# ================================================
# Policy Evaluation: Solve Bellman Expectation Equations
# ================================================
def iterative_policy_evaluation(
    policy: np.ndarray,
    P: np.ndarray,
    R: np.ndarray,
    gamma: float,
    epsilon: float = 1e-6,
    max_iters: int = 1000
) -> Tuple[np.ndarray, int]:
    """
    Solve for V^π by iteratively applying Bellman backups.
    
    This finds the unique fixed point of the Bellman expectation equation.
    """
    n_states = P.shape[0]
    V = np.zeros(n_states)
    
    for iteration in range(max_iters):
        V_new = np.zeros(n_states)
        
        for s in range(n_states):
            V_new[s] = bellman_expectation_v(s, policy, P, R, V, gamma)
        
        # Check convergence
        delta = np.max(np.abs(V_new - V))
        V = V_new
        
        if delta < epsilon:
            return V, iteration + 1
    
    return V, max_iters
 
 
# ================================================
# Example: Evaluate a Random Policy
# ================================================
def demo_policy_evaluation():
    """
    Evaluate a random policy on a simple MDP.
    """
    # 4-state chain: 0 -- 1 -- 2 -- 3 (goal)
    n_states = 4
    n_actions = 2  # Left (0), Right (1)
    
    # Transition probabilities (deterministic)
    P = np.zeros((n_states, n_actions, n_states))
    for s in range(n_states):
        P[s, 0, max(0, s-1)] = 1.0         # Left
        P[s, 1, min(n_states-1, s+1)] = 1.0  # Right
    
    # Rewards: +1 for reaching goal (state 3)
    R = np.zeros((n_states, n_actions))
    R[2, 1] = 1.0  # Reward for entering goal from state 2
    
    # Random policy: 50% left, 50% right
    policy_random = np.ones((n_states, n_actions)) / n_actions
    
    # Optimal policy: always go right
    policy_optimal = np.zeros((n_states, n_actions))
    policy_optimal[:, 1] = 1.0  # Always action 1 (right)
    
    gamma = 0.9
    
    print("=== Policy Evaluation ===")
    print(f"MDP: 4-state chain, goal at state 3, γ={gamma}")
    
    V_random, iters = iterative_policy_evaluation(policy_random, P, R, gamma)
    print(f"\nRandom policy V^π: {V_random.round(3)}")
    print(f"  Converged in {iters} iterations")
    
    V_optimal, iters = iterative_policy_evaluation(policy_optimal, P, R, gamma)
    print(f"\nOptimal policy V^π*: {V_optimal.round(3)}")
    print(f"  Converged in {iters} iterations")
    
    print("\nNote: Optimal policy has higher values everywhere!")
 
demo_policy_evaluation()

The Bellman Equation in Matrix Form

For finite MDPs, the Bellman expectation equation can be written in matrix-vector form, enabling direct solution.

Let $\mathbf{P}^\pi$ be the state transition matrix under policy $\pi$:

$$(\mathbf{P}^\pi){ss'} = \sum{a \in \mathcal{A}} \pi(a|s) P(s'|s, a)$$

Let $\mathbf{R}^\pi$ be the expected immediate reward vector:

$$(\mathbf{R}^\pi)s = \sum{a \in \mathcal{A}} \pi(a|s) R(s, a)$$

The Bellman expectation equation becomes:

$$\mathbf{V}^\pi = \mathbf{R}^\pi + \gamma \mathbf{P}^\pi \mathbf{V}^\pi$$

This is a system of linear equations! Rearranging:

$$(\mathbf{I} - \gamma \mathbf{P}^\pi) \mathbf{V}^\pi = \mathbf{R}^\pi$$

$$\mathbf{V}^\pi = (\mathbf{I} - \gamma \mathbf{P}^\pi)^{-1} \mathbf{R}^\pi$$

When $\gamma < 1$, the matrix $(\mathbf{I} - \gamma \mathbf{P}^\pi)$ is always invertible (the Neumann series converges), giving a unique solution.

bellman_matrix.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
import numpy as np
 
def compute_policy_matrices(
    policy: np.ndarray,      # [n_states, n_actions]
    P: np.ndarray,           # [n_states, n_actions, n_states]
    R: np.ndarray            # [n_states, n_actions]
) -> tuple:
    """
    Compute the policy-specific transition matrix P^π and reward vector R^π.
    """
    n_states, n_actions = policy.shape
    
    # P^π[s, s'] = Σ_a π(a|s) P(s'|s,a)
    P_pi = np.zeros((n_states, n_states))
    for s in range(n_states):
        for a in range(n_actions):
            P_pi[s, :] += policy[s, a] * P[s, a, :]
    
    # R^π[s] = Σ_a π(a|s) R(s,a)
    R_pi = np.sum(policy * R, axis=1)
    
    return P_pi, R_pi
 
 
def solve_bellman_direct(
    policy: np.ndarray,
    P: np.ndarray,
    R: np.ndarray,
    gamma: float
) -> np.ndarray:
    """
    Directly solve the Bellman equation using matrix inversion.
    
    V^π = (I - γP^π)^{-1} R^π
    
    Complexity: O(n³) for matrix inversion
    Best for: Small state spaces, exact solution needed
    """
    n_states = policy.shape[0]
    
    P_pi, R_pi = compute_policy_matrices(policy, P, R)
    
    # Solve: (I - γP^π) V = R^π
    A = np.eye(n_states) - gamma * P_pi
    V = np.linalg.solve(A, R_pi)
    
    return V
 
 
def neumann_series_solution(
    policy: np.ndarray,
    P: np.ndarray,
    R: np.ndarray,
    gamma: float,
    n_terms: int = 100
) -> np.ndarray:
    """
    Approximate solution using Neumann series expansion.
    
    (I - γP^π)^{-1} = I + γP^π + γ²(P^π)² + γ³(P^π)³ + ...
    
    V^π = R^π + γP^π R^π + γ²(P^π)² R^π + ...
    
    This shows the interpretation: value = sum of discounted future rewards.
    """
    P_pi, R_pi = compute_policy_matrices(policy, P, R)
    n_states = len(R_pi)
    
    V = np.zeros(n_states)
    P_power = np.eye(n_states)  # (P^π)^0 = I
    
    for k in range(n_terms):
        contribution = (gamma ** k) * (P_power @ R_pi)
        V += contribution
        P_power = P_power @ P_pi  # Update to (P^π)^{k+1}
    
    return V
 
 
# ================================================
# Compare Methods
# ================================================
def compare_solution_methods():
    """
    Compare iterative, direct, and Neumann series solutions.
    All should give the same answer.
    """
    # Same MDP as before
    n_states = 4
    n_actions = 2
    
    P = np.zeros((n_states, n_actions, n_states))
    for s in range(n_states):
        P[s, 0, max(0, s-1)] = 1.0
        P[s, 1, min(n_states-1, s+1)] = 1.0
    
    R = np.zeros((n_states, n_actions))
    R[2, 1] = 1.0
    
    policy = np.ones((n_states, n_actions)) / n_actions
    gamma = 0.9
    
    print("=== Comparing Bellman Solution Methods ===")
    
    # Method 1: Iterative
    from bellman_expectation import iterative_policy_evaluation  # (defined above)
    # V_iter, _ = iterative_policy_evaluation(policy, P, R, gamma)
    # Using inline version for demonstration:
    V_iter = np.zeros(n_states)
    for _ in range(1000):
        V_new = np.zeros(n_states)
        for s in range(n_states):
            for a in range(n_actions):
                V_new[s] += policy[s, a] * (R[s, a] + gamma * np.sum(P[s, a, :] * V_iter))
        if np.max(np.abs(V_new - V_iter)) < 1e-10:
            break
        V_iter = V_new
    
    # Method 2: Direct (matrix inversion)
    V_direct = solve_bellman_direct(policy, P, R, gamma)
    
    # Method 3: Neumann series
    V_neumann = neumann_series_solution(policy, P, R, gamma, n_terms=50)
    
    print(f"Iterative:      {V_iter.round(6)}")
    print(f"Direct solve:   {V_direct.round(6)}")
    print(f"Neumann series: {V_neumann.round(6)}")
    print(f"\nMax difference: {np.max(np.abs(V_iter - V_direct)):.2e}")
    print("All methods agree! ✓")
 
compare_solution_methods()

When to Use Each Method

Iterative: Best for large state spaces (memory efficient, avoids matrix storage). Direct solve: Best for small state spaces when exact solution needed immediately. Neumann series: Illustrative—shows that value is sum of future discounted rewards—but rarely used in practice.

Bellman Optimality Equations

The Bellman Optimality Equations describe the value functions for the optimal policy $\pi^*$—the policy that maximizes expected return.

Optimal State Value Function:

$$V^(s) = \max_{a \in \mathcal{A}} \left[ R(s, a) + \gamma \sum_{s' \in \mathcal{S}} P(s'|s, a) V^(s') \right]$$

Optimal Action Value Function:

$$Q^(s, a) = R(s, a) + \gamma \sum_{s' \in \mathcal{S}} P(s'|s, a) \max_{a' \in \mathcal{A}} Q^(s', a')$$

The Crucial Difference:

Compared to expectation equations, the optimality equations replace the weighted sum $\sum_a \pi(a|s) [\ldots]$ with a max operation. Instead of averaging over actions according to a policy, we select the best action.

Relationships:

$$V^(s) = \max_{a} Q^(s, a)$$

$$Q^(s, a) = R(s, a) + \gamma \sum_{s'} P(s'|s, a) V^(s')$$

Knowing $Q^$ immediately gives the optimal policy: $\pi^(s) = \arg\max_a Q^*(s, a)$

bellman_optimality.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
import numpy as np
from typing import Tuple
 
def bellman_optimality_v(
    s: int,
    P: np.ndarray,           # P(s'|s,a): [n_states, n_actions, n_states]
    R: np.ndarray,           # R(s,a): [n_states, n_actions]
    V: np.ndarray,           # Current V estimate: [n_states]
    gamma: float
) -> Tuple[float, int]:
    """
    Compute Bellman optimality backup for V at state s.
    
    V*(s) = max_a [R(s,a) + γ Σ_s' P(s'|s,a) V*(s')]
    
    Returns: (new_value, best_action)
    """
    n_actions = P.shape[1]
    
    q_values = []
    for a in range(n_actions):
        q = R[s, a] + gamma * np.sum(P[s, a, :] * V)
        q_values.append(q)
    
    best_action = np.argmax(q_values)
    best_value = q_values[best_action]
    
    return best_value, best_action
 
 
def bellman_optimality_q(
    s: int,
    a: int,
    P: np.ndarray,
    R: np.ndarray,
    Q: np.ndarray,           # Current Q estimate: [n_states, n_actions]
    gamma: float
) -> float:
    """
    Compute Bellman optimality backup for Q at (s, a).
    
    Q*(s,a) = R(s,a) + γ Σ_s' P(s'|s,a) max_a' Q*(s',a')
    """
    n_states = Q.shape[0]
    
    q_value = R[s, a]
    
    for s_prime in range(n_states):
        prob = P[s, a, s_prime]
        if prob > 0:
            max_next_q = np.max(Q[s_prime, :])
            q_value += gamma * prob * max_next_q
    
    return q_value
 
 
# ================================================
# Value Iteration: Solve Bellman Optimality Equations
# ================================================
def value_iteration(
    P: np.ndarray,
    R: np.ndarray,
    gamma: float,
    epsilon: float = 1e-6,
    max_iters: int = 1000
) -> Tuple[np.ndarray, np.ndarray, int]:
    """
    Find V* and π* using value iteration.
    
    Repeatedly apply Bellman optimality backups until convergence.
    
    Returns: (V_star, policy, iterations)
    """
    n_states, n_actions, _ = P.shape
    V = np.zeros(n_states)
    policy = np.zeros(n_states, dtype=int)
    
    for iteration in range(max_iters):
        V_new = np.zeros(n_states)
        
        for s in range(n_states):
            V_new[s], policy[s] = bellman_optimality_v(s, P, R, V, gamma)
        
        # Check convergence
        delta = np.max(np.abs(V_new - V))
        V = V_new
        
        if delta < epsilon:
            return V, policy, iteration + 1
    
    return V, policy, max_iters
 
 
def q_value_iteration(
    P: np.ndarray,
    R: np.ndarray,
    gamma: float,
    epsilon: float = 1e-6,
    max_iters: int = 1000
) -> Tuple[np.ndarray, int]:
    """
    Find Q* using Q-value iteration.
    
    Returns: (Q_star, iterations)
    """
    n_states, n_actions, _ = P.shape
    Q = np.zeros((n_states, n_actions))
    
    for iteration in range(max_iters):
        Q_new = np.zeros((n_states, n_actions))
        
        for s in range(n_states):
            for a in range(n_actions):
                Q_new[s, a] = bellman_optimality_q(s, a, P, R, Q, gamma)
        
        delta = np.max(np.abs(Q_new - Q))
        Q = Q_new
        
        if delta < epsilon:
            return Q, iteration + 1
    
    return Q, max_iters
 
 
# ================================================
# Demonstration
# ================================================
def demo_value_iteration():
    """
    Solve a simple MDP using value iteration.
    """
    # Gridworld: 3x3 grid, goal at (2,2), hole at (1,1)
    #  0  1  2
    #  3  H  5
    #  6  7  G
    
    n_states = 9
    n_actions = 4  # Up, Down, Left, Right
    goal = 8
    hole = 4
    
    # Build transition matrix (deterministic moves, stay at walls)
    P = np.zeros((n_states, n_actions, n_states))
    
    def next_state(s, a):
        if s == goal or s == hole:
            return s  # Terminal states
        row, col = s // 3, s % 3
        if a == 0: row = max(0, row - 1)    # Up
        if a == 1: row = min(2, row + 1)    # Down
        if a == 2: col = max(0, col - 1)    # Left
        if a == 3: col = min(2, col + 1)    # Right
        return row * 3 + col
    
    for s in range(n_states):
        for a in range(n_actions):
            P[s, a, next_state(s, a)] = 1.0
    
    # Rewards
    R = np.zeros((n_states, n_actions))
    for a in range(n_actions):
        # Reward for entering goal
        for s in range(n_states):
            if next_state(s, a) == goal:
                R[s, a] = 1.0
            elif next_state(s, a) == hole:
                R[s, a] = -1.0
    
    gamma = 0.9
    
    print("=== Value Iteration on 3x3 Gridworld ===")
    print("Grid: G=goal(+1), H=hole(-1)")
    print(" 0  1  2")
    print(" 3  H  5")
    print(" 6  7  G")
    
    V_star, policy, iters = value_iteration(P, R, gamma)
    
    print(f"\nConverged in {iters} iterations")
    print(f"\nOptimal Values V*:")
    print(V_star.reshape(3, 3).round(2))
    
    action_names = ['↑', '↓', '←', '→']
    print(f"\nOptimal Policy π*:")
    policy_grid = [action_names[a] for a in policy]
    policy_grid[hole] = 'H'
    policy_grid[goal] = 'G'
    for i in range(3):
        print(' '.join(policy_grid[i*3:(i+1)*3]))
 
demo_value_iteration()

Contraction Mapping and Convergence Guarantee

Why do value iteration and policy evaluation converge? The answer lies in the contraction mapping theorem.

Definition: Contraction Mapping

A function $T$ is a $\gamma$-contraction in the max-norm if for all $V_1, V_2$:

$$||TV_1 - TV_2||\infty \leq \gamma ||V_1 - V_2||\infty$$

The function "shrinks" distances between inputs.

Bellman Operator:

Define the Bellman optimality operator $\mathcal{T}$:

$$(\mathcal{T}V)(s) = \max_a \left[ R(s, a) + \gamma \sum_{s'} P(s'|s, a) V(s') \right]$$

Theorem: The Bellman operator $\mathcal{T}$ is a $\gamma$-contraction.

Banach Fixed Point Theorem: Every contraction mapping has a unique fixed point, and repeatedly applying the mapping converges to that fixed point.

Implication: Since $\mathcal{T}$ is a $\gamma$-contraction:

$V^*$ (the unique fixed point) exists and is unique
Iterating $V_{k+1} = \mathcal{T}V_k$ converges to $V^*$
Convergence is geometric: $||V_k - V^||_\infty \leq \gamma^k ||V_0 - V^||_\infty$

Proof Sketch of Contraction Property

For any two value functions V₁, V₂ and state s: |𝒯V₁(s) - 𝒯V₂(s)| = |maxₐ[R + γΣP·V₁] - maxₐ[R + γΣP·V₂]| ≤ maxₐ γ|Σ P(s'|s,a)(V₁(s') - V₂(s'))| ≤ γ ||V₁ - V₂||∞ The max-over-actions difference is bounded by γ times the max difference in next-state values.

Convergence Properties
Property	Value Iteration	Policy Iteration
Fixed point	$V^*$ (optimal value)	$V^$ and $\pi^$
Update type	Value updates only	Alternates policy eval & improvement
Convergence	Geometric $\gamma^k$	Finite (at most $\|\mathcal{A}\|^{\|\mathcal{S}\|}$ switches)
Per iteration cost	$O(\|\mathcal{S}\|^2\|\mathcal{A}\|)$	$O(\|\mathcal{S}\|^3)$ per evaluation
Better for	Large action spaces	Large state spaces

Backup Diagrams: Visualizing Bellman Updates

Backup diagrams provide intuitive visualization of Bellman updates. They show the flow of value information from future states back to the current state.

Components:

Open circles (◯): States
Filled circles (●): State-action pairs
Arcs: Transitions (weighted by probability)

Bellman Expectation Backup for V^π:

         ◯ s
        /|\
       ● ● ●  (actions, weighted by π(a|s))
      /|\|/|\
     ◯ ◯ ◯ ◯  (next states, weighted by P(s'|s,a))

Value flows from next states up through actions (averaging) to current state.

Bellman Optimality Backup for V:*

         ◯ s
          |
        max
        /|\
       ● ● ●  (actions, take max)
      /|\|/|\
     ◯ ◯ ◯ ◯  (next states, weighted by P(s'|s,a))

Value flows from next states, but we take the max over actions instead of averaging.

Backup Diagram Insights

•Depth of backup — One-step backups use immediate next states; n-step backups look further ahead
•Width of backup — Full backups consider all actions/next-states; sample backups use only observed transitions
•Model-based vs model-free — Full backups require knowing P; sample backups don't
•Expectation vs optimality — Weighted average (π) vs max over actions

From Equations to Algorithms

The Bellman equations directly inspire the major families of RL algorithms:

Dynamic Programming (known model):

Algorithm	Updates	Based On
Policy Evaluation	$V^\pi$ via expectation backups	Bellman expectation for V
Policy Iteration	Alternates eval & improve	Both expectation & optimality
Value Iteration	$V^*$ via optimality backups	Bellman optimality for V

Model-Free Learning (unknown model):

Algorithm	Updates	Bellman Equation
Monte Carlo	$V(s) \leftarrow \text{avg}(G)$	Expectation (sample-based)
TD(0)	$V(s) \leftarrow V(s) + \alpha[r + \gamma V(s') - V(s)]$	Expectation (bootstrapping)
SARSA	$Q(s,a) \leftarrow Q + \alpha[r + \gamma Q(s',a') - Q]$	Expectation for Q
Q-Learning	$Q(s,a) \leftarrow Q + \alpha[r + \gamma \max_a Q(s',a) - Q]$	Optimality for Q

Every algorithm is either solving the Bellman equations exactly (DP) or approximating the solution through samples (RL).

algorithm_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import numpy as np
 
def td0_update(V, s, r, s_prime, alpha, gamma):
    """
    TD(0) update - sample-based Bellman expectation backup.
    
    Target: R + γV(s')
    Update: V(s) += α [target - V(s)]
    """
    target = r + gamma * V[s_prime]
    V[s] = V[s] + alpha * (target - V[s])
    return V
 
 
def q_learning_update(Q, s, a, r, s_prime, alpha, gamma):
    """
    Q-Learning update - sample-based Bellman optimality backup.
    
    Target: R + γ max_a' Q(s', a')
    Update: Q(s,a) += α [target - Q(s,a)]
    """
    target = r + gamma * np.max(Q[s_prime, :])
    Q[s, a] = Q[s, a] + alpha * (target - Q[s, a])
    return Q
 
 
def sarsa_update(Q, s, a, r, s_prime, a_prime, alpha, gamma):
    """
    SARSA update - sample-based Bellman expectation backup for Q.
    
    Target: R + γ Q(s', a')  [using actual next action]
    Update: Q(s,a) += α [target - Q(s,a)]
    """
    target = r + gamma * Q[s_prime, a_prime]
    Q[s, a] = Q[s, a] + alpha * (target - Q[s, a])
    return Q
 
 
# ================================================
# Key Insight: The Bellman Error
# ================================================
def bellman_error_q(Q, s, a, r, s_prime, gamma):
    """
    The Bellman error (or TD error) for Q-learning.
    
    δ = R + γ max_a' Q(s', a') - Q(s, a)
    
    At convergence (Q = Q*), the expected Bellman error is zero.
    Learning minimizes (squared) Bellman error.
    """
    target = r + gamma * np.max(Q[s_prime, :])
    current = Q[s, a]
    return target - current
 
 
print("=== Algorithm Relationship to Bellman ===")
print()
print("All RL algorithms minimize Bellman error:")
print("  δ = [R + γ · (next value)] - (current value)")
print()
print("DP methods: Sweep over all states, apply exact backups")
print("MC methods: Estimate backup target with full returns")
print("TD methods: Estimate backup target with bootstrapping")
print()
print("The Bellman equation IS reinforcement learning!")

Practical Considerations

Understanding Bellman equations leads to practical insights:

Synchronous vs. Asynchronous Updates:

Value iteration typically uses synchronous updates—compute all new values, then replace. But asynchronous updates (update states in any order, using latest values) also converge and can be faster if updates are prioritized.

Prioritized Sweeping:

Focus updates on states whose values are likely to change the most. If a predecessor state's value changed significantly, its successors' values are now stale.

Real-Time Dynamic Programming (RTDP):

Update only states encountered during simulation. Focuses computation on relevant parts of the state space.

Function Approximation:

For large/continuous state spaces, we can't store tabular V or Q. Instead, we approximate: $$V(s) \approx V_\theta(s) = \theta^T \phi(s)$$

Training minimizes Bellman error: $L(\theta) = \mathbb{E}[(r + \gamma V_\theta(s') - V_\theta(s))^2]$

The Deadly Triad

Combining (1) function approximation, (2) bootstrapping (using estimates in targets), and (3) off-policy learning can cause divergence—value estimates can grow without bound. This is called the 'deadly triad' and motivates careful algorithm design in deep RL (target networks, clipping, etc.).

Summary: The Bellman Equations

We have reached the mathematical heart of reinforcement learning. The Bellman equations express the recursive structure of value, enabling both exact solutions (DP) and sample-based learning (RL).

Key Takeaways

•Bellman Expectation Equations describe V^π and Q^π for a fixed policy using weighted sums over actions
•Bellman Optimality Equations describe V* and Q* by taking max over actions
•Matrix form enables direct solution via (I - γP^π)⁻¹R^π for small state spaces
•Contraction property guarantees unique fixed point and convergence of iterative methods
•All RL algorithms are fundamentally solving or approximating Bellman equations
•Bellman error δ = target - current is the signal that drives learning

Module Complete!

With the Bellman equations mastered, you now possess the complete mathematical foundation of MDPs. The next modules will build upon this foundation, exploring value-based methods (Q-learning), policy-based methods (policy gradients), and their modern deep learning incarnations.

Module 2 Complete

Congratulations! You have completed Module 2: Markov Decision Processes. You now understand the complete MDP framework—the 5-tuple formulation, state transitions, reward signals, discount factors, and the Bellman equations that tie everything together. This is the theoretical foundation that every RL algorithm builds upon.

5 / 5

Loading learning content...

Machine LearningReinforcement Learning

Markov Decision Processes

LevelAdvanced

Duration90 mins

TopicReinforcement Learning

5 / 5

Bellman Equations

The Recursive Heart of RL

This simple observation—that long-term value has recursive structure—underlies virtually every reinforcement learning algorithm. The Bellman equations express this recursion mathematically:

$$V(s) = R(s) + \gamma \sum_{s'} P(s'|s) V(s')$$

The Bellman equations come in two forms:

Bellman Expectation Equations — describe value for a fixed policy $\pi$
Bellman Optimality Equations — describe value for the optimal policy $\pi^*$

Together, they form the mathematical backbone of RL theory.

What You Will Learn

Value Functions: Quantifying Long-Term Worth

Before diving into Bellman equations, let's precisely define the value functions they describe.

State Value Function $V^\pi(s)$:

$$V^\pi(s) = \mathbb{E}\pi\left[G_t | S_t = s\right] = \mathbb{E}\pi\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k} | S_t = s\right]$$

The expected return starting from state $s$ and following policy $\pi$ thereafter.

Action Value Function $Q^\pi(s, a)$:

$$Q^\pi(s, a) = \mathbb{E}_\pi\left[G_t | S_t = s, A_t = a\right]$$

The expected return starting from state $s$, taking action $a$, then following policy $\pi$.

Relationship between V and Q:

$$V^\pi(s) = \sum_{a \in \mathcal{A}} \pi(a|s) Q^\pi(s, a)$$

$$Q^\pi(s, a) = R(s, a) + \gamma \sum_{s' \in \mathcal{S}} P(s'|s, a) V^\pi(s')$$

These relationships are crucial—$V$ and $Q$ encode the same information differently, and we can convert between them.

Comparing V and Q Functions
Property	V(s)	Q(s, a)
Domain	States only	State-action pairs
Size (tabular)	$\|\mathcal{S}\|$	$\|\mathcal{S}\| \times \|\mathcal{A}\|$
Policy inside?	Yes, baked in	Explicit per action
Model needed?	No for evaluation	Yes to extract policy
Common use	Planning with model	Model-free learning

Bellman Expectation Equations

The Bellman Expectation Equations describe the value of states (or state-actions) under a fixed policy $\pi$. They express that value is immediate reward plus discounted expected future value.

For the State Value Function:

$$V^\pi(s) = \sum_{a \in \mathcal{A}} \pi(a|s) \left[ R(s, a) + \gamma \sum_{s' \in \mathcal{S}} P(s'|s, a) V^\pi(s') \right]$$

For the Action Value Function:

$$Q^\pi(s, a) = R(s, a) + \gamma \sum_{s' \in \mathcal{S}} P(s'|s, a) \sum_{a' \in \mathcal{A}} \pi(a'|s') Q^\pi(s', a')$$

Derivation (for V):

Expanding the expectations over $\pi$ and $P$ gives the full equation.

bellman_expectation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
import numpy as np
from typing import Dict, Tuple
 
def bellman_expectation_v(
    s: int,
    policy: np.ndarray,      # π(a|s): shape [n_states, n_actions]
    P: np.ndarray,           # P(s'|s,a): shape [n_states, n_actions, n_states]
    R: np.ndarray,           # R(s,a): shape [n_states, n_actions]
    V: np.ndarray,           # Current V estimate: shape [n_states]
    gamma: float
) -> float:
    """
    Compute Bellman expectation backup for state s.
    
    V^π(s) = Σ_a π(a|s) [R(s,a) + γ Σ_s' P(s'|s,a) V^π(s')]
    """
    n_actions = P.shape[1]
    value = 0.0
    
    for a in range(n_actions):
        prob_action = policy[s, a]
        if prob_action == 0:
            continue
            
        # Immediate reward
        r = R[s, a]
        
        # Expected future value
        future_value = gamma * np.sum(P[s, a, :] * V)
        
        # Add to total, weighted by action probability
        value += prob_action * (r + future_value)
    
    return value
 
 
def bellman_expectation_q(
    s: int,
    a: int,
    policy: np.ndarray,
    P: np.ndarray,
    R: np.ndarray,
    Q: np.ndarray,          # Current Q estimate: shape [n_states, n_actions]
    gamma: float
) -> float:
    """
    Compute Bellman expectation backup for state-action (s, a).
    
    Q^π(s,a) = R(s,a) + γ Σ_s' P(s'|s,a) Σ_a' π(a'|s') Q^π(s',a')
    """
    n_states, n_actions = Q.shape
    
    # Immediate reward
    q_value = R[s, a]
    
    # Expected future value
    for s_prime in range(n_states):
        prob_transition = P[s, a, s_prime]
        if prob_transition == 0:
            continue
            
        # Expected Q at next state under policy
        expected_next_q = np.sum(policy[s_prime, :] * Q[s_prime, :])
        
        q_value += gamma * prob_transition * expected_next_q
    
    return q_value
 
 
# ================================================
# Policy Evaluation: Solve Bellman Expectation Equations
# ================================================
def iterative_policy_evaluation(
    policy: np.ndarray,
    P: np.ndarray,
    R: np.ndarray,
    gamma: float,
    epsilon: float = 1e-6,
    max_iters: int = 1000
) -> Tuple[np.ndarray, int]:
    """
    Solve for V^π by iteratively applying Bellman backups.
    
    This finds the unique fixed point of the Bellman expectation equation.
    """
    n_states = P.shape[0]
    V = np.zeros(n_states)
    
    for iteration in range(max_iters):
        V_new = np.zeros(n_states)
        
        for s in range(n_states):
            V_new[s] = bellman_expectation_v(s, policy, P, R, V, gamma)
        
        # Check convergence
        delta = np.max(np.abs(V_new - V))
        V = V_new
        
        if delta < epsilon:
            return V, iteration + 1
    
    return V, max_iters
 
 
# ================================================
# Example: Evaluate a Random Policy
# ================================================
def demo_policy_evaluation():
    """
    Evaluate a random policy on a simple MDP.
    """
    # 4-state chain: 0 -- 1 -- 2 -- 3 (goal)
    n_states = 4
    n_actions = 2  # Left (0), Right (1)
    
    # Transition probabilities (deterministic)
    P = np.zeros((n_states, n_actions, n_states))
    for s in range(n_states):
        P[s, 0, max(0, s-1)] = 1.0         # Left
        P[s, 1, min(n_states-1, s+1)] = 1.0  # Right
    
    # Rewards: +1 for reaching goal (state 3)
    R = np.zeros((n_states, n_actions))
    R[2, 1] = 1.0  # Reward for entering goal from state 2
    
    # Random policy: 50% left, 50% right
    policy_random = np.ones((n_states, n_actions)) / n_actions
    
    # Optimal policy: always go right
    policy_optimal = np.zeros((n_states, n_actions))
    policy_optimal[:, 1] = 1.0  # Always action 1 (right)
    
    gamma = 0.9
    
    print("=== Policy Evaluation ===")
    print(f"MDP: 4-state chain, goal at state 3, γ={gamma}")
    
    V_random, iters = iterative_policy_evaluation(policy_random, P, R, gamma)
    print(f"\nRandom policy V^π: {V_random.round(3)}")
    print(f"  Converged in {iters} iterations")
    
    V_optimal, iters = iterative_policy_evaluation(policy_optimal, P, R, gamma)
    print(f"\nOptimal policy V^π*: {V_optimal.round(3)}")
    print(f"  Converged in {iters} iterations")
    
    print("\nNote: Optimal policy has higher values everywhere!")
 
demo_policy_evaluation()

The Bellman Equation in Matrix Form

For finite MDPs, the Bellman expectation equation can be written in matrix-vector form, enabling direct solution.

Let $\mathbf{P}^\pi$ be the state transition matrix under policy $\pi$:

$$(\mathbf{P}^\pi){ss'} = \sum{a \in \mathcal{A}} \pi(a|s) P(s'|s, a)$$

Let $\mathbf{R}^\pi$ be the expected immediate reward vector:

$$(\mathbf{R}^\pi)s = \sum{a \in \mathcal{A}} \pi(a|s) R(s, a)$$

The Bellman expectation equation becomes:

$$\mathbf{V}^\pi = \mathbf{R}^\pi + \gamma \mathbf{P}^\pi \mathbf{V}^\pi$$

This is a system of linear equations! Rearranging:

$$(\mathbf{I} - \gamma \mathbf{P}^\pi) \mathbf{V}^\pi = \mathbf{R}^\pi$$

$$\mathbf{V}^\pi = (\mathbf{I} - \gamma \mathbf{P}^\pi)^{-1} \mathbf{R}^\pi$$

When $\gamma < 1$, the matrix $(\mathbf{I} - \gamma \mathbf{P}^\pi)$ is always invertible (the Neumann series converges), giving a unique solution.

bellman_matrix.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
import numpy as np
 
def compute_policy_matrices(
    policy: np.ndarray,      # [n_states, n_actions]
    P: np.ndarray,           # [n_states, n_actions, n_states]
    R: np.ndarray            # [n_states, n_actions]
) -> tuple:
    """
    Compute the policy-specific transition matrix P^π and reward vector R^π.
    """
    n_states, n_actions = policy.shape
    
    # P^π[s, s'] = Σ_a π(a|s) P(s'|s,a)
    P_pi = np.zeros((n_states, n_states))
    for s in range(n_states):
        for a in range(n_actions):
            P_pi[s, :] += policy[s, a] * P[s, a, :]
    
    # R^π[s] = Σ_a π(a|s) R(s,a)
    R_pi = np.sum(policy * R, axis=1)
    
    return P_pi, R_pi
 
 
def solve_bellman_direct(
    policy: np.ndarray,
    P: np.ndarray,
    R: np.ndarray,
    gamma: float
) -> np.ndarray:
    """
    Directly solve the Bellman equation using matrix inversion.
    
    V^π = (I - γP^π)^{-1} R^π
    
    Complexity: O(n³) for matrix inversion
    Best for: Small state spaces, exact solution needed
    """
    n_states = policy.shape[0]
    
    P_pi, R_pi = compute_policy_matrices(policy, P, R)
    
    # Solve: (I - γP^π) V = R^π
    A = np.eye(n_states) - gamma * P_pi
    V = np.linalg.solve(A, R_pi)
    
    return V
 
 
def neumann_series_solution(
    policy: np.ndarray,
    P: np.ndarray,
    R: np.ndarray,
    gamma: float,
    n_terms: int = 100
) -> np.ndarray:
    """
    Approximate solution using Neumann series expansion.
    
    (I - γP^π)^{-1} = I + γP^π + γ²(P^π)² + γ³(P^π)³ + ...
    
    V^π = R^π + γP^π R^π + γ²(P^π)² R^π + ...
    
    This shows the interpretation: value = sum of discounted future rewards.
    """
    P_pi, R_pi = compute_policy_matrices(policy, P, R)
    n_states = len(R_pi)
    
    V = np.zeros(n_states)
    P_power = np.eye(n_states)  # (P^π)^0 = I
    
    for k in range(n_terms):
        contribution = (gamma ** k) * (P_power @ R_pi)
        V += contribution
        P_power = P_power @ P_pi  # Update to (P^π)^{k+1}
    
    return V
 
 
# ================================================
# Compare Methods
# ================================================
def compare_solution_methods():
    """
    Compare iterative, direct, and Neumann series solutions.
    All should give the same answer.
    """
    # Same MDP as before
    n_states = 4
    n_actions = 2
    
    P = np.zeros((n_states, n_actions, n_states))
    for s in range(n_states):
        P[s, 0, max(0, s-1)] = 1.0
        P[s, 1, min(n_states-1, s+1)] = 1.0
    
    R = np.zeros((n_states, n_actions))
    R[2, 1] = 1.0
    
    policy = np.ones((n_states, n_actions)) / n_actions
    gamma = 0.9
    
    print("=== Comparing Bellman Solution Methods ===")
    
    # Method 1: Iterative
    from bellman_expectation import iterative_policy_evaluation  # (defined above)
    # V_iter, _ = iterative_policy_evaluation(policy, P, R, gamma)
    # Using inline version for demonstration:
    V_iter = np.zeros(n_states)
    for _ in range(1000):
        V_new = np.zeros(n_states)
        for s in range(n_states):
            for a in range(n_actions):
                V_new[s] += policy[s, a] * (R[s, a] + gamma * np.sum(P[s, a, :] * V_iter))
        if np.max(np.abs(V_new - V_iter)) < 1e-10:
            break
        V_iter = V_new
    
    # Method 2: Direct (matrix inversion)
    V_direct = solve_bellman_direct(policy, P, R, gamma)
    
    # Method 3: Neumann series
    V_neumann = neumann_series_solution(policy, P, R, gamma, n_terms=50)
    
    print(f"Iterative:      {V_iter.round(6)}")
    print(f"Direct solve:   {V_direct.round(6)}")
    print(f"Neumann series: {V_neumann.round(6)}")
    print(f"\nMax difference: {np.max(np.abs(V_iter - V_direct)):.2e}")
    print("All methods agree! ✓")
 
compare_solution_methods()

When to Use Each Method

Bellman Optimality Equations

The Bellman Optimality Equations describe the value functions for the optimal policy $\pi^*$—the policy that maximizes expected return.

Optimal State Value Function:

$$V^(s) = \max_{a \in \mathcal{A}} \left[ R(s, a) + \gamma \sum_{s' \in \mathcal{S}} P(s'|s, a) V^(s') \right]$$

Optimal Action Value Function:

$$Q^(s, a) = R(s, a) + \gamma \sum_{s' \in \mathcal{S}} P(s'|s, a) \max_{a' \in \mathcal{A}} Q^(s', a')$$

The Crucial Difference:

Relationships:

$$V^(s) = \max_{a} Q^(s, a)$$

$$Q^(s, a) = R(s, a) + \gamma \sum_{s'} P(s'|s, a) V^(s')$$

Knowing $Q^$ immediately gives the optimal policy: $\pi^(s) = \arg\max_a Q^*(s, a)$

bellman_optimality.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
import numpy as np
from typing import Tuple
 
def bellman_optimality_v(
    s: int,
    P: np.ndarray,           # P(s'|s,a): [n_states, n_actions, n_states]
    R: np.ndarray,           # R(s,a): [n_states, n_actions]
    V: np.ndarray,           # Current V estimate: [n_states]
    gamma: float
) -> Tuple[float, int]:
    """
    Compute Bellman optimality backup for V at state s.
    
    V*(s) = max_a [R(s,a) + γ Σ_s' P(s'|s,a) V*(s')]
    
    Returns: (new_value, best_action)
    """
    n_actions = P.shape[1]
    
    q_values = []
    for a in range(n_actions):
        q = R[s, a] + gamma * np.sum(P[s, a, :] * V)
        q_values.append(q)
    
    best_action = np.argmax(q_values)
    best_value = q_values[best_action]
    
    return best_value, best_action
 
 
def bellman_optimality_q(
    s: int,
    a: int,
    P: np.ndarray,
    R: np.ndarray,
    Q: np.ndarray,           # Current Q estimate: [n_states, n_actions]
    gamma: float
) -> float:
    """
    Compute Bellman optimality backup for Q at (s, a).
    
    Q*(s,a) = R(s,a) + γ Σ_s' P(s'|s,a) max_a' Q*(s',a')
    """
    n_states = Q.shape[0]
    
    q_value = R[s, a]
    
    for s_prime in range(n_states):
        prob = P[s, a, s_prime]
        if prob > 0:
            max_next_q = np.max(Q[s_prime, :])
            q_value += gamma * prob * max_next_q
    
    return q_value
 
 
# ================================================
# Value Iteration: Solve Bellman Optimality Equations
# ================================================
def value_iteration(
    P: np.ndarray,
    R: np.ndarray,
    gamma: float,
    epsilon: float = 1e-6,
    max_iters: int = 1000
) -> Tuple[np.ndarray, np.ndarray, int]:
    """
    Find V* and π* using value iteration.
    
    Repeatedly apply Bellman optimality backups until convergence.
    
    Returns: (V_star, policy, iterations)
    """
    n_states, n_actions, _ = P.shape
    V = np.zeros(n_states)
    policy = np.zeros(n_states, dtype=int)
    
    for iteration in range(max_iters):
        V_new = np.zeros(n_states)
        
        for s in range(n_states):
            V_new[s], policy[s] = bellman_optimality_v(s, P, R, V, gamma)
        
        # Check convergence
        delta = np.max(np.abs(V_new - V))
        V = V_new
        
        if delta < epsilon:
            return V, policy, iteration + 1
    
    return V, policy, max_iters
 
 
def q_value_iteration(
    P: np.ndarray,
    R: np.ndarray,
    gamma: float,
    epsilon: float = 1e-6,
    max_iters: int = 1000
) -> Tuple[np.ndarray, int]:
    """
    Find Q* using Q-value iteration.
    
    Returns: (Q_star, iterations)
    """
    n_states, n_actions, _ = P.shape
    Q = np.zeros((n_states, n_actions))
    
    for iteration in range(max_iters):
        Q_new = np.zeros((n_states, n_actions))
        
        for s in range(n_states):
            for a in range(n_actions):
                Q_new[s, a] = bellman_optimality_q(s, a, P, R, Q, gamma)
        
        delta = np.max(np.abs(Q_new - Q))
        Q = Q_new
        
        if delta < epsilon:
            return Q, iteration + 1
    
    return Q, max_iters
 
 
# ================================================
# Demonstration
# ================================================
def demo_value_iteration():
    """
    Solve a simple MDP using value iteration.
    """
    # Gridworld: 3x3 grid, goal at (2,2), hole at (1,1)
    #  0  1  2
    #  3  H  5
    #  6  7  G
    
    n_states = 9
    n_actions = 4  # Up, Down, Left, Right
    goal = 8
    hole = 4
    
    # Build transition matrix (deterministic moves, stay at walls)
    P = np.zeros((n_states, n_actions, n_states))
    
    def next_state(s, a):
        if s == goal or s == hole:
            return s  # Terminal states
        row, col = s // 3, s % 3
        if a == 0: row = max(0, row - 1)    # Up
        if a == 1: row = min(2, row + 1)    # Down
        if a == 2: col = max(0, col - 1)    # Left
        if a == 3: col = min(2, col + 1)    # Right
        return row * 3 + col
    
    for s in range(n_states):
        for a in range(n_actions):
            P[s, a, next_state(s, a)] = 1.0
    
    # Rewards
    R = np.zeros((n_states, n_actions))
    for a in range(n_actions):
        # Reward for entering goal
        for s in range(n_states):
            if next_state(s, a) == goal:
                R[s, a] = 1.0
            elif next_state(s, a) == hole:
                R[s, a] = -1.0
    
    gamma = 0.9
    
    print("=== Value Iteration on 3x3 Gridworld ===")
    print("Grid: G=goal(+1), H=hole(-1)")
    print(" 0  1  2")
    print(" 3  H  5")
    print(" 6  7  G")
    
    V_star, policy, iters = value_iteration(P, R, gamma)
    
    print(f"\nConverged in {iters} iterations")
    print(f"\nOptimal Values V*:")
    print(V_star.reshape(3, 3).round(2))
    
    action_names = ['↑', '↓', '←', '→']
    print(f"\nOptimal Policy π*:")
    policy_grid = [action_names[a] for a in policy]
    policy_grid[hole] = 'H'
    policy_grid[goal] = 'G'
    for i in range(3):
        print(' '.join(policy_grid[i*3:(i+1)*3]))
 
demo_value_iteration()

Contraction Mapping and Convergence Guarantee

Why do value iteration and policy evaluation converge? The answer lies in the contraction mapping theorem.

Definition: Contraction Mapping

A function $T$ is a $\gamma$-contraction in the max-norm if for all $V_1, V_2$:

$$||TV_1 - TV_2||\infty \leq \gamma ||V_1 - V_2||\infty$$

The function "shrinks" distances between inputs.

Bellman Operator:

Define the Bellman optimality operator $\mathcal{T}$:

$$(\mathcal{T}V)(s) = \max_a \left[ R(s, a) + \gamma \sum_{s'} P(s'|s, a) V(s') \right]$$

Theorem: The Bellman operator $\mathcal{T}$ is a $\gamma$-contraction.

Banach Fixed Point Theorem: Every contraction mapping has a unique fixed point, and repeatedly applying the mapping converges to that fixed point.

Implication: Since $\mathcal{T}$ is a $\gamma$-contraction:

$V^*$ (the unique fixed point) exists and is unique
Iterating $V_{k+1} = \mathcal{T}V_k$ converges to $V^*$
Convergence is geometric: $||V_k - V^||_\infty \leq \gamma^k ||V_0 - V^||_\infty$

Proof Sketch of Contraction Property

Convergence Properties
Property	Value Iteration	Policy Iteration
Fixed point	$V^*$ (optimal value)	$V^$ and $\pi^$
Update type	Value updates only	Alternates policy eval & improvement
Convergence	Geometric $\gamma^k$	Finite (at most $\|\mathcal{A}\|^{\|\mathcal{S}\|}$ switches)
Per iteration cost	$O(\|\mathcal{S}\|^2\|\mathcal{A}\|)$	$O(\|\mathcal{S}\|^3)$ per evaluation
Better for	Large action spaces	Large state spaces

Backup Diagrams: Visualizing Bellman Updates

Backup diagrams provide intuitive visualization of Bellman updates. They show the flow of value information from future states back to the current state.

Components:

Open circles (◯): States
Filled circles (●): State-action pairs
Arcs: Transitions (weighted by probability)

Bellman Expectation Backup for V^π:

         ◯ s
        /|\
       ● ● ●  (actions, weighted by π(a|s))
      /|\|/|\
     ◯ ◯ ◯ ◯  (next states, weighted by P(s'|s,a))

Value flows from next states up through actions (averaging) to current state.

Bellman Optimality Backup for V:*

         ◯ s
          |
        max
        /|\
       ● ● ●  (actions, take max)
      /|\|/|\
     ◯ ◯ ◯ ◯  (next states, weighted by P(s'|s,a))

Value flows from next states, but we take the max over actions instead of averaging.

Backup Diagram Insights

•Depth of backup — One-step backups use immediate next states; n-step backups look further ahead
•Width of backup — Full backups consider all actions/next-states; sample backups use only observed transitions
•Model-based vs model-free — Full backups require knowing P; sample backups don't
•Expectation vs optimality — Weighted average (π) vs max over actions

From Equations to Algorithms

The Bellman equations directly inspire the major families of RL algorithms:

Dynamic Programming (known model):

Algorithm	Updates	Based On
Policy Evaluation	$V^\pi$ via expectation backups	Bellman expectation for V
Policy Iteration	Alternates eval & improve	Both expectation & optimality
Value Iteration	$V^*$ via optimality backups	Bellman optimality for V

Model-Free Learning (unknown model):

Algorithm	Updates	Bellman Equation
Monte Carlo	$V(s) \leftarrow \text{avg}(G)$	Expectation (sample-based)
TD(0)	$V(s) \leftarrow V(s) + \alpha[r + \gamma V(s') - V(s)]$	Expectation (bootstrapping)
SARSA	$Q(s,a) \leftarrow Q + \alpha[r + \gamma Q(s',a') - Q]$	Expectation for Q
Q-Learning	$Q(s,a) \leftarrow Q + \alpha[r + \gamma \max_a Q(s',a) - Q]$	Optimality for Q

Every algorithm is either solving the Bellman equations exactly (DP) or approximating the solution through samples (RL).

algorithm_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import numpy as np
 
def td0_update(V, s, r, s_prime, alpha, gamma):
    """
    TD(0) update - sample-based Bellman expectation backup.
    
    Target: R + γV(s')
    Update: V(s) += α [target - V(s)]
    """
    target = r + gamma * V[s_prime]
    V[s] = V[s] + alpha * (target - V[s])
    return V
 
 
def q_learning_update(Q, s, a, r, s_prime, alpha, gamma):
    """
    Q-Learning update - sample-based Bellman optimality backup.
    
    Target: R + γ max_a' Q(s', a')
    Update: Q(s,a) += α [target - Q(s,a)]
    """
    target = r + gamma * np.max(Q[s_prime, :])
    Q[s, a] = Q[s, a] + alpha * (target - Q[s, a])
    return Q
 
 
def sarsa_update(Q, s, a, r, s_prime, a_prime, alpha, gamma):
    """
    SARSA update - sample-based Bellman expectation backup for Q.
    
    Target: R + γ Q(s', a')  [using actual next action]
    Update: Q(s,a) += α [target - Q(s,a)]
    """
    target = r + gamma * Q[s_prime, a_prime]
    Q[s, a] = Q[s, a] + alpha * (target - Q[s, a])
    return Q
 
 
# ================================================
# Key Insight: The Bellman Error
# ================================================
def bellman_error_q(Q, s, a, r, s_prime, gamma):
    """
    The Bellman error (or TD error) for Q-learning.
    
    δ = R + γ max_a' Q(s', a') - Q(s, a)
    
    At convergence (Q = Q*), the expected Bellman error is zero.
    Learning minimizes (squared) Bellman error.
    """
    target = r + gamma * np.max(Q[s_prime, :])
    current = Q[s, a]
    return target - current
 
 
print("=== Algorithm Relationship to Bellman ===")
print()
print("All RL algorithms minimize Bellman error:")
print("  δ = [R + γ · (next value)] - (current value)")
print()
print("DP methods: Sweep over all states, apply exact backups")
print("MC methods: Estimate backup target with full returns")
print("TD methods: Estimate backup target with bootstrapping")
print()
print("The Bellman equation IS reinforcement learning!")

Practical Considerations

Understanding Bellman equations leads to practical insights:

Synchronous vs. Asynchronous Updates:

Prioritized Sweeping:

Focus updates on states whose values are likely to change the most. If a predecessor state's value changed significantly, its successors' values are now stale.

Real-Time Dynamic Programming (RTDP):

Update only states encountered during simulation. Focuses computation on relevant parts of the state space.

Function Approximation:

For large/continuous state spaces, we can't store tabular V or Q. Instead, we approximate: $$V(s) \approx V_\theta(s) = \theta^T \phi(s)$$

Training minimizes Bellman error: $L(\theta) = \mathbb{E}[(r + \gamma V_\theta(s') - V_\theta(s))^2]$

The Deadly Triad

Summary: The Bellman Equations

We have reached the mathematical heart of reinforcement learning. The Bellman equations express the recursive structure of value, enabling both exact solutions (DP) and sample-based learning (RL).

Key Takeaways

•Bellman Expectation Equations describe V^π and Q^π for a fixed policy using weighted sums over actions
•Bellman Optimality Equations describe V* and Q* by taking max over actions
•Matrix form enables direct solution via (I - γP^π)⁻¹R^π for small state spaces
•Contraction property guarantees unique fixed point and convergence of iterative methods
•All RL algorithms are fundamentally solving or approximating Bellman equations
•Bellman error δ = target - current is the signal that drives learning

Module Complete!

Module 2 Complete

5 / 5