Machine LearningReinforcement Learning

Value-Based Methods

LevelAdvanced

Duration180 mins

TopicReinforcement Learning

1 / 5

Value Iteration

Computing Optimal Behavior from First Principles

Imagine you're designing an autonomous navigation system. Your robot stands at the entrance of an unfamiliar building, tasked with finding the shortest path to a specific room while avoiding obstacles and minimizing energy consumption. How should it decide which direction to move at each intersection? What makes one policy—a mapping from situations to actions—better than another?

Value iteration answers these questions by computing the optimal value of every possible state: the maximum expected cumulative reward an agent can achieve starting from that state and acting optimally thereafter. Once we know these optimal values, the optimal policy falls out naturally—simply choose actions that lead to states with the highest values.

This elegant algorithm, rooted in the principle of dynamic programming, was formalized by Richard Bellman in the 1950s and remains foundational to modern reinforcement learning. It provides both a theoretical cornerstone and a practical algorithm that, under the right conditions, provably converges to optimal solutions.

What You Will Learn

By the end of this page, you will master the complete theory and practice of value iteration: the Bellman optimality principle, the algorithm derivation, rigorous convergence proofs, implementation in code, computational complexity analysis, and practical considerations for applying value iteration to real problems.

The Bellman Optimality Principle

Value iteration is built upon one of the most profound insights in sequential decision-making: Bellman's Principle of Optimality.

"An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision." — Richard Bellman, 1957

This principle states that optimal behavior decomposes recursively. If you're taking the optimal path from A to C through B, then the segment from B to C must also be optimal. This self-consistency is captured mathematically by the Bellman equations.

Why Bellman's Insight Matters

The recursive structure of optimal decisions transforms an exponentially complex search problem into a tractable computation. Instead of considering all possible action sequences (exponential in horizon), we can compute optimal values incrementally by leveraging the solution to smaller subproblems.

Formal Definition of Optimal Value Functions

Recall from the MDP framework that a policy $\pi: S \to A$ (or more generally $\pi: S \to \Delta(A)$ for stochastic policies) specifies how an agent selects actions. The state-value function under policy $\pi$ is:

$$V^\pi(s) = \mathbb{E}\pi \left[ \sum{t=0}^{\infty} \gamma^t R_{t+1} \mid S_0 = s \right]$$

This measures the expected cumulative discounted reward starting from state $s$ and following policy $\pi$ thereafter.

The optimal value function $V^*$ represents the best achievable value:

$$V^*(s) = \max_\pi V^\pi(s) \quad \text{for all } s \in S$$

Similarly, the optimal action-value (Q) function is:

$$Q^*(s, a) = \max_\pi Q^\pi(s, a)$$

where $Q^\pi(s, a)$ denotes the expected return when taking action $a$ in state $s$ and following $\pi$ thereafter.

The Bellman Optimality Equations

The optimal value function satisfies a remarkable self-consistency condition—the Bellman optimality equation:

$$V^(s) = \max_{a \in A(s)} \left[ R(s, a) + \gamma \sum_{s' \in S} P(s' | s, a) V^(s') \right]$$

Let's unpack each component:

Component	Meaning
$V^*(s)$	Optimal value of state $s$—best achievable expected return
$\max_{a \in A(s)}$	Choose the action that maximizes value (greedy selection)
$R(s, a)$	Immediate reward for taking action $a$ in state $s$
$\gamma$	Discount factor, $0 \leq \gamma < 1$
$P(s' \mid s, a)$	Transition probability to state $s'$
$\sum_{s'} P(s' \mid s, a) V^*(s')$	Expected optimal future value

The equation says: the optimal value of a state equals the immediate reward plus the discounted expected optimal value of the next state, maximized over all possible actions.

The Q-Function Form

The Bellman optimality equation for Q-values is: $Q^(s, a) = R(s, a) + \gamma \sum_{s'} P(s' | s, a) \max_{a'} Q^(s', a')$. This form is crucial for Q-learning, which we'll cover in the next page. Note that $V^(s) = \max_a Q^(s, a)$.

The Value Iteration Algorithm

The Bellman optimality equation defines $V^*$ implicitly—but how do we actually compute it? Value iteration is the answer: an iterative algorithm that repeatedly applies the Bellman optimality operator until convergence.

The Bellman Optimality Operator

Define the Bellman optimality operator $T^*$ that maps a value function $V$ to a new value function:

$$(T^* V)(s) = \max_{a \in A(s)} \left[ R(s, a) + \gamma \sum_{s' \in S} P(s' | s, a) V(s') \right]$$

The optimal value function $V^$ is a fixed point of this operator: $T^ V^* = V^*$.

The Algorithm

Value iteration repeatedly applies $T^*$ starting from an arbitrary initialization:

Initialize $V_0(s)$ arbitrarily for all states (commonly to 0)
Repeat for $k = 0, 1, 2, \ldots$:
- For each state $s \in S$: $$V_{k+1}(s) = \max_{a \in A(s)} \left[ R(s, a) + \gamma \sum_{s' \in S} P(s' | s, a) V_k(s') \right]$$
- Compute $\delta = \max_s |V_{k+1}(s) - V_k(s)|$
Until $\delta < \theta$ (convergence threshold)
Extract policy: $\pi^(s) = \arg\max_{a} \left[ R(s, a) + \gamma \sum_{s'} P(s' | s, a) V^(s') \right]$

value_iteration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import numpy as np
from typing import Tuple, Callable
 
def value_iteration(
    num_states: int,
    num_actions: int,
    transition_probs: np.ndarray,  # Shape: (S, A, S)
    rewards: np.ndarray,           # Shape: (S, A) or (S, A, S)
    gamma: float = 0.99,
    theta: float = 1e-8,
    max_iterations: int = 10000
) -> Tuple[np.ndarray, np.ndarray, list]:
    """
    Value Iteration Algorithm for finite MDPs.
    
    Args:
        num_states: Number of states |S|
        num_actions: Number of actions |A|
        transition_probs: P(s'|s,a) transition probability matrix
        rewards: R(s,a) or R(s,a,s') reward function
        gamma: Discount factor (0 <= gamma < 1)
        theta: Convergence threshold
        max_iterations: Maximum iterations to prevent infinite loops
    
    Returns:
        V: Optimal value function, shape (S,)
        policy: Optimal policy, shape (S,)
        deltas: History of max value changes per iteration
    """
    # Initialize value function arbitrarily (zeros is common)
    V = np.zeros(num_states)
    deltas = []
    
    # Handle different reward formats
    if rewards.ndim == 2:
        # R(s, a) format - expand to expected reward
        R_sa = rewards
    else:
        # R(s, a, s') format - compute expected reward
        R_sa = np.sum(transition_probs * rewards, axis=2)
    
    for iteration in range(max_iterations):
        V_new = np.zeros(num_states)
        
        for s in range(num_states):
            # Compute Q(s, a) for all actions
            Q_values = np.zeros(num_actions)
            for a in range(num_actions):
                # Bellman backup: R(s,a) + gamma * sum_s' P(s'|s,a) * V(s')
                expected_future_value = np.dot(transition_probs[s, a, :], V)
                Q_values[a] = R_sa[s, a] + gamma * expected_future_value
            
            # Max over actions (Bellman optimality)
            V_new[s] = np.max(Q_values)
        
        # Check for convergence
        delta = np.max(np.abs(V_new - V))
        deltas.append(delta)
        V = V_new.copy()
        
        if delta < theta:
            print(f"Converged in {iteration + 1} iterations (delta = {delta:.2e})")
            break
    else:
        print(f"Warning: Did not converge within {max_iterations} iterations")
    
    # Extract greedy policy from optimal value function
    policy = np.zeros(num_states, dtype=int)
    for s in range(num_states):
        Q_values = np.zeros(num_actions)
        for a in range(num_actions):
            expected_future_value = np.dot(transition_probs[s, a, :], V)
            Q_values[a] = R_sa[s, a] + gamma * expected_future_value
        policy[s] = np.argmax(Q_values)
    
    return V, policy, deltas

Synchronous vs Asynchronous Updates

The algorithm above uses synchronous updates: all states are updated using the same previous value function V_k before any V_{k+1} values are used. Asynchronous variants update states one at a time, immediately using new values. Both converge, but asynchronous updates can sometimes be faster in practice.

Convergence Theory

A critical question: does value iteration actually converge to $V^*$? The answer is yes, and the proof relies on the contraction mapping theorem.

The Bellman Operator is a Contraction

Definition (Contraction Mapping): A mapping $T: X \to X$ on a metric space $(X, d)$ is a contraction if there exists $\gamma \in [0, 1)$ such that for all $x, y \in X$: $$d(Tx, Ty) \leq \gamma \cdot d(x, y)$$

Theorem: The Bellman optimality operator $T^$ is a $\gamma$-contraction under the sup-norm (infinity norm): $$|T^ V_1 - T^* V_2|\infty \leq \gamma |V_1 - V_2|\infty$$

Proof: For any state $s$, $$ \begin{aligned} |(T^* V_1)(s) - (T^* V_2)(s)| &= \left| \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V_1(s') \right] - \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V_2(s') \right] \right| \ &\leq \max_a \gamma \left| \sum_{s'} P(s'|s,a) (V_1(s') - V_2(s')) \right| \ &\leq \gamma \max_a \sum_{s'} P(s'|s,a) |V_1(s') - V_2(s')| \ &\leq \gamma |V_1 - V_2|_\infty \end{aligned} $$

The second step uses the property $|\max f - \max g| \leq \max |f - g|$.

The Banach Fixed-Point Theorem

The Banach fixed-point theorem guarantees that a contraction mapping on a complete metric space has a unique fixed point, and iterative application of the mapping converges to it from any starting point. Since the space of bounded value functions with the sup-norm is complete, value iteration converges to the unique V*.

Convergence Rate Analysis

Since $T^$ is a $\gamma$-contraction, we have: $$|V_k - V^|\infty \leq \gamma^k |V_0 - V^*|\infty$$

This gives us linear convergence with rate $\gamma$. The number of iterations needed to achieve $\epsilon$-accuracy is:

$$k \geq \frac{\log(|V_0 - V^*|_\infty / \epsilon)}{\log(1/\gamma)}$$

Practical implication: When $\gamma$ is close to 1 (e.g., 0.99 for long-horizon problems), convergence is slow. Each iteration multiplies the error by $\gamma$, so at $\gamma = 0.99$, we need approximately $\log(1/\epsilon) / 0.01 \approx 460$ iterations to reduce error by a factor of 100.

Iterations to Reduce Error by Factor of 1000
Discount Factor γ	Approximate Iterations Needed
0.9	~66
0.95	~133
0.99	~688
0.999	~6,905

Stopping Criteria in Practice

Instead of computing ||V - V*||, which requires knowing V*, we use ||V_{k+1} - V_k|| < θ. The relationship is: ||V_k - V*|| ≤ ||V_{k+1} - V_k|| / (1 - γ). So for θ = 0.01 and γ = 0.99, we guarantee ||V_k - V*|| ≤ 1.0.

Worked Example: Gridworld

Let's apply value iteration to a concrete problem: the classic Gridworld environment. This example illuminates how value iteration propagates value information through the state space.

Problem Setup

Consider a 4×4 grid with:

States: 16 cells, numbered 0–15 (row-major order)
Actions: North, South, East, West (deterministic movement)
Terminal states: Cell 0 (top-left) and Cell 15 (bottom-right)
Rewards: -1 for each step (incentivizes reaching terminal states quickly)
Boundaries: Moving into a wall keeps the agent in place

This is an episodic task: episodes end when reaching a terminal state.

gridworld.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import numpy as np
 
class Gridworld:
    """4x4 Gridworld environment for value iteration demonstration."""
    
    def __init__(self, size=4):
        self.size = size
        self.num_states = size * size
        self.num_actions = 4  # N, S, E, W
        self.terminal_states = {0, self.num_states - 1}
        
        # Action effects: (row_delta, col_delta)
        self.action_effects = {
            0: (-1, 0),  # North
            1: (1, 0),   # South
            2: (0, 1),   # East
            3: (0, -1),  # West
        }
        
        # Build transition matrix P[s, a, s'] and reward R[s, a]
        self.P = np.zeros((self.num_states, self.num_actions, self.num_states))
        self.R = np.zeros((self.num_states, self.num_actions))
        self._build_dynamics()
    
    def _state_to_coord(self, s):
        return s // self.size, s % self.size
    
    def _coord_to_state(self, row, col):
        return row * self.size + col
    
    def _build_dynamics(self):
        for s in range(self.num_states):
            if s in self.terminal_states:
                # Terminal states: all actions lead to same state, zero reward
                for a in range(self.num_actions):
                    self.P[s, a, s] = 1.0
                    self.R[s, a] = 0.0
                continue
            
            row, col = self._state_to_coord(s)
            
            for a, (dr, dc) in self.action_effects.items():
                new_row = np.clip(row + dr, 0, self.size - 1)
                new_col = np.clip(col + dc, 0, self.size - 1)
                next_s = self._coord_to_state(new_row, new_col)
                
                self.P[s, a, next_s] = 1.0
                self.R[s, a] = -1.0  # -1 reward for each step
 
# Run value iteration on gridworld
env = Gridworld(size=4)
V, policy, deltas = value_iteration(
    env.num_states, env.num_actions, env.P, env.R,
    gamma=1.0,  # No discounting (episodic)
    theta=1e-10
)
 
# Display results
print("\nOptimal Value Function (4x4 grid):")
print(V.reshape(4, 4).round(2))
print("\nOptimal Policy (0=N, 1=S, 2=E, 3=W):")
print(policy.reshape(4, 4))

Iteration-by-Iteration Analysis

Let's trace value propagation through iterations:

Iteration 0 (Initial $V_0 = 0$ everywhere):

All non-terminal states update to $V_1(s) = -1$ (immediate reward)
Terminal states remain $V(0) = V(15) = 0$

Iteration 1:

States adjacent to terminals: $V_2(1) = V_2(4) = -1 + 1 \cdot 0 = -1$ (can reach terminal)
States two steps away: $V_2(5) = -1 + 1 \cdot (-1) = -2$

Subsequent iterations:

Value propagates outward from terminal states
Each iteration extends the "horizon" of foresight by one step
States learn their true distance (in expected steps) to terminals

Converged Optimal Value Function (4×4 Gridworld, γ=1.0)
	Col 0	Col 1	Col 2	Col 3
Row 0	0.0	-1.0	-2.0	-3.0
Row 1	-1.0	-2.0	-3.0	-2.0
Row 2	-2.0	-3.0	-2.0	-1.0
Row 3	-3.0	-2.0	-1.0	0.0

Interpreting the Result

The optimal value at each cell equals the negative of the Manhattan distance to the nearest terminal state. Cell (1,1) has value -2 because the shortest path to either terminal requires 2 steps. The optimal policy points toward the nearest terminal—beautiful emergent behavior from a simple recursive update!

Computational Complexity

Understanding the computational requirements of value iteration is essential for practical deployment. Let's analyze time and space complexity rigorously.

Time Complexity

Per-iteration cost: For each state $s$, we compute: $$V_{k+1}(s) = \max_{a} \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V_k(s') \right]$$

Inner sum over $s'$: $O(|S|)$ per action
Max over actions: $O(|A|)$
Total per state: $O(|A| \cdot |S|)$
Total per iteration: $O(|S| \cdot |A| \cdot |S|) = O(|S|^2 \cdot |A|)$

Number of iterations: From convergence analysis, approximately $O\left(\frac{1}{1-\gamma} \log\frac{1}{\epsilon}\right)$ iterations.

Total time complexity: $$O\left(\frac{|S|^2 \cdot |A|}{1-\gamma} \log\frac{1}{\epsilon}\right)$$

Value Iteration Complexity Summary
Aspect	Complexity	Practical Implication
Per-iteration time	O(\|S\|² · \|A\|)	Quadratic in states—limits scalability
Number of iterations	O(log(1/ε) / (1-γ))	Slow for γ near 1
Space	O(\|S\|)	Linear—value function storage only
Transition model	O(\|S\|² · \|A\|)	Must store or compute P(s'\|s,a)

Space Complexity

Value iteration is remarkably space-efficient:

Value function storage: $O(|S|)$
Transition model: $O(|S|^2 \cdot |A|)$ if stored explicitly
Total: $O(|S|^2 \cdot |A|)$ dominated by the MDP model

Sparse transitions: Many real MDPs have sparse transitions—each action leads to only a few possible next states. With sparsity factor $k$ (average non-zero transitions per state-action):

Per-iteration: $O(|S| \cdot |A| \cdot k)$
Model storage: $O(|S| \cdot |A| \cdot k)$

This makes value iteration practical for much larger state spaces when dynamics are sparse.

The Curse of Dimensionality

The $O(|S|^2)$ per-iteration cost is the Achilles' heel of value iteration. If your state space is a product of d dimensions each with n values, |S| = n^d. For a robot with 10 joints, each discretized to 100 positions, |S| = 100^10 = 10^20. This is why we need function approximation and model-free methods for realistic problems.

Optimizations and Variants

Several techniques can accelerate value iteration or improve its practicality:

1. Gauss-Seidel (In-Place) Updates

Instead of computing all new values before any update, use new values immediately:

for s in range(num_states):
    V[s] = max_a [R(s,a) + gamma * sum_s' P(s'|s,a) * V[s']]

This in-place variant often converges faster because updates propagate immediately. The convergence guarantee still holds (asynchronous DP), though the convergence rate analysis differs.

2. Prioritized Sweeping

Not all states need equal attention. Prioritized sweeping maintains a priority queue based on the magnitude of the last update (Bellman error):

Update the state with the highest priority
For states that could be affected (predecessors), update their priority based on how much the current update could change them
Repeat

This focuses computation where it matters most—states with large value changes that will propagate to others.

prioritized_sweeping.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import heapq
from collections import defaultdict
 
def prioritized_value_iteration(
    num_states, num_actions, P, R, gamma=0.99, theta=1e-8, max_updates=1000000
):
    """
    Prioritized Sweeping variant of Value Iteration.
    Updates states in order of their Bellman error magnitude.
    """
    V = np.zeros(num_states)
    
    # Build predecessor model: pred[s'] = [(s, a, prob), ...]
    predecessors = defaultdict(list)
    for s in range(num_states):
        for a in range(num_actions):
            for s_next in range(num_states):
                if P[s, a, s_next] > 0:
                    predecessors[s_next].append((s, a, P[s, a, s_next]))
    
    # Initialize priority queue with all states
    # Priority is negative Bellman error (heapq is min-heap)
    pq = []
    bellman_errors = np.zeros(num_states)
    
    for s in range(num_states):
        Q_max = max(R[s, a] + gamma * np.dot(P[s, a, :], V) 
                   for a in range(num_actions))
        bellman_errors[s] = abs(Q_max - V[s])
        if bellman_errors[s] > theta:
            heapq.heappush(pq, (-bellman_errors[s], s))
    
    in_queue = set(s for _, s in pq)
    updates = 0
    
    while pq and updates < max_updates:
        # Pop state with largest error
        neg_priority, s = heapq.heappop(pq)
        in_queue.discard(s)
        
        # Bellman update
        old_V = V[s]
        V[s] = max(R[s, a] + gamma * np.dot(P[s, a, :], V) 
                  for a in range(num_actions))
        updates += 1
        
        # Update predecessors' priorities
        if abs(V[s] - old_V) > theta:
            for s_pred, a_pred, prob in predecessors[s]:
                # Change in predecessor's value could be up to gamma * prob * |delta|
                potential_change = gamma * prob * abs(V[s] - old_V)
                if potential_change > theta and s_pred not in in_queue:
                    heapq.heappush(pq, (-potential_change, s_pred))
                    in_queue.add(s_pred)
    
    return V, updates

3. Policy Iteration vs Value Iteration

An alternative algorithm, policy iteration, separates policy evaluation (computing $V^\pi$ for fixed $\pi$) from policy improvement (finding a better policy).

Aspect	Value Iteration	Policy Iteration
Update	One Bellman backup per sweep	Full policy evaluation between improvements
Iterations	Many (slow per-sweep)	Few (expensive per-sweep)
Per-iteration cost	$O(\|S\|^2 \cdot \|A\|)$	$O(\|S\|^3)$ for exact evaluation
Memory	Single value function	Value function per evaluation
When to prefer	Small actions, large states	Policies stabilize quickly

Policy iteration often converges in fewer iterations but each iteration is more expensive.

Modified Policy Iteration

A hybrid approach: instead of full policy evaluation, perform k Bellman backups (k << convergence) before policy improvement. This combines the fast convergence of policy iteration with the cheap iterations of value iteration. Choosing k dynamically based on the improvement rate can be particularly effective.

Practical Considerations

Deploying value iteration in practice requires attention to implementation details that theory often glosses over.

Numerical Stability

Reward scaling: If rewards are very large or very small, numerical precision issues can arise. Normalize rewards to a reasonable range (e.g., [-1, 1] or [0, 1]).

Discount factor near 1: When $\gamma \approx 1$, values can grow very large. Consider using undiscounted episodic formulations where possible, or track value differences rather than absolute values.

Initialization: While any initialization converges, starting close to the true values (if you have a heuristic) speeds convergence.

Common Implementation Pitfalls

•Confusing indices: Mixing up s, a, s' indices in transition matrices is the #1 bug. Use explicit variable names and verify with small examples.
•Terminal state handling: Terminal states should have zero value and self-loops. Forgetting to handle them correctly causes divergence.
•Synchronous update error: Using V_new[s'] instead of V[s'] in the Bellman backup breaks synchronous semantics (though still converges).
•Convergence check placement: Check convergence after the full sweep, not after each state update.
•Policy extraction before convergence: Extracting greedy policy from an unconverged value function can give suboptimal policies.

When to Use Value Iteration

Good fit:

Small to medium state spaces (< 10^6 states)
Known, tabular MDP model (P and R available)
High accuracy required
Offline planning (compute once, execute many times)

Poor fit:

Large/continuous state spaces → function approximation needed
Unknown dynamics → model-free methods (Q-learning, policy gradient)
Online learning → sample-based methods
Continuous actions → policy-based methods

Value Iteration as a Building Block

Even when value iteration isn't directly applicable, understanding it is crucial. Many advanced algorithms (fitted value iteration, DQN, model-based RL) are built on value iteration principles with function approximation. The Bellman optimality equation remains the north star of RL, regardless of how we approximate its solution.

Summary: Value Iteration

Value iteration is the cornerstone algorithm for computing optimal value functions in Markov Decision Processes. Let's consolidate what we've learned:

Key Takeaways

•Bellman optimality principle: Optimal decisions decompose recursively—if the whole path is optimal, every suffix must also be optimal.
•The algorithm: Repeatedly apply Bellman backups V_{k+1}(s) = max_a [R(s,a) + γ Σ P(s'|s,a) V_k(s')] until convergence.
•Convergence guarantee: The Bellman operator is a γ-contraction, guaranteeing convergence to unique V* from any starting point.
•Convergence rate: Linear convergence at rate γ—slow when γ ≈ 1, requiring O(1/(1-γ)) iterations.
•Complexity: O(|S|² · |A|) per iteration—the curse of dimensionality limits scalability.
•Optimizations: Gauss-Seidel updates, prioritized sweeping, and modified policy iteration can accelerate convergence.
•Practical scope: Best for offline planning with known models and tabular state spaces.

What's next: Value iteration requires a complete model of the environment—transition probabilities and rewards. But what if we don't have this model and must learn from experience? The next page introduces Q-learning, a model-free algorithm that learns optimal action-values directly from observed transitions.

Page Complete

You now have a comprehensive understanding of value iteration: its theoretical foundations in dynamic programming, convergence properties, practical implementation, and computational limitations. This knowledge forms the bedrock for understanding all value-based RL methods that follow.

1 / 5

Loading learning content...

Machine LearningReinforcement Learning

Value-Based Methods

LevelAdvanced

Duration180 mins

TopicReinforcement Learning

1 / 5

Value Iteration

Computing Optimal Behavior from First Principles

What You Will Learn

The Bellman Optimality Principle

Value iteration is built upon one of the most profound insights in sequential decision-making: Bellman's Principle of Optimality.

"An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision." — Richard Bellman, 1957

Why Bellman's Insight Matters

Formal Definition of Optimal Value Functions

$$V^\pi(s) = \mathbb{E}\pi \left[ \sum{t=0}^{\infty} \gamma^t R_{t+1} \mid S_0 = s \right]$$

This measures the expected cumulative discounted reward starting from state $s$ and following policy $\pi$ thereafter.

The optimal value function $V^*$ represents the best achievable value:

$$V^*(s) = \max_\pi V^\pi(s) \quad \text{for all } s \in S$$

Similarly, the optimal action-value (Q) function is:

$$Q^*(s, a) = \max_\pi Q^\pi(s, a)$$

where $Q^\pi(s, a)$ denotes the expected return when taking action $a$ in state $s$ and following $\pi$ thereafter.

The Bellman Optimality Equations

The optimal value function satisfies a remarkable self-consistency condition—the Bellman optimality equation:

$$V^(s) = \max_{a \in A(s)} \left[ R(s, a) + \gamma \sum_{s' \in S} P(s' | s, a) V^(s') \right]$$

Let's unpack each component:

Component	Meaning
$V^*(s)$	Optimal value of state $s$—best achievable expected return
$\max_{a \in A(s)}$	Choose the action that maximizes value (greedy selection)
$R(s, a)$	Immediate reward for taking action $a$ in state $s$
$\gamma$	Discount factor, $0 \leq \gamma < 1$
$P(s' \mid s, a)$	Transition probability to state $s'$
$\sum_{s'} P(s' \mid s, a) V^*(s')$	Expected optimal future value

The equation says: the optimal value of a state equals the immediate reward plus the discounted expected optimal value of the next state, maximized over all possible actions.

The Q-Function Form

The Value Iteration Algorithm

The Bellman Optimality Operator

Define the Bellman optimality operator $T^*$ that maps a value function $V$ to a new value function:

$$(T^* V)(s) = \max_{a \in A(s)} \left[ R(s, a) + \gamma \sum_{s' \in S} P(s' | s, a) V(s') \right]$$

The optimal value function $V^$ is a fixed point of this operator: $T^ V^* = V^*$.

The Algorithm

Value iteration repeatedly applies $T^*$ starting from an arbitrary initialization:

Initialize $V_0(s)$ arbitrarily for all states (commonly to 0)
Repeat for $k = 0, 1, 2, \ldots$:
- For each state $s \in S$: $$V_{k+1}(s) = \max_{a \in A(s)} \left[ R(s, a) + \gamma \sum_{s' \in S} P(s' | s, a) V_k(s') \right]$$
- Compute $\delta = \max_s |V_{k+1}(s) - V_k(s)|$
Until $\delta < \theta$ (convergence threshold)
Extract policy: $\pi^(s) = \arg\max_{a} \left[ R(s, a) + \gamma \sum_{s'} P(s' | s, a) V^(s') \right]$

value_iteration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import numpy as np
from typing import Tuple, Callable
 
def value_iteration(
    num_states: int,
    num_actions: int,
    transition_probs: np.ndarray,  # Shape: (S, A, S)
    rewards: np.ndarray,           # Shape: (S, A) or (S, A, S)
    gamma: float = 0.99,
    theta: float = 1e-8,
    max_iterations: int = 10000
) -> Tuple[np.ndarray, np.ndarray, list]:
    """
    Value Iteration Algorithm for finite MDPs.
    
    Args:
        num_states: Number of states |S|
        num_actions: Number of actions |A|
        transition_probs: P(s'|s,a) transition probability matrix
        rewards: R(s,a) or R(s,a,s') reward function
        gamma: Discount factor (0 <= gamma < 1)
        theta: Convergence threshold
        max_iterations: Maximum iterations to prevent infinite loops
    
    Returns:
        V: Optimal value function, shape (S,)
        policy: Optimal policy, shape (S,)
        deltas: History of max value changes per iteration
    """
    # Initialize value function arbitrarily (zeros is common)
    V = np.zeros(num_states)
    deltas = []
    
    # Handle different reward formats
    if rewards.ndim == 2:
        # R(s, a) format - expand to expected reward
        R_sa = rewards
    else:
        # R(s, a, s') format - compute expected reward
        R_sa = np.sum(transition_probs * rewards, axis=2)
    
    for iteration in range(max_iterations):
        V_new = np.zeros(num_states)
        
        for s in range(num_states):
            # Compute Q(s, a) for all actions
            Q_values = np.zeros(num_actions)
            for a in range(num_actions):
                # Bellman backup: R(s,a) + gamma * sum_s' P(s'|s,a) * V(s')
                expected_future_value = np.dot(transition_probs[s, a, :], V)
                Q_values[a] = R_sa[s, a] + gamma * expected_future_value
            
            # Max over actions (Bellman optimality)
            V_new[s] = np.max(Q_values)
        
        # Check for convergence
        delta = np.max(np.abs(V_new - V))
        deltas.append(delta)
        V = V_new.copy()
        
        if delta < theta:
            print(f"Converged in {iteration + 1} iterations (delta = {delta:.2e})")
            break
    else:
        print(f"Warning: Did not converge within {max_iterations} iterations")
    
    # Extract greedy policy from optimal value function
    policy = np.zeros(num_states, dtype=int)
    for s in range(num_states):
        Q_values = np.zeros(num_actions)
        for a in range(num_actions):
            expected_future_value = np.dot(transition_probs[s, a, :], V)
            Q_values[a] = R_sa[s, a] + gamma * expected_future_value
        policy[s] = np.argmax(Q_values)
    
    return V, policy, deltas

Synchronous vs Asynchronous Updates

Convergence Theory

A critical question: does value iteration actually converge to $V^*$? The answer is yes, and the proof relies on the contraction mapping theorem.

The Bellman Operator is a Contraction

Theorem: The Bellman optimality operator $T^$ is a $\gamma$-contraction under the sup-norm (infinity norm): $$|T^ V_1 - T^* V_2|\infty \leq \gamma |V_1 - V_2|\infty$$

The second step uses the property $|\max f - \max g| \leq \max |f - g|$.

The Banach Fixed-Point Theorem

Convergence Rate Analysis

Since $T^$ is a $\gamma$-contraction, we have: $$|V_k - V^|\infty \leq \gamma^k |V_0 - V^*|\infty$$

This gives us linear convergence with rate $\gamma$. The number of iterations needed to achieve $\epsilon$-accuracy is:

$$k \geq \frac{\log(|V_0 - V^*|_\infty / \epsilon)}{\log(1/\gamma)}$$

Iterations to Reduce Error by Factor of 1000
Discount Factor γ	Approximate Iterations Needed
0.9	~66
0.95	~133
0.99	~688
0.999	~6,905

Stopping Criteria in Practice

Worked Example: Gridworld

Let's apply value iteration to a concrete problem: the classic Gridworld environment. This example illuminates how value iteration propagates value information through the state space.

Problem Setup

Consider a 4×4 grid with:

States: 16 cells, numbered 0–15 (row-major order)
Actions: North, South, East, West (deterministic movement)
Terminal states: Cell 0 (top-left) and Cell 15 (bottom-right)
Rewards: -1 for each step (incentivizes reaching terminal states quickly)
Boundaries: Moving into a wall keeps the agent in place

This is an episodic task: episodes end when reaching a terminal state.

gridworld.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import numpy as np
 
class Gridworld:
    """4x4 Gridworld environment for value iteration demonstration."""
    
    def __init__(self, size=4):
        self.size = size
        self.num_states = size * size
        self.num_actions = 4  # N, S, E, W
        self.terminal_states = {0, self.num_states - 1}
        
        # Action effects: (row_delta, col_delta)
        self.action_effects = {
            0: (-1, 0),  # North
            1: (1, 0),   # South
            2: (0, 1),   # East
            3: (0, -1),  # West
        }
        
        # Build transition matrix P[s, a, s'] and reward R[s, a]
        self.P = np.zeros((self.num_states, self.num_actions, self.num_states))
        self.R = np.zeros((self.num_states, self.num_actions))
        self._build_dynamics()
    
    def _state_to_coord(self, s):
        return s // self.size, s % self.size
    
    def _coord_to_state(self, row, col):
        return row * self.size + col
    
    def _build_dynamics(self):
        for s in range(self.num_states):
            if s in self.terminal_states:
                # Terminal states: all actions lead to same state, zero reward
                for a in range(self.num_actions):
                    self.P[s, a, s] = 1.0
                    self.R[s, a] = 0.0
                continue
            
            row, col = self._state_to_coord(s)
            
            for a, (dr, dc) in self.action_effects.items():
                new_row = np.clip(row + dr, 0, self.size - 1)
                new_col = np.clip(col + dc, 0, self.size - 1)
                next_s = self._coord_to_state(new_row, new_col)
                
                self.P[s, a, next_s] = 1.0
                self.R[s, a] = -1.0  # -1 reward for each step
 
# Run value iteration on gridworld
env = Gridworld(size=4)
V, policy, deltas = value_iteration(
    env.num_states, env.num_actions, env.P, env.R,
    gamma=1.0,  # No discounting (episodic)
    theta=1e-10
)
 
# Display results
print("\nOptimal Value Function (4x4 grid):")
print(V.reshape(4, 4).round(2))
print("\nOptimal Policy (0=N, 1=S, 2=E, 3=W):")
print(policy.reshape(4, 4))

Iteration-by-Iteration Analysis

Let's trace value propagation through iterations:

Iteration 0 (Initial $V_0 = 0$ everywhere):

All non-terminal states update to $V_1(s) = -1$ (immediate reward)
Terminal states remain $V(0) = V(15) = 0$

Iteration 1:

States adjacent to terminals: $V_2(1) = V_2(4) = -1 + 1 \cdot 0 = -1$ (can reach terminal)
States two steps away: $V_2(5) = -1 + 1 \cdot (-1) = -2$

Subsequent iterations:

Value propagates outward from terminal states
Each iteration extends the "horizon" of foresight by one step
States learn their true distance (in expected steps) to terminals

Converged Optimal Value Function (4×4 Gridworld, γ=1.0)
	Col 0	Col 1	Col 2	Col 3
Row 0	0.0	-1.0	-2.0	-3.0
Row 1	-1.0	-2.0	-3.0	-2.0
Row 2	-2.0	-3.0	-2.0	-1.0
Row 3	-3.0	-2.0	-1.0	0.0

Interpreting the Result

Computational Complexity

Understanding the computational requirements of value iteration is essential for practical deployment. Let's analyze time and space complexity rigorously.

Time Complexity

Per-iteration cost: For each state $s$, we compute: $$V_{k+1}(s) = \max_{a} \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V_k(s') \right]$$

Inner sum over $s'$: $O(|S|)$ per action
Max over actions: $O(|A|)$
Total per state: $O(|A| \cdot |S|)$
Total per iteration: $O(|S| \cdot |A| \cdot |S|) = O(|S|^2 \cdot |A|)$

Number of iterations: From convergence analysis, approximately $O\left(\frac{1}{1-\gamma} \log\frac{1}{\epsilon}\right)$ iterations.

Total time complexity: $$O\left(\frac{|S|^2 \cdot |A|}{1-\gamma} \log\frac{1}{\epsilon}\right)$$

Value Iteration Complexity Summary
Aspect	Complexity	Practical Implication
Per-iteration time	O(\|S\|² · \|A\|)	Quadratic in states—limits scalability
Number of iterations	O(log(1/ε) / (1-γ))	Slow for γ near 1
Space	O(\|S\|)	Linear—value function storage only
Transition model	O(\|S\|² · \|A\|)	Must store or compute P(s'\|s,a)

Space Complexity

Value iteration is remarkably space-efficient:

Value function storage: $O(|S|)$
Transition model: $O(|S|^2 \cdot |A|)$ if stored explicitly
Total: $O(|S|^2 \cdot |A|)$ dominated by the MDP model

Sparse transitions: Many real MDPs have sparse transitions—each action leads to only a few possible next states. With sparsity factor $k$ (average non-zero transitions per state-action):

Per-iteration: $O(|S| \cdot |A| \cdot k)$
Model storage: $O(|S| \cdot |A| \cdot k)$

This makes value iteration practical for much larger state spaces when dynamics are sparse.

The Curse of Dimensionality

Optimizations and Variants

Several techniques can accelerate value iteration or improve its practicality:

1. Gauss-Seidel (In-Place) Updates

Instead of computing all new values before any update, use new values immediately:

for s in range(num_states):
    V[s] = max_a [R(s,a) + gamma * sum_s' P(s'|s,a) * V[s']]

This in-place variant often converges faster because updates propagate immediately. The convergence guarantee still holds (asynchronous DP), though the convergence rate analysis differs.

2. Prioritized Sweeping

Not all states need equal attention. Prioritized sweeping maintains a priority queue based on the magnitude of the last update (Bellman error):

Update the state with the highest priority
For states that could be affected (predecessors), update their priority based on how much the current update could change them
Repeat

This focuses computation where it matters most—states with large value changes that will propagate to others.

prioritized_sweeping.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import heapq
from collections import defaultdict
 
def prioritized_value_iteration(
    num_states, num_actions, P, R, gamma=0.99, theta=1e-8, max_updates=1000000
):
    """
    Prioritized Sweeping variant of Value Iteration.
    Updates states in order of their Bellman error magnitude.
    """
    V = np.zeros(num_states)
    
    # Build predecessor model: pred[s'] = [(s, a, prob), ...]
    predecessors = defaultdict(list)
    for s in range(num_states):
        for a in range(num_actions):
            for s_next in range(num_states):
                if P[s, a, s_next] > 0:
                    predecessors[s_next].append((s, a, P[s, a, s_next]))
    
    # Initialize priority queue with all states
    # Priority is negative Bellman error (heapq is min-heap)
    pq = []
    bellman_errors = np.zeros(num_states)
    
    for s in range(num_states):
        Q_max = max(R[s, a] + gamma * np.dot(P[s, a, :], V) 
                   for a in range(num_actions))
        bellman_errors[s] = abs(Q_max - V[s])
        if bellman_errors[s] > theta:
            heapq.heappush(pq, (-bellman_errors[s], s))
    
    in_queue = set(s for _, s in pq)
    updates = 0
    
    while pq and updates < max_updates:
        # Pop state with largest error
        neg_priority, s = heapq.heappop(pq)
        in_queue.discard(s)
        
        # Bellman update
        old_V = V[s]
        V[s] = max(R[s, a] + gamma * np.dot(P[s, a, :], V) 
                  for a in range(num_actions))
        updates += 1
        
        # Update predecessors' priorities
        if abs(V[s] - old_V) > theta:
            for s_pred, a_pred, prob in predecessors[s]:
                # Change in predecessor's value could be up to gamma * prob * |delta|
                potential_change = gamma * prob * abs(V[s] - old_V)
                if potential_change > theta and s_pred not in in_queue:
                    heapq.heappush(pq, (-potential_change, s_pred))
                    in_queue.add(s_pred)
    
    return V, updates

3. Policy Iteration vs Value Iteration

An alternative algorithm, policy iteration, separates policy evaluation (computing $V^\pi$ for fixed $\pi$) from policy improvement (finding a better policy).

Aspect	Value Iteration	Policy Iteration
Update	One Bellman backup per sweep	Full policy evaluation between improvements
Iterations	Many (slow per-sweep)	Few (expensive per-sweep)
Per-iteration cost	$O(\|S\|^2 \cdot \|A\|)$	$O(\|S\|^3)$ for exact evaluation
Memory	Single value function	Value function per evaluation
When to prefer	Small actions, large states	Policies stabilize quickly

Policy iteration often converges in fewer iterations but each iteration is more expensive.

Modified Policy Iteration

Practical Considerations

Deploying value iteration in practice requires attention to implementation details that theory often glosses over.

Numerical Stability

Reward scaling: If rewards are very large or very small, numerical precision issues can arise. Normalize rewards to a reasonable range (e.g., [-1, 1] or [0, 1]).

Initialization: While any initialization converges, starting close to the true values (if you have a heuristic) speeds convergence.

Common Implementation Pitfalls

•Confusing indices: Mixing up s, a, s' indices in transition matrices is the #1 bug. Use explicit variable names and verify with small examples.
•Terminal state handling: Terminal states should have zero value and self-loops. Forgetting to handle them correctly causes divergence.
•Synchronous update error: Using V_new[s'] instead of V[s'] in the Bellman backup breaks synchronous semantics (though still converges).
•Convergence check placement: Check convergence after the full sweep, not after each state update.
•Policy extraction before convergence: Extracting greedy policy from an unconverged value function can give suboptimal policies.

When to Use Value Iteration

Good fit:

Small to medium state spaces (< 10^6 states)
Known, tabular MDP model (P and R available)
High accuracy required
Offline planning (compute once, execute many times)

Poor fit:

Large/continuous state spaces → function approximation needed
Unknown dynamics → model-free methods (Q-learning, policy gradient)
Online learning → sample-based methods
Continuous actions → policy-based methods

Value Iteration as a Building Block

Summary: Value Iteration

Value iteration is the cornerstone algorithm for computing optimal value functions in Markov Decision Processes. Let's consolidate what we've learned:

Key Takeaways

•Bellman optimality principle: Optimal decisions decompose recursively—if the whole path is optimal, every suffix must also be optimal.
•The algorithm: Repeatedly apply Bellman backups V_{k+1}(s) = max_a [R(s,a) + γ Σ P(s'|s,a) V_k(s')] until convergence.
•Convergence guarantee: The Bellman operator is a γ-contraction, guaranteeing convergence to unique V* from any starting point.
•Convergence rate: Linear convergence at rate γ—slow when γ ≈ 1, requiring O(1/(1-γ)) iterations.
•Complexity: O(|S|² · |A|) per iteration—the curse of dimensionality limits scalability.
•Optimizations: Gauss-Seidel updates, prioritized sweeping, and modified policy iteration can accelerate convergence.
•Practical scope: Best for offline planning with known models and tabular state spaces.

Page Complete

1 / 5