Rl Fundamentals - Learning Module

Loading content...

0/245

Value Functions

Quantifying Future Success

How valuable is it to be in a particular state? How good is it to take a specific action? These questions are answered by value functions—arguably the most important concept in reinforcement learning.

Value functions compress infinite futures into single numbers. They tell us the expected cumulative reward from any situation, enabling agents to make farsighted decisions without explicitly planning every possible trajectory.

Most RL algorithms—from Q-learning to actor-critic to AlphaGo—rely fundamentally on value function estimation. Understanding value functions deeply means understanding how intelligent agents can reason about long-term consequences.

What You Will Learn

By the end of this page, you'll understand state value functions V(s), action value functions Q(s,a), their relationship via Bellman equations, methods for estimating them (Monte Carlo, TD learning), and the critical role they play in policy evaluation and improvement.

The State Value Function V(s)

The state value function $V^\pi(s)$ answers: "Starting from state $s$ and following policy $\pi$, what return can I expect?"

Formal Definition

$$V^\pi(s) = \mathbb{E}{\pi}\left[ G_t ,\middle|, S_t = s \right] = \mathbb{E}{\pi}\left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} ,\middle|, S_t = s \right]$$

The expectation is over:

The stochasticity of the policy $\pi(a|s)$
The stochasticity of the environment's transitions $P(s'|s,a)$
The stochasticity of rewards $R(s,a,s')$

Interpretation

$V^\pi(s)$ summarizes the future: high values indicate "good" states where following policy $\pi$ will accumulate high reward; low values indicate "bad" states where future reward is limited.

Policy Dependence

Critically, $V^\pi$ depends on the policy. The same state has different values under different policies. A state might be valuable under an expert policy but worthless under a random policy.

state_value_function.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import numpy as np
from typing import Callable, Dict, List, Tuple
 
def monte_carlo_state_value(
    env,
    policy: Callable,
    num_episodes: int = 1000,
    gamma: float = 0.99,
    first_visit: bool = True
) -> Dict[int, float]:
    """
    Monte Carlo estimation of V^π(s) for discrete state spaces.
    
    First-visit MC: Only count the first occurrence of each state per episode.
    Every-visit MC: Count every occurrence of each state.
    
    Args:
        env: Environment with discrete states
        policy: Function mapping state to action
        num_episodes: Number of episodes to sample
        gamma: Discount factor
        first_visit: Use first-visit MC if True, every-visit if False
        
    Returns:
        V: Dictionary mapping states to estimated values
    """
    # Track returns for each state
    returns = {}  # state -> list of returns
    
    for episode in range(num_episodes):
        # Generate episode
        states, rewards = [], []
        state = env.reset()
        done = False
        
        while not done:
            action = policy(state)
            next_state, reward, done, _ = env.step(action)
            states.append(state)
            rewards.append(reward)
            state = next_state
        
        # Compute returns working backwards
        G = 0
        visited = set()
        
        for t in reversed(range(len(states))):
            G = rewards[t] + gamma * G
            s = states[t]
            
            # First-visit: skip if already visited in this episode
            if first_visit and s in visited:
                continue
            visited.add(s)
            
            if s not in returns:
                returns[s] = []
            returns[s].append(G)
    
    # Average returns for each state
    V = {s: np.mean(rs) for s, rs in returns.items()}
    
    return V
 
 
class TDStateValueEstimator:
    """
    Temporal Difference (TD(0)) estimation of V^π(s).
    
    Update rule:
    V(s) ← V(s) + α [r + γV(s') - V(s)]
    
    This is an online algorithm that updates after every step,
    using the bootstrap estimate V(s') instead of waiting for
    the true return.
    """
    
    def __init__(self, n_states: int, 
                 learning_rate: float = 0.1,
                 gamma: float = 0.99):
        self.V = np.zeros(n_states)
        self.alpha = learning_rate
        self.gamma = gamma
    
    def update(self, state: int, reward: float, 
               next_state: int, done: bool) -> float:
        """
        Perform one TD(0) update.
        
        Returns:
            td_error: The temporal difference error δ
        """
        if done:
            # Terminal state has value 0
            target = reward
        else:
            # Bootstrap from next state's value
            target = reward + self.gamma * self.V[next_state]
        
        # TD error: how much our prediction was off
        td_error = target - self.V[state]
        
        # Update value estimate
        self.V[state] += self.alpha * td_error
        
        return td_error
    
    def get_value(self, state: int) -> float:
        return self.V[state]

Intuitive Understanding

Think of $V^\pi(s)$ as answering: "If I'm dropped into state $s$ and must follow policy $\pi$ forever, how happy should I be?"

In a game: high V means likely to win from here
In robot navigation: high V means close to goal
In trading: high V means well-positioned for profits

The discount factor $\gamma$ determines how far ahead the value function "looks". With $\gamma = 0$, only immediate reward matters. With $\gamma \approx 1$, distant future rewards are nearly as important as immediate ones.

The Action Value Function Q(s, a)

The action value function $Q^\pi(s, a)$ answers: "Starting from state $s$, taking action $a$, and then following policy $\pi$, what return can I expect?"

Formal Definition

$$Q^\pi(s, a) = \mathbb{E}{\pi}\left[ G_t ,\middle|, S_t = s, A_t = a \right] = \mathbb{E}{\pi}\left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} ,\middle|, S_t = s, A_t = a \right]$$

The key difference from $V$: we condition on both state AND the first action. After that first action, we follow $\pi$.

Why Q Is So Useful

Q-values enable model-free policy improvement. If we know $Q^\pi(s, a)$ for all actions, we can improve the policy by selecting:

$$\pi'(s) = \arg\max_a Q^\pi(s, a)$$

This doesn't require knowing the environment dynamics! We just need to know which action has the highest Q-value.

With $V^\pi$, improving the policy requires knowing transitions: which action leads to which next state? Q bypasses this by incorporating the action's consequence directly.

V(s) vs Q(s,a): Key Differences
Aspect	State Value V^π(s)	Action Value Q^π(s,a)
Input	State only	State AND action
Size (tabular)	\|S\| entries	\|S\| × \|A\| entries
Policy Improvement	Needs model P(s'\|s,a)	Model-free: argmax_a Q(s,a)
Common algorithms	TD(0), Monte Carlo V	Q-learning, SARSA, DQN
Continuous actions	Easy (1D output)	Hard (infinite actions)
Use case	Policy evaluation	Policy improvement & control

Relationship Between V and Q

V and Q are intimately connected:

$$V^\pi(s) = \sum_a \pi(a|s) Q^\pi(s, a) = \mathbb{E}_{a \sim \pi}[Q^\pi(s, a)]$$

The state value is the expected Q-value over the policy's action distribution.

$$Q^\pi(s, a) = \mathbb{E}_{s' \sim P}[R(s,a,s') + \gamma V^\pi(s')]$$

The action value is the expected immediate reward plus discounted next-state value.

These relationships form the basis of actor-critic methods: the critic estimates V or Q, which guides the actor's policy improvement.

The Q-Learning Insight

Q-learning's genius is learning Q* (optimal Q) directly, without needing to iterate through policies. By using the max over actions in the update, it approximates the optimal action-value regardless of what policy generated the experience. This enables off-policy learning and experience replay.

Bellman Equations: The Foundation

The Bellman equations express a recursive relationship between values at successive states. They're named after Richard Bellman, who pioneered dynamic programming.

The Core Insight

The value of a state can be decomposed into:

Immediate reward from the current step
Discounted value of the resulting state(s)

This recursive structure enables efficient computation—we don't need to enumerate all future trajectories.

Bellman Expectation Equation for V

$$V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^\pi(s') \right]$$

In words: the value of $s$ is the expected immediate reward plus the discounted expected value of the next state, where the expectation is over both the policy's action choice and the environment's transition.

Bellman Expectation Equation for Q

$$Q^\pi(s, a) = \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma \sum_{a'} \pi(a'|s') Q^\pi(s', a') \right]$$

Or more compactly: $$Q^\pi(s, a) = \mathbb{E}{s'}[R + \gamma V^\pi(s')] = \mathbb{E}{s'}[R + \gamma \mathbb{E}_{a' \sim \pi}[Q^\pi(s', a')]]$$

Converting Mermaid diagram...

Bellman Optimality Equations

For the optimal value functions, we replace expectations over the policy with maximization:

$$V^(s) = \max_a \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^(s') \right]$$

$$Q^(s, a) = \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma \max_{a'} Q^(s', a') \right]$$

These are the fixed-point equations that $V^$ and $Q^$ satisfy. Finding the optimal value function is equivalent to finding the fixed point of these equations.

From Bellman Equations to Algorithms

Dynamic Programming (Value/Policy Iteration): Direct application when model is known
TD Learning: Sample-based updates using Bellman equation structure
Q-Learning: Off-policy TD with Bellman optimality equation
Actor-Critic: Policy gradient + Bellman-based value estimation

The Bellman Operator

The Bellman equation can be written as V = T^π V, where T^π is the Bellman operator. This operator is a contraction mapping (with γ < 1), meaning repeated application converges to a unique fixed point V^π. This mathematical property guarantees that algorithms like value iteration converge.

The Advantage Function A(s, a)

The advantage function measures how much better an action is compared to the average action:

$$A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)$$

Interpretation

$A > 0$: Action is better than average (under current policy)
$A = 0$: Action is exactly average
$A < 0$: Action is worse than average

By averaging over the policy, we have: $$\mathbb{E}_{a \sim \pi}[A^\pi(s, a)] = 0$$

Advantages are zero-mean by construction.

Why Advantages Matter for Policy Gradients

The policy gradient theorem can be written as:

$$\nabla_\theta J(\pi_\theta) = \mathbb{E}{\pi\theta}\left[ \nabla_\theta \log \pi_\theta(a|s) \cdot A^{\pi_\theta}(s, a) \right]$$

Using advantages instead of Q-values or returns:

Reduces variance: Centering around the mean removes unnecessary variation
Correct direction: Positive advantages push up action probabilities, negative push down
Baseline subtraction: V(s) is the optimal baseline for variance reduction

This is why actor-critic methods estimate both V (critic) and use it to compute advantages for the actor update.

advantage_estimation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
import numpy as np
import torch
from typing import List
 
def compute_advantages_monte_carlo(
    rewards: List[float],
    values: List[float],
    gamma: float = 0.99
) -> np.ndarray:
    """
    Compute advantages using Monte Carlo returns minus value baseline.
    
    A_t = G_t - V(s_t)
    
    High variance but unbiased.
    """
    T = len(rewards)
    returns = np.zeros(T)
    
    # Compute returns backwards
    G = 0
    for t in reversed(range(T)):
        G = rewards[t] + gamma * G
        returns[t] = G
    
    # Advantage = return - baseline
    advantages = returns - np.array(values)
    
    return advantages
 
 
def compute_advantages_td(
    rewards: List[float],
    values: List[float],
    next_values: List[float],
    dones: List[bool],
    gamma: float = 0.99
) -> np.ndarray:
    """
    Compute TD(0) advantages.
    
    A_t = r_t + γV(s_{t+1}) - V(s_t)  (one-step advantage)
    
    Low variance but biased (depends on value function accuracy).
    """
    T = len(rewards)
    advantages = np.zeros(T)
    
    for t in range(T):
        if dones[t]:
            # Terminal: no next state value
            advantages[t] = rewards[t] - values[t]
        else:
            # TD target - current value
            advantages[t] = rewards[t] + gamma * next_values[t] - values[t]
    
    return advantages
 
 
def compute_gae(
    rewards: List[float],
    values: List[float],
    next_values: List[float],
    dones: List[bool],
    gamma: float = 0.99,
    gae_lambda: float = 0.95
) -> np.ndarray:
    """
    Generalized Advantage Estimation (GAE) - Schulman et al., 2016
    
    GAE interpolates between TD (λ=0) and Monte Carlo (λ=1).
    A^GAE = sum_{l=0}^∞ (γλ)^l δ_{t+l}
    
    where δ_t = r_t + γV(s_{t+1}) - V(s_t) is the TD error.
    
    λ controls the bias-variance tradeoff:
    - λ=0: TD(0), low variance, high bias
    - λ=1: Monte Carlo, high variance, low bias
    - λ≈0.95: Common choice, good balance
    
    This is the standard advantage estimator in PPO, A2C, and TRPO.
    """
    T = len(rewards)
    advantages = np.zeros(T)
    
    # Compute backwards for efficiency
    gae = 0
    for t in reversed(range(T)):
        if dones[t]:
            delta = rewards[t] - values[t]
            gae = delta  # Reset GAE at terminal states
        else:
            delta = rewards[t] + gamma * next_values[t] - values[t]
            gae = delta + gamma * gae_lambda * gae
        
        advantages[t] = gae
    
    return advantages
 
 
class GAEEstimator:
    """
    Production-ready GAE estimator with normalization.
    
    Used in PPO and other policy gradient methods.
    """
    
    def __init__(self, gamma: float = 0.99, 
                 gae_lambda: float = 0.95,
                 normalize: bool = True):
        self.gamma = gamma
        self.gae_lambda = gae_lambda
        self.normalize = normalize
    
    def __call__(self,
                 rewards: torch.Tensor,
                 values: torch.Tensor,
                 dones: torch.Tensor) -> torch.Tensor:
        """
        Compute GAE advantages.
        
        Args:
            rewards: [T] tensor of rewards
            values: [T+1] tensor of values (includes bootstrap value)
            dones: [T] tensor of done flags
            
        Returns:
            advantages: [T] tensor of GAE advantages
        """
        T = rewards.shape[0]
        advantages = torch.zeros(T)
        
        gae = 0
        for t in reversed(range(T)):
            # Handle terminal states
            non_terminal = 1.0 - dones[t].float()
            
            # TD error
            delta = rewards[t] + self.gamma * values[t+1] * non_terminal - values[t]
            
            # GAE recursion
            gae = delta + self.gamma * self.gae_lambda * non_terminal * gae
            advantages[t] = gae
        
        # Normalize for stability
        if self.normalize and len(advantages) > 1:
            advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
        
        return advantages

GAE: The Industry Standard

Generalized Advantage Estimation (GAE) is used in nearly all modern policy gradient implementations. The λ parameter offers a principled way to trade off bias and variance. Setting λ ≈ 0.95 works well across most problems without tuning.

Temporal Difference Learning

Temporal Difference (TD) learning is the workhorse of modern RL. It combines ideas from Monte Carlo (learning from experience) and dynamic programming (bootstrapping from estimates).

The TD(0) Update

$$V(s_t) \leftarrow V(s_t) + \alpha \left[ r_{t+1} + \gamma V(s_{t+1}) - V(s_t) \right]$$

The term in brackets is the TD error: $$\delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)$$

$r_{t+1} + \gamma V(s_{t+1})$: TD target (one-step estimate of value)
$V(s_t)$: Current estimate
$\delta_t$: Update direction and magnitude

Why TD Works

TD uses the current estimate $V(s_{t+1})$ as a stand-in for the true future return. This is called bootstrapping. Even though this estimate is initially wrong, TD provably converges to $V^\pi$ under standard conditions.

TD Advantages

•Online: Update after every step
•Incremental: No need to store episodes
•Lower variance: Bootstrapping reduces variance
•Works for continuing tasks: Doesn't need episode ends
•Computationally efficient: O(1) per update

TD Disadvantages

•Biased: Depends on value estimate quality
•Can diverge: With function approximation + off-policy
•Sensitive to learning rate: α too large causes instability
•May propagate errors: Bad estimates corrupt nearby values
•Harder convergence proofs: Compared to MC

TD(λ): Bridging TD and Monte Carlo

TD(0) uses only a one-step lookahead. Monte Carlo uses the full return. TD(λ) interpolates:

$$G_t^{(n)} = r_{t+1} + \gamma r_{t+2} + \ldots + \gamma^{n-1} r_{t+n} + \gamma^n V(s_{t+n})$$

The λ-return is a weighted average of all n-step returns: $$G_t^\lambda = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)}$$

$\lambda = 0$: TD(0) (one-step)
$\lambda = 1$: Monte Carlo (full return)
$0 < \lambda < 1$: Intermediate (e.g., $\lambda = 0.9$)

This is the same bias-variance tradeoff as GAE, but for value estimation rather than advantages.

The Deadly Triad

TD learning can diverge when combining: (1) Function approximation, (2) Bootstrapping, and (3) Off-policy learning. This 'deadly triad' was a major obstacle in deep RL until techniques like target networks (DQN), double Q-learning, and clipped objectives (PPO) provided stabilization.

Value Function Approximation

For large or continuous state spaces, we cannot store a value for every state. Instead, we use function approximation: parameterized functions that generalize across similar states.

Linear Value Functions

The simplest approximation uses linear combination of features:

$$V_\mathbf{w}(s) = \mathbf{w}^\top \phi(s) = \sum_i w_i \phi_i(s)$$

where $\phi(s)$ is a feature vector and $\mathbf{w}$ are learned weights.

Neural Network Value Functions

Deep RL uses neural networks to represent values:

$$V_\theta(s) = \text{NeuralNet}\theta(s)$$ $$Q\theta(s, a) = \text{NeuralNet}_\theta(s, a)$$

Networks can capture complex, non-linear value landscapes that linear functions cannot.

value_function_approximation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Tuple
 
class ValueNetwork(nn.Module):
    """
    Neural network for state value function V(s).
    
    Takes state as input, outputs single scalar value.
    """
    
    def __init__(self, state_dim: int, 
                 hidden_dims: Tuple[int, ...] = (64, 64)):
        super().__init__()
        
        layers = []
        prev_dim = state_dim
        
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.ReLU(),
            ])
            prev_dim = hidden_dim
        
        layers.append(nn.Linear(prev_dim, 1))
        
        self.network = nn.Sequential(*layers)
    
    def forward(self, state: torch.Tensor) -> torch.Tensor:
        """Return value estimate for state(s)."""
        return self.network(state).squeeze(-1)
 
 
class QNetwork(nn.Module):
    """
    Neural network for action value function Q(s,a).
    
    For DISCRETE actions: takes state, outputs Q-value for each action.
    """
    
    def __init__(self, state_dim: int, n_actions: int,
                 hidden_dims: Tuple[int, ...] = (64, 64)):
        super().__init__()
        
        layers = []
        prev_dim = state_dim
        
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.ReLU(),
            ])
            prev_dim = hidden_dim
        
        layers.append(nn.Linear(prev_dim, n_actions))
        
        self.network = nn.Sequential(*layers)
    
    def forward(self, state: torch.Tensor) -> torch.Tensor:
        """Return Q-values for all actions."""
        return self.network(state)
    
    def get_value(self, state: torch.Tensor, 
                  action: torch.Tensor) -> torch.Tensor:
        """Return Q-value for specific state-action pair."""
        q_values = self.forward(state)
        return q_values.gather(1, action.unsqueeze(1)).squeeze(1)
 
 
class ContinuousQNetwork(nn.Module):
    """
    Q-network for CONTINUOUS actions.
    
    Takes (state, action) pair as input, outputs single Q-value.
    Used in DDPG, TD3, SAC.
    """
    
    def __init__(self, state_dim: int, action_dim: int,
                 hidden_dims: Tuple[int, ...] = (256, 256)):
        super().__init__()
        
        layers = []
        prev_dim = state_dim + action_dim  # Concatenate state and action
        
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.ReLU(),
            ])
            prev_dim = hidden_dim
        
        layers.append(nn.Linear(prev_dim, 1))
        
        self.network = nn.Sequential(*layers)
    
    def forward(self, state: torch.Tensor, 
                action: torch.Tensor) -> torch.Tensor:
        """Return Q-value for state-action pair."""
        x = torch.cat([state, action], dim=-1)
        return self.network(x).squeeze(-1)
 
 
class DuelingDQN(nn.Module):
    """
    Dueling DQN architecture (Wang et al., 2016).
    
    Separates Q(s,a) into:
    - V(s): state value (how good is this state?)
    - A(s,a): advantage (how much better is this action than average?)
    
    Q(s,a) = V(s) + A(s,a) - mean_a'(A(s,a'))
    
    The subtraction ensures A sums to zero and V represents the true state value.
    This separation helps with credit assignment and learning efficiency.
    """
    
    def __init__(self, state_dim: int, n_actions: int,
                 hidden_dim: int = 128):
        super().__init__()
        
        # Shared feature extraction
        self.features = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
        )
        
        # Value stream: V(s)
        self.value_stream = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )
        
        # Advantage stream: A(s,a)
        self.advantage_stream = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, n_actions)
        )
    
    def forward(self, state: torch.Tensor) -> torch.Tensor:
        features = self.features(state)
        
        value = self.value_stream(features)
        advantages = self.advantage_stream(features)
        
        # Combine: Q = V + (A - mean(A))
        # Subtracting mean ensures identifiability
        q_values = value + (advantages - advantages.mean(dim=-1, keepdim=True))
        
        return q_values

The Bias-Variance Trade-off in Approximation

Function approximation introduces approximation error. Even if we had infinite data, a linear function cannot perfectly represent a non-linear value function.

Capacity considerations:

Too simple (underfitting): High bias, can't capture true value function
Too complex (overfitting): High variance, poor generalization

Neural networks are high-capacity approximators, so RL typically struggles more with variance (instability) than bias. This is why regularization, target networks, and careful hyperparameter tuning are essential.

Target Networks: Stabilizing Learning

The Instability Problem

In TD learning with function approximation, the update is: $$\theta \leftarrow \theta + \alpha (r + \gamma V_\theta(s') - V_\theta(s)) \nabla_\theta V_\theta(s)$$

The target $r + \gamma V_\theta(s')$ depends on the same parameters $\theta$ we're updating. This creates a moving target problem—as we update, the target changes, potentially causing oscillation or divergence.

The Target Network Solution

Maintain a separate target network $\theta^-$ that's updated slowly:

$$\theta \leftarrow \theta + \alpha (r + \gamma V_{\theta^-}(s') - V_\theta(s)) \nabla_\theta V_\theta(s)$$

Now the target is stable: it only changes when we explicitly update $\theta^-$.

Target Network Update Strategies
Strategy	Update Rule	Characteristics
Hard Update	θ⁻ ← θ every N steps	Simple; targets change abruptly
Soft Update (Polyak)	θ⁻ ← τθ + (1-τ)θ⁻ every step	Smooth; τ ≈ 0.005 typical
EMA (Exponential Moving Average)	θ⁻ ← decay × θ⁻ + (1-decay) × θ	Same as soft update, different naming

target_networks.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
import torch
import torch.nn as nn
from copy import deepcopy
 
class TargetNetworkMixin:
    """
    Mixin providing target network functionality.
    
    Used in DQN, DDPG, TD3, SAC, and most deep RL algorithms.
    """
    
    def init_target_networks(self, networks: dict):
        """
        Create target networks as copies of main networks.
        
        Args:
            networks: Dict of {name: network} pairs
        """
        self.target_networks = {}
        for name, network in networks.items():
            # Deep copy to create separate parameters
            self.target_networks[name] = deepcopy(network)
            # Freeze target network (no gradient computation)
            for param in self.target_networks[name].parameters():
                param.requires_grad = False
    
    def soft_update(self, tau: float = 0.005):
        """
        Soft (Polyak) update of target networks.
        
        θ_target = τ × θ + (1-τ) × θ_target
        
        Small τ (0.001-0.01) provides stable, slowly-moving targets.
        """
        for name, target_net in self.target_networks.items():
            main_net = getattr(self, name)
            for target_param, main_param in zip(
                target_net.parameters(), 
                main_net.parameters()
            ):
                target_param.data.copy_(
                    tau * main_param.data + (1 - tau) * target_param.data
                )
    
    def hard_update(self):
        """
        Hard update: copy main network parameters to target.
        
        Used every N steps in original DQN.
        """
        for name, target_net in self.target_networks.items():
            main_net = getattr(self, name)
            target_net.load_state_dict(main_net.state_dict())
 
 
class DQN(nn.Module, TargetNetworkMixin):
    """
    Deep Q-Network with target network.
    
    Key innovations from Mnih et al., 2015:
    1. Experience replay (break correlation)
    2. Target network (stabilize targets)
    """
    
    def __init__(self, state_dim: int, n_actions: int,
                 hidden_dims=(128, 128)):
        super().__init__()
        
        # Main Q-network
        layers = []
        prev = state_dim
        for h in hidden_dims:
            layers.extend([nn.Linear(prev, h), nn.ReLU()])
            prev = h
        layers.append(nn.Linear(prev, n_actions))
        self.q_network = nn.Sequential(*layers)
        
        # Create target network
        self.init_target_networks({'q_network': self.q_network})
        
        self.target_update_freq = 1000
        self.update_count = 0
    
    def get_q_values(self, state: torch.Tensor) -> torch.Tensor:
        """Get Q-values from main network."""
        return self.q_network(state)
    
    def get_target_q_values(self, state: torch.Tensor) -> torch.Tensor:
        """Get Q-values from target network."""
        return self.target_networks['q_network'](state)
    
    def update(self, states, actions, rewards, next_states, dones,
               gamma=0.99):
        """
        Perform one DQN update step.
        """
        # Current Q-values for taken actions
        q_values = self.get_q_values(states).gather(1, actions.unsqueeze(1))
        
        # Target: r + γ max_a' Q_target(s', a')
        with torch.no_grad():
            next_q_values = self.get_target_q_values(next_states)
            max_next_q = next_q_values.max(dim=1)[0]
            targets = rewards + gamma * max_next_q * (1 - dones.float())
        
        # MSE loss
        loss = nn.functional.mse_loss(q_values.squeeze(), targets)
        
        # ... optimizer step ...
        
        # Periodic hard update
        self.update_count += 1
        if self.update_count % self.target_update_freq == 0:
            self.hard_update()
        
        return loss.item()

How Value Functions Are Used

Value functions serve multiple purposes across different RL algorithms:

1. Policy Improvement (Q-learning, DQN)

With Q*, act greedily: $$\pi^(s) = \arg\max_a Q^(s, a)$$

No explicit policy network needed—the Q-function implicitly defines the policy.

2. Advantage Computation (Actor-Critic)

The critic estimates V(s), which is used to compute advantages: $$A(s, a) = r + \gamma V(s') - V(s) \approx Q(s, a) - V(s)$$

Advantages guide the actor's policy gradient updates.

3. Baseline for Variance Reduction (REINFORCE with baseline)

Subtracting V(s) from returns doesn't change the expected gradient but reduces variance: $$\nabla_\theta J = \mathbb{E}[\nabla_\theta \log \pi(a|s) (G - V(s))]$$

4. Bootstrapping (TD methods)

Value estimates enable learning before episode end: $$V(s) \leftarrow V(s) + \alpha[r + \gamma V(s') - V(s)]$$

Value Functions Across Algorithm Families
Algorithm	Value Function	Primary Use
Q-Learning	Q(s,a)	Policy = argmax Q
DQN	Q(s,a) neural net	Policy = argmax Q
SARSA	Q(s,a)	On-policy TD control
A2C/A3C	V(s)	Baseline for PG, compute advantage
PPO	V(s)	GAE advantage estimation
SAC	V(s) and Q(s,a)	Entropy-regularized value
DDPG/TD3	Q(s,a)	Critic for continuous control
Monte Carlo	V(s) or Q(s,a)	First/every-visit estimation

V vs Q in Practice

For discrete actions, Q-networks are natural: output Q(s,a) for all a, take argmax. For continuous actions, Q(s,a) requires taking (s,a) pairs as input, which is used in DDPG/TD3/SAC. Some continuous algorithms use only V (PPO, TRPO) and never explicitly compute Q.

Practical Value Estimation

Best Practices for Value Networks

•Normalize inputs: States should have roughly zero mean, unit variance. Use running normalization.
•Clip value predictions: Extreme values can destabilize training. Clip to reasonable range.
•Use target networks: Essential for stability with Q-learning-style algorithms.
•Share representations carefully: Actor-critic can share early layers, but sometimes separate is better.
•Match value scale to rewards: If rewards are large, values will be large. Normalize or scale appropriately.
•Double Q-learning: Use two Q-networks to reduce overestimation bias (addressed in DQN improvements).

Common Value Estimation Pitfalls

•Overestimation: Q-learning tends to overestimate Q-values. Use Double DQN.
•Value explosion: With long episodes or γ ≈ 1, values can grow unbounded. Clip or use smaller γ.
•Bootstrap bias: Early estimates are wrong; this propagates. Start with good initialization or use MC warmup.
•Off-policy divergence: Function approximation + off-policy + bootstrapping can diverge. Use target networks and experience replay.
•Incorrect terminal handling: Terminal states have V=0 (no future). Handle done flags correctly.

Debugging Value Functions

Monitor: (1) Average value magnitude—should be reasonable given reward scale and horizon. (2) Value variance—shouldn't explode. (3) TD error—should decrease over training. (4) Correlation with Monte Carlo returns—values should predict actual returns. If these are off, check reward normalization, learning rate, and network capacity.

Summary: The Lens of Future Consequence

Value functions are the mathematical lens through which RL agents see the future. They compress the complexity of long-term consequences into tractable estimates that guide decision-making. Here are the key takeaways:

Key Takeaways

•V(s) measures state quality: Expected return from state s following policy π.
•Q(s,a) measures action quality: Expected return from taking action a in state s, then following π.
•Bellman equations express recursive value relationships, enabling efficient computation.
•Advantages A(s,a) = Q - V measure relative action quality, crucial for stable policy gradients.
•TD learning updates values incrementally using bootstrapped estimates.
•Function approximation enables learning in large/continuous state spaces.
•Target networks stabilize learning with function approximation.

Looking Ahead

With policies and value functions covered, one crucial concept remains: the Markov property. This mathematical assumption underlies all of RL's tractability. The next page examines what Markovianity means, when it holds, what happens when it doesn't, and how this foundational assumption shapes everything we've learned.

Page Complete

You now understand value functions—the quantitative foundation of reinforcement learning. From V and Q to Bellman equations, from TD learning to neural network approximation, value functions are how RL agents reason about the future. Next, we complete the picture with the Markov property.

Value Functions

Quantifying Future Success

What You Will Learn

The State Value Function V(s)

The state value function $V^\pi(s)$ answers: "Starting from state $s$ and following policy $\pi$, what return can I expect?"

Formal Definition

$$V^\pi(s) = \mathbb{E}{\pi}\left[ G_t ,\middle|, S_t = s \right] = \mathbb{E}{\pi}\left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} ,\middle|, S_t = s \right]$$

The expectation is over:

The stochasticity of the policy $\pi(a|s)$
The stochasticity of the environment's transitions $P(s'|s,a)$
The stochasticity of rewards $R(s,a,s')$

Interpretation

$V^\pi(s)$ summarizes the future: high values indicate "good" states where following policy $\pi$ will accumulate high reward; low values indicate "bad" states where future reward is limited.

Policy Dependence

Critically, $V^\pi$ depends on the policy. The same state has different values under different policies. A state might be valuable under an expert policy but worthless under a random policy.

state_value_function.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import numpy as np
from typing import Callable, Dict, List, Tuple
 
def monte_carlo_state_value(
    env,
    policy: Callable,
    num_episodes: int = 1000,
    gamma: float = 0.99,
    first_visit: bool = True
) -> Dict[int, float]:
    """
    Monte Carlo estimation of V^π(s) for discrete state spaces.
    
    First-visit MC: Only count the first occurrence of each state per episode.
    Every-visit MC: Count every occurrence of each state.
    
    Args:
        env: Environment with discrete states
        policy: Function mapping state to action
        num_episodes: Number of episodes to sample
        gamma: Discount factor
        first_visit: Use first-visit MC if True, every-visit if False
        
    Returns:
        V: Dictionary mapping states to estimated values
    """
    # Track returns for each state
    returns = {}  # state -> list of returns
    
    for episode in range(num_episodes):
        # Generate episode
        states, rewards = [], []
        state = env.reset()
        done = False
        
        while not done:
            action = policy(state)
            next_state, reward, done, _ = env.step(action)
            states.append(state)
            rewards.append(reward)
            state = next_state
        
        # Compute returns working backwards
        G = 0
        visited = set()
        
        for t in reversed(range(len(states))):
            G = rewards[t] + gamma * G
            s = states[t]
            
            # First-visit: skip if already visited in this episode
            if first_visit and s in visited:
                continue
            visited.add(s)
            
            if s not in returns:
                returns[s] = []
            returns[s].append(G)
    
    # Average returns for each state
    V = {s: np.mean(rs) for s, rs in returns.items()}
    
    return V
 
 
class TDStateValueEstimator:
    """
    Temporal Difference (TD(0)) estimation of V^π(s).
    
    Update rule:
    V(s) ← V(s) + α [r + γV(s') - V(s)]
    
    This is an online algorithm that updates after every step,
    using the bootstrap estimate V(s') instead of waiting for
    the true return.
    """
    
    def __init__(self, n_states: int, 
                 learning_rate: float = 0.1,
                 gamma: float = 0.99):
        self.V = np.zeros(n_states)
        self.alpha = learning_rate
        self.gamma = gamma
    
    def update(self, state: int, reward: float, 
               next_state: int, done: bool) -> float:
        """
        Perform one TD(0) update.
        
        Returns:
            td_error: The temporal difference error δ
        """
        if done:
            # Terminal state has value 0
            target = reward
        else:
            # Bootstrap from next state's value
            target = reward + self.gamma * self.V[next_state]
        
        # TD error: how much our prediction was off
        td_error = target - self.V[state]
        
        # Update value estimate
        self.V[state] += self.alpha * td_error
        
        return td_error
    
    def get_value(self, state: int) -> float:
        return self.V[state]

Intuitive Understanding

Think of $V^\pi(s)$ as answering: "If I'm dropped into state $s$ and must follow policy $\pi$ forever, how happy should I be?"

In a game: high V means likely to win from here
In robot navigation: high V means close to goal
In trading: high V means well-positioned for profits

The Action Value Function Q(s, a)

The action value function $Q^\pi(s, a)$ answers: "Starting from state $s$, taking action $a$, and then following policy $\pi$, what return can I expect?"

Formal Definition

$$Q^\pi(s, a) = \mathbb{E}{\pi}\left[ G_t ,\middle|, S_t = s, A_t = a \right] = \mathbb{E}{\pi}\left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} ,\middle|, S_t = s, A_t = a \right]$$

The key difference from $V$: we condition on both state AND the first action. After that first action, we follow $\pi$.

Why Q Is So Useful

Q-values enable model-free policy improvement. If we know $Q^\pi(s, a)$ for all actions, we can improve the policy by selecting:

$$\pi'(s) = \arg\max_a Q^\pi(s, a)$$

This doesn't require knowing the environment dynamics! We just need to know which action has the highest Q-value.

With $V^\pi$, improving the policy requires knowing transitions: which action leads to which next state? Q bypasses this by incorporating the action's consequence directly.

V(s) vs Q(s,a): Key Differences
Aspect	State Value V^π(s)	Action Value Q^π(s,a)
Input	State only	State AND action
Size (tabular)	\|S\| entries	\|S\| × \|A\| entries
Policy Improvement	Needs model P(s'\|s,a)	Model-free: argmax_a Q(s,a)
Common algorithms	TD(0), Monte Carlo V	Q-learning, SARSA, DQN
Continuous actions	Easy (1D output)	Hard (infinite actions)
Use case	Policy evaluation	Policy improvement & control

Relationship Between V and Q

V and Q are intimately connected:

$$V^\pi(s) = \sum_a \pi(a|s) Q^\pi(s, a) = \mathbb{E}_{a \sim \pi}[Q^\pi(s, a)]$$

The state value is the expected Q-value over the policy's action distribution.

$$Q^\pi(s, a) = \mathbb{E}_{s' \sim P}[R(s,a,s') + \gamma V^\pi(s')]$$

The action value is the expected immediate reward plus discounted next-state value.

These relationships form the basis of actor-critic methods: the critic estimates V or Q, which guides the actor's policy improvement.

The Q-Learning Insight

Bellman Equations: The Foundation

The Bellman equations express a recursive relationship between values at successive states. They're named after Richard Bellman, who pioneered dynamic programming.

The Core Insight

The value of a state can be decomposed into:

Immediate reward from the current step
Discounted value of the resulting state(s)

This recursive structure enables efficient computation—we don't need to enumerate all future trajectories.

Bellman Expectation Equation for V

$$V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^\pi(s') \right]$$

Bellman Expectation Equation for Q

$$Q^\pi(s, a) = \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma \sum_{a'} \pi(a'|s') Q^\pi(s', a') \right]$$

Or more compactly: $$Q^\pi(s, a) = \mathbb{E}{s'}[R + \gamma V^\pi(s')] = \mathbb{E}{s'}[R + \gamma \mathbb{E}_{a' \sim \pi}[Q^\pi(s', a')]]$$

Converting Mermaid diagram...

Bellman Optimality Equations

For the optimal value functions, we replace expectations over the policy with maximization:

$$V^(s) = \max_a \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^(s') \right]$$

$$Q^(s, a) = \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma \max_{a'} Q^(s', a') \right]$$

These are the fixed-point equations that $V^$ and $Q^$ satisfy. Finding the optimal value function is equivalent to finding the fixed point of these equations.

From Bellman Equations to Algorithms

Dynamic Programming (Value/Policy Iteration): Direct application when model is known
TD Learning: Sample-based updates using Bellman equation structure
Q-Learning: Off-policy TD with Bellman optimality equation
Actor-Critic: Policy gradient + Bellman-based value estimation

The Bellman Operator

The Advantage Function A(s, a)

The advantage function measures how much better an action is compared to the average action:

$$A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)$$

Interpretation

$A > 0$: Action is better than average (under current policy)
$A = 0$: Action is exactly average
$A < 0$: Action is worse than average

By averaging over the policy, we have: $$\mathbb{E}_{a \sim \pi}[A^\pi(s, a)] = 0$$

Advantages are zero-mean by construction.

Why Advantages Matter for Policy Gradients

The policy gradient theorem can be written as:

$$\nabla_\theta J(\pi_\theta) = \mathbb{E}{\pi\theta}\left[ \nabla_\theta \log \pi_\theta(a|s) \cdot A^{\pi_\theta}(s, a) \right]$$

Using advantages instead of Q-values or returns:

Reduces variance: Centering around the mean removes unnecessary variation
Correct direction: Positive advantages push up action probabilities, negative push down
Baseline subtraction: V(s) is the optimal baseline for variance reduction

This is why actor-critic methods estimate both V (critic) and use it to compute advantages for the actor update.

advantage_estimation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
import numpy as np
import torch
from typing import List
 
def compute_advantages_monte_carlo(
    rewards: List[float],
    values: List[float],
    gamma: float = 0.99
) -> np.ndarray:
    """
    Compute advantages using Monte Carlo returns minus value baseline.
    
    A_t = G_t - V(s_t)
    
    High variance but unbiased.
    """
    T = len(rewards)
    returns = np.zeros(T)
    
    # Compute returns backwards
    G = 0
    for t in reversed(range(T)):
        G = rewards[t] + gamma * G
        returns[t] = G
    
    # Advantage = return - baseline
    advantages = returns - np.array(values)
    
    return advantages
 
 
def compute_advantages_td(
    rewards: List[float],
    values: List[float],
    next_values: List[float],
    dones: List[bool],
    gamma: float = 0.99
) -> np.ndarray:
    """
    Compute TD(0) advantages.
    
    A_t = r_t + γV(s_{t+1}) - V(s_t)  (one-step advantage)
    
    Low variance but biased (depends on value function accuracy).
    """
    T = len(rewards)
    advantages = np.zeros(T)
    
    for t in range(T):
        if dones[t]:
            # Terminal: no next state value
            advantages[t] = rewards[t] - values[t]
        else:
            # TD target - current value
            advantages[t] = rewards[t] + gamma * next_values[t] - values[t]
    
    return advantages
 
 
def compute_gae(
    rewards: List[float],
    values: List[float],
    next_values: List[float],
    dones: List[bool],
    gamma: float = 0.99,
    gae_lambda: float = 0.95
) -> np.ndarray:
    """
    Generalized Advantage Estimation (GAE) - Schulman et al., 2016
    
    GAE interpolates between TD (λ=0) and Monte Carlo (λ=1).
    A^GAE = sum_{l=0}^∞ (γλ)^l δ_{t+l}
    
    where δ_t = r_t + γV(s_{t+1}) - V(s_t) is the TD error.
    
    λ controls the bias-variance tradeoff:
    - λ=0: TD(0), low variance, high bias
    - λ=1: Monte Carlo, high variance, low bias
    - λ≈0.95: Common choice, good balance
    
    This is the standard advantage estimator in PPO, A2C, and TRPO.
    """
    T = len(rewards)
    advantages = np.zeros(T)
    
    # Compute backwards for efficiency
    gae = 0
    for t in reversed(range(T)):
        if dones[t]:
            delta = rewards[t] - values[t]
            gae = delta  # Reset GAE at terminal states
        else:
            delta = rewards[t] + gamma * next_values[t] - values[t]
            gae = delta + gamma * gae_lambda * gae
        
        advantages[t] = gae
    
    return advantages
 
 
class GAEEstimator:
    """
    Production-ready GAE estimator with normalization.
    
    Used in PPO and other policy gradient methods.
    """
    
    def __init__(self, gamma: float = 0.99, 
                 gae_lambda: float = 0.95,
                 normalize: bool = True):
        self.gamma = gamma
        self.gae_lambda = gae_lambda
        self.normalize = normalize
    
    def __call__(self,
                 rewards: torch.Tensor,
                 values: torch.Tensor,
                 dones: torch.Tensor) -> torch.Tensor:
        """
        Compute GAE advantages.
        
        Args:
            rewards: [T] tensor of rewards
            values: [T+1] tensor of values (includes bootstrap value)
            dones: [T] tensor of done flags
            
        Returns:
            advantages: [T] tensor of GAE advantages
        """
        T = rewards.shape[0]
        advantages = torch.zeros(T)
        
        gae = 0
        for t in reversed(range(T)):
            # Handle terminal states
            non_terminal = 1.0 - dones[t].float()
            
            # TD error
            delta = rewards[t] + self.gamma * values[t+1] * non_terminal - values[t]
            
            # GAE recursion
            gae = delta + self.gamma * self.gae_lambda * non_terminal * gae
            advantages[t] = gae
        
        # Normalize for stability
        if self.normalize and len(advantages) > 1:
            advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
        
        return advantages

GAE: The Industry Standard

Temporal Difference Learning

Temporal Difference (TD) learning is the workhorse of modern RL. It combines ideas from Monte Carlo (learning from experience) and dynamic programming (bootstrapping from estimates).

The TD(0) Update

$$V(s_t) \leftarrow V(s_t) + \alpha \left[ r_{t+1} + \gamma V(s_{t+1}) - V(s_t) \right]$$

The term in brackets is the TD error: $$\delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)$$

$r_{t+1} + \gamma V(s_{t+1})$: TD target (one-step estimate of value)
$V(s_t)$: Current estimate
$\delta_t$: Update direction and magnitude

Why TD Works

TD Advantages

•Online: Update after every step
•Incremental: No need to store episodes
•Lower variance: Bootstrapping reduces variance
•Works for continuing tasks: Doesn't need episode ends
•Computationally efficient: O(1) per update

TD Disadvantages

•Biased: Depends on value estimate quality
•Can diverge: With function approximation + off-policy
•Sensitive to learning rate: α too large causes instability
•May propagate errors: Bad estimates corrupt nearby values
•Harder convergence proofs: Compared to MC

TD(λ): Bridging TD and Monte Carlo

TD(0) uses only a one-step lookahead. Monte Carlo uses the full return. TD(λ) interpolates:

$$G_t^{(n)} = r_{t+1} + \gamma r_{t+2} + \ldots + \gamma^{n-1} r_{t+n} + \gamma^n V(s_{t+n})$$

The λ-return is a weighted average of all n-step returns: $$G_t^\lambda = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)}$$

$\lambda = 0$: TD(0) (one-step)
$\lambda = 1$: Monte Carlo (full return)
$0 < \lambda < 1$: Intermediate (e.g., $\lambda = 0.9$)

This is the same bias-variance tradeoff as GAE, but for value estimation rather than advantages.

The Deadly Triad

Value Function Approximation

For large or continuous state spaces, we cannot store a value for every state. Instead, we use function approximation: parameterized functions that generalize across similar states.

Linear Value Functions

The simplest approximation uses linear combination of features:

$$V_\mathbf{w}(s) = \mathbf{w}^\top \phi(s) = \sum_i w_i \phi_i(s)$$

where $\phi(s)$ is a feature vector and $\mathbf{w}$ are learned weights.

Neural Network Value Functions

Deep RL uses neural networks to represent values:

$$V_\theta(s) = \text{NeuralNet}\theta(s)$$ $$Q\theta(s, a) = \text{NeuralNet}_\theta(s, a)$$

Networks can capture complex, non-linear value landscapes that linear functions cannot.

value_function_approximation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Tuple
 
class ValueNetwork(nn.Module):
    """
    Neural network for state value function V(s).
    
    Takes state as input, outputs single scalar value.
    """
    
    def __init__(self, state_dim: int, 
                 hidden_dims: Tuple[int, ...] = (64, 64)):
        super().__init__()
        
        layers = []
        prev_dim = state_dim
        
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.ReLU(),
            ])
            prev_dim = hidden_dim
        
        layers.append(nn.Linear(prev_dim, 1))
        
        self.network = nn.Sequential(*layers)
    
    def forward(self, state: torch.Tensor) -> torch.Tensor:
        """Return value estimate for state(s)."""
        return self.network(state).squeeze(-1)
 
 
class QNetwork(nn.Module):
    """
    Neural network for action value function Q(s,a).
    
    For DISCRETE actions: takes state, outputs Q-value for each action.
    """
    
    def __init__(self, state_dim: int, n_actions: int,
                 hidden_dims: Tuple[int, ...] = (64, 64)):
        super().__init__()
        
        layers = []
        prev_dim = state_dim
        
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.ReLU(),
            ])
            prev_dim = hidden_dim
        
        layers.append(nn.Linear(prev_dim, n_actions))
        
        self.network = nn.Sequential(*layers)
    
    def forward(self, state: torch.Tensor) -> torch.Tensor:
        """Return Q-values for all actions."""
        return self.network(state)
    
    def get_value(self, state: torch.Tensor, 
                  action: torch.Tensor) -> torch.Tensor:
        """Return Q-value for specific state-action pair."""
        q_values = self.forward(state)
        return q_values.gather(1, action.unsqueeze(1)).squeeze(1)
 
 
class ContinuousQNetwork(nn.Module):
    """
    Q-network for CONTINUOUS actions.
    
    Takes (state, action) pair as input, outputs single Q-value.
    Used in DDPG, TD3, SAC.
    """
    
    def __init__(self, state_dim: int, action_dim: int,
                 hidden_dims: Tuple[int, ...] = (256, 256)):
        super().__init__()
        
        layers = []
        prev_dim = state_dim + action_dim  # Concatenate state and action
        
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.ReLU(),
            ])
            prev_dim = hidden_dim
        
        layers.append(nn.Linear(prev_dim, 1))
        
        self.network = nn.Sequential(*layers)
    
    def forward(self, state: torch.Tensor, 
                action: torch.Tensor) -> torch.Tensor:
        """Return Q-value for state-action pair."""
        x = torch.cat([state, action], dim=-1)
        return self.network(x).squeeze(-1)
 
 
class DuelingDQN(nn.Module):
    """
    Dueling DQN architecture (Wang et al., 2016).
    
    Separates Q(s,a) into:
    - V(s): state value (how good is this state?)
    - A(s,a): advantage (how much better is this action than average?)
    
    Q(s,a) = V(s) + A(s,a) - mean_a'(A(s,a'))
    
    The subtraction ensures A sums to zero and V represents the true state value.
    This separation helps with credit assignment and learning efficiency.
    """
    
    def __init__(self, state_dim: int, n_actions: int,
                 hidden_dim: int = 128):
        super().__init__()
        
        # Shared feature extraction
        self.features = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
        )
        
        # Value stream: V(s)
        self.value_stream = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )
        
        # Advantage stream: A(s,a)
        self.advantage_stream = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, n_actions)
        )
    
    def forward(self, state: torch.Tensor) -> torch.Tensor:
        features = self.features(state)
        
        value = self.value_stream(features)
        advantages = self.advantage_stream(features)
        
        # Combine: Q = V + (A - mean(A))
        # Subtracting mean ensures identifiability
        q_values = value + (advantages - advantages.mean(dim=-1, keepdim=True))
        
        return q_values

The Bias-Variance Trade-off in Approximation

Function approximation introduces approximation error. Even if we had infinite data, a linear function cannot perfectly represent a non-linear value function.

Capacity considerations:

Too simple (underfitting): High bias, can't capture true value function
Too complex (overfitting): High variance, poor generalization

Target Networks: Stabilizing Learning

The Instability Problem

In TD learning with function approximation, the update is: $$\theta \leftarrow \theta + \alpha (r + \gamma V_\theta(s') - V_\theta(s)) \nabla_\theta V_\theta(s)$$

The Target Network Solution

Maintain a separate target network $\theta^-$ that's updated slowly:

$$\theta \leftarrow \theta + \alpha (r + \gamma V_{\theta^-}(s') - V_\theta(s)) \nabla_\theta V_\theta(s)$$

Now the target is stable: it only changes when we explicitly update $\theta^-$.

Target Network Update Strategies
Strategy	Update Rule	Characteristics
Hard Update	θ⁻ ← θ every N steps	Simple; targets change abruptly
Soft Update (Polyak)	θ⁻ ← τθ + (1-τ)θ⁻ every step	Smooth; τ ≈ 0.005 typical
EMA (Exponential Moving Average)	θ⁻ ← decay × θ⁻ + (1-decay) × θ	Same as soft update, different naming

target_networks.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
import torch
import torch.nn as nn
from copy import deepcopy
 
class TargetNetworkMixin:
    """
    Mixin providing target network functionality.
    
    Used in DQN, DDPG, TD3, SAC, and most deep RL algorithms.
    """
    
    def init_target_networks(self, networks: dict):
        """
        Create target networks as copies of main networks.
        
        Args:
            networks: Dict of {name: network} pairs
        """
        self.target_networks = {}
        for name, network in networks.items():
            # Deep copy to create separate parameters
            self.target_networks[name] = deepcopy(network)
            # Freeze target network (no gradient computation)
            for param in self.target_networks[name].parameters():
                param.requires_grad = False
    
    def soft_update(self, tau: float = 0.005):
        """
        Soft (Polyak) update of target networks.
        
        θ_target = τ × θ + (1-τ) × θ_target
        
        Small τ (0.001-0.01) provides stable, slowly-moving targets.
        """
        for name, target_net in self.target_networks.items():
            main_net = getattr(self, name)
            for target_param, main_param in zip(
                target_net.parameters(), 
                main_net.parameters()
            ):
                target_param.data.copy_(
                    tau * main_param.data + (1 - tau) * target_param.data
                )
    
    def hard_update(self):
        """
        Hard update: copy main network parameters to target.
        
        Used every N steps in original DQN.
        """
        for name, target_net in self.target_networks.items():
            main_net = getattr(self, name)
            target_net.load_state_dict(main_net.state_dict())
 
 
class DQN(nn.Module, TargetNetworkMixin):
    """
    Deep Q-Network with target network.
    
    Key innovations from Mnih et al., 2015:
    1. Experience replay (break correlation)
    2. Target network (stabilize targets)
    """
    
    def __init__(self, state_dim: int, n_actions: int,
                 hidden_dims=(128, 128)):
        super().__init__()
        
        # Main Q-network
        layers = []
        prev = state_dim
        for h in hidden_dims:
            layers.extend([nn.Linear(prev, h), nn.ReLU()])
            prev = h
        layers.append(nn.Linear(prev, n_actions))
        self.q_network = nn.Sequential(*layers)
        
        # Create target network
        self.init_target_networks({'q_network': self.q_network})
        
        self.target_update_freq = 1000
        self.update_count = 0
    
    def get_q_values(self, state: torch.Tensor) -> torch.Tensor:
        """Get Q-values from main network."""
        return self.q_network(state)
    
    def get_target_q_values(self, state: torch.Tensor) -> torch.Tensor:
        """Get Q-values from target network."""
        return self.target_networks['q_network'](state)
    
    def update(self, states, actions, rewards, next_states, dones,
               gamma=0.99):
        """
        Perform one DQN update step.
        """
        # Current Q-values for taken actions
        q_values = self.get_q_values(states).gather(1, actions.unsqueeze(1))
        
        # Target: r + γ max_a' Q_target(s', a')
        with torch.no_grad():
            next_q_values = self.get_target_q_values(next_states)
            max_next_q = next_q_values.max(dim=1)[0]
            targets = rewards + gamma * max_next_q * (1 - dones.float())
        
        # MSE loss
        loss = nn.functional.mse_loss(q_values.squeeze(), targets)
        
        # ... optimizer step ...
        
        # Periodic hard update
        self.update_count += 1
        if self.update_count % self.target_update_freq == 0:
            self.hard_update()
        
        return loss.item()

How Value Functions Are Used

Value functions serve multiple purposes across different RL algorithms:

1. Policy Improvement (Q-learning, DQN)

With Q*, act greedily: $$\pi^(s) = \arg\max_a Q^(s, a)$$

No explicit policy network needed—the Q-function implicitly defines the policy.

2. Advantage Computation (Actor-Critic)

The critic estimates V(s), which is used to compute advantages: $$A(s, a) = r + \gamma V(s') - V(s) \approx Q(s, a) - V(s)$$

Advantages guide the actor's policy gradient updates.

3. Baseline for Variance Reduction (REINFORCE with baseline)

Subtracting V(s) from returns doesn't change the expected gradient but reduces variance: $$\nabla_\theta J = \mathbb{E}[\nabla_\theta \log \pi(a|s) (G - V(s))]$$

4. Bootstrapping (TD methods)

Value estimates enable learning before episode end: $$V(s) \leftarrow V(s) + \alpha[r + \gamma V(s') - V(s)]$$

Value Functions Across Algorithm Families
Algorithm	Value Function	Primary Use
Q-Learning	Q(s,a)	Policy = argmax Q
DQN	Q(s,a) neural net	Policy = argmax Q
SARSA	Q(s,a)	On-policy TD control
A2C/A3C	V(s)	Baseline for PG, compute advantage
PPO	V(s)	GAE advantage estimation
SAC	V(s) and Q(s,a)	Entropy-regularized value
DDPG/TD3	Q(s,a)	Critic for continuous control
Monte Carlo	V(s) or Q(s,a)	First/every-visit estimation

V vs Q in Practice

Practical Value Estimation

Best Practices for Value Networks

•Normalize inputs: States should have roughly zero mean, unit variance. Use running normalization.
•Clip value predictions: Extreme values can destabilize training. Clip to reasonable range.
•Use target networks: Essential for stability with Q-learning-style algorithms.
•Share representations carefully: Actor-critic can share early layers, but sometimes separate is better.
•Match value scale to rewards: If rewards are large, values will be large. Normalize or scale appropriately.
•Double Q-learning: Use two Q-networks to reduce overestimation bias (addressed in DQN improvements).

Common Value Estimation Pitfalls

•Overestimation: Q-learning tends to overestimate Q-values. Use Double DQN.
•Value explosion: With long episodes or γ ≈ 1, values can grow unbounded. Clip or use smaller γ.
•Bootstrap bias: Early estimates are wrong; this propagates. Start with good initialization or use MC warmup.
•Off-policy divergence: Function approximation + off-policy + bootstrapping can diverge. Use target networks and experience replay.
•Incorrect terminal handling: Terminal states have V=0 (no future). Handle done flags correctly.

Debugging Value Functions

Summary: The Lens of Future Consequence

Key Takeaways

•V(s) measures state quality: Expected return from state s following policy π.
•Q(s,a) measures action quality: Expected return from taking action a in state s, then following π.
•Bellman equations express recursive value relationships, enabling efficient computation.
•Advantages A(s,a) = Q - V measure relative action quality, crucial for stable policy gradients.
•TD learning updates values incrementally using bootstrapped estimates.
•Function approximation enables learning in large/continuous state spaces.
•Target networks stabilize learning with function approximation.

Looking Ahead

Page Complete