Loading content...
How valuable is it to be in a particular state? How good is it to take a specific action? These questions are answered by value functions—arguably the most important concept in reinforcement learning.
Value functions compress infinite futures into single numbers. They tell us the expected cumulative reward from any situation, enabling agents to make farsighted decisions without explicitly planning every possible trajectory.
Most RL algorithms—from Q-learning to actor-critic to AlphaGo—rely fundamentally on value function estimation. Understanding value functions deeply means understanding how intelligent agents can reason about long-term consequences.
By the end of this page, you'll understand state value functions V(s), action value functions Q(s,a), their relationship via Bellman equations, methods for estimating them (Monte Carlo, TD learning), and the critical role they play in policy evaluation and improvement.
The state value function $V^\pi(s)$ answers: "Starting from state $s$ and following policy $\pi$, what return can I expect?"
Formal Definition
$$V^\pi(s) = \mathbb{E}{\pi}\left[ G_t ,\middle|, S_t = s \right] = \mathbb{E}{\pi}\left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} ,\middle|, S_t = s \right]$$
The expectation is over:
Interpretation
$V^\pi(s)$ summarizes the future: high values indicate "good" states where following policy $\pi$ will accumulate high reward; low values indicate "bad" states where future reward is limited.
Policy Dependence
Critically, $V^\pi$ depends on the policy. The same state has different values under different policies. A state might be valuable under an expert policy but worthless under a random policy.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
import numpy as npfrom typing import Callable, Dict, List, Tuple def monte_carlo_state_value( env, policy: Callable, num_episodes: int = 1000, gamma: float = 0.99, first_visit: bool = True) -> Dict[int, float]: """ Monte Carlo estimation of V^π(s) for discrete state spaces. First-visit MC: Only count the first occurrence of each state per episode. Every-visit MC: Count every occurrence of each state. Args: env: Environment with discrete states policy: Function mapping state to action num_episodes: Number of episodes to sample gamma: Discount factor first_visit: Use first-visit MC if True, every-visit if False Returns: V: Dictionary mapping states to estimated values """ # Track returns for each state returns = {} # state -> list of returns for episode in range(num_episodes): # Generate episode states, rewards = [], [] state = env.reset() done = False while not done: action = policy(state) next_state, reward, done, _ = env.step(action) states.append(state) rewards.append(reward) state = next_state # Compute returns working backwards G = 0 visited = set() for t in reversed(range(len(states))): G = rewards[t] + gamma * G s = states[t] # First-visit: skip if already visited in this episode if first_visit and s in visited: continue visited.add(s) if s not in returns: returns[s] = [] returns[s].append(G) # Average returns for each state V = {s: np.mean(rs) for s, rs in returns.items()} return V class TDStateValueEstimator: """ Temporal Difference (TD(0)) estimation of V^π(s). Update rule: V(s) ← V(s) + α [r + γV(s') - V(s)] This is an online algorithm that updates after every step, using the bootstrap estimate V(s') instead of waiting for the true return. """ def __init__(self, n_states: int, learning_rate: float = 0.1, gamma: float = 0.99): self.V = np.zeros(n_states) self.alpha = learning_rate self.gamma = gamma def update(self, state: int, reward: float, next_state: int, done: bool) -> float: """ Perform one TD(0) update. Returns: td_error: The temporal difference error δ """ if done: # Terminal state has value 0 target = reward else: # Bootstrap from next state's value target = reward + self.gamma * self.V[next_state] # TD error: how much our prediction was off td_error = target - self.V[state] # Update value estimate self.V[state] += self.alpha * td_error return td_error def get_value(self, state: int) -> float: return self.V[state]Intuitive Understanding
Think of $V^\pi(s)$ as answering: "If I'm dropped into state $s$ and must follow policy $\pi$ forever, how happy should I be?"
The discount factor $\gamma$ determines how far ahead the value function "looks". With $\gamma = 0$, only immediate reward matters. With $\gamma \approx 1$, distant future rewards are nearly as important as immediate ones.
The action value function $Q^\pi(s, a)$ answers: "Starting from state $s$, taking action $a$, and then following policy $\pi$, what return can I expect?"
Formal Definition
$$Q^\pi(s, a) = \mathbb{E}{\pi}\left[ G_t ,\middle|, S_t = s, A_t = a \right] = \mathbb{E}{\pi}\left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} ,\middle|, S_t = s, A_t = a \right]$$
The key difference from $V$: we condition on both state AND the first action. After that first action, we follow $\pi$.
Why Q Is So Useful
Q-values enable model-free policy improvement. If we know $Q^\pi(s, a)$ for all actions, we can improve the policy by selecting:
$$\pi'(s) = \arg\max_a Q^\pi(s, a)$$
This doesn't require knowing the environment dynamics! We just need to know which action has the highest Q-value.
With $V^\pi$, improving the policy requires knowing transitions: which action leads to which next state? Q bypasses this by incorporating the action's consequence directly.
| Aspect | State Value V^π(s) | Action Value Q^π(s,a) |
|---|---|---|
| Input | State only | State AND action |
| Size (tabular) | |S| entries | |S| × |A| entries |
| Policy Improvement | Needs model P(s'|s,a) | Model-free: argmax_a Q(s,a) |
| Common algorithms | TD(0), Monte Carlo V | Q-learning, SARSA, DQN |
| Continuous actions | Easy (1D output) | Hard (infinite actions) |
| Use case | Policy evaluation | Policy improvement & control |
Relationship Between V and Q
V and Q are intimately connected:
$$V^\pi(s) = \sum_a \pi(a|s) Q^\pi(s, a) = \mathbb{E}_{a \sim \pi}[Q^\pi(s, a)]$$
The state value is the expected Q-value over the policy's action distribution.
$$Q^\pi(s, a) = \mathbb{E}_{s' \sim P}[R(s,a,s') + \gamma V^\pi(s')]$$
The action value is the expected immediate reward plus discounted next-state value.
These relationships form the basis of actor-critic methods: the critic estimates V or Q, which guides the actor's policy improvement.
Q-learning's genius is learning Q* (optimal Q) directly, without needing to iterate through policies. By using the max over actions in the update, it approximates the optimal action-value regardless of what policy generated the experience. This enables off-policy learning and experience replay.
The Bellman equations express a recursive relationship between values at successive states. They're named after Richard Bellman, who pioneered dynamic programming.
The Core Insight
The value of a state can be decomposed into:
This recursive structure enables efficient computation—we don't need to enumerate all future trajectories.
Bellman Expectation Equation for V
$$V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^\pi(s') \right]$$
In words: the value of $s$ is the expected immediate reward plus the discounted expected value of the next state, where the expectation is over both the policy's action choice and the environment's transition.
Bellman Expectation Equation for Q
$$Q^\pi(s, a) = \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma \sum_{a'} \pi(a'|s') Q^\pi(s', a') \right]$$
Or more compactly: $$Q^\pi(s, a) = \mathbb{E}{s'}[R + \gamma V^\pi(s')] = \mathbb{E}{s'}[R + \gamma \mathbb{E}_{a' \sim \pi}[Q^\pi(s', a')]]$$
Bellman Optimality Equations
For the optimal value functions, we replace expectations over the policy with maximization:
$$V^(s) = \max_a \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^(s') \right]$$
$$Q^(s, a) = \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma \max_{a'} Q^(s', a') \right]$$
These are the fixed-point equations that $V^$ and $Q^$ satisfy. Finding the optimal value function is equivalent to finding the fixed point of these equations.
From Bellman Equations to Algorithms
The Bellman equation can be written as V = T^π V, where T^π is the Bellman operator. This operator is a contraction mapping (with γ < 1), meaning repeated application converges to a unique fixed point V^π. This mathematical property guarantees that algorithms like value iteration converge.
The advantage function measures how much better an action is compared to the average action:
$$A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)$$
Interpretation
By averaging over the policy, we have: $$\mathbb{E}_{a \sim \pi}[A^\pi(s, a)] = 0$$
Advantages are zero-mean by construction.
Why Advantages Matter for Policy Gradients
The policy gradient theorem can be written as:
$$\nabla_\theta J(\pi_\theta) = \mathbb{E}{\pi\theta}\left[ \nabla_\theta \log \pi_\theta(a|s) \cdot A^{\pi_\theta}(s, a) \right]$$
Using advantages instead of Q-values or returns:
This is why actor-critic methods estimate both V (critic) and use it to compute advantages for the actor update.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149
import numpy as npimport torchfrom typing import List def compute_advantages_monte_carlo( rewards: List[float], values: List[float], gamma: float = 0.99) -> np.ndarray: """ Compute advantages using Monte Carlo returns minus value baseline. A_t = G_t - V(s_t) High variance but unbiased. """ T = len(rewards) returns = np.zeros(T) # Compute returns backwards G = 0 for t in reversed(range(T)): G = rewards[t] + gamma * G returns[t] = G # Advantage = return - baseline advantages = returns - np.array(values) return advantages def compute_advantages_td( rewards: List[float], values: List[float], next_values: List[float], dones: List[bool], gamma: float = 0.99) -> np.ndarray: """ Compute TD(0) advantages. A_t = r_t + γV(s_{t+1}) - V(s_t) (one-step advantage) Low variance but biased (depends on value function accuracy). """ T = len(rewards) advantages = np.zeros(T) for t in range(T): if dones[t]: # Terminal: no next state value advantages[t] = rewards[t] - values[t] else: # TD target - current value advantages[t] = rewards[t] + gamma * next_values[t] - values[t] return advantages def compute_gae( rewards: List[float], values: List[float], next_values: List[float], dones: List[bool], gamma: float = 0.99, gae_lambda: float = 0.95) -> np.ndarray: """ Generalized Advantage Estimation (GAE) - Schulman et al., 2016 GAE interpolates between TD (λ=0) and Monte Carlo (λ=1). A^GAE = sum_{l=0}^∞ (γλ)^l δ_{t+l} where δ_t = r_t + γV(s_{t+1}) - V(s_t) is the TD error. λ controls the bias-variance tradeoff: - λ=0: TD(0), low variance, high bias - λ=1: Monte Carlo, high variance, low bias - λ≈0.95: Common choice, good balance This is the standard advantage estimator in PPO, A2C, and TRPO. """ T = len(rewards) advantages = np.zeros(T) # Compute backwards for efficiency gae = 0 for t in reversed(range(T)): if dones[t]: delta = rewards[t] - values[t] gae = delta # Reset GAE at terminal states else: delta = rewards[t] + gamma * next_values[t] - values[t] gae = delta + gamma * gae_lambda * gae advantages[t] = gae return advantages class GAEEstimator: """ Production-ready GAE estimator with normalization. Used in PPO and other policy gradient methods. """ def __init__(self, gamma: float = 0.99, gae_lambda: float = 0.95, normalize: bool = True): self.gamma = gamma self.gae_lambda = gae_lambda self.normalize = normalize def __call__(self, rewards: torch.Tensor, values: torch.Tensor, dones: torch.Tensor) -> torch.Tensor: """ Compute GAE advantages. Args: rewards: [T] tensor of rewards values: [T+1] tensor of values (includes bootstrap value) dones: [T] tensor of done flags Returns: advantages: [T] tensor of GAE advantages """ T = rewards.shape[0] advantages = torch.zeros(T) gae = 0 for t in reversed(range(T)): # Handle terminal states non_terminal = 1.0 - dones[t].float() # TD error delta = rewards[t] + self.gamma * values[t+1] * non_terminal - values[t] # GAE recursion gae = delta + self.gamma * self.gae_lambda * non_terminal * gae advantages[t] = gae # Normalize for stability if self.normalize and len(advantages) > 1: advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8) return advantagesGeneralized Advantage Estimation (GAE) is used in nearly all modern policy gradient implementations. The λ parameter offers a principled way to trade off bias and variance. Setting λ ≈ 0.95 works well across most problems without tuning.
Temporal Difference (TD) learning is the workhorse of modern RL. It combines ideas from Monte Carlo (learning from experience) and dynamic programming (bootstrapping from estimates).
The TD(0) Update
$$V(s_t) \leftarrow V(s_t) + \alpha \left[ r_{t+1} + \gamma V(s_{t+1}) - V(s_t) \right]$$
The term in brackets is the TD error: $$\delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)$$
Why TD Works
TD uses the current estimate $V(s_{t+1})$ as a stand-in for the true future return. This is called bootstrapping. Even though this estimate is initially wrong, TD provably converges to $V^\pi$ under standard conditions.
TD(λ): Bridging TD and Monte Carlo
TD(0) uses only a one-step lookahead. Monte Carlo uses the full return. TD(λ) interpolates:
$$G_t^{(n)} = r_{t+1} + \gamma r_{t+2} + \ldots + \gamma^{n-1} r_{t+n} + \gamma^n V(s_{t+n})$$
The λ-return is a weighted average of all n-step returns: $$G_t^\lambda = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)}$$
This is the same bias-variance tradeoff as GAE, but for value estimation rather than advantages.
TD learning can diverge when combining: (1) Function approximation, (2) Bootstrapping, and (3) Off-policy learning. This 'deadly triad' was a major obstacle in deep RL until techniques like target networks (DQN), double Q-learning, and clipped objectives (PPO) provided stabilization.
For large or continuous state spaces, we cannot store a value for every state. Instead, we use function approximation: parameterized functions that generalize across similar states.
Linear Value Functions
The simplest approximation uses linear combination of features:
$$V_\mathbf{w}(s) = \mathbf{w}^\top \phi(s) = \sum_i w_i \phi_i(s)$$
where $\phi(s)$ is a feature vector and $\mathbf{w}$ are learned weights.
Neural Network Value Functions
Deep RL uses neural networks to represent values:
$$V_\theta(s) = \text{NeuralNet}\theta(s)$$ $$Q\theta(s, a) = \text{NeuralNet}_\theta(s, a)$$
Networks can capture complex, non-linear value landscapes that linear functions cannot.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153
import torchimport torch.nn as nnimport torch.nn.functional as Ffrom typing import Tuple class ValueNetwork(nn.Module): """ Neural network for state value function V(s). Takes state as input, outputs single scalar value. """ def __init__(self, state_dim: int, hidden_dims: Tuple[int, ...] = (64, 64)): super().__init__() layers = [] prev_dim = state_dim for hidden_dim in hidden_dims: layers.extend([ nn.Linear(prev_dim, hidden_dim), nn.ReLU(), ]) prev_dim = hidden_dim layers.append(nn.Linear(prev_dim, 1)) self.network = nn.Sequential(*layers) def forward(self, state: torch.Tensor) -> torch.Tensor: """Return value estimate for state(s).""" return self.network(state).squeeze(-1) class QNetwork(nn.Module): """ Neural network for action value function Q(s,a). For DISCRETE actions: takes state, outputs Q-value for each action. """ def __init__(self, state_dim: int, n_actions: int, hidden_dims: Tuple[int, ...] = (64, 64)): super().__init__() layers = [] prev_dim = state_dim for hidden_dim in hidden_dims: layers.extend([ nn.Linear(prev_dim, hidden_dim), nn.ReLU(), ]) prev_dim = hidden_dim layers.append(nn.Linear(prev_dim, n_actions)) self.network = nn.Sequential(*layers) def forward(self, state: torch.Tensor) -> torch.Tensor: """Return Q-values for all actions.""" return self.network(state) def get_value(self, state: torch.Tensor, action: torch.Tensor) -> torch.Tensor: """Return Q-value for specific state-action pair.""" q_values = self.forward(state) return q_values.gather(1, action.unsqueeze(1)).squeeze(1) class ContinuousQNetwork(nn.Module): """ Q-network for CONTINUOUS actions. Takes (state, action) pair as input, outputs single Q-value. Used in DDPG, TD3, SAC. """ def __init__(self, state_dim: int, action_dim: int, hidden_dims: Tuple[int, ...] = (256, 256)): super().__init__() layers = [] prev_dim = state_dim + action_dim # Concatenate state and action for hidden_dim in hidden_dims: layers.extend([ nn.Linear(prev_dim, hidden_dim), nn.ReLU(), ]) prev_dim = hidden_dim layers.append(nn.Linear(prev_dim, 1)) self.network = nn.Sequential(*layers) def forward(self, state: torch.Tensor, action: torch.Tensor) -> torch.Tensor: """Return Q-value for state-action pair.""" x = torch.cat([state, action], dim=-1) return self.network(x).squeeze(-1) class DuelingDQN(nn.Module): """ Dueling DQN architecture (Wang et al., 2016). Separates Q(s,a) into: - V(s): state value (how good is this state?) - A(s,a): advantage (how much better is this action than average?) Q(s,a) = V(s) + A(s,a) - mean_a'(A(s,a')) The subtraction ensures A sums to zero and V represents the true state value. This separation helps with credit assignment and learning efficiency. """ def __init__(self, state_dim: int, n_actions: int, hidden_dim: int = 128): super().__init__() # Shared feature extraction self.features = nn.Sequential( nn.Linear(state_dim, hidden_dim), nn.ReLU(), ) # Value stream: V(s) self.value_stream = nn.Sequential( nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, 1) ) # Advantage stream: A(s,a) self.advantage_stream = nn.Sequential( nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, n_actions) ) def forward(self, state: torch.Tensor) -> torch.Tensor: features = self.features(state) value = self.value_stream(features) advantages = self.advantage_stream(features) # Combine: Q = V + (A - mean(A)) # Subtracting mean ensures identifiability q_values = value + (advantages - advantages.mean(dim=-1, keepdim=True)) return q_valuesThe Bias-Variance Trade-off in Approximation
Function approximation introduces approximation error. Even if we had infinite data, a linear function cannot perfectly represent a non-linear value function.
Capacity considerations:
Neural networks are high-capacity approximators, so RL typically struggles more with variance (instability) than bias. This is why regularization, target networks, and careful hyperparameter tuning are essential.
The Instability Problem
In TD learning with function approximation, the update is: $$\theta \leftarrow \theta + \alpha (r + \gamma V_\theta(s') - V_\theta(s)) \nabla_\theta V_\theta(s)$$
The target $r + \gamma V_\theta(s')$ depends on the same parameters $\theta$ we're updating. This creates a moving target problem—as we update, the target changes, potentially causing oscillation or divergence.
The Target Network Solution
Maintain a separate target network $\theta^-$ that's updated slowly:
$$\theta \leftarrow \theta + \alpha (r + \gamma V_{\theta^-}(s') - V_\theta(s)) \nabla_\theta V_\theta(s)$$
Now the target is stable: it only changes when we explicitly update $\theta^-$.
| Strategy | Update Rule | Characteristics |
|---|---|---|
| Hard Update | θ⁻ ← θ every N steps | Simple; targets change abruptly |
| Soft Update (Polyak) | θ⁻ ← τθ + (1-τ)θ⁻ every step | Smooth; τ ≈ 0.005 typical |
| EMA (Exponential Moving Average) | θ⁻ ← decay × θ⁻ + (1-decay) × θ | Same as soft update, different naming |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116
import torchimport torch.nn as nnfrom copy import deepcopy class TargetNetworkMixin: """ Mixin providing target network functionality. Used in DQN, DDPG, TD3, SAC, and most deep RL algorithms. """ def init_target_networks(self, networks: dict): """ Create target networks as copies of main networks. Args: networks: Dict of {name: network} pairs """ self.target_networks = {} for name, network in networks.items(): # Deep copy to create separate parameters self.target_networks[name] = deepcopy(network) # Freeze target network (no gradient computation) for param in self.target_networks[name].parameters(): param.requires_grad = False def soft_update(self, tau: float = 0.005): """ Soft (Polyak) update of target networks. θ_target = τ × θ + (1-τ) × θ_target Small τ (0.001-0.01) provides stable, slowly-moving targets. """ for name, target_net in self.target_networks.items(): main_net = getattr(self, name) for target_param, main_param in zip( target_net.parameters(), main_net.parameters() ): target_param.data.copy_( tau * main_param.data + (1 - tau) * target_param.data ) def hard_update(self): """ Hard update: copy main network parameters to target. Used every N steps in original DQN. """ for name, target_net in self.target_networks.items(): main_net = getattr(self, name) target_net.load_state_dict(main_net.state_dict()) class DQN(nn.Module, TargetNetworkMixin): """ Deep Q-Network with target network. Key innovations from Mnih et al., 2015: 1. Experience replay (break correlation) 2. Target network (stabilize targets) """ def __init__(self, state_dim: int, n_actions: int, hidden_dims=(128, 128)): super().__init__() # Main Q-network layers = [] prev = state_dim for h in hidden_dims: layers.extend([nn.Linear(prev, h), nn.ReLU()]) prev = h layers.append(nn.Linear(prev, n_actions)) self.q_network = nn.Sequential(*layers) # Create target network self.init_target_networks({'q_network': self.q_network}) self.target_update_freq = 1000 self.update_count = 0 def get_q_values(self, state: torch.Tensor) -> torch.Tensor: """Get Q-values from main network.""" return self.q_network(state) def get_target_q_values(self, state: torch.Tensor) -> torch.Tensor: """Get Q-values from target network.""" return self.target_networks['q_network'](state) def update(self, states, actions, rewards, next_states, dones, gamma=0.99): """ Perform one DQN update step. """ # Current Q-values for taken actions q_values = self.get_q_values(states).gather(1, actions.unsqueeze(1)) # Target: r + γ max_a' Q_target(s', a') with torch.no_grad(): next_q_values = self.get_target_q_values(next_states) max_next_q = next_q_values.max(dim=1)[0] targets = rewards + gamma * max_next_q * (1 - dones.float()) # MSE loss loss = nn.functional.mse_loss(q_values.squeeze(), targets) # ... optimizer step ... # Periodic hard update self.update_count += 1 if self.update_count % self.target_update_freq == 0: self.hard_update() return loss.item()Value functions serve multiple purposes across different RL algorithms:
1. Policy Improvement (Q-learning, DQN)
With Q*, act greedily: $$\pi^(s) = \arg\max_a Q^(s, a)$$
No explicit policy network needed—the Q-function implicitly defines the policy.
2. Advantage Computation (Actor-Critic)
The critic estimates V(s), which is used to compute advantages: $$A(s, a) = r + \gamma V(s') - V(s) \approx Q(s, a) - V(s)$$
Advantages guide the actor's policy gradient updates.
3. Baseline for Variance Reduction (REINFORCE with baseline)
Subtracting V(s) from returns doesn't change the expected gradient but reduces variance: $$\nabla_\theta J = \mathbb{E}[\nabla_\theta \log \pi(a|s) (G - V(s))]$$
4. Bootstrapping (TD methods)
Value estimates enable learning before episode end: $$V(s) \leftarrow V(s) + \alpha[r + \gamma V(s') - V(s)]$$
| Algorithm | Value Function | Primary Use |
|---|---|---|
| Q-Learning | Q(s,a) | Policy = argmax Q |
| DQN | Q(s,a) neural net | Policy = argmax Q |
| SARSA | Q(s,a) | On-policy TD control |
| A2C/A3C | V(s) | Baseline for PG, compute advantage |
| PPO | V(s) | GAE advantage estimation |
| SAC | V(s) and Q(s,a) | Entropy-regularized value |
| DDPG/TD3 | Q(s,a) | Critic for continuous control |
| Monte Carlo | V(s) or Q(s,a) | First/every-visit estimation |
For discrete actions, Q-networks are natural: output Q(s,a) for all a, take argmax. For continuous actions, Q(s,a) requires taking (s,a) pairs as input, which is used in DDPG/TD3/SAC. Some continuous algorithms use only V (PPO, TRPO) and never explicitly compute Q.
Monitor: (1) Average value magnitude—should be reasonable given reward scale and horizon. (2) Value variance—shouldn't explode. (3) TD error—should decrease over training. (4) Correlation with Monte Carlo returns—values should predict actual returns. If these are off, check reward normalization, learning rate, and network capacity.
Value functions are the mathematical lens through which RL agents see the future. They compress the complexity of long-term consequences into tractable estimates that guide decision-making. Here are the key takeaways:
Looking Ahead
With policies and value functions covered, one crucial concept remains: the Markov property. This mathematical assumption underlies all of RL's tractability. The next page examines what Markovianity means, when it holds, what happens when it doesn't, and how this foundational assumption shapes everything we've learned.
You now understand value functions—the quantitative foundation of reinforcement learning. From V and Q to Bellman equations, from TD learning to neural network approximation, value functions are how RL agents reason about the future. Next, we complete the picture with the Markov property.