Loading learning content...
Given a state, what should an agent do? This question—seemingly simple—is the essence of reinforcement learning. The answer is formalized as a policy: a mapping from states to actions that defines the agent's behavior.
Policies are the central object of study in RL. We evaluate policies to understand their performance. We compare policies to identify better strategies. We improve policies through learning. And we search for the optimal policy—the strategy that maximizes expected cumulative reward.
Understanding policies deeply means understanding how agents decide, how decisions can be represented, learned, and optimized. This page develops that understanding from first principles.
By the end of this page, you'll understand deterministic and stochastic policies, their mathematical formulations, practical representations using function approximators, policy evaluation methods, the concept of optimal policies, and how policy decisions propagate through time to determine long-term outcomes.
Formally, a policy is a mapping from states to actions or distributions over actions. It completely specifies the agent's behavior—given any state, the policy determines what the agent does.
Deterministic Policy
A deterministic policy $\mu: \mathcal{S} \rightarrow \mathcal{A}$ maps each state to exactly one action:
$$a = \mu(s)$$
Given state $s$, the agent always takes action $\mu(s)$. There's no randomness in action selection.
Stochastic Policy
A stochastic policy $\pi: \mathcal{S} \times \mathcal{A} \rightarrow [0, 1]$ maps each state to a probability distribution over actions:
$$a \sim \pi(\cdot|s)$$
The notation $\pi(a|s)$ denotes the probability of taking action $a$ in state $s$. For all states, probabilities must sum to one:
$$\sum_{a \in \mathcal{A}} \pi(a|s) = 1 \quad \text{(discrete)}$$ $$\int_{\mathcal{A}} \pi(a|s) , da = 1 \quad \text{(continuous)}$$
| Aspect | Deterministic π = μ(s) | Stochastic a ~ π(·|s) |
|---|---|---|
| Output | Single action | Probability distribution |
| Exploration | No inherent exploration | Built-in randomness enables exploration |
| Differentiability | argmax breaks gradients | Sampling enables gradient estimation |
| Optimality | Optimal policy can be deterministic* | Need stochasticity for exploration during learning |
| Game Theory | Can be exploited by adversary | Randomization prevents pure exploitation |
| Common Use | DDPG, TD3 (continuous control) | PPO, A2C, SAC, Policy Gradient |
Why Stochastic Policies?
If the optimal policy is deterministic (which it is in standard MDPs), why use stochastic policies at all?
Exploration: During learning, we need to try different actions to discover which are best. Stochastic policies naturally explore.
Gradient-based learning: Policy gradient methods require differentiable action selection. Stochasticity enables the log-derivative trick.
Robustness: In adversarial settings (games, security), randomized strategies can be unexploitable when deterministic ones cannot.
Handling partial observability: When the state is aliased (same observation from different true states), stochastic policies can be optimal.
Regularization: Entropy bonuses encourage stochastic policies during training, preventing premature convergence to suboptimal deterministic policies.
*The optimal policy in an MDP with full observability can always be expressed as deterministic, but we often use stochastic policies during learning and derive the deterministic greedy policy afterward.
Stochastic policies elegantly solve the exploration problem during training. Higher entropy (more uniform) distributions explore more; lower entropy (more peaked) distributions exploit more. Many algorithms explicitly control entropy to manage this trade-off.
In practice, policies are represented by parameterized functions, typically neural networks. We denote a parameterized policy as $\pi_\theta$ where $\theta$ represents the learnable parameters.
Tabular Policies
For small, discrete state and action spaces, we can store the policy explicitly:
$$\pi(a|s) = \text{Table}[s, a]$$
This requires $|\mathcal{S}| \times |\mathcal{A}|$ parameters. Infeasible for large spaces, but optimal for small problems.
Linear Policies
The simplest function approximation:
$$\pi(a|s) = \text{softmax}(\phi(s)^\top \mathbf{w}_a)$$
where $\phi(s)$ is a feature vector. Limited expressiveness but interpretable.
Neural Network Policies
Deep policies can represent complex mappings:
$$\pi_\theta(a|s) = \text{NeuralNet}_\theta(s)$$
The network architecture depends on state and action types.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265
import torchimport torch.nn as nnimport torch.nn.functional as Fimport torch.distributions as Dfrom typing import Tuple, Optionalimport numpy as np class CategoricalPolicy(nn.Module): """ Stochastic policy for discrete action spaces. Architecture: MLP → softmax over actions Distribution: Categorical Used by: A2C, PPO, REINFORCE with discrete actions """ def __init__(self, state_dim: int, n_actions: int, hidden_dims: Tuple[int, ...] = (64, 64)): super().__init__() layers = [] prev_dim = state_dim for hidden_dim in hidden_dims: layers.extend([ nn.Linear(prev_dim, hidden_dim), nn.ReLU(), ]) prev_dim = hidden_dim layers.append(nn.Linear(prev_dim, n_actions)) self.network = nn.Sequential(*layers) def forward(self, state: torch.Tensor) -> D.Categorical: """Return action distribution for given state.""" logits = self.network(state) return D.Categorical(logits=logits) def act(self, state: torch.Tensor, deterministic: bool = False) -> Tuple[torch.Tensor, torch.Tensor]: """ Sample action and compute log probability. Args: state: Current state [batch, state_dim] deterministic: If True, return mode (greedy action) Returns: action: Selected action [batch] log_prob: Log probability of action [batch] """ dist = self.forward(state) if deterministic: action = dist.probs.argmax(dim=-1) else: action = dist.sample() log_prob = dist.log_prob(action) return action, log_prob def evaluate(self, state: torch.Tensor, action: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]: """ Evaluate log probability and entropy for given state-action pair. Used in PPO and A2C for computing ratio and entropy bonus. """ dist = self.forward(state) log_prob = dist.log_prob(action) entropy = dist.entropy() return log_prob, entropy class GaussianPolicy(nn.Module): """ Stochastic policy for continuous action spaces. Architecture: MLP → (mean, log_std) Distribution: Independent Gaussian per action dimension Used by: PPO, A2C, TRPO with continuous actions """ def __init__(self, state_dim: int, action_dim: int, hidden_dims: Tuple[int, ...] = (64, 64), log_std_init: float = 0.0, log_std_min: float = -20.0, log_std_max: float = 2.0): super().__init__() self.action_dim = action_dim self.log_std_min = log_std_min self.log_std_max = log_std_max # Shared feature extractor layers = [] prev_dim = state_dim for hidden_dim in hidden_dims: layers.extend([ nn.Linear(prev_dim, hidden_dim), nn.ReLU(), ]) prev_dim = hidden_dim self.features = nn.Sequential(*layers) # Separate heads for mean and log_std self.mean_head = nn.Linear(prev_dim, action_dim) self.log_std_head = nn.Linear(prev_dim, action_dim) # Initialize log_std to desired starting value nn.init.constant_(self.log_std_head.bias, log_std_init) def forward(self, state: torch.Tensor) -> D.Normal: """Return action distribution for given state.""" features = self.features(state) mean = self.mean_head(features) log_std = self.log_std_head(features) log_std = torch.clamp(log_std, self.log_std_min, self.log_std_max) std = torch.exp(log_std) return D.Normal(mean, std) def act(self, state: torch.Tensor, deterministic: bool = False) -> Tuple[torch.Tensor, torch.Tensor]: """Sample action and compute log probability.""" dist = self.forward(state) if deterministic: action = dist.mean else: action = dist.rsample() # Reparameterized sampling # Sum log probs across action dimensions log_prob = dist.log_prob(action).sum(dim=-1) return action, log_prob def evaluate(self, state: torch.Tensor, action: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]: """Evaluate log probability and entropy.""" dist = self.forward(state) log_prob = dist.log_prob(action).sum(dim=-1) entropy = dist.entropy().sum(dim=-1) return log_prob, entropy class SquashedGaussianPolicy(nn.Module): """ Squashed Gaussian policy for bounded continuous actions. Actions are sampled from Gaussian, then passed through tanh to bound to [-1, 1]. Log probability is corrected for the change of variables. Used by: SAC (Soft Actor-Critic) Key insight: tanh squashing ensures actions stay in valid range while maintaining differentiability for gradient-based optimization. """ def __init__(self, state_dim: int, action_dim: int, hidden_dims: Tuple[int, ...] = (256, 256), log_std_min: float = -20.0, log_std_max: float = 2.0): super().__init__() self.log_std_min = log_std_min self.log_std_max = log_std_max # Build MLP layers = [] prev_dim = state_dim for hidden_dim in hidden_dims: layers.extend([ nn.Linear(prev_dim, hidden_dim), nn.ReLU(), ]) prev_dim = hidden_dim self.net = nn.Sequential(*layers) self.mean_linear = nn.Linear(prev_dim, action_dim) self.log_std_linear = nn.Linear(prev_dim, action_dim) def forward(self, state: torch.Tensor ) -> Tuple[torch.Tensor, torch.Tensor]: """Return mean and log_std of Gaussian (before squashing).""" x = self.net(state) mean = self.mean_linear(x) log_std = self.log_std_linear(x) log_std = torch.clamp(log_std, self.log_std_min, self.log_std_max) return mean, log_std def act(self, state: torch.Tensor, deterministic: bool = False ) -> Tuple[torch.Tensor, torch.Tensor]: """ Sample squashed action and compute corrected log probability. """ mean, log_std = self.forward(state) std = torch.exp(log_std) dist = D.Normal(mean, std) if deterministic: u = mean # Pre-squash action else: u = dist.rsample() # Reparameterized sample # Squash through tanh action = torch.tanh(u) # Correct log_prob for squashing (change of variables) # log π(a|s) = log p(u) - log |det(da/du)| # = log p(u) - sum_i log(1 - tanh^2(u_i)) log_prob = dist.log_prob(u).sum(dim=-1) log_prob -= torch.log(1 - action.pow(2) + 1e-6).sum(dim=-1) return action, log_prob class DeterministicPolicy(nn.Module): """ Deterministic policy for continuous actions. Used by: DDPG, TD3 Since action selection is deterministic, exploration must be added externally (e.g., OU noise, Gaussian noise). """ def __init__(self, state_dim: int, action_dim: int, hidden_dims: Tuple[int, ...] = (256, 256), action_bound: float = 1.0): super().__init__() self.action_bound = action_bound layers = [] prev_dim = state_dim for hidden_dim in hidden_dims: layers.extend([ nn.Linear(prev_dim, hidden_dim), nn.ReLU(), ]) prev_dim = hidden_dim layers.append(nn.Linear(prev_dim, action_dim)) self.network = nn.Sequential(*layers) def forward(self, state: torch.Tensor) -> torch.Tensor: """Return deterministic action.""" action = self.network(state) return self.action_bound * torch.tanh(action) def act(self, state: torch.Tensor, noise_std: float = 0.0) -> torch.Tensor: """Return action, optionally with exploration noise.""" action = self.forward(state) if noise_std > 0: noise = torch.randn_like(action) * noise_std action = action + noise action = torch.clamp(action, -self.action_bound, self.action_bound) return actionPolicy evaluation answers: how good is a given policy? Specifically, what is the expected return when following policy $\pi$?
The Value of a Policy
The performance or value of policy $\pi$ is the expected return when starting from the initial state distribution and following $\pi$:
$$J(\pi) = \mathbb{E}{\tau \sim \pi}[G_0] = \mathbb{E}{s_0 \sim p_0, a_t \sim \pi} \left[ \sum_{t=0}^{\infty} \gamma^t r_{t+1} \right]$$
Comparing policies is simple: $\pi$ is better than $\pi'$ if $J(\pi) > J(\pi')$.
Monte Carlo Evaluation
The simplest evaluation method: run the policy many times and average returns.
$$\hat{J}(\pi) = \frac{1}{N} \sum_{i=1}^{N} G_0^{(i)}$$
This is an unbiased estimator but has high variance—some episodes may be much longer or luckier than others.
Temporal Difference Evaluation
Instead of waiting for episode end, update value estimates incrementally using the Bellman equation. This is lower variance but introduces bias from bootstrapping.
State Value Function
The state value function $V^\pi(s)$ gives the expected return starting from state $s$ and following policy $\pi$:
$$V^\pi(s) = \mathbb{E}{\pi}\left[ \sum{t=0}^{\infty} \gamma^t r_{t+1} ,\middle|, s_0 = s \right]$$
This satisfies the Bellman expectation equation:
$$V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^\pi(s') \right]$$
In words: the value of a state is the expected immediate reward plus the discounted value of the next state, averaged over the policy's action distribution and environment dynamics.
Action Value Function
The action value function $Q^\pi(s, a)$ gives the expected return starting from state $s$, taking action $a$, then following policy $\pi$:
$$Q^\pi(s, a) = \mathbb{E}{\pi}\left[ \sum{t=0}^{\infty} \gamma^t r_{t+1} ,\middle|, s_0 = s, a_0 = a \right]$$
Relation to state value: $$V^\pi(s) = \sum_a \pi(a|s) Q^\pi(s, a) = \mathbb{E}_{a \sim \pi}[Q^\pi(s, a)]$$
V(s) is sufficient for policy improvement if you know the environment dynamics (model-based). Q(s,a) enables model-free policy improvement—you can choose the best action without knowing what states actions lead to. This is why Q-learning and actor-critic methods are so popular.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158
import numpy as npfrom typing import List, Tuple, Callableimport torchimport torch.nn as nn def monte_carlo_evaluation( env, policy: Callable, num_episodes: int = 100, gamma: float = 0.99, max_steps: int = 1000) -> Tuple[float, float, List[float]]: """ Evaluate policy using Monte Carlo rollouts. Args: env: Gymnasium-style environment policy: Function mapping state to action num_episodes: Number of episodes to sample gamma: Discount factor max_steps: Maximum steps per episode Returns: mean_return: Estimated J(π) std_return: Standard deviation all_returns: List of individual episode returns """ all_returns = [] for episode in range(num_episodes): state, _ = env.reset() episode_rewards = [] for step in range(max_steps): action = policy(state) next_state, reward, terminated, truncated, _ = env.step(action) episode_rewards.append(reward) if terminated or truncated: break state = next_state # Compute discounted return G = 0.0 for reward in reversed(episode_rewards): G = reward + gamma * G all_returns.append(G) mean_return = np.mean(all_returns) std_return = np.std(all_returns) return mean_return, std_return, all_returns class TDPolicyEvaluation: """ Temporal Difference policy evaluation (TD(0)). Updates value estimate after each step using: V(s) ← V(s) + α [r + γV(s') - V(s)] Lower variance than Monte Carlo but biased due to bootstrapping. """ def __init__(self, state_dim: int, hidden_dims: Tuple[int, ...] = (64, 64), learning_rate: float = 1e-3, gamma: float = 0.99): self.gamma = gamma # Value function approximator layers = [] prev_dim = state_dim for hidden_dim in hidden_dims: layers.extend([ nn.Linear(prev_dim, hidden_dim), nn.ReLU(), ]) prev_dim = hidden_dim layers.append(nn.Linear(prev_dim, 1)) self.value_net = nn.Sequential(*layers) self.optimizer = torch.optim.Adam( self.value_net.parameters(), lr=learning_rate ) def update(self, state: np.ndarray, reward: float, next_state: np.ndarray, done: bool) -> float: """ Perform one TD(0) update. Returns: td_error: The temporal difference error (for monitoring) """ state_t = torch.FloatTensor(state).unsqueeze(0) next_state_t = torch.FloatTensor(next_state).unsqueeze(0) # Current value estimate value = self.value_net(state_t) # TD target with torch.no_grad(): if done: target = reward else: target = reward + self.gamma * self.value_net(next_state_t) # TD error td_error = target - value # Update value function loss = td_error.pow(2) self.optimizer.zero_grad() loss.backward() self.optimizer.step() return td_error.item() def estimate_value(self, state: np.ndarray) -> float: """Get current value estimate for a state.""" state_t = torch.FloatTensor(state).unsqueeze(0) with torch.no_grad(): return self.value_net(state_t).item() def evaluate_policy_with_confidence( env, policy: Callable, num_episodes: int = 100, gamma: float = 0.99, confidence: float = 0.95) -> dict: """ Evaluate policy with confidence interval. Returns dictionary with: - mean: Point estimate of J(π) - std: Standard deviation of returns - ci_lower, ci_upper: Confidence interval bounds - n_episodes: Number of episodes used """ mean, std, returns = monte_carlo_evaluation( env, policy, num_episodes, gamma ) # Confidence interval using t-distribution from scipy import stats t_value = stats.t.ppf((1 + confidence) / 2, num_episodes - 1) margin = t_value * std / np.sqrt(num_episodes) return { 'mean': mean, 'std': std, 'ci_lower': mean - margin, 'ci_upper': mean + margin, 'confidence': confidence, 'n_episodes': num_episodes }The ultimate goal of RL is to find an optimal policy $\pi^*$—one that maximizes expected return from every state.
Definition of Optimality
A policy $\pi^*$ is optimal if:
$$V^{\pi^*}(s) \geq V^{\pi}(s) \quad \forall s \in \mathcal{S}, \forall \pi$$
The optimal policy achieves the highest value in every state simultaneously. Remarkably, such a policy always exists in MDPs.
Optimal Value Functions
The optimal state value function is: $$V^(s) = \max_\pi V^\pi(s) = V^{\pi^}(s)$$
The optimal action value function is: $$Q^*(s, a) = \max_\pi Q^\pi(s, a)$$
These satisfy the Bellman optimality equations:
$$V^(s) = \max_a \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^(s') \right]$$
$$Q^(s, a) = \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma \max_{a'} Q^(s', a') \right]$$
Deriving Optimal Policy from Q*
Once we have $Q^*$, the optimal policy is trivially derived:
$$\pi^(s) = \arg\max_a Q^(s, a)$$
This is why Q-learning is so powerful: learn $Q^*$, then act greedily with respect to it.
Key Theorems
Theorem 1 (Existence): For any finite MDP, there exists an optimal policy that is deterministic and stationary (doesn't depend on time or history beyond current state).
Theorem 2 (Uniqueness of V, Q)**: The optimal value functions are unique, though optimal policies may not be.
Theorem 3 (Policy Improvement): If we improve a policy at any state (choose an action with higher Q-value), the overall policy improves. This enables iterative policy improvement algorithms.
The Bellman optimality equation defines V* as a fixed point of the Bellman operator T*: V* = TV. Value iteration finds this fixed point by repeated application of T*. The contraction property of T* (under discount γ < 1) guarantees convergence to the unique fixed point V*.
Not all policy representations are equally expressive. Understanding the hierarchy of policy classes helps choose appropriate representations.
Memoryless vs. History-Dependent
Memoryless (Markovian) policies depend only on current state: $\pi(a|s)$. Sufficient for MDPs.
History-dependent policies depend on the entire trajectory: $\pi(a|s_0, a_0, \ldots, s_t)$. Necessary for POMDPs where the current observation doesn't capture the true state.
Stationary vs. Non-Stationary
Stationary policies are the same at all time steps: $\pi(a|s)$. Optimal for infinite-horizon discounted MDPs.
Non-stationary policies vary with time: $\pi_t(a|s)$. May be needed for finite-horizon problems or changing objectives.
Reactive vs. Deliberative
Reactive policies compute actions quickly from current state—feedforward networks.
Deliberative policies may perform internal computation (planning, search) before acting—more powerful but slower.
| Class | Form | When Needed | Computation |
|---|---|---|---|
| Tabular | π[s,a] lookup table | Small discrete spaces | O(1) lookup |
| Linear | softmax(φ(s)ᵀW) | Simple problems, interpretability | O(d) per action |
| MLP | Neural network on state | Standard continuous control | O(network size) |
| CNN + MLP | Conv layers + MLP | Image observations | O(image size × depth) |
| RNN/LSTM | Recurrent over history | Partial observability | O(hidden²) per step |
| Transformer | Attention over history | Complex dependencies | O(T² × d) for T steps |
| Planning-based | Search/MCTS at runtime | Games, complex reasoning | O(branching^depth) |
The Representation-Learning Trade-off
More expressive policy classes can represent more complex behaviors but:
Principle: Use the simplest policy class that can express the optimal behavior. If a linear policy suffices, don't use a deep network. If an MLP works, don't use an RNN.
Architecture Design Heuristics:
Neural networks are universal function approximators—they can represent any continuous function. But this doesn't mean gradient descent will find the optimal policy. The optimization landscape, sample efficiency, generalization, and hyperparameter sensitivity all affect whether a policy can be learned in practice.
How do we find good policies? There are two major paradigms:
Value-Based Methods (Q-Learning family)
Policy-Based Methods (Policy Gradient family)
The Policy Gradient Theorem
The key insight enabling direct policy optimization:
$$\nabla_\theta J(\pi_\theta) = \mathbb{E}{\tau \sim \pi\theta} \left[ \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t \right]$$
In words: the gradient of expected return equals the expected sum of gradients of log-probabilities, weighted by returns. Actions that led to high returns have their probabilities increased; actions leading to low returns are decreased.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139
import torchimport torch.nn as nnimport torch.optim as optimimport numpy as npfrom typing import List, Tuple class SimpleREINFORCE: """ REINFORCE: The simplest policy gradient algorithm. Intuition: Run the policy, observe what happened, increase probability of actions that led to high returns, decrease probability of actions that led to low returns. This is essentially supervised learning where the "labels" (good actions) are discovered through trial and error. """ def __init__(self, policy: nn.Module, learning_rate: float = 1e-3, gamma: float = 0.99): self.policy = policy self.gamma = gamma self.optimizer = optim.Adam(policy.parameters(), lr=learning_rate) def compute_returns(self, rewards: List[float]) -> torch.Tensor: """Compute discounted returns for each timestep.""" returns = [] G = 0.0 for reward in reversed(rewards): G = reward + self.gamma * G returns.insert(0, G) returns = torch.tensor(returns, dtype=torch.float32) # Optionally normalize for stability returns = (returns - returns.mean()) / (returns.std() + 1e-8) return returns def update(self, states: List[np.ndarray], actions: List[int], rewards: List[float]) -> float: """ Perform one REINFORCE update. The loss is: -sum_t [log π(a_t|s_t) * G_t] Negative because we're doing gradient ascent on expected return, which is gradient descent on negative expected return. """ states_t = torch.FloatTensor(np.array(states)) actions_t = torch.LongTensor(actions) returns = self.compute_returns(rewards) # Get log probabilities of taken actions dist = self.policy(states_t) log_probs = dist.log_prob(actions_t) # Policy gradient loss: -E[log π(a|s) * G] # Intuition: # - G > 0: action was good, increase log_prob (decrease loss) # - G < 0: action was bad, decrease log_prob (increase loss) loss = -(log_probs * returns).mean() self.optimizer.zero_grad() loss.backward() self.optimizer.step() return loss.item() class ActorCriticPreview: """ Actor-Critic: Combines policy gradient with value function. Key insight: Use learned value function to reduce variance of the policy gradient estimate. Instead of G_t (high variance), use advantage A_t = G_t - V(s_t) This doesn't change the expected gradient but reduces variance. """ def __init__(self, actor: nn.Module, # Policy network critic: nn.Module, # Value network actor_lr: float = 3e-4, critic_lr: float = 1e-3, gamma: float = 0.99): self.actor = actor self.critic = critic self.gamma = gamma self.actor_optimizer = optim.Adam(actor.parameters(), lr=actor_lr) self.critic_optimizer = optim.Adam(critic.parameters(), lr=critic_lr) def update(self, states: torch.Tensor, actions: torch.Tensor, rewards: torch.Tensor, next_states: torch.Tensor, dones: torch.Tensor) -> Tuple[float, float]: """ Perform actor-critic update. Returns: actor_loss: Policy gradient loss critic_loss: Value function MSE loss """ # ===== Critic Update ===== # TD target: r + γ * V(s') for non-terminal with torch.no_grad(): next_values = self.critic(next_states).squeeze() td_targets = rewards + self.gamma * next_values * (1 - dones) current_values = self.critic(states).squeeze() critic_loss = nn.functional.mse_loss(current_values, td_targets) self.critic_optimizer.zero_grad() critic_loss.backward() self.critic_optimizer.step() # ===== Actor Update ===== # Advantage: how much better was the action than average? with torch.no_grad(): advantages = td_targets - current_values # Policy gradient with advantage dist = self.actor(states) log_probs = dist.log_prob(actions) actor_loss = -(log_probs * advantages).mean() self.actor_optimizer.zero_grad() actor_loss.backward() self.actor_optimizer.step() return actor_loss.item(), critic_loss.item()A subtle but important distinction: the policy used to collect experience may differ from the policy being learned.
Behavior Policy ($\beta$ or $\mu$): The policy actually interacting with the environment, generating experience data.
Target Policy ($\pi$): The policy we're trying to learn or evaluate.
On-Policy Methods
Behavior = Target: We use the current policy $\pi_\theta$ for data collection and update $\pi_\theta$ based on that data.
Off-Policy Methods
Behavior ≠ Target: We collect data with one policy but learn about a different policy.
| Aspect | On-Policy | Off-Policy |
|---|---|---|
| Data generation | Current policy | Any policy (can differ) |
| Sample reuse | Cannot reuse old data | Experience replay possible |
| Sample efficiency | Lower (fresh data needed) | Higher (reuse data) |
| Stability | More stable | Can be less stable |
| Correction needed | No | Importance sampling (for PG) |
| Can learn from demos | No | Yes |
| Examples | A2C, PPO, TRPO | DQN, DDPG, SAC |
Importance Sampling Correction
When the behavior policy $\mu$ differs from target policy $\pi$, expectations must be corrected:
$$\mathbb{E}{a \sim \pi}[f(a)] = \mathbb{E}{a \sim \mu}\left[ \frac{\pi(a|s)}{\mu(a|s)} f(a) \right]$$
The ratio $\rho = \pi(a|s) / \mu(a|s)$ reweights samples. High ratio: action is more likely under $\pi$ than $\mu$. Low ratio: less likely.
Problem: Importance ratios can have high variance, especially when $\pi$ and $\mu$ differ significantly.
Solutions:
Q-learning is off-policy without needing importance sampling. Why? Because the Bellman update Q(s,a) ← r + γ max Q(s',a') doesn't depend on which policy selected a. The action's value doesn't change based on how we chose it. This makes Q-learning remarkably sample-efficient but limits it to learning deterministic policies.
Entropy measures the randomness of a policy's action distribution:
$$H(\pi(\cdot|s)) = -\sum_a \pi(a|s) \log \pi(a|s) \quad \text{(discrete)}$$
$$H(\pi(\cdot|s)) = -\int \pi(a|s) \log \pi(a|s) , da \quad \text{(continuous)}$$
High entropy: uniform distribution, lots of randomness. Low entropy: peaked distribution, nearly deterministic.
Why Entropy Matters
Exploration: High entropy policies explore more, visiting diverse states.
Regularization: Entropy bonus prevents premature convergence to suboptimal deterministic policies.
Robustness: Stochastic policies are harder for adversaries to exploit and more robust to model errors.
Maximum Entropy RL: SAC and related algorithms maximize reward AND entropy simultaneously, leading to robust, multi-modal policies.
Entropy-Regularized Objectives
The standard RL objective is: $$J(\pi) = \mathbb{E}_{\pi}\left[ \sum_t \gamma^t r_t \right]$$
The entropy-regularized (maximum entropy) objective adds an entropy bonus: $$J_{MaxEnt}(\pi) = \mathbb{E}_{\pi}\left[ \sum_t \gamma^t \left( r_t + \alpha H(\pi(\cdot|s_t)) \right) \right]$$
where $\alpha$ is the temperature parameter controlling the entropy-reward trade-off.
Effects:
Automatic Entropy Tuning (SAC)
Instead of manually tuning $\alpha$, we can learn it by constraining entropy to a target:
$$\alpha^* = \arg\min_\alpha \mathbb{E}_{s \sim D}\left[ -\alpha \log \pi(a|s) - \alpha \bar{H} \right]$$
where $\bar{H}$ is the target entropy (e.g., $-\dim(\mathcal{A})$ for continuous actions).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122
import torchimport torch.nn as nnfrom torch.distributions import Categorical, Normal def compute_entropy_discrete(logits: torch.Tensor) -> torch.Tensor: """ Compute entropy of categorical distribution. H = -sum_a p(a) log p(a) Using logits directly for numerical stability: H = log(sum(exp(logits))) - sum(p * logits) """ probs = torch.softmax(logits, dim=-1) log_probs = torch.log_softmax(logits, dim=-1) entropy = -(probs * log_probs).sum(dim=-1) return entropy def compute_entropy_gaussian(log_std: torch.Tensor) -> torch.Tensor: """ Compute entropy of Gaussian distribution. For N(μ, σ²): H = 0.5 * log(2πeσ²) = 0.5 * (1 + log(2π) + 2*log_std) """ return 0.5 * (1 + torch.log(2 * torch.tensor(torch.pi)) + 2 * log_std).sum(dim=-1) class EntropyRegularizedPolicyGradient: """ Policy gradient with entropy regularization. Objective: max E[sum_t (r_t + α H(π(·|s_t)))] The entropy bonus encourages exploration and prevents the policy from becoming too deterministic too early. """ def __init__(self, policy: nn.Module, learning_rate: float = 3e-4, gamma: float = 0.99, entropy_coef: float = 0.01): self.policy = policy self.gamma = gamma self.entropy_coef = entropy_coef self.optimizer = torch.optim.Adam(policy.parameters(), lr=learning_rate) def compute_loss(self, states: torch.Tensor, actions: torch.Tensor, returns: torch.Tensor) -> tuple: """ Compute policy gradient loss with entropy bonus. Loss = -E[log π(a|s) * return] - α * E[H(π)] """ dist = self.policy(states) log_probs = dist.log_prob(actions) entropy = dist.entropy() # Policy gradient term pg_loss = -(log_probs * returns).mean() # Entropy bonus (negative because we want to maximize entropy) entropy_loss = -entropy.mean() total_loss = pg_loss + self.entropy_coef * entropy_loss return total_loss, pg_loss.item(), entropy.mean().item() class AutomaticEntropyTuning: """ Automatic entropy coefficient tuning (as in SAC). Instead of fixed α, learn α to maintain target entropy. Objective: min_α E[-α * log π(a|s) - α * H_target] This increases α if entropy is below target (encouraging exploration) and decreases α if entropy is above target. """ def __init__(self, action_dim: int, initial_alpha: float = 1.0, learning_rate: float = 3e-4): # Target entropy: -dim(A) is a common choice for continuous actions self.target_entropy = -action_dim # Learn log(α) for numerical stability self.log_alpha = torch.tensor( [torch.log(torch.tensor(initial_alpha))], requires_grad=True ) self.optimizer = torch.optim.Adam([self.log_alpha], lr=learning_rate) @property def alpha(self) -> torch.Tensor: return self.log_alpha.exp() def update(self, log_probs: torch.Tensor) -> float: """ Update α based on current policy entropy. Args: log_probs: Log probabilities of actions under current policy Returns: alpha_loss: The α update loss """ # Loss: α * (-log π - H_target) # If -log π > H_target (entropy above target): decrease α # If -log π < H_target (entropy below target): increase α alpha_loss = -(self.alpha * (log_probs + self.target_entropy).detach()).mean() self.optimizer.zero_grad() alpha_loss.backward() self.optimizer.step() return alpha_loss.item()When policy fails to learn: (1) Verify rewards are received and vary. (2) Check actions are in valid range. (3) Monitor entropy—shouldn't collapse to zero. (4) Visualize value estimates—should correlate with actual returns. (5) Try known-working hyperparameters from similar problems. (6) Simplify the environment to verify algorithm works.
Policies are the central concept in reinforcement learning—they define agent behavior, and finding good policies is the goal of all RL algorithms. Let's consolidate the key insights:
Looking Ahead
With policies defined, we need a way to measure their quality—not just overall performance, but the value of individual states and actions. The next page explores value functions, which quantify expected future reward and form the foundation of most RL algorithms.
You now understand policies—the decision-making rules at the heart of reinforcement learning. From deterministic to stochastic, from tabular to neural, policies define what agents do. Next, we'll see how value functions help us evaluate and improve policies.