Machine LearningReinforcement Learning

RL Fundamentals

LevelAdvanced

Duration90 mins

TopicReinforcement Learning

3 / 5

Policies

The Agent's Strategy for Action

Given a state, what should an agent do? This question—seemingly simple—is the essence of reinforcement learning. The answer is formalized as a policy: a mapping from states to actions that defines the agent's behavior.

Policies are the central object of study in RL. We evaluate policies to understand their performance. We compare policies to identify better strategies. We improve policies through learning. And we search for the optimal policy—the strategy that maximizes expected cumulative reward.

Understanding policies deeply means understanding how agents decide, how decisions can be represented, learned, and optimized. This page develops that understanding from first principles.

What You Will Learn

By the end of this page, you'll understand deterministic and stochastic policies, their mathematical formulations, practical representations using function approximators, policy evaluation methods, the concept of optimal policies, and how policy decisions propagate through time to determine long-term outcomes.

What Is a Policy?

Formally, a policy is a mapping from states to actions or distributions over actions. It completely specifies the agent's behavior—given any state, the policy determines what the agent does.

Deterministic Policy

A deterministic policy $\mu: \mathcal{S} \rightarrow \mathcal{A}$ maps each state to exactly one action:

$$a = \mu(s)$$

Given state $s$, the agent always takes action $\mu(s)$. There's no randomness in action selection.

Stochastic Policy

A stochastic policy $\pi: \mathcal{S} \times \mathcal{A} \rightarrow [0, 1]$ maps each state to a probability distribution over actions:

$$a \sim \pi(\cdot|s)$$

The notation $\pi(a|s)$ denotes the probability of taking action $a$ in state $s$. For all states, probabilities must sum to one:

$$\sum_{a \in \mathcal{A}} \pi(a|s) = 1 \quad \text{(discrete)}$$ $$\int_{\mathcal{A}} \pi(a|s) , da = 1 \quad \text{(continuous)}$$

Deterministic vs Stochastic Policies
Aspect	Deterministic π = μ(s)	Stochastic a ~ π(·\|s)
Output	Single action	Probability distribution
Exploration	No inherent exploration	Built-in randomness enables exploration
Differentiability	argmax breaks gradients	Sampling enables gradient estimation
Optimality	Optimal policy can be deterministic*	Need stochasticity for exploration during learning
Game Theory	Can be exploited by adversary	Randomization prevents pure exploitation
Common Use	DDPG, TD3 (continuous control)	PPO, A2C, SAC, Policy Gradient

Why Stochastic Policies?

If the optimal policy is deterministic (which it is in standard MDPs), why use stochastic policies at all?

Exploration: During learning, we need to try different actions to discover which are best. Stochastic policies naturally explore.
Gradient-based learning: Policy gradient methods require differentiable action selection. Stochasticity enables the log-derivative trick.
Robustness: In adversarial settings (games, security), randomized strategies can be unexploitable when deterministic ones cannot.
Handling partial observability: When the state is aliased (same observation from different true states), stochastic policies can be optimal.
Regularization: Entropy bonuses encourage stochastic policies during training, preventing premature convergence to suboptimal deterministic policies.

*The optimal policy in an MDP with full observability can always be expressed as deterministic, but we often use stochastic policies during learning and derive the deterministic greedy policy afterward.

The Exploration-Exploitation Connection

Stochastic policies elegantly solve the exploration problem during training. Higher entropy (more uniform) distributions explore more; lower entropy (more peaked) distributions exploit more. Many algorithms explicitly control entropy to manage this trade-off.

Policy Parameterization

In practice, policies are represented by parameterized functions, typically neural networks. We denote a parameterized policy as $\pi_\theta$ where $\theta$ represents the learnable parameters.

Tabular Policies

For small, discrete state and action spaces, we can store the policy explicitly:

$$\pi(a|s) = \text{Table}[s, a]$$

This requires $|\mathcal{S}| \times |\mathcal{A}|$ parameters. Infeasible for large spaces, but optimal for small problems.

Linear Policies

The simplest function approximation:

$$\pi(a|s) = \text{softmax}(\phi(s)^\top \mathbf{w}_a)$$

where $\phi(s)$ is a feature vector. Limited expressiveness but interpretable.

Neural Network Policies

Deep policies can represent complex mappings:

$$\pi_\theta(a|s) = \text{NeuralNet}_\theta(s)$$

The network architecture depends on state and action types.

policy_architectures.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.distributions as D
from typing import Tuple, Optional
import numpy as np
 
class CategoricalPolicy(nn.Module):
    """
    Stochastic policy for discrete action spaces.
    
    Architecture: MLP → softmax over actions
    Distribution: Categorical
    
    Used by: A2C, PPO, REINFORCE with discrete actions
    """
    
    def __init__(self, state_dim: int, n_actions: int,
                 hidden_dims: Tuple[int, ...] = (64, 64)):
        super().__init__()
        
        layers = []
        prev_dim = state_dim
        
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.ReLU(),
            ])
            prev_dim = hidden_dim
        
        layers.append(nn.Linear(prev_dim, n_actions))
        
        self.network = nn.Sequential(*layers)
    
    def forward(self, state: torch.Tensor) -> D.Categorical:
        """Return action distribution for given state."""
        logits = self.network(state)
        return D.Categorical(logits=logits)
    
    def act(self, state: torch.Tensor, 
            deterministic: bool = False) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Sample action and compute log probability.
        
        Args:
            state: Current state [batch, state_dim]
            deterministic: If True, return mode (greedy action)
            
        Returns:
            action: Selected action [batch]
            log_prob: Log probability of action [batch]
        """
        dist = self.forward(state)
        
        if deterministic:
            action = dist.probs.argmax(dim=-1)
        else:
            action = dist.sample()
        
        log_prob = dist.log_prob(action)
        return action, log_prob
    
    def evaluate(self, state: torch.Tensor, 
                 action: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Evaluate log probability and entropy for given state-action pair.
        Used in PPO and A2C for computing ratio and entropy bonus.
        """
        dist = self.forward(state)
        log_prob = dist.log_prob(action)
        entropy = dist.entropy()
        return log_prob, entropy
 
 
class GaussianPolicy(nn.Module):
    """
    Stochastic policy for continuous action spaces.
    
    Architecture: MLP → (mean, log_std)
    Distribution: Independent Gaussian per action dimension
    
    Used by: PPO, A2C, TRPO with continuous actions
    """
    
    def __init__(self, state_dim: int, action_dim: int,
                 hidden_dims: Tuple[int, ...] = (64, 64),
                 log_std_init: float = 0.0,
                 log_std_min: float = -20.0,
                 log_std_max: float = 2.0):
        super().__init__()
        
        self.action_dim = action_dim
        self.log_std_min = log_std_min
        self.log_std_max = log_std_max
        
        # Shared feature extractor
        layers = []
        prev_dim = state_dim
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.ReLU(),
            ])
            prev_dim = hidden_dim
        self.features = nn.Sequential(*layers)
        
        # Separate heads for mean and log_std
        self.mean_head = nn.Linear(prev_dim, action_dim)
        self.log_std_head = nn.Linear(prev_dim, action_dim)
        
        # Initialize log_std to desired starting value
        nn.init.constant_(self.log_std_head.bias, log_std_init)
    
    def forward(self, state: torch.Tensor) -> D.Normal:
        """Return action distribution for given state."""
        features = self.features(state)
        mean = self.mean_head(features)
        log_std = self.log_std_head(features)
        log_std = torch.clamp(log_std, self.log_std_min, self.log_std_max)
        std = torch.exp(log_std)
        
        return D.Normal(mean, std)
    
    def act(self, state: torch.Tensor,
            deterministic: bool = False) -> Tuple[torch.Tensor, torch.Tensor]:
        """Sample action and compute log probability."""
        dist = self.forward(state)
        
        if deterministic:
            action = dist.mean
        else:
            action = dist.rsample()  # Reparameterized sampling
        
        # Sum log probs across action dimensions
        log_prob = dist.log_prob(action).sum(dim=-1)
        return action, log_prob
    
    def evaluate(self, state: torch.Tensor,
                 action: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """Evaluate log probability and entropy."""
        dist = self.forward(state)
        log_prob = dist.log_prob(action).sum(dim=-1)
        entropy = dist.entropy().sum(dim=-1)
        return log_prob, entropy
 
 
class SquashedGaussianPolicy(nn.Module):
    """
    Squashed Gaussian policy for bounded continuous actions.
    
    Actions are sampled from Gaussian, then passed through tanh
    to bound to [-1, 1]. Log probability is corrected for the
    change of variables.
    
    Used by: SAC (Soft Actor-Critic)
    
    Key insight: tanh squashing ensures actions stay in valid range
    while maintaining differentiability for gradient-based optimization.
    """
    
    def __init__(self, state_dim: int, action_dim: int,
                 hidden_dims: Tuple[int, ...] = (256, 256),
                 log_std_min: float = -20.0,
                 log_std_max: float = 2.0):
        super().__init__()
        
        self.log_std_min = log_std_min
        self.log_std_max = log_std_max
        
        # Build MLP
        layers = []
        prev_dim = state_dim
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.ReLU(),
            ])
            prev_dim = hidden_dim
        self.net = nn.Sequential(*layers)
        
        self.mean_linear = nn.Linear(prev_dim, action_dim)
        self.log_std_linear = nn.Linear(prev_dim, action_dim)
    
    def forward(self, state: torch.Tensor
                ) -> Tuple[torch.Tensor, torch.Tensor]:
        """Return mean and log_std of Gaussian (before squashing)."""
        x = self.net(state)
        mean = self.mean_linear(x)
        log_std = self.log_std_linear(x)
        log_std = torch.clamp(log_std, self.log_std_min, self.log_std_max)
        return mean, log_std
    
    def act(self, state: torch.Tensor,
            deterministic: bool = False
            ) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Sample squashed action and compute corrected log probability.
        """
        mean, log_std = self.forward(state)
        std = torch.exp(log_std)
        dist = D.Normal(mean, std)
        
        if deterministic:
            u = mean  # Pre-squash action
        else:
            u = dist.rsample()  # Reparameterized sample
        
        # Squash through tanh
        action = torch.tanh(u)
        
        # Correct log_prob for squashing (change of variables)
        # log π(a|s) = log p(u) - log |det(da/du)|
        #            = log p(u) - sum_i log(1 - tanh^2(u_i))
        log_prob = dist.log_prob(u).sum(dim=-1)
        log_prob -= torch.log(1 - action.pow(2) + 1e-6).sum(dim=-1)
        
        return action, log_prob
 
 
class DeterministicPolicy(nn.Module):
    """
    Deterministic policy for continuous actions.
    
    Used by: DDPG, TD3
    
    Since action selection is deterministic, exploration must
    be added externally (e.g., OU noise, Gaussian noise).
    """
    
    def __init__(self, state_dim: int, action_dim: int,
                 hidden_dims: Tuple[int, ...] = (256, 256),
                 action_bound: float = 1.0):
        super().__init__()
        
        self.action_bound = action_bound
        
        layers = []
        prev_dim = state_dim
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.ReLU(),
            ])
            prev_dim = hidden_dim
        layers.append(nn.Linear(prev_dim, action_dim))
        
        self.network = nn.Sequential(*layers)
    
    def forward(self, state: torch.Tensor) -> torch.Tensor:
        """Return deterministic action."""
        action = self.network(state)
        return self.action_bound * torch.tanh(action)
    
    def act(self, state: torch.Tensor,
            noise_std: float = 0.0) -> torch.Tensor:
        """Return action, optionally with exploration noise."""
        action = self.forward(state)
        
        if noise_std > 0:
            noise = torch.randn_like(action) * noise_std
            action = action + noise
            action = torch.clamp(action, -self.action_bound, self.action_bound)
        
        return action

Policy Evaluation

Policy evaluation answers: how good is a given policy? Specifically, what is the expected return when following policy $\pi$?

The Value of a Policy

The performance or value of policy $\pi$ is the expected return when starting from the initial state distribution and following $\pi$:

$$J(\pi) = \mathbb{E}{\tau \sim \pi}[G_0] = \mathbb{E}{s_0 \sim p_0, a_t \sim \pi} \left[ \sum_{t=0}^{\infty} \gamma^t r_{t+1} \right]$$

Comparing policies is simple: $\pi$ is better than $\pi'$ if $J(\pi) > J(\pi')$.

Monte Carlo Evaluation

The simplest evaluation method: run the policy many times and average returns.

$$\hat{J}(\pi) = \frac{1}{N} \sum_{i=1}^{N} G_0^{(i)}$$

This is an unbiased estimator but has high variance—some episodes may be much longer or luckier than others.

Temporal Difference Evaluation

Instead of waiting for episode end, update value estimates incrementally using the Bellman equation. This is lower variance but introduces bias from bootstrapping.

State Value Function

The state value function $V^\pi(s)$ gives the expected return starting from state $s$ and following policy $\pi$:

$$V^\pi(s) = \mathbb{E}{\pi}\left[ \sum{t=0}^{\infty} \gamma^t r_{t+1} ,\middle|, s_0 = s \right]$$

This satisfies the Bellman expectation equation:

$$V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^\pi(s') \right]$$

In words: the value of a state is the expected immediate reward plus the discounted value of the next state, averaged over the policy's action distribution and environment dynamics.

Action Value Function

The action value function $Q^\pi(s, a)$ gives the expected return starting from state $s$, taking action $a$, then following policy $\pi$:

$$Q^\pi(s, a) = \mathbb{E}{\pi}\left[ \sum{t=0}^{\infty} \gamma^t r_{t+1} ,\middle|, s_0 = s, a_0 = a \right]$$

Relation to state value: $$V^\pi(s) = \sum_a \pi(a|s) Q^\pi(s, a) = \mathbb{E}_{a \sim \pi}[Q^\pi(s, a)]$$

V vs Q: When to Use Which

V(s) is sufficient for policy improvement if you know the environment dynamics (model-based). Q(s,a) enables model-free policy improvement—you can choose the best action without knowing what states actions lead to. This is why Q-learning and actor-critic methods are so popular.

policy_evaluation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
import numpy as np
from typing import List, Tuple, Callable
import torch
import torch.nn as nn
 
def monte_carlo_evaluation(
    env,
    policy: Callable,
    num_episodes: int = 100,
    gamma: float = 0.99,
    max_steps: int = 1000
) -> Tuple[float, float, List[float]]:
    """
    Evaluate policy using Monte Carlo rollouts.
    
    Args:
        env: Gymnasium-style environment
        policy: Function mapping state to action
        num_episodes: Number of episodes to sample
        gamma: Discount factor
        max_steps: Maximum steps per episode
        
    Returns:
        mean_return: Estimated J(π)
        std_return: Standard deviation
        all_returns: List of individual episode returns
    """
    all_returns = []
    
    for episode in range(num_episodes):
        state, _ = env.reset()
        episode_rewards = []
        
        for step in range(max_steps):
            action = policy(state)
            next_state, reward, terminated, truncated, _ = env.step(action)
            episode_rewards.append(reward)
            
            if terminated or truncated:
                break
            state = next_state
        
        # Compute discounted return
        G = 0.0
        for reward in reversed(episode_rewards):
            G = reward + gamma * G
        all_returns.append(G)
    
    mean_return = np.mean(all_returns)
    std_return = np.std(all_returns)
    
    return mean_return, std_return, all_returns
 
 
class TDPolicyEvaluation:
    """
    Temporal Difference policy evaluation (TD(0)).
    
    Updates value estimate after each step using:
    V(s) ← V(s) + α [r + γV(s') - V(s)]
    
    Lower variance than Monte Carlo but biased due to bootstrapping.
    """
    
    def __init__(self, state_dim: int, 
                 hidden_dims: Tuple[int, ...] = (64, 64),
                 learning_rate: float = 1e-3,
                 gamma: float = 0.99):
        self.gamma = gamma
        
        # Value function approximator
        layers = []
        prev_dim = state_dim
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.ReLU(),
            ])
            prev_dim = hidden_dim
        layers.append(nn.Linear(prev_dim, 1))
        
        self.value_net = nn.Sequential(*layers)
        self.optimizer = torch.optim.Adam(
            self.value_net.parameters(), lr=learning_rate
        )
    
    def update(self, state: np.ndarray, reward: float,
               next_state: np.ndarray, done: bool) -> float:
        """
        Perform one TD(0) update.
        
        Returns:
            td_error: The temporal difference error (for monitoring)
        """
        state_t = torch.FloatTensor(state).unsqueeze(0)
        next_state_t = torch.FloatTensor(next_state).unsqueeze(0)
        
        # Current value estimate
        value = self.value_net(state_t)
        
        # TD target
        with torch.no_grad():
            if done:
                target = reward
            else:
                target = reward + self.gamma * self.value_net(next_state_t)
        
        # TD error
        td_error = target - value
        
        # Update value function
        loss = td_error.pow(2)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        return td_error.item()
    
    def estimate_value(self, state: np.ndarray) -> float:
        """Get current value estimate for a state."""
        state_t = torch.FloatTensor(state).unsqueeze(0)
        with torch.no_grad():
            return self.value_net(state_t).item()
 
 
def evaluate_policy_with_confidence(
    env,
    policy: Callable,
    num_episodes: int = 100,
    gamma: float = 0.99,
    confidence: float = 0.95
) -> dict:
    """
    Evaluate policy with confidence interval.
    
    Returns dictionary with:
    - mean: Point estimate of J(π)
    - std: Standard deviation of returns
    - ci_lower, ci_upper: Confidence interval bounds
    - n_episodes: Number of episodes used
    """
    mean, std, returns = monte_carlo_evaluation(
        env, policy, num_episodes, gamma
    )
    
    # Confidence interval using t-distribution
    from scipy import stats
    t_value = stats.t.ppf((1 + confidence) / 2, num_episodes - 1)
    margin = t_value * std / np.sqrt(num_episodes)
    
    return {
        'mean': mean,
        'std': std,
        'ci_lower': mean - margin,
        'ci_upper': mean + margin,
        'confidence': confidence,
        'n_episodes': num_episodes
    }

Optimal Policies

The ultimate goal of RL is to find an optimal policy $\pi^*$—one that maximizes expected return from every state.

Definition of Optimality

A policy $\pi^*$ is optimal if:

$$V^{\pi^*}(s) \geq V^{\pi}(s) \quad \forall s \in \mathcal{S}, \forall \pi$$

The optimal policy achieves the highest value in every state simultaneously. Remarkably, such a policy always exists in MDPs.

Optimal Value Functions

The optimal state value function is: $$V^(s) = \max_\pi V^\pi(s) = V^{\pi^}(s)$$

The optimal action value function is: $$Q^*(s, a) = \max_\pi Q^\pi(s, a)$$

These satisfy the Bellman optimality equations:

$$V^(s) = \max_a \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^(s') \right]$$

$$Q^(s, a) = \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma \max_{a'} Q^(s', a') \right]$$

Deriving Optimal Policy from Q*

Once we have $Q^*$, the optimal policy is trivially derived:

$$\pi^(s) = \arg\max_a Q^(s, a)$$

This is why Q-learning is so powerful: learn $Q^*$, then act greedily with respect to it.

Key Theorems

Theorem 1 (Existence): For any finite MDP, there exists an optimal policy that is deterministic and stationary (doesn't depend on time or history beyond current state).

Theorem 2 (Uniqueness of V, Q)**: The optimal value functions are unique, though optimal policies may not be.

Theorem 3 (Policy Improvement): If we improve a policy at any state (choose an action with higher Q-value), the overall policy improves. This enables iterative policy improvement algorithms.

The Bellman Optimality Equation as a Fixed Point

The Bellman optimality equation defines V* as a fixed point of the Bellman operator T*: V* = TV. Value iteration finds this fixed point by repeated application of T*. The contraction property of T* (under discount γ < 1) guarantees convergence to the unique fixed point V*.

Properties of Optimal Policies

•Can always be deterministic
•Stationary (doesn't change over time)
•Markovian (depends only on current state)
•Greedy with respect to Q*
•May not be unique (ties in Q*)

When Optimality is Hard

•Large/continuous state spaces (approximation needed)
•Unknown dynamics (exploration required)
•Partial observability (history matters)
•Multi-agent settings (no single optimum)
•Non-stationary environments (optimum shifts)

Policy Classes and Expressiveness

Not all policy representations are equally expressive. Understanding the hierarchy of policy classes helps choose appropriate representations.

Memoryless vs. History-Dependent

Memoryless (Markovian) policies depend only on current state: $\pi(a|s)$. Sufficient for MDPs.

History-dependent policies depend on the entire trajectory: $\pi(a|s_0, a_0, \ldots, s_t)$. Necessary for POMDPs where the current observation doesn't capture the true state.

Stationary vs. Non-Stationary

Stationary policies are the same at all time steps: $\pi(a|s)$. Optimal for infinite-horizon discounted MDPs.

Non-stationary policies vary with time: $\pi_t(a|s)$. May be needed for finite-horizon problems or changing objectives.

Reactive vs. Deliberative

Reactive policies compute actions quickly from current state—feedforward networks.

Deliberative policies may perform internal computation (planning, search) before acting—more powerful but slower.

Policy Representation Hierarchy
Class	Form	When Needed	Computation
Tabular	π[s,a] lookup table	Small discrete spaces	O(1) lookup
Linear	softmax(φ(s)ᵀW)	Simple problems, interpretability	O(d) per action
MLP	Neural network on state	Standard continuous control	O(network size)
CNN + MLP	Conv layers + MLP	Image observations	O(image size × depth)
RNN/LSTM	Recurrent over history	Partial observability	O(hidden²) per step
Transformer	Attention over history	Complex dependencies	O(T² × d) for T steps
Planning-based	Search/MCTS at runtime	Games, complex reasoning	O(branching^depth)

The Representation-Learning Trade-off

More expressive policy classes can represent more complex behaviors but:

Harder to train: More parameters, more local optima, more hyperparameter sensitivity
More data hungry: Complex functions need more examples to learn
Slower inference: More computation per action decision
Harder to interpret: Understanding why actions were chosen becomes difficult

Principle: Use the simplest policy class that can express the optimal behavior. If a linear policy suffices, don't use a deep network. If an MLP works, don't use an RNN.

Architecture Design Heuristics:

Start with 2-layer MLPs (64 or 256 hidden, depending on problem scale)
Add recurrence only if single-state is provably insufficient
Match architecture to observation structure (CNN for images)
Consider action space: separate heads for discrete choices vs. continuous parameters

Universal Approximation vs. Learnability

Neural networks are universal function approximators—they can represent any continuous function. But this doesn't mean gradient descent will find the optimal policy. The optimization landscape, sample efficiency, generalization, and hyperparameter sensitivity all affect whether a policy can be learned in practice.

How Policies Learn: A Preview

How do we find good policies? There are two major paradigms:

Value-Based Methods (Q-Learning family)

Learn the optimal Q-function $Q^*$
Derive policy by acting greedily: $\pi(s) = \arg\max_a Q^*(s, a)$
Policy is implicit—defined by the learned values

Policy-Based Methods (Policy Gradient family)

Directly parameterize the policy $\pi_\theta$
Optimize $\theta$ to maximize expected return
Use gradient ascent: $\theta \leftarrow \theta + \alpha \nabla_\theta J(\pi_\theta)$

The Policy Gradient Theorem

The key insight enabling direct policy optimization:

$$\nabla_\theta J(\pi_\theta) = \mathbb{E}{\tau \sim \pi\theta} \left[ \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t \right]$$

In words: the gradient of expected return equals the expected sum of gradients of log-probabilities, weighted by returns. Actions that led to high returns have their probabilities increased; actions leading to low returns are decreased.

policy_gradient_intuition.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from typing import List, Tuple
 
class SimpleREINFORCE:
    """
    REINFORCE: The simplest policy gradient algorithm.
    
    Intuition: Run the policy, observe what happened, increase
    probability of actions that led to high returns, decrease
    probability of actions that led to low returns.
    
    This is essentially supervised learning where the "labels"
    (good actions) are discovered through trial and error.
    """
    
    def __init__(self, policy: nn.Module, 
                 learning_rate: float = 1e-3,
                 gamma: float = 0.99):
        self.policy = policy
        self.gamma = gamma
        self.optimizer = optim.Adam(policy.parameters(), lr=learning_rate)
    
    def compute_returns(self, rewards: List[float]) -> torch.Tensor:
        """Compute discounted returns for each timestep."""
        returns = []
        G = 0.0
        
        for reward in reversed(rewards):
            G = reward + self.gamma * G
            returns.insert(0, G)
        
        returns = torch.tensor(returns, dtype=torch.float32)
        
        # Optionally normalize for stability
        returns = (returns - returns.mean()) / (returns.std() + 1e-8)
        
        return returns
    
    def update(self, 
               states: List[np.ndarray],
               actions: List[int],
               rewards: List[float]) -> float:
        """
        Perform one REINFORCE update.
        
        The loss is: -sum_t [log π(a_t|s_t) * G_t]
        
        Negative because we're doing gradient ascent on expected return,
        which is gradient descent on negative expected return.
        """
        states_t = torch.FloatTensor(np.array(states))
        actions_t = torch.LongTensor(actions)
        returns = self.compute_returns(rewards)
        
        # Get log probabilities of taken actions
        dist = self.policy(states_t)
        log_probs = dist.log_prob(actions_t)
        
        # Policy gradient loss: -E[log π(a|s) * G]
        # Intuition:
        # - G > 0: action was good, increase log_prob (decrease loss)
        # - G < 0: action was bad, decrease log_prob (increase loss)
        loss = -(log_probs * returns).mean()
        
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        return loss.item()
 
 
class ActorCriticPreview:
    """
    Actor-Critic: Combines policy gradient with value function.
    
    Key insight: Use learned value function to reduce variance
    of the policy gradient estimate.
    
    Instead of G_t (high variance), use advantage A_t = G_t - V(s_t)
    This doesn't change the expected gradient but reduces variance.
    """
    
    def __init__(self, 
                 actor: nn.Module,   # Policy network
                 critic: nn.Module,  # Value network
                 actor_lr: float = 3e-4,
                 critic_lr: float = 1e-3,
                 gamma: float = 0.99):
        self.actor = actor
        self.critic = critic
        self.gamma = gamma
        
        self.actor_optimizer = optim.Adam(actor.parameters(), lr=actor_lr)
        self.critic_optimizer = optim.Adam(critic.parameters(), lr=critic_lr)
    
    def update(self,
               states: torch.Tensor,
               actions: torch.Tensor,
               rewards: torch.Tensor,
               next_states: torch.Tensor,
               dones: torch.Tensor) -> Tuple[float, float]:
        """
        Perform actor-critic update.
        
        Returns:
            actor_loss: Policy gradient loss
            critic_loss: Value function MSE loss
        """
        # ===== Critic Update =====
        # TD target: r + γ * V(s') for non-terminal
        with torch.no_grad():
            next_values = self.critic(next_states).squeeze()
            td_targets = rewards + self.gamma * next_values * (1 - dones)
        
        current_values = self.critic(states).squeeze()
        critic_loss = nn.functional.mse_loss(current_values, td_targets)
        
        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        self.critic_optimizer.step()
        
        # ===== Actor Update =====
        # Advantage: how much better was the action than average?
        with torch.no_grad():
            advantages = td_targets - current_values
        
        # Policy gradient with advantage
        dist = self.actor(states)
        log_probs = dist.log_prob(actions)
        actor_loss = -(log_probs * advantages).mean()
        
        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        self.actor_optimizer.step()
        
        return actor_loss.item(), critic_loss.item()

Behavior vs. Target Policies

A subtle but important distinction: the policy used to collect experience may differ from the policy being learned.

Behavior Policy ($\beta$ or $\mu$): The policy actually interacting with the environment, generating experience data.

Target Policy ($\pi$): The policy we're trying to learn or evaluate.

On-Policy Methods

Behavior = Target: We use the current policy $\pi_\theta$ for data collection and update $\pi_\theta$ based on that data.

Examples: SARSA, A2C, PPO (approximately on-policy)
Pros: Simpler, more stable, no importance sampling needed
Cons: Data can only be used once, sample inefficient

Off-Policy Methods

Behavior ≠ Target: We collect data with one policy but learn about a different policy.

Examples: Q-Learning, DQN, DDPG, SAC
Pros: Reuse old data (experience replay), learn from demonstrations
Cons: Need importance sampling correction for policy gradient; Q-learning avoids this

On-Policy vs Off-Policy Characteristics
Aspect	On-Policy	Off-Policy
Data generation	Current policy	Any policy (can differ)
Sample reuse	Cannot reuse old data	Experience replay possible
Sample efficiency	Lower (fresh data needed)	Higher (reuse data)
Stability	More stable	Can be less stable
Correction needed	No	Importance sampling (for PG)
Can learn from demos	No	Yes
Examples	A2C, PPO, TRPO	DQN, DDPG, SAC

Importance Sampling Correction

When the behavior policy $\mu$ differs from target policy $\pi$, expectations must be corrected:

$$\mathbb{E}{a \sim \pi}[f(a)] = \mathbb{E}{a \sim \mu}\left[ \frac{\pi(a|s)}{\mu(a|s)} f(a) \right]$$

The ratio $\rho = \pi(a|s) / \mu(a|s)$ reweights samples. High ratio: action is more likely under $\pi$ than $\mu$. Low ratio: less likely.

Problem: Importance ratios can have high variance, especially when $\pi$ and $\mu$ differ significantly.

Solutions:

Keep $\pi$ close to $\mu$ (PPO's clipping, TRPO's constraint)
Use value function methods that avoid importance sampling (Q-learning)
Truncate extreme ratios (but introduces bias)

Q-Learning's Elegance

Q-learning is off-policy without needing importance sampling. Why? Because the Bellman update Q(s,a) ← r + γ max Q(s',a') doesn't depend on which policy selected a. The action's value doesn't change based on how we chose it. This makes Q-learning remarkably sample-efficient but limits it to learning deterministic policies.

Policy Entropy and Regularization

Entropy measures the randomness of a policy's action distribution:

$$H(\pi(\cdot|s)) = -\sum_a \pi(a|s) \log \pi(a|s) \quad \text{(discrete)}$$

$$H(\pi(\cdot|s)) = -\int \pi(a|s) \log \pi(a|s) , da \quad \text{(continuous)}$$

High entropy: uniform distribution, lots of randomness. Low entropy: peaked distribution, nearly deterministic.

Why Entropy Matters

Exploration: High entropy policies explore more, visiting diverse states.
Regularization: Entropy bonus prevents premature convergence to suboptimal deterministic policies.
Robustness: Stochastic policies are harder for adversaries to exploit and more robust to model errors.
Maximum Entropy RL: SAC and related algorithms maximize reward AND entropy simultaneously, leading to robust, multi-modal policies.

Entropy-Regularized Objectives

The standard RL objective is: $$J(\pi) = \mathbb{E}_{\pi}\left[ \sum_t \gamma^t r_t \right]$$

The entropy-regularized (maximum entropy) objective adds an entropy bonus: $$J_{MaxEnt}(\pi) = \mathbb{E}_{\pi}\left[ \sum_t \gamma^t \left( r_t + \alpha H(\pi(\cdot|s_t)) \right) \right]$$

where $\alpha$ is the temperature parameter controlling the entropy-reward trade-off.

Effects:

High $\alpha$: Prioritize entropy (exploration), sacrifice some reward
Low $\alpha$: Prioritize reward, reduce exploration
$\alpha = 0$: Standard RL objective

Automatic Entropy Tuning (SAC)

Instead of manually tuning $\alpha$, we can learn it by constraining entropy to a target:

$$\alpha^* = \arg\min_\alpha \mathbb{E}_{s \sim D}\left[ -\alpha \log \pi(a|s) - \alpha \bar{H} \right]$$

where $\bar{H}$ is the target entropy (e.g., $-\dim(\mathcal{A})$ for continuous actions).

entropy_regularization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
import torch
import torch.nn as nn
from torch.distributions import Categorical, Normal
 
def compute_entropy_discrete(logits: torch.Tensor) -> torch.Tensor:
    """
    Compute entropy of categorical distribution.
    
    H = -sum_a p(a) log p(a)
    
    Using logits directly for numerical stability:
    H = log(sum(exp(logits))) - sum(p * logits)
    """
    probs = torch.softmax(logits, dim=-1)
    log_probs = torch.log_softmax(logits, dim=-1)
    entropy = -(probs * log_probs).sum(dim=-1)
    return entropy
 
 
def compute_entropy_gaussian(log_std: torch.Tensor) -> torch.Tensor:
    """
    Compute entropy of Gaussian distribution.
    
    For N(μ, σ²): H = 0.5 * log(2πeσ²) = 0.5 * (1 + log(2π) + 2*log_std)
    """
    return 0.5 * (1 + torch.log(2 * torch.tensor(torch.pi)) + 2 * log_std).sum(dim=-1)
 
 
class EntropyRegularizedPolicyGradient:
    """
    Policy gradient with entropy regularization.
    
    Objective: max E[sum_t (r_t + α H(π(·|s_t)))]
    
    The entropy bonus encourages exploration and prevents
    the policy from becoming too deterministic too early.
    """
    
    def __init__(self, 
                 policy: nn.Module,
                 learning_rate: float = 3e-4,
                 gamma: float = 0.99,
                 entropy_coef: float = 0.01):
        self.policy = policy
        self.gamma = gamma
        self.entropy_coef = entropy_coef
        self.optimizer = torch.optim.Adam(policy.parameters(), lr=learning_rate)
    
    def compute_loss(self,
                     states: torch.Tensor,
                     actions: torch.Tensor,
                     returns: torch.Tensor) -> tuple:
        """
        Compute policy gradient loss with entropy bonus.
        
        Loss = -E[log π(a|s) * return] - α * E[H(π)]
        """
        dist = self.policy(states)
        log_probs = dist.log_prob(actions)
        entropy = dist.entropy()
        
        # Policy gradient term
        pg_loss = -(log_probs * returns).mean()
        
        # Entropy bonus (negative because we want to maximize entropy)
        entropy_loss = -entropy.mean()
        
        total_loss = pg_loss + self.entropy_coef * entropy_loss
        
        return total_loss, pg_loss.item(), entropy.mean().item()
 
 
class AutomaticEntropyTuning:
    """
    Automatic entropy coefficient tuning (as in SAC).
    
    Instead of fixed α, learn α to maintain target entropy.
    
    Objective: min_α E[-α * log π(a|s) - α * H_target]
    
    This increases α if entropy is below target (encouraging exploration)
    and decreases α if entropy is above target.
    """
    
    def __init__(self,
                 action_dim: int,
                 initial_alpha: float = 1.0,
                 learning_rate: float = 3e-4):
        # Target entropy: -dim(A) is a common choice for continuous actions
        self.target_entropy = -action_dim
        
        # Learn log(α) for numerical stability
        self.log_alpha = torch.tensor(
            [torch.log(torch.tensor(initial_alpha))],
            requires_grad=True
        )
        self.optimizer = torch.optim.Adam([self.log_alpha], lr=learning_rate)
    
    @property
    def alpha(self) -> torch.Tensor:
        return self.log_alpha.exp()
    
    def update(self, log_probs: torch.Tensor) -> float:
        """
        Update α based on current policy entropy.
        
        Args:
            log_probs: Log probabilities of actions under current policy
            
        Returns:
            alpha_loss: The α update loss
        """
        # Loss: α * (-log π - H_target)
        # If -log π > H_target (entropy above target): decrease α
        # If -log π < H_target (entropy below target): increase α
        alpha_loss = -(self.alpha * (log_probs + self.target_entropy).detach()).mean()
        
        self.optimizer.zero_grad()
        alpha_loss.backward()
        self.optimizer.step()
        
        return alpha_loss.item()

Practical Policy Design

Policy Architecture Guidelines

•Network Size: Start with 2 hidden layers, 64-256 units. Increase if learning stalls despite exploration.
•Activation Functions: ReLU is standard. Use Tanh for final layer if actions need bounding.
•Weight Initialization: Xavier/He initialization. Initialize output layer with smaller weights for stability.
•Log-std Parameterization: For Gaussian policies, learn log(σ) not σ directly. Prevents negative std.
•Separate Networks: Often use separate networks for actor and critic. Shared features can help or hurt.
•Action Scaling: Output actions in [-1, 1], scale to environment's actual range externally.

Common Policy Bugs

•NaN actions: Usually from exploding std or numerical instability. Clamp log_std to reasonable range.
•Entropy collapse: Policy becomes deterministic too early. Add entropy bonus or increase initial entropy.
•Gradient explosion: Normalize observations, clip gradients, use smaller learning rate.
•Mode collapse: Policy only uses one action. Check reward signal, increase exploration.
•Forgetting: Performance degrades over time. May need target networks or regularization.

Debugging Checklist

When policy fails to learn: (1) Verify rewards are received and vary. (2) Check actions are in valid range. (3) Monitor entropy—shouldn't collapse to zero. (4) Visualize value estimates—should correlate with actual returns. (5) Try known-working hyperparameters from similar problems. (6) Simplify the environment to verify algorithm works.

Summary: The Heart of Decision-Making

Policies are the central concept in reinforcement learning—they define agent behavior, and finding good policies is the goal of all RL algorithms. Let's consolidate the key insights:

Key Takeaways

•Policies map states to actions—deterministic policies output single actions, stochastic policies output distributions.
•Stochastic policies enable exploration during learning and provide theoretical benefits for optimization.
•Policies are parameterized by neural networks, enabling learning through gradient-based optimization.
•Policy evaluation measures how good a policy is—the expected cumulative reward from following it.
•Optimal policies maximize value from every state and are the goal of reinforcement learning.
•Behavior vs. target policies distinguish on-policy methods (simpler) from off-policy methods (more sample efficient).
•Entropy regularization encourages exploration and prevents premature convergence.

Looking Ahead

With policies defined, we need a way to measure their quality—not just overall performance, but the value of individual states and actions. The next page explores value functions, which quantify expected future reward and form the foundation of most RL algorithms.

Page Complete

You now understand policies—the decision-making rules at the heart of reinforcement learning. From deterministic to stochastic, from tabular to neural, policies define what agents do. Next, we'll see how value functions help us evaluate and improve policies.

3 / 5

Loading learning content...

Machine LearningReinforcement Learning

RL Fundamentals

LevelAdvanced

Duration90 mins

TopicReinforcement Learning

3 / 5

Policies

The Agent's Strategy for Action

Understanding policies deeply means understanding how agents decide, how decisions can be represented, learned, and optimized. This page develops that understanding from first principles.

What You Will Learn

What Is a Policy?

Formally, a policy is a mapping from states to actions or distributions over actions. It completely specifies the agent's behavior—given any state, the policy determines what the agent does.

Deterministic Policy

A deterministic policy $\mu: \mathcal{S} \rightarrow \mathcal{A}$ maps each state to exactly one action:

$$a = \mu(s)$$

Given state $s$, the agent always takes action $\mu(s)$. There's no randomness in action selection.

Stochastic Policy

A stochastic policy $\pi: \mathcal{S} \times \mathcal{A} \rightarrow [0, 1]$ maps each state to a probability distribution over actions:

$$a \sim \pi(\cdot|s)$$

The notation $\pi(a|s)$ denotes the probability of taking action $a$ in state $s$. For all states, probabilities must sum to one:

$$\sum_{a \in \mathcal{A}} \pi(a|s) = 1 \quad \text{(discrete)}$$ $$\int_{\mathcal{A}} \pi(a|s) , da = 1 \quad \text{(continuous)}$$

Deterministic vs Stochastic Policies
Aspect	Deterministic π = μ(s)	Stochastic a ~ π(·\|s)
Output	Single action	Probability distribution
Exploration	No inherent exploration	Built-in randomness enables exploration
Differentiability	argmax breaks gradients	Sampling enables gradient estimation
Optimality	Optimal policy can be deterministic*	Need stochasticity for exploration during learning
Game Theory	Can be exploited by adversary	Randomization prevents pure exploitation
Common Use	DDPG, TD3 (continuous control)	PPO, A2C, SAC, Policy Gradient

Why Stochastic Policies?

If the optimal policy is deterministic (which it is in standard MDPs), why use stochastic policies at all?

Exploration: During learning, we need to try different actions to discover which are best. Stochastic policies naturally explore.
Gradient-based learning: Policy gradient methods require differentiable action selection. Stochasticity enables the log-derivative trick.
Robustness: In adversarial settings (games, security), randomized strategies can be unexploitable when deterministic ones cannot.
Handling partial observability: When the state is aliased (same observation from different true states), stochastic policies can be optimal.
Regularization: Entropy bonuses encourage stochastic policies during training, preventing premature convergence to suboptimal deterministic policies.

The Exploration-Exploitation Connection

Policy Parameterization

In practice, policies are represented by parameterized functions, typically neural networks. We denote a parameterized policy as $\pi_\theta$ where $\theta$ represents the learnable parameters.

Tabular Policies

For small, discrete state and action spaces, we can store the policy explicitly:

$$\pi(a|s) = \text{Table}[s, a]$$

This requires $|\mathcal{S}| \times |\mathcal{A}|$ parameters. Infeasible for large spaces, but optimal for small problems.

Linear Policies

The simplest function approximation:

$$\pi(a|s) = \text{softmax}(\phi(s)^\top \mathbf{w}_a)$$

where $\phi(s)$ is a feature vector. Limited expressiveness but interpretable.

Neural Network Policies

Deep policies can represent complex mappings:

$$\pi_\theta(a|s) = \text{NeuralNet}_\theta(s)$$

The network architecture depends on state and action types.

policy_architectures.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.distributions as D
from typing import Tuple, Optional
import numpy as np
 
class CategoricalPolicy(nn.Module):
    """
    Stochastic policy for discrete action spaces.
    
    Architecture: MLP → softmax over actions
    Distribution: Categorical
    
    Used by: A2C, PPO, REINFORCE with discrete actions
    """
    
    def __init__(self, state_dim: int, n_actions: int,
                 hidden_dims: Tuple[int, ...] = (64, 64)):
        super().__init__()
        
        layers = []
        prev_dim = state_dim
        
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.ReLU(),
            ])
            prev_dim = hidden_dim
        
        layers.append(nn.Linear(prev_dim, n_actions))
        
        self.network = nn.Sequential(*layers)
    
    def forward(self, state: torch.Tensor) -> D.Categorical:
        """Return action distribution for given state."""
        logits = self.network(state)
        return D.Categorical(logits=logits)
    
    def act(self, state: torch.Tensor, 
            deterministic: bool = False) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Sample action and compute log probability.
        
        Args:
            state: Current state [batch, state_dim]
            deterministic: If True, return mode (greedy action)
            
        Returns:
            action: Selected action [batch]
            log_prob: Log probability of action [batch]
        """
        dist = self.forward(state)
        
        if deterministic:
            action = dist.probs.argmax(dim=-1)
        else:
            action = dist.sample()
        
        log_prob = dist.log_prob(action)
        return action, log_prob
    
    def evaluate(self, state: torch.Tensor, 
                 action: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Evaluate log probability and entropy for given state-action pair.
        Used in PPO and A2C for computing ratio and entropy bonus.
        """
        dist = self.forward(state)
        log_prob = dist.log_prob(action)
        entropy = dist.entropy()
        return log_prob, entropy
 
 
class GaussianPolicy(nn.Module):
    """
    Stochastic policy for continuous action spaces.
    
    Architecture: MLP → (mean, log_std)
    Distribution: Independent Gaussian per action dimension
    
    Used by: PPO, A2C, TRPO with continuous actions
    """
    
    def __init__(self, state_dim: int, action_dim: int,
                 hidden_dims: Tuple[int, ...] = (64, 64),
                 log_std_init: float = 0.0,
                 log_std_min: float = -20.0,
                 log_std_max: float = 2.0):
        super().__init__()
        
        self.action_dim = action_dim
        self.log_std_min = log_std_min
        self.log_std_max = log_std_max
        
        # Shared feature extractor
        layers = []
        prev_dim = state_dim
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.ReLU(),
            ])
            prev_dim = hidden_dim
        self.features = nn.Sequential(*layers)
        
        # Separate heads for mean and log_std
        self.mean_head = nn.Linear(prev_dim, action_dim)
        self.log_std_head = nn.Linear(prev_dim, action_dim)
        
        # Initialize log_std to desired starting value
        nn.init.constant_(self.log_std_head.bias, log_std_init)
    
    def forward(self, state: torch.Tensor) -> D.Normal:
        """Return action distribution for given state."""
        features = self.features(state)
        mean = self.mean_head(features)
        log_std = self.log_std_head(features)
        log_std = torch.clamp(log_std, self.log_std_min, self.log_std_max)
        std = torch.exp(log_std)
        
        return D.Normal(mean, std)
    
    def act(self, state: torch.Tensor,
            deterministic: bool = False) -> Tuple[torch.Tensor, torch.Tensor]:
        """Sample action and compute log probability."""
        dist = self.forward(state)
        
        if deterministic:
            action = dist.mean
        else:
            action = dist.rsample()  # Reparameterized sampling
        
        # Sum log probs across action dimensions
        log_prob = dist.log_prob(action).sum(dim=-1)
        return action, log_prob
    
    def evaluate(self, state: torch.Tensor,
                 action: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """Evaluate log probability and entropy."""
        dist = self.forward(state)
        log_prob = dist.log_prob(action).sum(dim=-1)
        entropy = dist.entropy().sum(dim=-1)
        return log_prob, entropy
 
 
class SquashedGaussianPolicy(nn.Module):
    """
    Squashed Gaussian policy for bounded continuous actions.
    
    Actions are sampled from Gaussian, then passed through tanh
    to bound to [-1, 1]. Log probability is corrected for the
    change of variables.
    
    Used by: SAC (Soft Actor-Critic)
    
    Key insight: tanh squashing ensures actions stay in valid range
    while maintaining differentiability for gradient-based optimization.
    """
    
    def __init__(self, state_dim: int, action_dim: int,
                 hidden_dims: Tuple[int, ...] = (256, 256),
                 log_std_min: float = -20.0,
                 log_std_max: float = 2.0):
        super().__init__()
        
        self.log_std_min = log_std_min
        self.log_std_max = log_std_max
        
        # Build MLP
        layers = []
        prev_dim = state_dim
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.ReLU(),
            ])
            prev_dim = hidden_dim
        self.net = nn.Sequential(*layers)
        
        self.mean_linear = nn.Linear(prev_dim, action_dim)
        self.log_std_linear = nn.Linear(prev_dim, action_dim)
    
    def forward(self, state: torch.Tensor
                ) -> Tuple[torch.Tensor, torch.Tensor]:
        """Return mean and log_std of Gaussian (before squashing)."""
        x = self.net(state)
        mean = self.mean_linear(x)
        log_std = self.log_std_linear(x)
        log_std = torch.clamp(log_std, self.log_std_min, self.log_std_max)
        return mean, log_std
    
    def act(self, state: torch.Tensor,
            deterministic: bool = False
            ) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Sample squashed action and compute corrected log probability.
        """
        mean, log_std = self.forward(state)
        std = torch.exp(log_std)
        dist = D.Normal(mean, std)
        
        if deterministic:
            u = mean  # Pre-squash action
        else:
            u = dist.rsample()  # Reparameterized sample
        
        # Squash through tanh
        action = torch.tanh(u)
        
        # Correct log_prob for squashing (change of variables)
        # log π(a|s) = log p(u) - log |det(da/du)|
        #            = log p(u) - sum_i log(1 - tanh^2(u_i))
        log_prob = dist.log_prob(u).sum(dim=-1)
        log_prob -= torch.log(1 - action.pow(2) + 1e-6).sum(dim=-1)
        
        return action, log_prob
 
 
class DeterministicPolicy(nn.Module):
    """
    Deterministic policy for continuous actions.
    
    Used by: DDPG, TD3
    
    Since action selection is deterministic, exploration must
    be added externally (e.g., OU noise, Gaussian noise).
    """
    
    def __init__(self, state_dim: int, action_dim: int,
                 hidden_dims: Tuple[int, ...] = (256, 256),
                 action_bound: float = 1.0):
        super().__init__()
        
        self.action_bound = action_bound
        
        layers = []
        prev_dim = state_dim
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.ReLU(),
            ])
            prev_dim = hidden_dim
        layers.append(nn.Linear(prev_dim, action_dim))
        
        self.network = nn.Sequential(*layers)
    
    def forward(self, state: torch.Tensor) -> torch.Tensor:
        """Return deterministic action."""
        action = self.network(state)
        return self.action_bound * torch.tanh(action)
    
    def act(self, state: torch.Tensor,
            noise_std: float = 0.0) -> torch.Tensor:
        """Return action, optionally with exploration noise."""
        action = self.forward(state)
        
        if noise_std > 0:
            noise = torch.randn_like(action) * noise_std
            action = action + noise
            action = torch.clamp(action, -self.action_bound, self.action_bound)
        
        return action

Policy Evaluation

Policy evaluation answers: how good is a given policy? Specifically, what is the expected return when following policy $\pi$?

The Value of a Policy

The performance or value of policy $\pi$ is the expected return when starting from the initial state distribution and following $\pi$:

$$J(\pi) = \mathbb{E}{\tau \sim \pi}[G_0] = \mathbb{E}{s_0 \sim p_0, a_t \sim \pi} \left[ \sum_{t=0}^{\infty} \gamma^t r_{t+1} \right]$$

Comparing policies is simple: $\pi$ is better than $\pi'$ if $J(\pi) > J(\pi')$.

Monte Carlo Evaluation

The simplest evaluation method: run the policy many times and average returns.

$$\hat{J}(\pi) = \frac{1}{N} \sum_{i=1}^{N} G_0^{(i)}$$

This is an unbiased estimator but has high variance—some episodes may be much longer or luckier than others.

Temporal Difference Evaluation

Instead of waiting for episode end, update value estimates incrementally using the Bellman equation. This is lower variance but introduces bias from bootstrapping.

State Value Function

The state value function $V^\pi(s)$ gives the expected return starting from state $s$ and following policy $\pi$:

$$V^\pi(s) = \mathbb{E}{\pi}\left[ \sum{t=0}^{\infty} \gamma^t r_{t+1} ,\middle|, s_0 = s \right]$$

This satisfies the Bellman expectation equation:

$$V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^\pi(s') \right]$$

In words: the value of a state is the expected immediate reward plus the discounted value of the next state, averaged over the policy's action distribution and environment dynamics.

Action Value Function

The action value function $Q^\pi(s, a)$ gives the expected return starting from state $s$, taking action $a$, then following policy $\pi$:

$$Q^\pi(s, a) = \mathbb{E}{\pi}\left[ \sum{t=0}^{\infty} \gamma^t r_{t+1} ,\middle|, s_0 = s, a_0 = a \right]$$

Relation to state value: $$V^\pi(s) = \sum_a \pi(a|s) Q^\pi(s, a) = \mathbb{E}_{a \sim \pi}[Q^\pi(s, a)]$$

V vs Q: When to Use Which

policy_evaluation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
import numpy as np
from typing import List, Tuple, Callable
import torch
import torch.nn as nn
 
def monte_carlo_evaluation(
    env,
    policy: Callable,
    num_episodes: int = 100,
    gamma: float = 0.99,
    max_steps: int = 1000
) -> Tuple[float, float, List[float]]:
    """
    Evaluate policy using Monte Carlo rollouts.
    
    Args:
        env: Gymnasium-style environment
        policy: Function mapping state to action
        num_episodes: Number of episodes to sample
        gamma: Discount factor
        max_steps: Maximum steps per episode
        
    Returns:
        mean_return: Estimated J(π)
        std_return: Standard deviation
        all_returns: List of individual episode returns
    """
    all_returns = []
    
    for episode in range(num_episodes):
        state, _ = env.reset()
        episode_rewards = []
        
        for step in range(max_steps):
            action = policy(state)
            next_state, reward, terminated, truncated, _ = env.step(action)
            episode_rewards.append(reward)
            
            if terminated or truncated:
                break
            state = next_state
        
        # Compute discounted return
        G = 0.0
        for reward in reversed(episode_rewards):
            G = reward + gamma * G
        all_returns.append(G)
    
    mean_return = np.mean(all_returns)
    std_return = np.std(all_returns)
    
    return mean_return, std_return, all_returns
 
 
class TDPolicyEvaluation:
    """
    Temporal Difference policy evaluation (TD(0)).
    
    Updates value estimate after each step using:
    V(s) ← V(s) + α [r + γV(s') - V(s)]
    
    Lower variance than Monte Carlo but biased due to bootstrapping.
    """
    
    def __init__(self, state_dim: int, 
                 hidden_dims: Tuple[int, ...] = (64, 64),
                 learning_rate: float = 1e-3,
                 gamma: float = 0.99):
        self.gamma = gamma
        
        # Value function approximator
        layers = []
        prev_dim = state_dim
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.ReLU(),
            ])
            prev_dim = hidden_dim
        layers.append(nn.Linear(prev_dim, 1))
        
        self.value_net = nn.Sequential(*layers)
        self.optimizer = torch.optim.Adam(
            self.value_net.parameters(), lr=learning_rate
        )
    
    def update(self, state: np.ndarray, reward: float,
               next_state: np.ndarray, done: bool) -> float:
        """
        Perform one TD(0) update.
        
        Returns:
            td_error: The temporal difference error (for monitoring)
        """
        state_t = torch.FloatTensor(state).unsqueeze(0)
        next_state_t = torch.FloatTensor(next_state).unsqueeze(0)
        
        # Current value estimate
        value = self.value_net(state_t)
        
        # TD target
        with torch.no_grad():
            if done:
                target = reward
            else:
                target = reward + self.gamma * self.value_net(next_state_t)
        
        # TD error
        td_error = target - value
        
        # Update value function
        loss = td_error.pow(2)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        return td_error.item()
    
    def estimate_value(self, state: np.ndarray) -> float:
        """Get current value estimate for a state."""
        state_t = torch.FloatTensor(state).unsqueeze(0)
        with torch.no_grad():
            return self.value_net(state_t).item()
 
 
def evaluate_policy_with_confidence(
    env,
    policy: Callable,
    num_episodes: int = 100,
    gamma: float = 0.99,
    confidence: float = 0.95
) -> dict:
    """
    Evaluate policy with confidence interval.
    
    Returns dictionary with:
    - mean: Point estimate of J(π)
    - std: Standard deviation of returns
    - ci_lower, ci_upper: Confidence interval bounds
    - n_episodes: Number of episodes used
    """
    mean, std, returns = monte_carlo_evaluation(
        env, policy, num_episodes, gamma
    )
    
    # Confidence interval using t-distribution
    from scipy import stats
    t_value = stats.t.ppf((1 + confidence) / 2, num_episodes - 1)
    margin = t_value * std / np.sqrt(num_episodes)
    
    return {
        'mean': mean,
        'std': std,
        'ci_lower': mean - margin,
        'ci_upper': mean + margin,
        'confidence': confidence,
        'n_episodes': num_episodes
    }

Optimal Policies

The ultimate goal of RL is to find an optimal policy $\pi^*$—one that maximizes expected return from every state.

Definition of Optimality

A policy $\pi^*$ is optimal if:

$$V^{\pi^*}(s) \geq V^{\pi}(s) \quad \forall s \in \mathcal{S}, \forall \pi$$

The optimal policy achieves the highest value in every state simultaneously. Remarkably, such a policy always exists in MDPs.

Optimal Value Functions

The optimal state value function is: $$V^(s) = \max_\pi V^\pi(s) = V^{\pi^}(s)$$

The optimal action value function is: $$Q^*(s, a) = \max_\pi Q^\pi(s, a)$$

These satisfy the Bellman optimality equations:

$$V^(s) = \max_a \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma V^(s') \right]$$

$$Q^(s, a) = \sum_{s'} P(s'|s,a) \left[ R(s,a,s') + \gamma \max_{a'} Q^(s', a') \right]$$

Deriving Optimal Policy from Q*

Once we have $Q^*$, the optimal policy is trivially derived:

$$\pi^(s) = \arg\max_a Q^(s, a)$$

This is why Q-learning is so powerful: learn $Q^*$, then act greedily with respect to it.

Key Theorems

Theorem 1 (Existence): For any finite MDP, there exists an optimal policy that is deterministic and stationary (doesn't depend on time or history beyond current state).

Theorem 2 (Uniqueness of V, Q)**: The optimal value functions are unique, though optimal policies may not be.

Theorem 3 (Policy Improvement): If we improve a policy at any state (choose an action with higher Q-value), the overall policy improves. This enables iterative policy improvement algorithms.

The Bellman Optimality Equation as a Fixed Point

Properties of Optimal Policies

•Can always be deterministic
•Stationary (doesn't change over time)
•Markovian (depends only on current state)
•Greedy with respect to Q*
•May not be unique (ties in Q*)

When Optimality is Hard

•Large/continuous state spaces (approximation needed)
•Unknown dynamics (exploration required)
•Partial observability (history matters)
•Multi-agent settings (no single optimum)
•Non-stationary environments (optimum shifts)

Policy Classes and Expressiveness

Not all policy representations are equally expressive. Understanding the hierarchy of policy classes helps choose appropriate representations.

Memoryless vs. History-Dependent

Memoryless (Markovian) policies depend only on current state: $\pi(a|s)$. Sufficient for MDPs.

History-dependent policies depend on the entire trajectory: $\pi(a|s_0, a_0, \ldots, s_t)$. Necessary for POMDPs where the current observation doesn't capture the true state.

Stationary vs. Non-Stationary

Stationary policies are the same at all time steps: $\pi(a|s)$. Optimal for infinite-horizon discounted MDPs.

Non-stationary policies vary with time: $\pi_t(a|s)$. May be needed for finite-horizon problems or changing objectives.

Reactive vs. Deliberative

Reactive policies compute actions quickly from current state—feedforward networks.

Deliberative policies may perform internal computation (planning, search) before acting—more powerful but slower.

Policy Representation Hierarchy
Class	Form	When Needed	Computation
Tabular	π[s,a] lookup table	Small discrete spaces	O(1) lookup
Linear	softmax(φ(s)ᵀW)	Simple problems, interpretability	O(d) per action
MLP	Neural network on state	Standard continuous control	O(network size)
CNN + MLP	Conv layers + MLP	Image observations	O(image size × depth)
RNN/LSTM	Recurrent over history	Partial observability	O(hidden²) per step
Transformer	Attention over history	Complex dependencies	O(T² × d) for T steps
Planning-based	Search/MCTS at runtime	Games, complex reasoning	O(branching^depth)

The Representation-Learning Trade-off

More expressive policy classes can represent more complex behaviors but:

Harder to train: More parameters, more local optima, more hyperparameter sensitivity
More data hungry: Complex functions need more examples to learn
Slower inference: More computation per action decision
Harder to interpret: Understanding why actions were chosen becomes difficult

Principle: Use the simplest policy class that can express the optimal behavior. If a linear policy suffices, don't use a deep network. If an MLP works, don't use an RNN.

Architecture Design Heuristics:

Start with 2-layer MLPs (64 or 256 hidden, depending on problem scale)
Add recurrence only if single-state is provably insufficient
Match architecture to observation structure (CNN for images)
Consider action space: separate heads for discrete choices vs. continuous parameters

Universal Approximation vs. Learnability

How Policies Learn: A Preview

How do we find good policies? There are two major paradigms:

Value-Based Methods (Q-Learning family)

Learn the optimal Q-function $Q^*$
Derive policy by acting greedily: $\pi(s) = \arg\max_a Q^*(s, a)$
Policy is implicit—defined by the learned values

Policy-Based Methods (Policy Gradient family)

Directly parameterize the policy $\pi_\theta$
Optimize $\theta$ to maximize expected return
Use gradient ascent: $\theta \leftarrow \theta + \alpha \nabla_\theta J(\pi_\theta)$

The Policy Gradient Theorem

The key insight enabling direct policy optimization:

$$\nabla_\theta J(\pi_\theta) = \mathbb{E}{\tau \sim \pi\theta} \left[ \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t \right]$$

policy_gradient_intuition.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from typing import List, Tuple
 
class SimpleREINFORCE:
    """
    REINFORCE: The simplest policy gradient algorithm.
    
    Intuition: Run the policy, observe what happened, increase
    probability of actions that led to high returns, decrease
    probability of actions that led to low returns.
    
    This is essentially supervised learning where the "labels"
    (good actions) are discovered through trial and error.
    """
    
    def __init__(self, policy: nn.Module, 
                 learning_rate: float = 1e-3,
                 gamma: float = 0.99):
        self.policy = policy
        self.gamma = gamma
        self.optimizer = optim.Adam(policy.parameters(), lr=learning_rate)
    
    def compute_returns(self, rewards: List[float]) -> torch.Tensor:
        """Compute discounted returns for each timestep."""
        returns = []
        G = 0.0
        
        for reward in reversed(rewards):
            G = reward + self.gamma * G
            returns.insert(0, G)
        
        returns = torch.tensor(returns, dtype=torch.float32)
        
        # Optionally normalize for stability
        returns = (returns - returns.mean()) / (returns.std() + 1e-8)
        
        return returns
    
    def update(self, 
               states: List[np.ndarray],
               actions: List[int],
               rewards: List[float]) -> float:
        """
        Perform one REINFORCE update.
        
        The loss is: -sum_t [log π(a_t|s_t) * G_t]
        
        Negative because we're doing gradient ascent on expected return,
        which is gradient descent on negative expected return.
        """
        states_t = torch.FloatTensor(np.array(states))
        actions_t = torch.LongTensor(actions)
        returns = self.compute_returns(rewards)
        
        # Get log probabilities of taken actions
        dist = self.policy(states_t)
        log_probs = dist.log_prob(actions_t)
        
        # Policy gradient loss: -E[log π(a|s) * G]
        # Intuition:
        # - G > 0: action was good, increase log_prob (decrease loss)
        # - G < 0: action was bad, decrease log_prob (increase loss)
        loss = -(log_probs * returns).mean()
        
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        return loss.item()
 
 
class ActorCriticPreview:
    """
    Actor-Critic: Combines policy gradient with value function.
    
    Key insight: Use learned value function to reduce variance
    of the policy gradient estimate.
    
    Instead of G_t (high variance), use advantage A_t = G_t - V(s_t)
    This doesn't change the expected gradient but reduces variance.
    """
    
    def __init__(self, 
                 actor: nn.Module,   # Policy network
                 critic: nn.Module,  # Value network
                 actor_lr: float = 3e-4,
                 critic_lr: float = 1e-3,
                 gamma: float = 0.99):
        self.actor = actor
        self.critic = critic
        self.gamma = gamma
        
        self.actor_optimizer = optim.Adam(actor.parameters(), lr=actor_lr)
        self.critic_optimizer = optim.Adam(critic.parameters(), lr=critic_lr)
    
    def update(self,
               states: torch.Tensor,
               actions: torch.Tensor,
               rewards: torch.Tensor,
               next_states: torch.Tensor,
               dones: torch.Tensor) -> Tuple[float, float]:
        """
        Perform actor-critic update.
        
        Returns:
            actor_loss: Policy gradient loss
            critic_loss: Value function MSE loss
        """
        # ===== Critic Update =====
        # TD target: r + γ * V(s') for non-terminal
        with torch.no_grad():
            next_values = self.critic(next_states).squeeze()
            td_targets = rewards + self.gamma * next_values * (1 - dones)
        
        current_values = self.critic(states).squeeze()
        critic_loss = nn.functional.mse_loss(current_values, td_targets)
        
        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        self.critic_optimizer.step()
        
        # ===== Actor Update =====
        # Advantage: how much better was the action than average?
        with torch.no_grad():
            advantages = td_targets - current_values
        
        # Policy gradient with advantage
        dist = self.actor(states)
        log_probs = dist.log_prob(actions)
        actor_loss = -(log_probs * advantages).mean()
        
        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        self.actor_optimizer.step()
        
        return actor_loss.item(), critic_loss.item()

Behavior vs. Target Policies

A subtle but important distinction: the policy used to collect experience may differ from the policy being learned.

Behavior Policy ($\beta$ or $\mu$): The policy actually interacting with the environment, generating experience data.

Target Policy ($\pi$): The policy we're trying to learn or evaluate.

On-Policy Methods

Behavior = Target: We use the current policy $\pi_\theta$ for data collection and update $\pi_\theta$ based on that data.

Examples: SARSA, A2C, PPO (approximately on-policy)
Pros: Simpler, more stable, no importance sampling needed
Cons: Data can only be used once, sample inefficient

Off-Policy Methods

Behavior ≠ Target: We collect data with one policy but learn about a different policy.

Examples: Q-Learning, DQN, DDPG, SAC
Pros: Reuse old data (experience replay), learn from demonstrations
Cons: Need importance sampling correction for policy gradient; Q-learning avoids this

On-Policy vs Off-Policy Characteristics
Aspect	On-Policy	Off-Policy
Data generation	Current policy	Any policy (can differ)
Sample reuse	Cannot reuse old data	Experience replay possible
Sample efficiency	Lower (fresh data needed)	Higher (reuse data)
Stability	More stable	Can be less stable
Correction needed	No	Importance sampling (for PG)
Can learn from demos	No	Yes
Examples	A2C, PPO, TRPO	DQN, DDPG, SAC

Importance Sampling Correction

When the behavior policy $\mu$ differs from target policy $\pi$, expectations must be corrected:

$$\mathbb{E}{a \sim \pi}[f(a)] = \mathbb{E}{a \sim \mu}\left[ \frac{\pi(a|s)}{\mu(a|s)} f(a) \right]$$

The ratio $\rho = \pi(a|s) / \mu(a|s)$ reweights samples. High ratio: action is more likely under $\pi$ than $\mu$. Low ratio: less likely.

Problem: Importance ratios can have high variance, especially when $\pi$ and $\mu$ differ significantly.

Solutions:

Keep $\pi$ close to $\mu$ (PPO's clipping, TRPO's constraint)
Use value function methods that avoid importance sampling (Q-learning)
Truncate extreme ratios (but introduces bias)

Q-Learning's Elegance

Policy Entropy and Regularization

Entropy measures the randomness of a policy's action distribution:

$$H(\pi(\cdot|s)) = -\sum_a \pi(a|s) \log \pi(a|s) \quad \text{(discrete)}$$

$$H(\pi(\cdot|s)) = -\int \pi(a|s) \log \pi(a|s) , da \quad \text{(continuous)}$$

High entropy: uniform distribution, lots of randomness. Low entropy: peaked distribution, nearly deterministic.

Why Entropy Matters

Exploration: High entropy policies explore more, visiting diverse states.
Regularization: Entropy bonus prevents premature convergence to suboptimal deterministic policies.
Robustness: Stochastic policies are harder for adversaries to exploit and more robust to model errors.
Maximum Entropy RL: SAC and related algorithms maximize reward AND entropy simultaneously, leading to robust, multi-modal policies.

Entropy-Regularized Objectives

The standard RL objective is: $$J(\pi) = \mathbb{E}_{\pi}\left[ \sum_t \gamma^t r_t \right]$$

The entropy-regularized (maximum entropy) objective adds an entropy bonus: $$J_{MaxEnt}(\pi) = \mathbb{E}_{\pi}\left[ \sum_t \gamma^t \left( r_t + \alpha H(\pi(\cdot|s_t)) \right) \right]$$

where $\alpha$ is the temperature parameter controlling the entropy-reward trade-off.

Effects:

High $\alpha$: Prioritize entropy (exploration), sacrifice some reward
Low $\alpha$: Prioritize reward, reduce exploration
$\alpha = 0$: Standard RL objective

Automatic Entropy Tuning (SAC)

Instead of manually tuning $\alpha$, we can learn it by constraining entropy to a target:

$$\alpha^* = \arg\min_\alpha \mathbb{E}_{s \sim D}\left[ -\alpha \log \pi(a|s) - \alpha \bar{H} \right]$$

where $\bar{H}$ is the target entropy (e.g., $-\dim(\mathcal{A})$ for continuous actions).

entropy_regularization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
import torch
import torch.nn as nn
from torch.distributions import Categorical, Normal
 
def compute_entropy_discrete(logits: torch.Tensor) -> torch.Tensor:
    """
    Compute entropy of categorical distribution.
    
    H = -sum_a p(a) log p(a)
    
    Using logits directly for numerical stability:
    H = log(sum(exp(logits))) - sum(p * logits)
    """
    probs = torch.softmax(logits, dim=-1)
    log_probs = torch.log_softmax(logits, dim=-1)
    entropy = -(probs * log_probs).sum(dim=-1)
    return entropy
 
 
def compute_entropy_gaussian(log_std: torch.Tensor) -> torch.Tensor:
    """
    Compute entropy of Gaussian distribution.
    
    For N(μ, σ²): H = 0.5 * log(2πeσ²) = 0.5 * (1 + log(2π) + 2*log_std)
    """
    return 0.5 * (1 + torch.log(2 * torch.tensor(torch.pi)) + 2 * log_std).sum(dim=-1)
 
 
class EntropyRegularizedPolicyGradient:
    """
    Policy gradient with entropy regularization.
    
    Objective: max E[sum_t (r_t + α H(π(·|s_t)))]
    
    The entropy bonus encourages exploration and prevents
    the policy from becoming too deterministic too early.
    """
    
    def __init__(self, 
                 policy: nn.Module,
                 learning_rate: float = 3e-4,
                 gamma: float = 0.99,
                 entropy_coef: float = 0.01):
        self.policy = policy
        self.gamma = gamma
        self.entropy_coef = entropy_coef
        self.optimizer = torch.optim.Adam(policy.parameters(), lr=learning_rate)
    
    def compute_loss(self,
                     states: torch.Tensor,
                     actions: torch.Tensor,
                     returns: torch.Tensor) -> tuple:
        """
        Compute policy gradient loss with entropy bonus.
        
        Loss = -E[log π(a|s) * return] - α * E[H(π)]
        """
        dist = self.policy(states)
        log_probs = dist.log_prob(actions)
        entropy = dist.entropy()
        
        # Policy gradient term
        pg_loss = -(log_probs * returns).mean()
        
        # Entropy bonus (negative because we want to maximize entropy)
        entropy_loss = -entropy.mean()
        
        total_loss = pg_loss + self.entropy_coef * entropy_loss
        
        return total_loss, pg_loss.item(), entropy.mean().item()
 
 
class AutomaticEntropyTuning:
    """
    Automatic entropy coefficient tuning (as in SAC).
    
    Instead of fixed α, learn α to maintain target entropy.
    
    Objective: min_α E[-α * log π(a|s) - α * H_target]
    
    This increases α if entropy is below target (encouraging exploration)
    and decreases α if entropy is above target.
    """
    
    def __init__(self,
                 action_dim: int,
                 initial_alpha: float = 1.0,
                 learning_rate: float = 3e-4):
        # Target entropy: -dim(A) is a common choice for continuous actions
        self.target_entropy = -action_dim
        
        # Learn log(α) for numerical stability
        self.log_alpha = torch.tensor(
            [torch.log(torch.tensor(initial_alpha))],
            requires_grad=True
        )
        self.optimizer = torch.optim.Adam([self.log_alpha], lr=learning_rate)
    
    @property
    def alpha(self) -> torch.Tensor:
        return self.log_alpha.exp()
    
    def update(self, log_probs: torch.Tensor) -> float:
        """
        Update α based on current policy entropy.
        
        Args:
            log_probs: Log probabilities of actions under current policy
            
        Returns:
            alpha_loss: The α update loss
        """
        # Loss: α * (-log π - H_target)
        # If -log π > H_target (entropy above target): decrease α
        # If -log π < H_target (entropy below target): increase α
        alpha_loss = -(self.alpha * (log_probs + self.target_entropy).detach()).mean()
        
        self.optimizer.zero_grad()
        alpha_loss.backward()
        self.optimizer.step()
        
        return alpha_loss.item()

Practical Policy Design

Policy Architecture Guidelines

•Network Size: Start with 2 hidden layers, 64-256 units. Increase if learning stalls despite exploration.
•Activation Functions: ReLU is standard. Use Tanh for final layer if actions need bounding.
•Weight Initialization: Xavier/He initialization. Initialize output layer with smaller weights for stability.
•Log-std Parameterization: For Gaussian policies, learn log(σ) not σ directly. Prevents negative std.
•Separate Networks: Often use separate networks for actor and critic. Shared features can help or hurt.
•Action Scaling: Output actions in [-1, 1], scale to environment's actual range externally.

Common Policy Bugs

•NaN actions: Usually from exploding std or numerical instability. Clamp log_std to reasonable range.
•Entropy collapse: Policy becomes deterministic too early. Add entropy bonus or increase initial entropy.
•Gradient explosion: Normalize observations, clip gradients, use smaller learning rate.
•Mode collapse: Policy only uses one action. Check reward signal, increase exploration.
•Forgetting: Performance degrades over time. May need target networks or regularization.

Debugging Checklist

Summary: The Heart of Decision-Making

Policies are the central concept in reinforcement learning—they define agent behavior, and finding good policies is the goal of all RL algorithms. Let's consolidate the key insights:

Key Takeaways

•Policies map states to actions—deterministic policies output single actions, stochastic policies output distributions.
•Stochastic policies enable exploration during learning and provide theoretical benefits for optimization.
•Policies are parameterized by neural networks, enabling learning through gradient-based optimization.
•Policy evaluation measures how good a policy is—the expected cumulative reward from following it.
•Optimal policies maximize value from every state and are the goal of reinforcement learning.
•Behavior vs. target policies distinguish on-policy methods (simpler) from off-policy methods (more sample efficient).
•Entropy regularization encourages exploration and prevents premature convergence.

Looking Ahead

Page Complete

3 / 5