Deep Reinforcement Learning - Learning Module

Loading content...

0/278

PPO and SAC

The Rise of Policy Gradient Methods

While value-based methods like DQN and Rainbow transformed discrete-action reinforcement learning, a different approach was needed for problems with continuous action spaces. How do you control a robot arm with infinite possible joint angles? How do you navigate a car with continuous steering and throttle?

Policy gradient methods solve this by directly learning a policy—a mapping from states to actions—rather than learning value functions. Instead of computing Q(s, a) for every action and taking argmax, we learn a neural network that directly outputs actions (or distributions over actions).

Two algorithms dominate modern policy-based deep RL:

Proximal Policy Optimization (PPO): Developed by OpenAI in 2017, PPO achieves the stability of trust region methods without their computational complexity. It's simple to implement, robust to hyperparameters, and has become the default choice for many applications—from game-playing AI to robotic control to RLHF for language models.

Soft Actor-Critic (SAC): Developed by UC Berkeley in 2018, SAC combines off-policy learning with maximum entropy principles. It achieves exceptional sample efficiency and is particularly effective for continuous control tasks like robotics.

Together, PPO and SAC represent the state of the art in policy gradient methods, covering both on-policy (PPO) and off-policy (SAC) paradigms.

What You Will Learn

By the end of this page, you will understand the foundations of policy gradient methods, how PPO achieves stable training through clipped objectives, how SAC maximizes entropy for robust exploration, the trade-offs between on-policy and off-policy algorithms, and practical guidance on when to use each approach.

Policy Gradient Foundations

Before diving into PPO and SAC, we need to understand the policy gradient theorem—the mathematical foundation underlying all policy-based methods.

The Policy Optimization Objective

We want to find a policy π_θ that maximizes expected return:

$$J(\theta) = \mathbb{E}{\tau \sim \pi\theta} \left[ \sum_{t=0}^{\infty} \gamma^t r_t \right]$$

where τ represents a trajectory (state-action sequence) sampled under policy π_θ.

The Policy Gradient Theorem

Remarkably, we can compute the gradient of J(θ) without differentiating through the environment dynamics:

$$ abla_\theta J(\theta) = \mathbb{E}{\tau \sim \pi\theta} \left[ \sum_{t=0}^{\infty} abla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t \right]$$

where G_t is the return from time t. This is the foundational REINFORCE algorithm.

The Score Function Trick

The key insight is the identity: $$ abla_\theta \pi_\theta(a|s) = \pi_\theta(a|s) \cdot abla_\theta \log \pi_\theta(a|s)$$

This allows us to compute policy gradients using only:

Samples from the current policy
The log-probability of actions taken
The resulting returns

Why Not Differentiate Through the Environment?

The environment dynamics p(s'|s,a) are typically unknown (model-free RL) or non-differentiable. The policy gradient theorem elegantly sidesteps this by treating trajectories as samples and using the score function to compute gradients only through the policy network itself.

Variance Reduction: Baselines and Advantages

Raw policy gradients have extremely high variance—the return G_t can vary wildly episode to episode. Two key techniques reduce variance:

Baseline Subtraction: Subtract a baseline b(s) from returns: $$ abla_\theta J = \mathbb{E} \left[ abla_\theta \log \pi_\theta(a|s) \cdot (G_t - b(s_t)) \right]$$

This doesn't change the expected gradient (bias) but reduces variance. The optimal baseline is V(s), the state value function.

Advantage Function: When using V(s) as baseline, we get the advantage: $$A(s, a) = Q(s, a) - V(s)$$

The advantage tells us how much better action a is compared to the average action. This is the foundation of actor-critic methods:

Actor: The policy π_θ that selects actions
Critic: The value function V(s) that evaluates states

Both PPO and SAC are actor-critic algorithms, learning both a policy and a value function.

Policy Gradient Algorithm Landscape
Algorithm	On/Off-Policy	Key Innovation	Use Case
REINFORCE	On-policy	Basic policy gradient	Educational, simple tasks
A2C/A3C	On-policy	Actor-critic + parallelism	Moderate-scale training
TRPO	On-policy	Trust region constraint	High-stakes applications
PPO	On-policy	Clipped surrogate objective	General purpose, robust
DDPG	Off-policy	Deterministic policy gradient	Continuous control
TD3	Off-policy	Twin critics + delayed updates	Improved DDPG
SAC	Off-policy	Maximum entropy + twin critics	Sample-efficient continuous

Proximal Policy Optimization (PPO)

PPO achieves the sample efficiency of trust region methods while being much simpler to implement. Its key insight: constrain policy updates implicitly through a clipped objective, eliminating the need for complex second-order optimization.

The Problem PPO Solves

Policy gradient methods face a fundamental tension:

Large updates can destabilize training (policy changes too much, collected data becomes useless)
Small updates waste samples and slow learning

TRPO (Trust Region Policy Optimization) solved this with explicit KL-divergence constraints, but required complex conjugate gradient optimization.

The PPO Solution: Clipped Surrogate Objective

PPO defines a ratio between the new and old policies: $$r(\theta) = \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}$$

The clipped objective is: $$L^{\text{CLIP}}(\theta) = \mathbb{E} \left[ \min\left( r(\theta) A, \text{clip}(r(\theta), 1-\epsilon, 1+\epsilon) A \right) \right]$$

where ε (typically 0.2) controls how much the policy can change.

How Clipping Works:

If A > 0 (action was good): gradient encourages r → larger, but clips at 1+ε
If A < 0 (action was bad): gradient encourages r → smaller, but clips at 1-ε

The min ensures we never get credit for updates that violate the trust region.

Converting Mermaid diagram...

ppo_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Normal, Categorical
import numpy as np
 
class PPOActorCritic(nn.Module):
    """
    Actor-Critic network for PPO.
    
    For continuous actions, outputs a Gaussian distribution.
    For discrete actions, outputs action probabilities.
    """
    
    def __init__(
        self,
        state_dim: int,
        action_dim: int,
        continuous: bool = True,
        hidden_dim: int = 256
    ):
        super().__init__()
        
        self.continuous = continuous
        
        # Shared feature extractor (optional separate networks also work)
        self.shared = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.Tanh(),
        )
        
        # Actor head
        if continuous:
            # Output mean; log_std is separate parameter
            self.actor_mean = nn.Linear(hidden_dim, action_dim)
            self.actor_log_std = nn.Parameter(torch.zeros(action_dim))
        else:
            self.actor = nn.Linear(hidden_dim, action_dim)
        
        # Critic head: outputs V(s)
        self.critic = nn.Linear(hidden_dim, 1)
        
        # Initialize with smaller weights for final layers
        self._init_weights()
    
    def _init_weights(self):
        for module in [self.actor_mean if self.continuous else self.actor, self.critic]:
            nn.init.orthogonal_(module.weight, gain=0.01)
            nn.init.zeros_(module.bias)
    
    def forward(self, state):
        features = self.shared(state)
        value = self.critic(features)
        
        if self.continuous:
            action_mean = self.actor_mean(features)
            action_std = self.actor_log_std.exp().expand_as(action_mean)
            return action_mean, action_std, value
        else:
            action_logits = self.actor(features)
            return action_logits, value
    
    def get_action(self, state, deterministic=False):
        """Sample action from policy."""
        with torch.no_grad():
            if self.continuous:
                mean, std, value = self.forward(state)
                if deterministic:
                    action = mean
                else:
                    dist = Normal(mean, std)
                    action = dist.sample()
                return action, value
            else:
                logits, value = self.forward(state)
                if deterministic:
                    action = logits.argmax(dim=-1)
                else:
                    dist = Categorical(logits=logits)
                    action = dist.sample()
                return action, value
    
    def evaluate_actions(self, states, actions):
        """Compute log probs and values for given state-action pairs."""
        if self.continuous:
            mean, std, value = self.forward(states)
            dist = Normal(mean, std)
            log_probs = dist.log_prob(actions).sum(dim=-1)
            entropy = dist.entropy().sum(dim=-1)
        else:
            logits, value = self.forward(states)
            dist = Categorical(logits=logits)
            log_probs = dist.log_prob(actions)
            entropy = dist.entropy()
        
        return log_probs, value.squeeze(-1), entropy
 
 
class PPOTrainer:
    """
    PPO training logic with clipped objective.
    """
    
    def __init__(
        self,
        actor_critic: PPOActorCritic,
        lr: float = 3e-4,
        gamma: float = 0.99,
        gae_lambda: float = 0.95,
        clip_epsilon: float = 0.2,
        value_coef: float = 0.5,
        entropy_coef: float = 0.01,
        max_grad_norm: float = 0.5,
        ppo_epochs: int = 10,
        minibatch_size: int = 64,
    ):
        self.actor_critic = actor_critic
        self.optimizer = torch.optim.Adam(actor_critic.parameters(), lr=lr)
        
        self.gamma = gamma
        self.gae_lambda = gae_lambda
        self.clip_epsilon = clip_epsilon
        self.value_coef = value_coef
        self.entropy_coef = entropy_coef
        self.max_grad_norm = max_grad_norm
        self.ppo_epochs = ppo_epochs
        self.minibatch_size = minibatch_size
    
    def compute_gae(self, rewards, values, dones, next_value):
        """
        Compute Generalized Advantage Estimation (GAE).
        
        GAE balances bias-variance through lambda parameter.
        """
        advantages = []
        gae = 0
        
        for t in reversed(range(len(rewards))):
            if t == len(rewards) - 1:
                next_val = next_value
            else:
                next_val = values[t + 1]
            
            # TD error
            delta = rewards[t] + self.gamma * next_val * (1 - dones[t]) - values[t]
            
            # GAE accumulation
            gae = delta + self.gamma * self.gae_lambda * (1 - dones[t]) * gae
            advantages.insert(0, gae)
        
        advantages = torch.tensor(advantages)
        returns = advantages + torch.tensor(values)
        
        return advantages, returns
    
    def update(self, rollout_buffer):
        """
        Perform PPO update over multiple epochs.
        """
        # Unpack buffer
        states = torch.FloatTensor(rollout_buffer['states'])
        actions = torch.FloatTensor(rollout_buffer['actions'])
        old_log_probs = torch.FloatTensor(rollout_buffer['log_probs'])
        advantages = torch.FloatTensor(rollout_buffer['advantages'])
        returns = torch.FloatTensor(rollout_buffer['returns'])
        
        # Normalize advantages (important for stability!)
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
        
        # Multiple epochs over the same data
        for epoch in range(self.ppo_epochs):
            # Generate random minibatches
            indices = np.random.permutation(len(states))
            
            for start in range(0, len(states), self.minibatch_size):
                end = start + self.minibatch_size
                batch_indices = indices[start:end]
                
                batch_states = states[batch_indices]
                batch_actions = actions[batch_indices]
                batch_old_log_probs = old_log_probs[batch_indices]
                batch_advantages = advantages[batch_indices]
                batch_returns = returns[batch_indices]
                
                # Get current policy evaluation
                log_probs, values, entropy = self.actor_critic.evaluate_actions(
                    batch_states, batch_actions
                )
                
                # PPO CLIPPED OBJECTIVE
                # Compute probability ratio
                ratio = torch.exp(log_probs - batch_old_log_probs)
                
                # Clipped surrogate objective
                surr1 = ratio * batch_advantages
                surr2 = torch.clamp(
                    ratio, 
                    1 - self.clip_epsilon, 
                    1 + self.clip_epsilon
                ) * batch_advantages
                
                # Take the minimum (pessimistic bound)
                policy_loss = -torch.min(surr1, surr2).mean()
                
                # Value function loss
                value_loss = F.mse_loss(values, batch_returns)
                
                # Entropy bonus (encourages exploration)
                entropy_loss = -entropy.mean()
                
                # Combined loss
                loss = (
                    policy_loss 
                    + self.value_coef * value_loss 
                    + self.entropy_coef * entropy_loss
                )
                
                # Gradient update
                self.optimizer.zero_grad()
                loss.backward()
                nn.utils.clip_grad_norm_(
                    self.actor_critic.parameters(), 
                    self.max_grad_norm
                )
                self.optimizer.step()
        
        return {
            'policy_loss': policy_loss.item(),
            'value_loss': value_loss.item(),
            'entropy': -entropy_loss.item(),
        }

PPO Implementation Tips

Three details are crucial for PPO performance: (1) Normalize advantages within each minibatch for stable gradients; (2) Use GAE (λ≈0.95) for smooth advantage estimation; (3) Run multiple epochs (K≈10) over collected data for sample efficiency, but not so many that the policy drifts too far from the data-collecting policy.

PPO Training Details and Best Practices

PPO's simplicity is deceptive—achieving good performance requires attention to many details. This section covers the practices that separate high-performing PPO from mediocre implementations.

Generalized Advantage Estimation (GAE)

GAE provides a smooth trade-off between bias and variance in advantage estimation:

$$A^{GAE(\gamma, \lambda)}t = \sum{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}$$

where δ_t = r_t + γV(s_{t+1}) - V(s_t) is the TD error.

λ = 0: Pure TD (1-step), low variance, high bias
λ = 1: Monte Carlo, high variance, low bias
λ ≈ 0.95: Sweet spot for most tasks

Hyperparameter Guidelines

PPO Hyperparameter Recommendations
Parameter	Typical Value	Notes
Clip epsilon (ε)	0.1 - 0.3	0.2 is the default; smaller = more conservative
GAE λ	0.95 - 0.99	Controls advantage bias-variance
Discount γ	0.99	Standard; 0.995-0.999 for long-horizon tasks
PPO epochs (K)	3 - 30	More epochs = more sample efficient but risks overfitting
Minibatch size	32 - 4096	Larger batches stabilize training
Learning rate	3e-4	Adam default; often annealed
Entropy coefficient	0.0 - 0.01	Encourages exploration; task-dependent
Value coefficient	0.5 - 1.0	Relative weight of value loss
Max grad norm	0.5	Gradient clipping for stability

PPO Strengths

•Simplicity: No complex optimization, just SGD with clipping
•Robustness: Works well across diverse tasks with similar hyperparameters
•Stability: Clipping prevents catastrophic policy updates
•Scalable: Parallelizes naturally across multiple workers
•Proven track record: OpenAI Five, ChatGPT RLHF, robotic control

PPO Limitations

•On-policy: Data is discarded after each update; lower sample efficiency
•Requires many environments: Parallelism essential for speed
•Scaling challenges: Performance can plateau on complex tasks
•Hyperparameter sensitivity: More sensitive than SAC in practice
•Exploration limitations: May not explore efficiently in sparse reward settings

Common PPO Implementation Issues

Not normalizing observations: Standardize inputs (subtract mean, divide by std) computed over collected data
Incorrect advantage normalization: Normalize advantages per minibatch, not globally
Too many epochs: If ratio r(θ) frequently hits clip boundaries, reduce epochs
Value function issues: Consider separate networks for policy and value, or larger value loss coefficient
Not using orthogonal initialization: Significantly impacts performance on some environments
Incorrect done handling: For truncated episodes, bootstrap value; for true termination, don't

Soft Actor-Critic (SAC)

Soft Actor-Critic takes a fundamentally different approach: it maximizes both expected return and policy entropy. This maximum entropy framework provides robust exploration and importantly, makes SAC an off-policy algorithm with excellent sample efficiency.

The Maximum Entropy Objective

SAC optimizes a modified objective:

$$J(\pi) = \sum_{t=0}^{\infty} \mathbb{E} \left[ r(s_t, a_t) + \alpha \mathcal{H}(\pi(\cdot | s_t)) \right]$$

where H(π) = -E[log π(a|s)] is the entropy and α is the temperature parameter.

Why Entropy Matters:

Exploration: High entropy policies explore more diverse actions
Robustness: Entropy regularization prevents premature convergence
Multi-modal solutions: Can capture multiple good strategies
Composability: Learned policies transfer better to new tasks

Key Components of SAC:

Stochastic policy: Outputs Gaussian distribution, not deterministic action
Twin critics: Two Q-networks to reduce overestimation
Automatic temperature tuning: α is automatically adjusted
Soft value functions: Include entropy in value estimates

On-Policy vs Off-Policy

PPO is on-policy: it trains on data collected by the current policy, then discards it. SAC is off-policy: it stores data in a replay buffer and can train on experiences collected by any past policy. Off-policy algorithms typically have better sample efficiency but can be less stable. SAC addresses stability through maximum entropy, twin critics, and careful target updates.

sac_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Normal
import numpy as np
from copy import deepcopy
 
class GaussianPolicy(nn.Module):
    """
    SAC policy network outputting a squashed Gaussian distribution.
    
    Actions are sampled from Gaussian, then squashed through tanh
    to bound them to [-1, 1].
    """
    
    LOG_STD_MIN = -20
    LOG_STD_MAX = 2
    
    def __init__(
        self,
        state_dim: int,
        action_dim: int,
        hidden_dim: int = 256,
    ):
        super().__init__()
        
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
        )
        
        # Separate heads for mean and log_std
        self.mean_head = nn.Linear(hidden_dim, action_dim)
        self.log_std_head = nn.Linear(hidden_dim, action_dim)
    
    def forward(self, state):
        features = self.net(state)
        mean = self.mean_head(features)
        log_std = self.log_std_head(features)
        log_std = torch.clamp(log_std, self.LOG_STD_MIN, self.LOG_STD_MAX)
        
        return mean, log_std
    
    def sample(self, state):
        """
        Sample action and compute log probability.
        
        Uses reparameterization trick for differentiable sampling.
        Applies tanh squashing and corrects log_prob accordingly.
        """
        mean, log_std = self.forward(state)
        std = log_std.exp()
        
        # Reparameterization trick: sample from N(0,1), then transform
        normal = Normal(mean, std)
        x_t = normal.rsample()  # Differentiable sample
        
        # Squash through tanh to bound actions
        action = torch.tanh(x_t)
        
        # Compute log probability with squashing correction
        # log π(a|s) = log μ(u|s) - sum(log(1 - tanh²(u)))
        log_prob = normal.log_prob(x_t)
        # Squashing correction (Jacobian of tanh transformation)
        log_prob -= torch.log(1 - action.pow(2) + 1e-6)
        log_prob = log_prob.sum(dim=-1, keepdim=True)
        
        return action, log_prob
    
    def get_action(self, state, deterministic=False):
        """Get action for environment interaction."""
        with torch.no_grad():
            mean, log_std = self.forward(state)
            
            if deterministic:
                return torch.tanh(mean)
            else:
                std = log_std.exp()
                normal = Normal(mean, std)
                action = torch.tanh(normal.sample())
                return action
 
 
class TwinQNetwork(nn.Module):
    """
    Twin Q-networks for SAC.
    
    Using two Q-networks and taking the minimum reduces overestimation
    bias compared to using a single network.
    """
    
    def __init__(
        self,
        state_dim: int,
        action_dim: int,
        hidden_dim: int = 256,
    ):
        super().__init__()
        
        # Q1 network
        self.q1 = nn.Sequential(
            nn.Linear(state_dim + action_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1),
        )
        
        # Q2 network (same architecture, different initialization)
        self.q2 = nn.Sequential(
            nn.Linear(state_dim + action_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1),
        )
    
    def forward(self, state, action):
        x = torch.cat([state, action], dim=-1)
        return self.q1(x), self.q2(x)
    
    def q1_forward(self, state, action):
        x = torch.cat([state, action], dim=-1)
        return self.q1(x)
 
 
class SACAgent:
    """
    Soft Actor-Critic agent with automatic temperature tuning.
    """
    
    def __init__(
        self,
        state_dim: int,
        action_dim: int,
        device: torch.device,
        gamma: float = 0.99,
        tau: float = 0.005,
        lr: float = 3e-4,
        buffer_size: int = 1_000_000,
        batch_size: int = 256,
        learning_starts: int = 10000,
        target_entropy: float = None,
    ):
        self.device = device
        self.gamma = gamma
        self.tau = tau
        self.batch_size = batch_size
        self.learning_starts = learning_starts
        
        # Policy network
        self.policy = GaussianPolicy(state_dim, action_dim).to(device)
        
        # Twin Q-networks
        self.q_networks = TwinQNetwork(state_dim, action_dim).to(device)
        self.q_target = deepcopy(self.q_networks)
        
        # Freeze target networks
        for param in self.q_target.parameters():
            param.requires_grad = False
        
        # Automatic entropy tuning
        if target_entropy is None:
            self.target_entropy = -action_dim  # Heuristic from paper
        else:
            self.target_entropy = target_entropy
        
        self.log_alpha = torch.zeros(1, requires_grad=True, device=device)
        self.alpha = self.log_alpha.exp()
        
        # Optimizers
        self.policy_optimizer = torch.optim.Adam(self.policy.parameters(), lr=lr)
        self.q_optimizer = torch.optim.Adam(self.q_networks.parameters(), lr=lr)
        self.alpha_optimizer = torch.optim.Adam([self.log_alpha], lr=lr)
        
        # Replay buffer
        self.replay_buffer = ReplayBuffer(buffer_size, state_dim, action_dim)
        self.total_steps = 0
    
    def select_action(self, state, deterministic=False):
        state = torch.FloatTensor(state).unsqueeze(0).to(self.device)
        action = self.policy.get_action(state, deterministic)
        return action.cpu().numpy()[0]
    
    def train_step(self):
        if len(self.replay_buffer) < self.learning_starts:
            return {}
        
        # Sample batch
        states, actions, rewards, next_states, dones =             self.replay_buffer.sample(self.batch_size, self.device)
        
        # ========== Q-function update ==========
        with torch.no_grad():
            # Sample next actions and compute target Q
            next_actions, next_log_probs = self.policy.sample(next_states)
            
            # Take minimum of two Q targets (reduces overestimation)
            q1_next, q2_next = self.q_target(next_states, next_actions)
            q_next = torch.min(q1_next, q2_next)
            
            # Soft target: include entropy
            q_target = rewards + self.gamma * (1 - dones) *                        (q_next - self.alpha * next_log_probs)
        
        # Current Q estimates
        q1, q2 = self.q_networks(states, actions)
        
        # Q losses (MSE against target)
        q1_loss = F.mse_loss(q1, q_target)
        q2_loss = F.mse_loss(q2, q_target)
        q_loss = q1_loss + q2_loss
        
        self.q_optimizer.zero_grad()
        q_loss.backward()
        self.q_optimizer.step()
        
        # ========== Policy update ==========
        # Sample new actions for current states
        new_actions, log_probs = self.policy.sample(states)
        
        # Q-value for new actions (use Q1 only, or min of both)
        q1_new, q2_new = self.q_networks(states, new_actions)
        q_new = torch.min(q1_new, q2_new)
        
        # Policy loss: maximize Q - α * log_prob
        policy_loss = (self.alpha.detach() * log_probs - q_new).mean()
        
        self.policy_optimizer.zero_grad()
        policy_loss.backward()
        self.policy_optimizer.step()
        
        # ========== Temperature (alpha) update ==========
        alpha_loss = -(self.log_alpha * 
                       (log_probs + self.target_entropy).detach()).mean()
        
        self.alpha_optimizer.zero_grad()
        alpha_loss.backward()
        self.alpha_optimizer.step()
        
        self.alpha = self.log_alpha.exp()
        
        # ========== Soft update of target networks ==========
        with torch.no_grad():
            for param, target_param in zip(
                self.q_networks.parameters(),
                self.q_target.parameters()
            ):
                target_param.data.mul_(1 - self.tau)
                target_param.data.add_(self.tau * param.data)
        
        self.total_steps += 1
        
        return {
            'q_loss': q_loss.item(),
            'policy_loss': policy_loss.item(),
            'alpha': self.alpha.item(),
            'entropy': -log_probs.mean().item(),
        }
 
 
class ReplayBuffer:
    """Simple replay buffer for SAC."""
    
    def __init__(self, capacity, state_dim, action_dim):
        self.capacity = capacity
        self.ptr = 0
        self.size = 0
        
        self.states = np.zeros((capacity, state_dim), dtype=np.float32)
        self.actions = np.zeros((capacity, action_dim), dtype=np.float32)
        self.rewards = np.zeros((capacity, 1), dtype=np.float32)
        self.next_states = np.zeros((capacity, state_dim), dtype=np.float32)
        self.dones = np.zeros((capacity, 1), dtype=np.float32)
    
    def push(self, state, action, reward, next_state, done):
        self.states[self.ptr] = state
        self.actions[self.ptr] = action
        self.rewards[self.ptr] = reward
        self.next_states[self.ptr] = next_state
        self.dones[self.ptr] = done
        
        self.ptr = (self.ptr + 1) % self.capacity
        self.size = min(self.size + 1, self.capacity)
    
    def sample(self, batch_size, device):
        indices = np.random.randint(0, self.size, size=batch_size)
        
        return (
            torch.FloatTensor(self.states[indices]).to(device),
            torch.FloatTensor(self.actions[indices]).to(device),
            torch.FloatTensor(self.rewards[indices]).to(device),
            torch.FloatTensor(self.next_states[indices]).to(device),
            torch.FloatTensor(self.dones[indices]).to(device),
        )
    
    def __len__(self):
        return self.size

SAC Deep Dive: Entropy and Stability

SAC's success stems from several carefully designed components. Let's examine each in detail.

The Squashing Correction

SAC uses tanh to bound actions to [-1, 1]. This is crucial for physical systems where actuators have limits. But squashing affects the log probability calculation:

$$\log \pi(a|s) = \log \mu(u|s) - \sum_{i=1}^{D} \log(1 - \tanh^2(u_i))$$

where u is the pre-squashing sample and a = tanh(u) is the actual action.

Forgetting this correction is a common bug that causes poor performance.

Automatic Temperature Tuning

The temperature α balances reward maximization vs. entropy:

High α → prioritize exploration (high entropy)
Low α → prioritize exploitation (greedy actions)

Instead of treating α as a hyperparameter, SAC learns it automatically by solving:

$$\min_\alpha \mathbb{E}_{a_t \sim \pi_t} \left[ -\alpha \log \pi_t(a_t|s_t) - \alpha \bar{\mathcal{H}} \right]$$

where H̄ is the target entropy (typically -dim(A), the negative action dimension).

This ensures the policy maintains a minimum level of entropy—exploring enough without over-exploring.

Why Twin Critics?

Using two Q-networks and taking their minimum addresses overestimation bias (like Double DQN). This is especially important in actor-critic methods where the policy exploits any overestimated Q-values. Without twin critics, SAC tends to learn unstable, overoptimistic policies.

SAC Hyperparameter Guidelines
Parameter	Typical Value	Notes
Learning rate	3e-4	Same for all networks
Discount γ	0.99	Standard
Soft update τ	0.005	Target network update rate
Batch size	256	Larger batches often work better
Buffer size	10^6	Large buffer for off-policy learning
Target entropy	-dim(A)	Automatic; rarely needs tuning
Hidden dimensions	256	Two hidden layers typically
Learning starts	10,000	Fill buffer before training

SAC vs. DDPG/TD3

SAC builds on prior off-policy continuous control algorithms:

Aspect	DDPG	TD3	SAC
Policy	Deterministic	Deterministic	Stochastic
Exploration	Additive noise	Additive noise	Entropy in objective
Critics	One	Two	Two
Stability	Often unstable	Improved	Best
Sample efficiency	Good	Good	Best

SAC's stochastic policy with entropy regularization provides more robust exploration than adding external noise to a deterministic policy. This makes SAC more reliable across tasks without extensive hyperparameter tuning.

PPO vs SAC: Choosing the Right Algorithm

Both PPO and SAC are excellent choices for different scenarios. Understanding their trade-offs helps you select the right tool.

Key Differences:

PPO vs SAC Comparison
Dimension	PPO	SAC
Data usage	On-policy (discards data)	Off-policy (reuses data)
Sample efficiency	Lower	Higher
Parallelization	Essential for speed	Less critical
Action spaces	Discrete and continuous	Continuous only
Stability	Very stable	Stable with tuning
Implementation	Simpler	More complex
Hyperparameter sensitivity	Moderate	Lower
Final performance	Good	Often better

Choose PPO When:

•Discrete actions: PPO handles discrete action spaces naturally
•Parallel environments available: Can run many envs simultaneously
•Simplicity matters: Faster to implement and debug
•RLHF / Language models: PPO is the standard for LLM fine-tuning
•Unknown complexity: Robust baseline that works reasonably everywhere

Choose SAC When:

•Continuous control: SAC excels at robotics, locomotion
•Limited environment access: Need maximum sample efficiency
•Single environment: Can still learn effectively without parallelism
•Maximum performance needed: SAC often achieves higher final rewards
•Real-world robotics: Where every sample costs time and wear

Practical Guidance

Starting a new project: Start with PPO. It's simpler and works reasonably on most tasks. If performance plateaus, consider SAC.
Robotics/continuous control: Use SAC. The sample efficiency is crucial when each environment step involves a real robot or expensive simulation.
Game playing: Use PPO. Games often have discrete actions and benefit from massive parallelization.
Research/exploration: Use both and compare. The best algorithm is task-dependent.
Production deployment: Consider SAC for its sample efficiency, but ensure thorough testing for stability.

Hybrid Approaches

Some recent work combines benefits of both:

PPO with experience replay: Improves sample efficiency
SAC with discrete actions: Extensions exist (SAC-Discrete)
Off-policy PPO variants: Maintain PPO's simplicity with replay

Real-World Results

In benchmark comparisons, SAC typically achieves ~20-40% higher final rewards on MuJoCo continuous control tasks while using ~5-10× fewer environment steps. However, PPO is more robust to hyperparameter choices and easier to scale to massive distributed training.

Advanced Topics and Extensions

Both PPO and SAC have spawned numerous extensions and improvements. Understanding these advances provides insight into current research frontiers.

PPO Variants:

PPO-Penalty uses a KL penalty instead of clipping: $$L = \mathbb{E}[r(\theta) A] - \beta \cdot KL[\pi_{\theta_{old}} || \pi_\theta]$$

The coefficient β is adapted during training to target a specific KL divergence.

IPPO (Independent PPO) trains separate PPO agents in multi-agent settings, surprisingly effective despite ignoring agent interactions.

MAPPO adapts PPO for multi-agent RL with centralized training and decentralized execution.

SAC Variants:

SAC-Discrete extends SAC to discrete action spaces using the Gumbel-Softmax trick.

SAC-AE combines SAC with autoencoders for learning from pixels.

Droq adds dropout to Q-networks for higher replay ratios without overfitting.

Notable Extensions
Algorithm	Base	Key Innovation	Use Case
MAPPO	PPO	Multi-agent with shared value	Cooperative games
IMPALA	PPO-style	V-trace correction, distributed	Large-scale training
GRPO	PPO	Group relative policy optimization	RLHF
TQC	SAC	Truncated quantile critics	Reduced overestimation
REDQ	SAC-style	Ensemble critics, high replay ratio	Maximum sample efficiency
DrQ	SAC	Image augmentation regularization	Visual control

Scaling Deep RL

Both PPO and SAC can be scaled up:

Distributed PPO: Run hundreds of parallel environments, each collecting experience. Aggregate gradients across workers. Used by OpenAI Five, GPT-4 RLHF.

Distributed SAC: Separate actors (environment interaction) from learners (gradient computation). Apex-style architectures use prioritized replay across many actors.

Large-scale Results:

OpenAI Five (PPO): 256 GPUs, 128,000 CPU cores
ChatGPT RLHF (PPO): Scaled to billions of parameters
DeepMind Control (SAC variants): State-of-the-art robotic manipulation

Open Challenges:

Long-horizon credit assignment: Both struggle with very delayed rewards
Exploration in sparse reward: Random exploration insufficient for hard exploration
Generalization: Policies often overfit to specific environments
Safety constraints: Handling constraints during exploration
Multi-task learning: Training single agents on diverse tasks

Summary: Mastering Modern Policy Gradient Methods

We've covered the two most important policy gradient algorithms in modern deep RL. Let's consolidate the key insights:

Key Takeaways

•Policy gradient methods directly optimize policies, avoiding the need to compute argmax over continuous action spaces.
•PPO uses clipped objectives to constrain policy updates, achieving TRPO-level stability with simple SGD.
•SAC maximizes entropy alongside reward, providing robust exploration and enabling off-policy learning.
•PPO is on-policy and simpler; SAC is off-policy and more sample-efficient.
•Choose PPO for discrete actions, parallelization, or simplicity; choose SAC for continuous control and sample efficiency.
•Both algorithms power cutting-edge applications: game-playing AI, robotics, and language model fine-tuning.

Module Conclusion

With this page, we've completed our deep dive into Deep Reinforcement Learning. You now understand:

DQN Architecture: CNNs for value-based RL from pixels
Experience Replay: Breaking correlations for stable learning
Target Networks: Stabilizing training with frozen targets
Rainbow DQN: Combining six orthogonal improvements
PPO and SAC: Modern policy gradient methods for continuous control

This foundation covers the core algorithms powering today's most impressive RL systems—from game-playing AIs to robotic manipulation to language model alignment.

Where to Go From Here:

Implement PPO from scratch on a simple task (CartPole, LunarLander)
Try SAC on a continuous control benchmark (HalfCheetah, Ant)
Explore RLHF for language models using PPO
Study model-based RL (Dreamer, MuZero) for sample efficiency
Investigate multi-agent RL for complex coordination

Module Complete

Congratulations! You've mastered Deep Reinforcement Learning—from the DQN architecture that started the deep RL revolution to the PPO and SAC algorithms that define current practice. This knowledge equips you to understand, implement, and improve the RL systems transforming AI today.