Loading content...
While value-based methods like DQN and Rainbow transformed discrete-action reinforcement learning, a different approach was needed for problems with continuous action spaces. How do you control a robot arm with infinite possible joint angles? How do you navigate a car with continuous steering and throttle?
Policy gradient methods solve this by directly learning a policy—a mapping from states to actions—rather than learning value functions. Instead of computing Q(s, a) for every action and taking argmax, we learn a neural network that directly outputs actions (or distributions over actions).
Two algorithms dominate modern policy-based deep RL:
Proximal Policy Optimization (PPO): Developed by OpenAI in 2017, PPO achieves the stability of trust region methods without their computational complexity. It's simple to implement, robust to hyperparameters, and has become the default choice for many applications—from game-playing AI to robotic control to RLHF for language models.
Soft Actor-Critic (SAC): Developed by UC Berkeley in 2018, SAC combines off-policy learning with maximum entropy principles. It achieves exceptional sample efficiency and is particularly effective for continuous control tasks like robotics.
Together, PPO and SAC represent the state of the art in policy gradient methods, covering both on-policy (PPO) and off-policy (SAC) paradigms.
By the end of this page, you will understand the foundations of policy gradient methods, how PPO achieves stable training through clipped objectives, how SAC maximizes entropy for robust exploration, the trade-offs between on-policy and off-policy algorithms, and practical guidance on when to use each approach.
Before diving into PPO and SAC, we need to understand the policy gradient theorem—the mathematical foundation underlying all policy-based methods.
The Policy Optimization Objective
We want to find a policy π_θ that maximizes expected return:
$$J(\theta) = \mathbb{E}{\tau \sim \pi\theta} \left[ \sum_{t=0}^{\infty} \gamma^t r_t \right]$$
where τ represents a trajectory (state-action sequence) sampled under policy π_θ.
The Policy Gradient Theorem
Remarkably, we can compute the gradient of J(θ) without differentiating through the environment dynamics:
$$ abla_\theta J(\theta) = \mathbb{E}{\tau \sim \pi\theta} \left[ \sum_{t=0}^{\infty} abla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t \right]$$
where G_t is the return from time t. This is the foundational REINFORCE algorithm.
The Score Function Trick
The key insight is the identity: $$ abla_\theta \pi_\theta(a|s) = \pi_\theta(a|s) \cdot abla_\theta \log \pi_\theta(a|s)$$
This allows us to compute policy gradients using only:
The environment dynamics p(s'|s,a) are typically unknown (model-free RL) or non-differentiable. The policy gradient theorem elegantly sidesteps this by treating trajectories as samples and using the score function to compute gradients only through the policy network itself.
Variance Reduction: Baselines and Advantages
Raw policy gradients have extremely high variance—the return G_t can vary wildly episode to episode. Two key techniques reduce variance:
Baseline Subtraction: Subtract a baseline b(s) from returns: $$ abla_\theta J = \mathbb{E} \left[ abla_\theta \log \pi_\theta(a|s) \cdot (G_t - b(s_t)) \right]$$
This doesn't change the expected gradient (bias) but reduces variance. The optimal baseline is V(s), the state value function.
Advantage Function: When using V(s) as baseline, we get the advantage: $$A(s, a) = Q(s, a) - V(s)$$
The advantage tells us how much better action a is compared to the average action. This is the foundation of actor-critic methods:
Both PPO and SAC are actor-critic algorithms, learning both a policy and a value function.
| Algorithm | On/Off-Policy | Key Innovation | Use Case |
|---|---|---|---|
| REINFORCE | On-policy | Basic policy gradient | Educational, simple tasks |
| A2C/A3C | On-policy | Actor-critic + parallelism | Moderate-scale training |
| TRPO | On-policy | Trust region constraint | High-stakes applications |
| PPO | On-policy | Clipped surrogate objective | General purpose, robust |
| DDPG | Off-policy | Deterministic policy gradient | Continuous control |
| TD3 | Off-policy | Twin critics + delayed updates | Improved DDPG |
| SAC | Off-policy | Maximum entropy + twin critics | Sample-efficient continuous |
PPO achieves the sample efficiency of trust region methods while being much simpler to implement. Its key insight: constrain policy updates implicitly through a clipped objective, eliminating the need for complex second-order optimization.
The Problem PPO Solves
Policy gradient methods face a fundamental tension:
TRPO (Trust Region Policy Optimization) solved this with explicit KL-divergence constraints, but required complex conjugate gradient optimization.
The PPO Solution: Clipped Surrogate Objective
PPO defines a ratio between the new and old policies: $$r(\theta) = \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}$$
The clipped objective is: $$L^{\text{CLIP}}(\theta) = \mathbb{E} \left[ \min\left( r(\theta) A, \text{clip}(r(\theta), 1-\epsilon, 1+\epsilon) A \right) \right]$$
where ε (typically 0.2) controls how much the policy can change.
How Clipping Works:
The min ensures we never get credit for updates that violate the trust region.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233
import torchimport torch.nn as nnimport torch.nn.functional as Ffrom torch.distributions import Normal, Categoricalimport numpy as np class PPOActorCritic(nn.Module): """ Actor-Critic network for PPO. For continuous actions, outputs a Gaussian distribution. For discrete actions, outputs action probabilities. """ def __init__( self, state_dim: int, action_dim: int, continuous: bool = True, hidden_dim: int = 256 ): super().__init__() self.continuous = continuous # Shared feature extractor (optional separate networks also work) self.shared = nn.Sequential( nn.Linear(state_dim, hidden_dim), nn.Tanh(), nn.Linear(hidden_dim, hidden_dim), nn.Tanh(), ) # Actor head if continuous: # Output mean; log_std is separate parameter self.actor_mean = nn.Linear(hidden_dim, action_dim) self.actor_log_std = nn.Parameter(torch.zeros(action_dim)) else: self.actor = nn.Linear(hidden_dim, action_dim) # Critic head: outputs V(s) self.critic = nn.Linear(hidden_dim, 1) # Initialize with smaller weights for final layers self._init_weights() def _init_weights(self): for module in [self.actor_mean if self.continuous else self.actor, self.critic]: nn.init.orthogonal_(module.weight, gain=0.01) nn.init.zeros_(module.bias) def forward(self, state): features = self.shared(state) value = self.critic(features) if self.continuous: action_mean = self.actor_mean(features) action_std = self.actor_log_std.exp().expand_as(action_mean) return action_mean, action_std, value else: action_logits = self.actor(features) return action_logits, value def get_action(self, state, deterministic=False): """Sample action from policy.""" with torch.no_grad(): if self.continuous: mean, std, value = self.forward(state) if deterministic: action = mean else: dist = Normal(mean, std) action = dist.sample() return action, value else: logits, value = self.forward(state) if deterministic: action = logits.argmax(dim=-1) else: dist = Categorical(logits=logits) action = dist.sample() return action, value def evaluate_actions(self, states, actions): """Compute log probs and values for given state-action pairs.""" if self.continuous: mean, std, value = self.forward(states) dist = Normal(mean, std) log_probs = dist.log_prob(actions).sum(dim=-1) entropy = dist.entropy().sum(dim=-1) else: logits, value = self.forward(states) dist = Categorical(logits=logits) log_probs = dist.log_prob(actions) entropy = dist.entropy() return log_probs, value.squeeze(-1), entropy class PPOTrainer: """ PPO training logic with clipped objective. """ def __init__( self, actor_critic: PPOActorCritic, lr: float = 3e-4, gamma: float = 0.99, gae_lambda: float = 0.95, clip_epsilon: float = 0.2, value_coef: float = 0.5, entropy_coef: float = 0.01, max_grad_norm: float = 0.5, ppo_epochs: int = 10, minibatch_size: int = 64, ): self.actor_critic = actor_critic self.optimizer = torch.optim.Adam(actor_critic.parameters(), lr=lr) self.gamma = gamma self.gae_lambda = gae_lambda self.clip_epsilon = clip_epsilon self.value_coef = value_coef self.entropy_coef = entropy_coef self.max_grad_norm = max_grad_norm self.ppo_epochs = ppo_epochs self.minibatch_size = minibatch_size def compute_gae(self, rewards, values, dones, next_value): """ Compute Generalized Advantage Estimation (GAE). GAE balances bias-variance through lambda parameter. """ advantages = [] gae = 0 for t in reversed(range(len(rewards))): if t == len(rewards) - 1: next_val = next_value else: next_val = values[t + 1] # TD error delta = rewards[t] + self.gamma * next_val * (1 - dones[t]) - values[t] # GAE accumulation gae = delta + self.gamma * self.gae_lambda * (1 - dones[t]) * gae advantages.insert(0, gae) advantages = torch.tensor(advantages) returns = advantages + torch.tensor(values) return advantages, returns def update(self, rollout_buffer): """ Perform PPO update over multiple epochs. """ # Unpack buffer states = torch.FloatTensor(rollout_buffer['states']) actions = torch.FloatTensor(rollout_buffer['actions']) old_log_probs = torch.FloatTensor(rollout_buffer['log_probs']) advantages = torch.FloatTensor(rollout_buffer['advantages']) returns = torch.FloatTensor(rollout_buffer['returns']) # Normalize advantages (important for stability!) advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8) # Multiple epochs over the same data for epoch in range(self.ppo_epochs): # Generate random minibatches indices = np.random.permutation(len(states)) for start in range(0, len(states), self.minibatch_size): end = start + self.minibatch_size batch_indices = indices[start:end] batch_states = states[batch_indices] batch_actions = actions[batch_indices] batch_old_log_probs = old_log_probs[batch_indices] batch_advantages = advantages[batch_indices] batch_returns = returns[batch_indices] # Get current policy evaluation log_probs, values, entropy = self.actor_critic.evaluate_actions( batch_states, batch_actions ) # PPO CLIPPED OBJECTIVE # Compute probability ratio ratio = torch.exp(log_probs - batch_old_log_probs) # Clipped surrogate objective surr1 = ratio * batch_advantages surr2 = torch.clamp( ratio, 1 - self.clip_epsilon, 1 + self.clip_epsilon ) * batch_advantages # Take the minimum (pessimistic bound) policy_loss = -torch.min(surr1, surr2).mean() # Value function loss value_loss = F.mse_loss(values, batch_returns) # Entropy bonus (encourages exploration) entropy_loss = -entropy.mean() # Combined loss loss = ( policy_loss + self.value_coef * value_loss + self.entropy_coef * entropy_loss ) # Gradient update self.optimizer.zero_grad() loss.backward() nn.utils.clip_grad_norm_( self.actor_critic.parameters(), self.max_grad_norm ) self.optimizer.step() return { 'policy_loss': policy_loss.item(), 'value_loss': value_loss.item(), 'entropy': -entropy_loss.item(), }Three details are crucial for PPO performance: (1) Normalize advantages within each minibatch for stable gradients; (2) Use GAE (λ≈0.95) for smooth advantage estimation; (3) Run multiple epochs (K≈10) over collected data for sample efficiency, but not so many that the policy drifts too far from the data-collecting policy.
PPO's simplicity is deceptive—achieving good performance requires attention to many details. This section covers the practices that separate high-performing PPO from mediocre implementations.
Generalized Advantage Estimation (GAE)
GAE provides a smooth trade-off between bias and variance in advantage estimation:
$$A^{GAE(\gamma, \lambda)}t = \sum{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}$$
where δ_t = r_t + γV(s_{t+1}) - V(s_t) is the TD error.
Hyperparameter Guidelines
| Parameter | Typical Value | Notes |
|---|---|---|
| Clip epsilon (ε) | 0.1 - 0.3 | 0.2 is the default; smaller = more conservative |
| GAE λ | 0.95 - 0.99 | Controls advantage bias-variance |
| Discount γ | 0.99 | Standard; 0.995-0.999 for long-horizon tasks |
| PPO epochs (K) | 3 - 30 | More epochs = more sample efficient but risks overfitting |
| Minibatch size | 32 - 4096 | Larger batches stabilize training |
| Learning rate | 3e-4 | Adam default; often annealed |
| Entropy coefficient | 0.0 - 0.01 | Encourages exploration; task-dependent |
| Value coefficient | 0.5 - 1.0 | Relative weight of value loss |
| Max grad norm | 0.5 | Gradient clipping for stability |
Common PPO Implementation Issues
Not normalizing observations: Standardize inputs (subtract mean, divide by std) computed over collected data
Incorrect advantage normalization: Normalize advantages per minibatch, not globally
Too many epochs: If ratio r(θ) frequently hits clip boundaries, reduce epochs
Value function issues: Consider separate networks for policy and value, or larger value loss coefficient
Not using orthogonal initialization: Significantly impacts performance on some environments
Incorrect done handling: For truncated episodes, bootstrap value; for true termination, don't
Soft Actor-Critic takes a fundamentally different approach: it maximizes both expected return and policy entropy. This maximum entropy framework provides robust exploration and importantly, makes SAC an off-policy algorithm with excellent sample efficiency.
The Maximum Entropy Objective
SAC optimizes a modified objective:
$$J(\pi) = \sum_{t=0}^{\infty} \mathbb{E} \left[ r(s_t, a_t) + \alpha \mathcal{H}(\pi(\cdot | s_t)) \right]$$
where H(π) = -E[log π(a|s)] is the entropy and α is the temperature parameter.
Why Entropy Matters:
Key Components of SAC:
PPO is on-policy: it trains on data collected by the current policy, then discards it. SAC is off-policy: it stores data in a replay buffer and can train on experiences collected by any past policy. Off-policy algorithms typically have better sample efficiency but can be less stable. SAC addresses stability through maximum entropy, twin critics, and careful target updates.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298
import torchimport torch.nn as nnimport torch.nn.functional as Ffrom torch.distributions import Normalimport numpy as npfrom copy import deepcopy class GaussianPolicy(nn.Module): """ SAC policy network outputting a squashed Gaussian distribution. Actions are sampled from Gaussian, then squashed through tanh to bound them to [-1, 1]. """ LOG_STD_MIN = -20 LOG_STD_MAX = 2 def __init__( self, state_dim: int, action_dim: int, hidden_dim: int = 256, ): super().__init__() self.net = nn.Sequential( nn.Linear(state_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), ) # Separate heads for mean and log_std self.mean_head = nn.Linear(hidden_dim, action_dim) self.log_std_head = nn.Linear(hidden_dim, action_dim) def forward(self, state): features = self.net(state) mean = self.mean_head(features) log_std = self.log_std_head(features) log_std = torch.clamp(log_std, self.LOG_STD_MIN, self.LOG_STD_MAX) return mean, log_std def sample(self, state): """ Sample action and compute log probability. Uses reparameterization trick for differentiable sampling. Applies tanh squashing and corrects log_prob accordingly. """ mean, log_std = self.forward(state) std = log_std.exp() # Reparameterization trick: sample from N(0,1), then transform normal = Normal(mean, std) x_t = normal.rsample() # Differentiable sample # Squash through tanh to bound actions action = torch.tanh(x_t) # Compute log probability with squashing correction # log π(a|s) = log μ(u|s) - sum(log(1 - tanh²(u))) log_prob = normal.log_prob(x_t) # Squashing correction (Jacobian of tanh transformation) log_prob -= torch.log(1 - action.pow(2) + 1e-6) log_prob = log_prob.sum(dim=-1, keepdim=True) return action, log_prob def get_action(self, state, deterministic=False): """Get action for environment interaction.""" with torch.no_grad(): mean, log_std = self.forward(state) if deterministic: return torch.tanh(mean) else: std = log_std.exp() normal = Normal(mean, std) action = torch.tanh(normal.sample()) return action class TwinQNetwork(nn.Module): """ Twin Q-networks for SAC. Using two Q-networks and taking the minimum reduces overestimation bias compared to using a single network. """ def __init__( self, state_dim: int, action_dim: int, hidden_dim: int = 256, ): super().__init__() # Q1 network self.q1 = nn.Sequential( nn.Linear(state_dim + action_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, 1), ) # Q2 network (same architecture, different initialization) self.q2 = nn.Sequential( nn.Linear(state_dim + action_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, 1), ) def forward(self, state, action): x = torch.cat([state, action], dim=-1) return self.q1(x), self.q2(x) def q1_forward(self, state, action): x = torch.cat([state, action], dim=-1) return self.q1(x) class SACAgent: """ Soft Actor-Critic agent with automatic temperature tuning. """ def __init__( self, state_dim: int, action_dim: int, device: torch.device, gamma: float = 0.99, tau: float = 0.005, lr: float = 3e-4, buffer_size: int = 1_000_000, batch_size: int = 256, learning_starts: int = 10000, target_entropy: float = None, ): self.device = device self.gamma = gamma self.tau = tau self.batch_size = batch_size self.learning_starts = learning_starts # Policy network self.policy = GaussianPolicy(state_dim, action_dim).to(device) # Twin Q-networks self.q_networks = TwinQNetwork(state_dim, action_dim).to(device) self.q_target = deepcopy(self.q_networks) # Freeze target networks for param in self.q_target.parameters(): param.requires_grad = False # Automatic entropy tuning if target_entropy is None: self.target_entropy = -action_dim # Heuristic from paper else: self.target_entropy = target_entropy self.log_alpha = torch.zeros(1, requires_grad=True, device=device) self.alpha = self.log_alpha.exp() # Optimizers self.policy_optimizer = torch.optim.Adam(self.policy.parameters(), lr=lr) self.q_optimizer = torch.optim.Adam(self.q_networks.parameters(), lr=lr) self.alpha_optimizer = torch.optim.Adam([self.log_alpha], lr=lr) # Replay buffer self.replay_buffer = ReplayBuffer(buffer_size, state_dim, action_dim) self.total_steps = 0 def select_action(self, state, deterministic=False): state = torch.FloatTensor(state).unsqueeze(0).to(self.device) action = self.policy.get_action(state, deterministic) return action.cpu().numpy()[0] def train_step(self): if len(self.replay_buffer) < self.learning_starts: return {} # Sample batch states, actions, rewards, next_states, dones = self.replay_buffer.sample(self.batch_size, self.device) # ========== Q-function update ========== with torch.no_grad(): # Sample next actions and compute target Q next_actions, next_log_probs = self.policy.sample(next_states) # Take minimum of two Q targets (reduces overestimation) q1_next, q2_next = self.q_target(next_states, next_actions) q_next = torch.min(q1_next, q2_next) # Soft target: include entropy q_target = rewards + self.gamma * (1 - dones) * (q_next - self.alpha * next_log_probs) # Current Q estimates q1, q2 = self.q_networks(states, actions) # Q losses (MSE against target) q1_loss = F.mse_loss(q1, q_target) q2_loss = F.mse_loss(q2, q_target) q_loss = q1_loss + q2_loss self.q_optimizer.zero_grad() q_loss.backward() self.q_optimizer.step() # ========== Policy update ========== # Sample new actions for current states new_actions, log_probs = self.policy.sample(states) # Q-value for new actions (use Q1 only, or min of both) q1_new, q2_new = self.q_networks(states, new_actions) q_new = torch.min(q1_new, q2_new) # Policy loss: maximize Q - α * log_prob policy_loss = (self.alpha.detach() * log_probs - q_new).mean() self.policy_optimizer.zero_grad() policy_loss.backward() self.policy_optimizer.step() # ========== Temperature (alpha) update ========== alpha_loss = -(self.log_alpha * (log_probs + self.target_entropy).detach()).mean() self.alpha_optimizer.zero_grad() alpha_loss.backward() self.alpha_optimizer.step() self.alpha = self.log_alpha.exp() # ========== Soft update of target networks ========== with torch.no_grad(): for param, target_param in zip( self.q_networks.parameters(), self.q_target.parameters() ): target_param.data.mul_(1 - self.tau) target_param.data.add_(self.tau * param.data) self.total_steps += 1 return { 'q_loss': q_loss.item(), 'policy_loss': policy_loss.item(), 'alpha': self.alpha.item(), 'entropy': -log_probs.mean().item(), } class ReplayBuffer: """Simple replay buffer for SAC.""" def __init__(self, capacity, state_dim, action_dim): self.capacity = capacity self.ptr = 0 self.size = 0 self.states = np.zeros((capacity, state_dim), dtype=np.float32) self.actions = np.zeros((capacity, action_dim), dtype=np.float32) self.rewards = np.zeros((capacity, 1), dtype=np.float32) self.next_states = np.zeros((capacity, state_dim), dtype=np.float32) self.dones = np.zeros((capacity, 1), dtype=np.float32) def push(self, state, action, reward, next_state, done): self.states[self.ptr] = state self.actions[self.ptr] = action self.rewards[self.ptr] = reward self.next_states[self.ptr] = next_state self.dones[self.ptr] = done self.ptr = (self.ptr + 1) % self.capacity self.size = min(self.size + 1, self.capacity) def sample(self, batch_size, device): indices = np.random.randint(0, self.size, size=batch_size) return ( torch.FloatTensor(self.states[indices]).to(device), torch.FloatTensor(self.actions[indices]).to(device), torch.FloatTensor(self.rewards[indices]).to(device), torch.FloatTensor(self.next_states[indices]).to(device), torch.FloatTensor(self.dones[indices]).to(device), ) def __len__(self): return self.sizeSAC's success stems from several carefully designed components. Let's examine each in detail.
The Squashing Correction
SAC uses tanh to bound actions to [-1, 1]. This is crucial for physical systems where actuators have limits. But squashing affects the log probability calculation:
$$\log \pi(a|s) = \log \mu(u|s) - \sum_{i=1}^{D} \log(1 - \tanh^2(u_i))$$
where u is the pre-squashing sample and a = tanh(u) is the actual action.
Forgetting this correction is a common bug that causes poor performance.
Automatic Temperature Tuning
The temperature α balances reward maximization vs. entropy:
Instead of treating α as a hyperparameter, SAC learns it automatically by solving:
$$\min_\alpha \mathbb{E}_{a_t \sim \pi_t} \left[ -\alpha \log \pi_t(a_t|s_t) - \alpha \bar{\mathcal{H}} \right]$$
where H̄ is the target entropy (typically -dim(A), the negative action dimension).
This ensures the policy maintains a minimum level of entropy—exploring enough without over-exploring.
Using two Q-networks and taking their minimum addresses overestimation bias (like Double DQN). This is especially important in actor-critic methods where the policy exploits any overestimated Q-values. Without twin critics, SAC tends to learn unstable, overoptimistic policies.
| Parameter | Typical Value | Notes |
|---|---|---|
| Learning rate | 3e-4 | Same for all networks |
| Discount γ | 0.99 | Standard |
| Soft update τ | 0.005 | Target network update rate |
| Batch size | 256 | Larger batches often work better |
| Buffer size | 10^6 | Large buffer for off-policy learning |
| Target entropy | -dim(A) | Automatic; rarely needs tuning |
| Hidden dimensions | 256 | Two hidden layers typically |
| Learning starts | 10,000 | Fill buffer before training |
SAC vs. DDPG/TD3
SAC builds on prior off-policy continuous control algorithms:
| Aspect | DDPG | TD3 | SAC |
|---|---|---|---|
| Policy | Deterministic | Deterministic | Stochastic |
| Exploration | Additive noise | Additive noise | Entropy in objective |
| Critics | One | Two | Two |
| Stability | Often unstable | Improved | Best |
| Sample efficiency | Good | Good | Best |
SAC's stochastic policy with entropy regularization provides more robust exploration than adding external noise to a deterministic policy. This makes SAC more reliable across tasks without extensive hyperparameter tuning.
Both PPO and SAC are excellent choices for different scenarios. Understanding their trade-offs helps you select the right tool.
Key Differences:
| Dimension | PPO | SAC |
|---|---|---|
| Data usage | On-policy (discards data) | Off-policy (reuses data) |
| Sample efficiency | Lower | Higher |
| Parallelization | Essential for speed | Less critical |
| Action spaces | Discrete and continuous | Continuous only |
| Stability | Very stable | Stable with tuning |
| Implementation | Simpler | More complex |
| Hyperparameter sensitivity | Moderate | Lower |
| Final performance | Good | Often better |
Practical Guidance
Starting a new project: Start with PPO. It's simpler and works reasonably on most tasks. If performance plateaus, consider SAC.
Robotics/continuous control: Use SAC. The sample efficiency is crucial when each environment step involves a real robot or expensive simulation.
Game playing: Use PPO. Games often have discrete actions and benefit from massive parallelization.
Research/exploration: Use both and compare. The best algorithm is task-dependent.
Production deployment: Consider SAC for its sample efficiency, but ensure thorough testing for stability.
Hybrid Approaches
Some recent work combines benefits of both:
In benchmark comparisons, SAC typically achieves ~20-40% higher final rewards on MuJoCo continuous control tasks while using ~5-10× fewer environment steps. However, PPO is more robust to hyperparameter choices and easier to scale to massive distributed training.
Both PPO and SAC have spawned numerous extensions and improvements. Understanding these advances provides insight into current research frontiers.
PPO Variants:
PPO-Penalty uses a KL penalty instead of clipping: $$L = \mathbb{E}[r(\theta) A] - \beta \cdot KL[\pi_{\theta_{old}} || \pi_\theta]$$
The coefficient β is adapted during training to target a specific KL divergence.
IPPO (Independent PPO) trains separate PPO agents in multi-agent settings, surprisingly effective despite ignoring agent interactions.
MAPPO adapts PPO for multi-agent RL with centralized training and decentralized execution.
SAC Variants:
SAC-Discrete extends SAC to discrete action spaces using the Gumbel-Softmax trick.
SAC-AE combines SAC with autoencoders for learning from pixels.
Droq adds dropout to Q-networks for higher replay ratios without overfitting.
| Algorithm | Base | Key Innovation | Use Case |
|---|---|---|---|
| MAPPO | PPO | Multi-agent with shared value | Cooperative games |
| IMPALA | PPO-style | V-trace correction, distributed | Large-scale training |
| GRPO | PPO | Group relative policy optimization | RLHF |
| TQC | SAC | Truncated quantile critics | Reduced overestimation |
| REDQ | SAC-style | Ensemble critics, high replay ratio | Maximum sample efficiency |
| DrQ | SAC | Image augmentation regularization | Visual control |
Scaling Deep RL
Both PPO and SAC can be scaled up:
Distributed PPO: Run hundreds of parallel environments, each collecting experience. Aggregate gradients across workers. Used by OpenAI Five, GPT-4 RLHF.
Distributed SAC: Separate actors (environment interaction) from learners (gradient computation). Apex-style architectures use prioritized replay across many actors.
Large-scale Results:
Open Challenges:
We've covered the two most important policy gradient algorithms in modern deep RL. Let's consolidate the key insights:
Module Conclusion
With this page, we've completed our deep dive into Deep Reinforcement Learning. You now understand:
This foundation covers the core algorithms powering today's most impressive RL systems—from game-playing AIs to robotic manipulation to language model alignment.
Where to Go From Here:
Congratulations! You've mastered Deep Reinforcement Learning—from the DQN architecture that started the deep RL revolution to the PPO and SAC algorithms that define current practice. This knowledge equips you to understand, implement, and improve the RL systems transforming AI today.