Loading content...
We've seen that policy gradients (the 'actor') can directly optimize behavior, and we've learned that learned value functions make excellent baselines (the 'critic' that evaluates performance). Actor-Critic methods unite these ideas into a single, powerful framework.
In Actor-Critic architectures, the actor learns the policy π_θ(a|s), while the critic learns the value function V_ω(s) or Q_ω(s,a). The critic provides low-variance advantage estimates for training the actor, while the actor's behavior generates data for training the critic. This synergy enables faster learning than either approach alone.
Actor-Critic methods form the backbone of modern deep RL. Algorithms like A3C, A2C, PPO, SAC, and TD3 are all Actor-Critic methods. Understanding their shared foundation prepares you for the entire landscape of practical policy optimization.
By the end of this page, you will understand: the Actor-Critic framework and why it combines policy and value learning, the A2C algorithm in detail, architecture choices for actor and critic networks, how to balance actor and critic updates, and common failure modes with solutions.
Actor-Critic is not a single algorithm but a framework for combining policy optimization with value estimation. The key insight is that the critic's job is to help train the actor, not to act directly.
The Two Components:
The Actor-Critic Loop:
1234567891011121314151617181920212223242526
Actor-Critic Training Loop═══════════════════════════════════════════════════════════════ Initialize: Actor π_θ, Critic V_ω, learning rates α_π and α_V For each episode: Observe initial state s While not done: 1. ACTOR: Sample action a ~ π_θ(·|s) 2. ENVIRONMENT: Execute a, observe r, s', done 3. CRITIC EVALUATION: - Compute TD error: δ = r + γV_ω(s') - V_ω(s) - Or compute advantage: Â = estimated advantage 4. CRITIC UPDATE (TD learning): ω ← ω + α_V · δ · ∇_ω V_ω(s) 5. ACTOR UPDATE (policy gradient): θ ← θ + α_π · ∇_θ log π_θ(a|s) · Â 6. s ← s' ═══════════════════════════════════════════════════════════════The loop above shows online updates (after every step). In practice, batch updates (after collecting N steps or full episodes) are often preferred because: (1) gradients average over more samples, (2) GPU utilization is better with batches, (3) parallel workers can be used. A2C and PPO use batch updates.
A2C (Advantage Actor-Critic) is the synchronous, batch version of the famous A3C algorithm. It's the workhorse of policy gradient methods and an excellent starting point for understanding modern implementations.
A2C Algorithm:
12345678910111213141516171819202122232425262728293031323334
A2C: Advantage Actor-Critic═══════════════════════════════════════════════════════════════ Hyperparameters: - n_steps: Number of steps to collect before updating - n_envs: Number of parallel environments - γ: Discount factor - λ: GAE parameter - c_1: Value loss coefficient - c_2: Entropy coefficient Initialize: Actor-Critic network with shared/separate parameters Repeat: 1. COLLECT ROLLOUTS For each of n_envs environments in parallel: Collect n_steps of experience (s, a, r, s', done) Total: n_envs × n_steps transitions 2. COMPUTE ADVANTAGES AND RETURNS For each trajectory: Use GAE to compute advantages  and returns R 3. COMPUTE LOSSES Policy loss: L_π = -E[log π_θ(a|s) · Â] Value loss: L_V = E[(V_ω(s) - R)²] Entropy: H = E[-log π_θ(a|s)] Total loss: L = L_π + c_1·L_V - c_2·H 4. UPDATE PARAMETERS Compute ∇L and update using optimizer ═══════════════════════════════════════════════════════════════Complete A2C Implementation:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
import torchimport torch.nn as nnimport torch.optim as optimimport torch.nn.functional as Ffrom torch.distributions import Categoricalimport numpy as npfrom typing import List, Tuple, Dict class ActorCriticNetwork(nn.Module): """ Combined Actor-Critic network with shared feature extraction. Architecture: - Shared layers: Extract features from observations - Policy head (Actor): Outputs action distribution - Value head (Critic): Outputs state value estimate Sharing layers is common because: 1. Both benefit from similar state representations 2. Acts as implicit regularization 3. More parameter-efficient """ def __init__( self, state_dim: int, action_dim: int, hidden_dim: int = 256 ): super().__init__() # Shared feature extraction self.shared = nn.Sequential( nn.Linear(state_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU() ) # Actor head: outputs logits for each action self.actor_head = nn.Linear(hidden_dim, action_dim) # Critic head: outputs single value estimate self.critic_head = nn.Linear(hidden_dim, 1) # Initialize weights self._initialize_weights() def _initialize_weights(self): for module in self.modules(): if isinstance(module, nn.Linear): nn.init.orthogonal_(module.weight, gain=np.sqrt(2)) nn.init.constant_(module.bias, 0) # Smaller init for output layers nn.init.orthogonal_(self.actor_head.weight, gain=0.01) nn.init.orthogonal_(self.critic_head.weight, gain=1.0) def forward(self, state: torch.Tensor) -> Tuple[Categorical, torch.Tensor]: """ Forward pass returns both policy distribution and value. Args: state: Observation tensor [batch, state_dim] Returns: dist: Categorical distribution over actions value: State value estimates [batch, 1] """ features = self.shared(state) # Actor logits = self.actor_head(features) dist = Categorical(logits=logits) # Critic value = self.critic_head(features) return dist, value def get_action_and_value( self, state: torch.Tensor ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]: """ Get action, log probability, entropy, and value in one pass. Used during rollout collection. """ dist, value = self.forward(state) action = dist.sample() log_prob = dist.log_prob(action) entropy = dist.entropy() return action, log_prob, entropy, value.squeeze(-1)The A2C Agent:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161
class A2CAgent: """ Advantage Actor-Critic agent with GAE and batch updates. """ def __init__( self, state_dim: int, action_dim: int, lr: float = 3e-4, gamma: float = 0.99, gae_lambda: float = 0.95, value_coef: float = 0.5, entropy_coef: float = 0.01, max_grad_norm: float = 0.5, n_steps: int = 5 ): self.gamma = gamma self.gae_lambda = gae_lambda self.value_coef = value_coef self.entropy_coef = entropy_coef self.max_grad_norm = max_grad_norm self.n_steps = n_steps # Network and optimizer self.network = ActorCriticNetwork(state_dim, action_dim) self.optimizer = optim.Adam(self.network.parameters(), lr=lr) # Rollout storage self.states = [] self.actions = [] self.rewards = [] self.dones = [] self.log_probs = [] self.values = [] def select_action(self, state: np.ndarray) -> int: """Select action and store transition data.""" state_tensor = torch.FloatTensor(state).unsqueeze(0) with torch.no_grad(): action, log_prob, _, value = self.network.get_action_and_value(state_tensor) # Store for later learning self.states.append(state_tensor) self.actions.append(action) self.log_probs.append(log_prob) self.values.append(value) return action.item() def store_transition(self, reward: float, done: bool): """Store reward and done flag.""" self.rewards.append(reward) self.dones.append(done) def compute_gae(self, next_value: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]: """ Compute Generalized Advantage Estimation. Returns advantages and returns for the collected rollout. """ values = torch.cat(self.values) values_extended = torch.cat([values, next_value.unsqueeze(0)]) advantages = [] gae = 0 for t in reversed(range(len(self.rewards))): # Handle episode boundary if self.dones[t]: next_val = 0 else: next_val = values_extended[t + 1].item() # TD error delta = self.rewards[t] + self.gamma * next_val - values_extended[t].item() # GAE gae = delta + self.gamma * self.gae_lambda * (0 if self.dones[t] else gae) advantages.insert(0, gae) advantages = torch.tensor(advantages, dtype=torch.float32) returns = advantages + values return advantages, returns def update(self, next_state: np.ndarray) -> Dict[str, float]: """ Perform A2C update after collecting n_steps. Returns dictionary of metrics for logging. """ # Get value of next state for bootstrapping next_state_tensor = torch.FloatTensor(next_state).unsqueeze(0) with torch.no_grad(): _, next_value = self.network(next_state_tensor) next_value = next_value.squeeze() # Compute advantages and returns advantages, returns = self.compute_gae(next_value) # Prepare batch data states = torch.cat(self.states) actions = torch.cat(self.actions) old_log_probs = torch.cat(self.log_probs) # Forward pass dist, values = self.network(states) values = values.squeeze(-1) log_probs = dist.log_prob(actions) entropy = dist.entropy() # Normalize advantages advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8) # Losses # Policy loss: negative because we want to maximize policy_loss = -(log_probs * advantages.detach()).mean() # Value loss: MSE value_loss = F.mse_loss(values, returns.detach()) # Entropy bonus (negative because we subtract from total loss) entropy_loss = -entropy.mean() # Combined loss total_loss = ( policy_loss + self.value_coef * value_loss + self.entropy_coef * entropy_loss ) # Backpropagation self.optimizer.zero_grad() total_loss.backward() # Gradient clipping grad_norm = nn.utils.clip_grad_norm_( self.network.parameters(), self.max_grad_norm ) self.optimizer.step() # Clear rollout storage self.states = [] self.actions = [] self.rewards = [] self.dones = [] self.log_probs = [] self.values = [] return { 'policy_loss': policy_loss.item(), 'value_loss': value_loss.item(), 'entropy': -entropy_loss.item(), 'grad_norm': grad_norm.item(), 'total_loss': total_loss.item() }Critical A2C choices: (1) Shared vs separate networks—sharing is more efficient but coupling can cause instability. (2) n_steps trade-off—more steps = lower bias, higher variance. (3) Entropy coefficient—higher exploration but slower convergence. (4) Value coefficient—usually 0.5 works well, but can be tuned.
The critic serves multiple purposes in Actor-Critic algorithms. Understanding these roles clarifies why the critic is so important and how to design it effectively.
Role 1: Baseline for Variance Reduction
12345678
# Without critic (REINFORCE):∇_θ J = E[∇_θ log π(a|s) · G_t] # High variance # With critic as baseline:∇_θ J = E[∇_θ log π(a|s) · (G_t - V(s))] # Lower variance # The advantage A = G_t - V(s) is centered around zero# This removes the mean from the gradient, reducing variance dramaticallyRole 2: Bootstrapping for Online Learning
123456789101112
# Without critic (Monte Carlo):Must wait until episode ends to compute G_t = Σ γ^k r_{t+k}Cannot learn during the episode # With critic (TD learning):G_t ≈ r_t + γV(s_{t+1}) # Bootstrap with value estimateCan update after every step or batch of steps # This enables:# 1. Learning from incomplete episodes# 2. Handling continuous (non-episodic) tasks# 3. More frequent updates = faster learningRole 3: Credit Assignment via GAE
123456789
# GAE uses the critic at every timestep:δ_t = r_t + γV(s_{t+1}) - V(s_t) # TD error at each stepÂ_t = Σ_k (γλ)^k δ_{t+k} # Weighted sum of TD errors # The critic enables temporal credit assignment:# - Actions get credit for immediate rewards (via r_t)# - And discounted future effects (via V(s_{t+1}) - V(s_t)) # Good critic = good credit assignment = faster actor learningThe critic and actor are trained together, but on different objectives. A poor critic provides bad advantage estimates, harming actor learning. A poor actor generates bad trajectories, harming critic learning. This coupling can cause instability. Solutions include: separate networks, different learning rates, and target networks (used in SAC/TD3).
A crucial design decision in Actor-Critic methods is whether to share parameters between the actor and critic. Each approach has distinct advantages.
Shared Parameters:
123456789101112131415161718192021222324252627282930313233
class SharedActorCritic(nn.Module): """ Actor and Critic share the feature extraction layers. Advantages: - Fewer parameters (more efficient) - Implicit regularization (shared representations) - Works well for simple environments Disadvantages: - Coupled learning dynamics - Conflicting gradient directions possible - May limit expressiveness """ def __init__(self, state_dim, action_dim, hidden_dim=256): super().__init__() # SHARED layers self.shared = nn.Sequential( nn.Linear(state_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU() ) # Separate heads self.actor_head = nn.Linear(hidden_dim, action_dim) self.critic_head = nn.Linear(hidden_dim, 1) def forward(self, state): features = self.shared(state) # Same features for both! return Categorical(logits=self.actor_head(features)), self.critic_head(features)Separate Parameters:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
class SeparateActorCritic(nn.Module): """ Actor and Critic have completely separate networks. Advantages: - Independent learning dynamics - No gradient interference - Can use different architectures/sizes - Better for complex environments Disadvantages: - More parameters - Slower training (no knowledge sharing) - May overfit differently """ def __init__(self, state_dim, action_dim, hidden_dim=256): super().__init__() # Completely separate actor network self.actor = nn.Sequential( nn.Linear(state_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, action_dim) ) # Completely separate critic network self.critic = nn.Sequential( nn.Linear(state_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, 1) ) def forward(self, state): logits = self.actor(state) value = self.critic(state) return Categorical(logits=logits), value def get_policy(self, state): return Categorical(logits=self.actor(state)) def get_value(self, state): return self.critic(state)| Aspect | Shared | Separate |
|---|---|---|
| Parameter count | Lower | Higher |
| Learning stability | Can be unstable | More stable |
| Feature learning | Forced shared | Independent |
| Gradient interference | Possible | None |
| When to use | Simple tasks, limited compute | Complex tasks, critical stability |
Hybrid Approach: Partially Shared
12345678910111213141516171819202122232425262728293031323334353637
class HybridActorCritic(nn.Module): """ Share early layers, separate later layers. Rationale: - Early layers: Feature extraction (useful for both) - Later layers: Task-specific processing (should differ) """ def __init__(self, state_dim, action_dim, shared_dim=128, separate_dim=64): super().__init__() # Shared feature extraction self.shared_encoder = nn.Sequential( nn.Linear(state_dim, shared_dim), nn.ReLU() ) # Actor-specific layers self.actor_layers = nn.Sequential( nn.Linear(shared_dim, separate_dim), nn.ReLU(), nn.Linear(separate_dim, action_dim) ) # Critic-specific layers self.critic_layers = nn.Sequential( nn.Linear(shared_dim, separate_dim), nn.ReLU(), nn.Linear(separate_dim, 1) ) def forward(self, state): shared_features = self.shared_encoder(state) logits = self.actor_layers(shared_features) value = self.critic_layers(shared_features) return Categorical(logits=logits), valueStart with shared networks for simple environments (CartPole, LunarLander). Move to separate networks for complex domains (Atari, robotics). For vision-based RL, a shared CNN encoder with separate heads often works well. Algorithms like PPO commonly use shared architectures; SAC and TD3 use separate.
The relative learning speeds of actor and critic significantly impact training stability and efficiency. If the critic learns too slowly, it provides inaccurate baselines. If the critic learns too quickly relative to the actor, the value targets become inconsistent.
Methods for Balancing:
123456789101112131415161718192021222324252627282930313233343536
# Strategy 1: Different learning ratesactor_optimizer = optim.Adam(actor.parameters(), lr=3e-4)critic_optimizer = optim.Adam(critic.parameters(), lr=1e-3) # 3x higher # Strategy 2: Multiple critic updatesdef update_step(batch): # Multiple critic updates first for _ in range(5): # 5 critic steps critic_loss = compute_value_loss(critic, batch) critic_optimizer.zero_grad() critic_loss.backward() critic_optimizer.step() # Single actor update actor_loss = compute_policy_loss(actor, critic, batch) actor_optimizer.zero_grad() actor_loss.backward() actor_optimizer.step() # Strategy 3: Target network (for off-policy methods)class CriticWithTarget: def __init__(self, state_dim, tau=0.005): self.critic = CriticNetwork(state_dim) self.target = CriticNetwork(state_dim) self.target.load_state_dict(self.critic.state_dict()) self.tau = tau def update_target(self): """Soft update: target ← τ·critic + (1-τ)·target""" for param, target_param in zip( self.critic.parameters(), self.target.parameters() ): target_param.data.copy_( self.tau * param.data + (1 - self.tau) * target_param.data )Critic too slow: Value loss stays high, advantages have high variance, actor training is unstable. Critic too fast: Value loss is low but actor still random, critic overfitting to current behavior. Monitor both losses during training to detect imbalance.
The original Asynchronous Advantage Actor-Critic (A3C) uses multiple workers with asynchronous gradient updates. A2C is the synchronous variant that's often preferred in practice.
A3C (Asynchronous):
12345678910111213141516171819202122232425262728293031323334
A3C: Asynchronous Advantage Actor-Critic═══════════════════════════════════════════════════════════════ Multiple workers, each with:- Own copy of the environment- Own local gradients Global parameter server:- Maintains master network θ_global Worker loop:1. Copy θ_global → θ_local2. Collect trajectory using θ_local3. Compute local gradients4. Asynchronously update θ_global ← θ_global + α·∇θ_local Key property: No synchronization barrier!Workers update in parallel without waiting ═══════════════════════════════════════════════════════════════ A2C: Synchronous Advantage Actor-Critic═══════════════════════════════════════════════════════════════ Multiple parallel environments, single network:1. Collect n_steps from ALL environments2. Combine all transitions into single batch3. Compute single gradient update4. Apply update to single network Key property: All workers synchronized!Wait for all environments before updating ═══════════════════════════════════════════════════════════════| Aspect | A3C | A2C |
|---|---|---|
| Updates | Asynchronous | Synchronous |
| Gradient aggregation | Stale gradients possible | Fresh gradients always |
| GPU utilization | Poor (CPU workers) | Good (batched GPU ops) |
| Reproducibility | Non-deterministic | Deterministic (given seed) |
| Implementation | Complex (threading) | Simple (vectorized envs) |
| Performance | Similar or worse | Often better per wall-time |
1234567891011121314151617181920212223242526272829303132333435363738
import gymnasium as gymfrom gymnasium.vector import SyncVectorEnv, AsyncVectorEnv def make_env(env_id: str, seed: int): """Factory function for creating environments.""" def _init(): env = gym.make(env_id) env.reset(seed=seed) return env return _init def create_vectorized_envs(env_id: str, num_envs: int = 8): """ Create vectorized environments for A2C. Vectorized environments run multiple copies in parallel, allowing efficient batch collection. """ # Synchronous: simpler, more reliable env_fns = [make_env(env_id, seed=i) for i in range(num_envs)] envs = SyncVectorEnv(env_fns) # Alternative: AsyncVectorEnv for faster but less stable execution # envs = AsyncVectorEnv(env_fns) return envs # Usageenvs = create_vectorized_envs("CartPole-v1", num_envs=8) # Step all environments at onceactions = np.array([envs.action_space.sample() for _ in range(8)])next_obs, rewards, terminated, truncated, infos = envs.step(actions) # next_obs shape: (8, obs_dim) - batch of 8 observations!Use A2C with vectorized environments rather than A3C. A2C is simpler to implement, easier to debug, has better GPU utilization, and achieves similar or better performance. A3C's original motivation was utilizing multi-core CPUs, but modern GPU-accelerated training favors synchronous batched updates.
Actor-Critic algorithms, while powerful, can fail in subtle ways. Understanding common failure modes helps debug training issues.
Failure Mode 1: Policy Entropy Collapse
12345678910111213141516
# PROBLEM: Policy becomes deterministic too quickly# Symptom: entropy drops to near 0, exploration stops # SOLUTION 1: Increase entropy coefficiententropy_coef = 0.05 # Try higher values (default often 0.01) # SOLUTION 2: Schedule entropy coefficientdef get_entropy_coef(step, max_steps): """Linear decay from high to low entropy.""" initial = 0.1 final = 0.01 return initial - (initial - final) * (step / max_steps) # SOLUTION 3: Use entropy as constraint (SAC-style)# Target entropy = -action_dim (for continuous)# Automatically adjust α to maintain target entropyFailure Mode 2: Value Function Doesn't Converge
12345678910111213141516171819202122232425
# PROBLEM: Value loss stays high or oscillates# Symptom: L_V doesn't decrease, advantages are noisy # DIAGNOSIS: Plot value predictions vs actual returnsdef diagnose_value_function(critic, states, returns): with torch.no_grad(): predictions = critic(states).squeeze() # Scatter plot plt.scatter(returns.numpy(), predictions.numpy(), alpha=0.3) plt.xlabel('Actual Returns') plt.ylabel('Predicted Values') plt.plot([returns.min(), returns.max()], [returns.min(), returns.max()], 'r--') # y=x line plt.title('Value Function Accuracy') plt.show() # If points are far from y=x line, critic is inaccurate # SOLUTIONS:# 1. Increase critic learning rate# 2. Increase number of critic updates per actor update# 3. Use larger critic network# 4. Verify returns are computed correctly# 5. Check for reward scaling issuesFailure Mode 3: Actor-Critic Coupling Instability
1234567891011121314151617181920212223242526272829303132333435363738
# PROBLEM: Training is unstable, periodic performance drops# Symptom: Reward oscillates, both losses spike # DIAGNOSIS: Track gradient interferencedef check_gradient_interference(network) -> float: """ Measure how much actor and critic gradients conflict. High conflict = unstable training. """ actor_grads = [] critic_grads = [] for name, param in network.named_parameters(): if param.grad is not None: if 'actor' in name: actor_grads.append(param.grad.view(-1)) elif 'critic' in name: critic_grads.append(param.grad.view(-1)) elif 'shared' in name: # Shared layers have contributions from both pass if actor_grads and critic_grads: actor_flat = torch.cat(actor_grads) critic_flat = torch.cat(critic_grads) # Cosine similarity: 1=same direction, -1=opposite cos_sim = F.cosine_similarity( actor_flat.unsqueeze(0), critic_flat.unsqueeze(0) ) return cos_sim.item() return 0.0 # SOLUTIONS:# 1. Use separate networks for actor and critic# 2. Reduce learning rate# 3. Add gradient clipping# 4. Use stop-gradient on value baseline: advantages.detach()Most Actor-Critic failures trace back to learning rates. If in doubt: lower the learning rate. A common error is using lr=1e-3 (OK for supervised learning) when lr=3e-4 or even lr=1e-4 is needed for stable RL. Always start conservative and increase if learning is too slow.
Let's put everything together into a complete, production-ready A2C training loop with proper logging and monitoring.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130
from typing import Dictfrom collections import dequeimport time def train_a2c( env_id: str = "CartPole-v1", num_envs: int = 8, total_timesteps: int = 1_000_000, n_steps: int = 5, lr: float = 3e-4, gamma: float = 0.99, gae_lambda: float = 0.95, value_coef: float = 0.5, entropy_coef: float = 0.01, max_grad_norm: float = 0.5, log_interval: int = 10, seed: int = 42) -> Dict: """ Complete A2C training loop with vectorized environments. """ # Set seeds for reproducibility torch.manual_seed(seed) np.random.seed(seed) # Create vectorized environments envs = create_vectorized_envs(env_id, num_envs) state_dim = envs.single_observation_space.shape[0] action_dim = envs.single_action_space.n # Create agent agent = A2CAgent( state_dim=state_dim, action_dim=action_dim, lr=lr, gamma=gamma, gae_lambda=gae_lambda, value_coef=value_coef, entropy_coef=entropy_coef, max_grad_norm=max_grad_norm, n_steps=n_steps ) # Training tracking episode_rewards = deque(maxlen=100) episode_lengths = deque(maxlen=100) current_rewards = np.zeros(num_envs) current_lengths = np.zeros(num_envs, dtype=int) # Metrics start_time = time.time() global_step = 0 num_updates = 0 # Initial observation obs, _ = envs.reset(seed=seed) while global_step < total_timesteps: # Collect n_steps rollout for _ in range(n_steps): # Get actions for all environments obs_tensor = torch.FloatTensor(obs) with torch.no_grad(): actions, log_probs, entropies, values = ( agent.network.get_action_and_value(obs_tensor) ) # Store for learning agent.states.append(obs_tensor) agent.actions.append(actions) agent.log_probs.append(log_probs) agent.values.append(values) # Step environments next_obs, rewards, terminated, truncated, infos = envs.step(actions.numpy()) dones = terminated | truncated # Store rewards and dones for i in range(num_envs): agent.rewards.append(rewards[i]) agent.dones.append(dones[i]) # Track episode statistics current_rewards += rewards current_lengths += 1 for i in range(num_envs): if dones[i]: episode_rewards.append(current_rewards[i]) episode_lengths.append(current_lengths[i]) current_rewards[i] = 0 current_lengths[i] = 0 obs = next_obs global_step += num_envs # Perform A2C update metrics = agent.update(obs) num_updates += 1 # Logging if num_updates % log_interval == 0 and len(episode_rewards) > 0: elapsed = time.time() - start_time fps = global_step / elapsed print(f"Step: {global_step:8d} | " f"Episodes: {len(episode_rewards):4d} | " f"Mean Reward: {np.mean(episode_rewards):7.2f} | " f"Policy Loss: {metrics['policy_loss']:7.4f} | " f"Value Loss: {metrics['value_loss']:7.4f} | " f"Entropy: {metrics['entropy']:5.3f} | " f"FPS: {fps:.0f}") envs.close() return { 'episode_rewards': list(episode_rewards), 'final_mean_reward': np.mean(episode_rewards), 'agent': agent } if __name__ == "__main__": results = train_a2c( env_id="CartPole-v1", total_timesteps=500_000, num_envs=8 ) print(f"\nTraining complete! Final mean reward: {results['final_mean_reward']:.2f}")On CartPole-v1, A2C should achieve near-optimal performance (~500 reward) within 100-200k steps. Key hyperparameters: lr=3e-4, n_steps=5-10, entropy_coef=0.01. If reward plateaus below 200, try increasing entropy coefficient or reducing learning rate.
We've covered Actor-Critic methods comprehensively. Let's consolidate the key insights:
What's next:
We've now assembled the complete Actor-Critic framework. The final piece is understanding the Advantage Function more deeply—how to interpret it, compute it efficiently, and why it's central to understanding policy quality. This understanding will prepare you for advanced algorithms like PPO, which introduces trust-region constraints on policy updates.
You now understand Actor-Critic methods—the foundation of modern policy optimization. A2C combines policy gradients with learned value functions for efficient, stable learning. This sets the stage for understanding the advantage function in depth, completing our treatment of policy-based methods.