Policy Based Methods - Learning Module

Loading content...

0/245

Actor-Critic Methods

The Best of Both Worlds

We've seen that policy gradients (the 'actor') can directly optimize behavior, and we've learned that learned value functions make excellent baselines (the 'critic' that evaluates performance). Actor-Critic methods unite these ideas into a single, powerful framework.

In Actor-Critic architectures, the actor learns the policy π_θ(a|s), while the critic learns the value function V_ω(s) or Q_ω(s,a). The critic provides low-variance advantage estimates for training the actor, while the actor's behavior generates data for training the critic. This synergy enables faster learning than either approach alone.

Actor-Critic methods form the backbone of modern deep RL. Algorithms like A3C, A2C, PPO, SAC, and TD3 are all Actor-Critic methods. Understanding their shared foundation prepares you for the entire landscape of practical policy optimization.

What You Will Learn

By the end of this page, you will understand: the Actor-Critic framework and why it combines policy and value learning, the A2C algorithm in detail, architecture choices for actor and critic networks, how to balance actor and critic updates, and common failure modes with solutions.

The Actor-Critic Framework

Actor-Critic is not a single algorithm but a framework for combining policy optimization with value estimation. The key insight is that the critic's job is to help train the actor, not to act directly.

The Two Components:

The Actor (Policy)

•Parameterized policy π_θ(a|s)
•Maps states to action distributions
•Updated via policy gradients
•Goal: Maximize expected return
•Uses critic's estimates for low-variance gradients

The Critic (Value Function)

•Parameterized value V_ω(s) or Q_ω(s,a)
•Estimates expected future return
•Updated via TD learning or MC
•Goal: Accurate value predictions
•Provides baseline and/or bootstrap targets

The Actor-Critic Loop:

actor_critic_loop.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Actor-Critic Training Loop
═══════════════════════════════════════════════════════════════
 
Initialize: Actor π_θ, Critic V_ω, learning rates α_π and α_V
 
For each episode:
    Observe initial state s
    
    While not done:
        1. ACTOR: Sample action a ~ π_θ(·|s)
        
        2. ENVIRONMENT: Execute a, observe r, s', done
        
        3. CRITIC EVALUATION:
           - Compute TD error: δ = r + γV_ω(s') - V_ω(s)
           - Or compute advantage: Â = estimated advantage
        
        4. CRITIC UPDATE (TD learning):
           ω ← ω + α_V · δ · ∇_ω V_ω(s)
        
        5. ACTOR UPDATE (policy gradient):
           θ ← θ + α_π · ∇_θ log π_θ(a|s) · Â
        
        6. s ← s'
 
═══════════════════════════════════════════════════════════════

Online vs Batch Actor-Critic

The loop above shows online updates (after every step). In practice, batch updates (after collecting N steps or full episodes) are often preferred because: (1) gradients average over more samples, (2) GPU utilization is better with batches, (3) parallel workers can be used. A2C and PPO use batch updates.

Advantage Actor-Critic (A2C)

A2C (Advantage Actor-Critic) is the synchronous, batch version of the famous A3C algorithm. It's the workhorse of policy gradient methods and an excellent starting point for understanding modern implementations.

A2C Algorithm:

a2c_algorithm.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
A2C: Advantage Actor-Critic
═══════════════════════════════════════════════════════════════
 
Hyperparameters:
    - n_steps: Number of steps to collect before updating
    - n_envs: Number of parallel environments
    - γ: Discount factor
    - λ: GAE parameter
    - c_1: Value loss coefficient
    - c_2: Entropy coefficient
 
Initialize: Actor-Critic network with shared/separate parameters
 
Repeat:
    1. COLLECT ROLLOUTS
       For each of n_envs environments in parallel:
           Collect n_steps of experience (s, a, r, s', done)
       Total: n_envs × n_steps transitions
    
    2. COMPUTE ADVANTAGES AND RETURNS
       For each trajectory:
           Use GAE to compute advantages Â and returns R
    
    3. COMPUTE LOSSES
       Policy loss:  L_π = -E[log π_θ(a|s) · Â]
       Value loss:   L_V = E[(V_ω(s) - R)²]
       Entropy:      H = E[-log π_θ(a|s)]
       
       Total loss:   L = L_π + c_1·L_V - c_2·H
    
    4. UPDATE PARAMETERS
       Compute ∇L and update using optimizer
 
═══════════════════════════════════════════════════════════════

Complete A2C Implementation:

a2c_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions import Categorical
import numpy as np
from typing import List, Tuple, Dict
 
 
class ActorCriticNetwork(nn.Module):
    """
    Combined Actor-Critic network with shared feature extraction.
    
    Architecture:
    - Shared layers: Extract features from observations
    - Policy head (Actor): Outputs action distribution
    - Value head (Critic): Outputs state value estimate
    
    Sharing layers is common because:
    1. Both benefit from similar state representations
    2. Acts as implicit regularization
    3. More parameter-efficient
    """
    
    def __init__(
        self,
        state_dim: int,
        action_dim: int,
        hidden_dim: int = 256
    ):
        super().__init__()
        
        # Shared feature extraction
        self.shared = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        
        # Actor head: outputs logits for each action
        self.actor_head = nn.Linear(hidden_dim, action_dim)
        
        # Critic head: outputs single value estimate
        self.critic_head = nn.Linear(hidden_dim, 1)
        
        # Initialize weights
        self._initialize_weights()
    
    def _initialize_weights(self):
        for module in self.modules():
            if isinstance(module, nn.Linear):
                nn.init.orthogonal_(module.weight, gain=np.sqrt(2))
                nn.init.constant_(module.bias, 0)
        
        # Smaller init for output layers
        nn.init.orthogonal_(self.actor_head.weight, gain=0.01)
        nn.init.orthogonal_(self.critic_head.weight, gain=1.0)
    
    def forward(self, state: torch.Tensor) -> Tuple[Categorical, torch.Tensor]:
        """
        Forward pass returns both policy distribution and value.
        
        Args:
            state: Observation tensor [batch, state_dim]
        
        Returns:
            dist: Categorical distribution over actions
            value: State value estimates [batch, 1]
        """
        features = self.shared(state)
        
        # Actor
        logits = self.actor_head(features)
        dist = Categorical(logits=logits)
        
        # Critic
        value = self.critic_head(features)
        
        return dist, value
    
    def get_action_and_value(
        self,
        state: torch.Tensor
    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
        """
        Get action, log probability, entropy, and value in one pass.
        
        Used during rollout collection.
        """
        dist, value = self.forward(state)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        entropy = dist.entropy()
        
        return action, log_prob, entropy, value.squeeze(-1)

The A2C Agent:

a2c_agent.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
class A2CAgent:
    """
    Advantage Actor-Critic agent with GAE and batch updates.
    """
    
    def __init__(
        self,
        state_dim: int,
        action_dim: int,
        lr: float = 3e-4,
        gamma: float = 0.99,
        gae_lambda: float = 0.95,
        value_coef: float = 0.5,
        entropy_coef: float = 0.01,
        max_grad_norm: float = 0.5,
        n_steps: int = 5
    ):
        self.gamma = gamma
        self.gae_lambda = gae_lambda
        self.value_coef = value_coef
        self.entropy_coef = entropy_coef
        self.max_grad_norm = max_grad_norm
        self.n_steps = n_steps
        
        # Network and optimizer
        self.network = ActorCriticNetwork(state_dim, action_dim)
        self.optimizer = optim.Adam(self.network.parameters(), lr=lr)
        
        # Rollout storage
        self.states = []
        self.actions = []
        self.rewards = []
        self.dones = []
        self.log_probs = []
        self.values = []
    
    def select_action(self, state: np.ndarray) -> int:
        """Select action and store transition data."""
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        
        with torch.no_grad():
            action, log_prob, _, value = self.network.get_action_and_value(state_tensor)
        
        # Store for later learning
        self.states.append(state_tensor)
        self.actions.append(action)
        self.log_probs.append(log_prob)
        self.values.append(value)
        
        return action.item()
    
    def store_transition(self, reward: float, done: bool):
        """Store reward and done flag."""
        self.rewards.append(reward)
        self.dones.append(done)
    
    def compute_gae(self, next_value: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Compute Generalized Advantage Estimation.
        
        Returns advantages and returns for the collected rollout.
        """
        values = torch.cat(self.values)
        values_extended = torch.cat([values, next_value.unsqueeze(0)])
        
        advantages = []
        gae = 0
        
        for t in reversed(range(len(self.rewards))):
            # Handle episode boundary
            if self.dones[t]:
                next_val = 0
            else:
                next_val = values_extended[t + 1].item()
            
            # TD error
            delta = self.rewards[t] + self.gamma * next_val - values_extended[t].item()
            
            # GAE
            gae = delta + self.gamma * self.gae_lambda * (0 if self.dones[t] else gae)
            advantages.insert(0, gae)
        
        advantages = torch.tensor(advantages, dtype=torch.float32)
        returns = advantages + values
        
        return advantages, returns
    
    def update(self, next_state: np.ndarray) -> Dict[str, float]:
        """
        Perform A2C update after collecting n_steps.
        
        Returns dictionary of metrics for logging.
        """
        # Get value of next state for bootstrapping
        next_state_tensor = torch.FloatTensor(next_state).unsqueeze(0)
        with torch.no_grad():
            _, next_value = self.network(next_state_tensor)
        next_value = next_value.squeeze()
        
        # Compute advantages and returns
        advantages, returns = self.compute_gae(next_value)
        
        # Prepare batch data
        states = torch.cat(self.states)
        actions = torch.cat(self.actions)
        old_log_probs = torch.cat(self.log_probs)
        
        # Forward pass
        dist, values = self.network(states)
        values = values.squeeze(-1)
        
        log_probs = dist.log_prob(actions)
        entropy = dist.entropy()
        
        # Normalize advantages
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
        
        # Losses
        # Policy loss: negative because we want to maximize
        policy_loss = -(log_probs * advantages.detach()).mean()
        
        # Value loss: MSE
        value_loss = F.mse_loss(values, returns.detach())
        
        # Entropy bonus (negative because we subtract from total loss)
        entropy_loss = -entropy.mean()
        
        # Combined loss
        total_loss = (
            policy_loss + 
            self.value_coef * value_loss + 
            self.entropy_coef * entropy_loss
        )
        
        # Backpropagation
        self.optimizer.zero_grad()
        total_loss.backward()
        
        # Gradient clipping
        grad_norm = nn.utils.clip_grad_norm_(
            self.network.parameters(), 
            self.max_grad_norm
        )
        
        self.optimizer.step()
        
        # Clear rollout storage
        self.states = []
        self.actions = []
        self.rewards = []
        self.dones = []
        self.log_probs = []
        self.values = []
        
        return {
            'policy_loss': policy_loss.item(),
            'value_loss': value_loss.item(),
            'entropy': -entropy_loss.item(),
            'grad_norm': grad_norm.item(),
            'total_loss': total_loss.item()
        }

Key Design Decisions

Critical A2C choices: (1) Shared vs separate networks—sharing is more efficient but coupling can cause instability. (2) n_steps trade-off—more steps = lower bias, higher variance. (3) Entropy coefficient—higher exploration but slower convergence. (4) Value coefficient—usually 0.5 works well, but can be tuned.

Understanding the Critic's Role

The critic serves multiple purposes in Actor-Critic algorithms. Understanding these roles clarifies why the critic is so important and how to design it effectively.

Role 1: Baseline for Variance Reduction

critic_baseline.math
1
2
3
4
5
6
7
8
# Without critic (REINFORCE):
∇_θ J = E[∇_θ log π(a|s) · G_t]         # High variance
 
# With critic as baseline:
∇_θ J = E[∇_θ log π(a|s) · (G_t - V(s))] # Lower variance
 
# The advantage A = G_t - V(s) is centered around zero
# This removes the mean from the gradient, reducing variance dramatically

Role 2: Bootstrapping for Online Learning

critic_bootstrap.math
1
2
3
4
5
6
7
8
9
10
11
12
# Without critic (Monte Carlo):
Must wait until episode ends to compute G_t = Σ γ^k r_{t+k}
Cannot learn during the episode
 
# With critic (TD learning):
G_t ≈ r_t + γV(s_{t+1})  # Bootstrap with value estimate
Can update after every step or batch of steps
 
# This enables:
# 1. Learning from incomplete episodes
# 2. Handling continuous (non-episodic) tasks
# 3. More frequent updates = faster learning

Role 3: Credit Assignment via GAE

critic_credit.math
1
2
3
4
5
6
7
8
9
# GAE uses the critic at every timestep:
δ_t = r_t + γV(s_{t+1}) - V(s_t)  # TD error at each step
Â_t = Σ_k (γλ)^k δ_{t+k}          # Weighted sum of TD errors
 
# The critic enables temporal credit assignment:
# - Actions get credit for immediate rewards (via r_t)
# - And discounted future effects (via V(s_{t+1}) - V(s_t))
 
# Good critic = good credit assignment = faster actor learning

What Makes a Good Critic?

•Accurate Value Estimates: Low bias means correct gradient direction for the actor.
•Consistent Across States: Similar states should have similar values for stable learning.
•Appropriate Capacity: Too small = underfitting values; too large = overfitting to noisy returns.
•Proper Learning Rate: Too fast = noisy baselines; too slow = stale value estimates.
•Balanced with Actor: Critic should learn fast enough to provide useful signal, not so fast it dominates.

The Critic-Actor Coupling Problem

The critic and actor are trained together, but on different objectives. A poor critic provides bad advantage estimates, harming actor learning. A poor actor generates bad trajectories, harming critic learning. This coupling can cause instability. Solutions include: separate networks, different learning rates, and target networks (used in SAC/TD3).

Architecture Choices

A crucial design decision in Actor-Critic methods is whether to share parameters between the actor and critic. Each approach has distinct advantages.

Shared Parameters:

shared_architecture.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
class SharedActorCritic(nn.Module):
    """
    Actor and Critic share the feature extraction layers.
    
    Advantages:
    - Fewer parameters (more efficient)
    - Implicit regularization (shared representations)
    - Works well for simple environments
    
    Disadvantages:
    - Coupled learning dynamics
    - Conflicting gradient directions possible
    - May limit expressiveness
    """
    
    def __init__(self, state_dim, action_dim, hidden_dim=256):
        super().__init__()
        
        # SHARED layers
        self.shared = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        
        # Separate heads
        self.actor_head = nn.Linear(hidden_dim, action_dim)
        self.critic_head = nn.Linear(hidden_dim, 1)
    
    def forward(self, state):
        features = self.shared(state)  # Same features for both!
        return Categorical(logits=self.actor_head(features)), self.critic_head(features)

Separate Parameters:

separate_architecture.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
class SeparateActorCritic(nn.Module):
    """
    Actor and Critic have completely separate networks.
    
    Advantages:
    - Independent learning dynamics
    - No gradient interference
    - Can use different architectures/sizes
    - Better for complex environments
    
    Disadvantages:
    - More parameters
    - Slower training (no knowledge sharing)
    - May overfit differently
    """
    
    def __init__(self, state_dim, action_dim, hidden_dim=256):
        super().__init__()
        
        # Completely separate actor network
        self.actor = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim)
        )
        
        # Completely separate critic network
        self.critic = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )
    
    def forward(self, state):
        logits = self.actor(state)
        value = self.critic(state)
        return Categorical(logits=logits), value
    
    def get_policy(self, state):
        return Categorical(logits=self.actor(state))
    
    def get_value(self, state):
        return self.critic(state)

Architecture Comparison
Aspect	Shared	Separate
Parameter count	Lower	Higher
Learning stability	Can be unstable	More stable
Feature learning	Forced shared	Independent
Gradient interference	Possible	None
When to use	Simple tasks, limited compute	Complex tasks, critical stability

Hybrid Approach: Partially Shared

hybrid_architecture.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
class HybridActorCritic(nn.Module):
    """
    Share early layers, separate later layers.
    
    Rationale:
    - Early layers: Feature extraction (useful for both)
    - Later layers: Task-specific processing (should differ)
    """
    
    def __init__(self, state_dim, action_dim, shared_dim=128, separate_dim=64):
        super().__init__()
        
        # Shared feature extraction
        self.shared_encoder = nn.Sequential(
            nn.Linear(state_dim, shared_dim),
            nn.ReLU()
        )
        
        # Actor-specific layers
        self.actor_layers = nn.Sequential(
            nn.Linear(shared_dim, separate_dim),
            nn.ReLU(),
            nn.Linear(separate_dim, action_dim)
        )
        
        # Critic-specific layers
        self.critic_layers = nn.Sequential(
            nn.Linear(shared_dim, separate_dim),
            nn.ReLU(),
            nn.Linear(separate_dim, 1)
        )
    
    def forward(self, state):
        shared_features = self.shared_encoder(state)
        logits = self.actor_layers(shared_features)
        value = self.critic_layers(shared_features)
        return Categorical(logits=logits), value

Practical Recommendations

Start with shared networks for simple environments (CartPole, LunarLander). Move to separate networks for complex domains (Atari, robotics). For vision-based RL, a shared CNN encoder with separate heads often works well. Algorithms like PPO commonly use shared architectures; SAC and TD3 use separate.

Balancing Actor and Critic Updates

The relative learning speeds of actor and critic significantly impact training stability and efficiency. If the critic learns too slowly, it provides inaccurate baselines. If the critic learns too quickly relative to the actor, the value targets become inconsistent.

Methods for Balancing:

Balance Strategies

•Different Learning Rates: Use lr_critic > lr_actor so the critic is 'ahead' of the actor. Typical ratio: 2-10x.
•Loss Coefficient Weighting: The value_coef in A2C controls the relative gradient magnitude. Higher emphasizes critic learning.
•Multiple Critic Updates: Update the critic more frequently than the actor. E.g., 5 critic steps per actor step.
•Target Networks: Keep a slow-moving copy of the critic for stable value targets (used in SAC, TD3).
•Gradient Stopping: Detach value estimates when computing advantages to prevent actor gradients from affecting critic.

balancing_strategies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# Strategy 1: Different learning rates
actor_optimizer = optim.Adam(actor.parameters(), lr=3e-4)
critic_optimizer = optim.Adam(critic.parameters(), lr=1e-3)  # 3x higher
 
# Strategy 2: Multiple critic updates
def update_step(batch):
    # Multiple critic updates first
    for _ in range(5):  # 5 critic steps
        critic_loss = compute_value_loss(critic, batch)
        critic_optimizer.zero_grad()
        critic_loss.backward()
        critic_optimizer.step()
    
    # Single actor update
    actor_loss = compute_policy_loss(actor, critic, batch)
    actor_optimizer.zero_grad()
    actor_loss.backward()
    actor_optimizer.step()
 
# Strategy 3: Target network (for off-policy methods)
class CriticWithTarget:
    def __init__(self, state_dim, tau=0.005):
        self.critic = CriticNetwork(state_dim)
        self.target = CriticNetwork(state_dim)
        self.target.load_state_dict(self.critic.state_dict())
        self.tau = tau
    
    def update_target(self):
        """Soft update: target ← τ·critic + (1-τ)·target"""
        for param, target_param in zip(
            self.critic.parameters(),
            self.target.parameters()
        ):
            target_param.data.copy_(
                self.tau * param.data + (1 - self.tau) * target_param.data
            )

Signs of Imbalance

Critic too slow: Value loss stays high, advantages have high variance, actor training is unstable. Critic too fast: Value loss is low but actor still random, critic overfitting to current behavior. Monitor both losses during training to detect imbalance.

A2C vs A3C: Synchronous vs Asynchronous

The original Asynchronous Advantage Actor-Critic (A3C) uses multiple workers with asynchronous gradient updates. A2C is the synchronous variant that's often preferred in practice.

A3C (Asynchronous):

a3c_vs_a2c.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
A3C: Asynchronous Advantage Actor-Critic
═══════════════════════════════════════════════════════════════
 
Multiple workers, each with:
- Own copy of the environment
- Own local gradients
 
Global parameter server:
- Maintains master network θ_global
 
Worker loop:
1. Copy θ_global → θ_local
2. Collect trajectory using θ_local
3. Compute local gradients
4. Asynchronously update θ_global ← θ_global + α·∇θ_local
 
Key property: No synchronization barrier!
Workers update in parallel without waiting
 
═══════════════════════════════════════════════════════════════
 
A2C: Synchronous Advantage Actor-Critic
═══════════════════════════════════════════════════════════════
 
Multiple parallel environments, single network:
1. Collect n_steps from ALL environments
2. Combine all transitions into single batch
3. Compute single gradient update
4. Apply update to single network
 
Key property: All workers synchronized!
Wait for all environments before updating
 
═══════════════════════════════════════════════════════════════

A2C vs A3C Comparison
Aspect	A3C	A2C
Updates	Asynchronous	Synchronous
Gradient aggregation	Stale gradients possible	Fresh gradients always
GPU utilization	Poor (CPU workers)	Good (batched GPU ops)
Reproducibility	Non-deterministic	Deterministic (given seed)
Implementation	Complex (threading)	Simple (vectorized envs)
Performance	Similar or worse	Often better per wall-time

vectorized_envs.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import gymnasium as gym
from gymnasium.vector import SyncVectorEnv, AsyncVectorEnv
 
 
def make_env(env_id: str, seed: int):
    """Factory function for creating environments."""
    def _init():
        env = gym.make(env_id)
        env.reset(seed=seed)
        return env
    return _init
 
 
def create_vectorized_envs(env_id: str, num_envs: int = 8):
    """
    Create vectorized environments for A2C.
    
    Vectorized environments run multiple copies in parallel,
    allowing efficient batch collection.
    """
    # Synchronous: simpler, more reliable
    env_fns = [make_env(env_id, seed=i) for i in range(num_envs)]
    envs = SyncVectorEnv(env_fns)
    
    # Alternative: AsyncVectorEnv for faster but less stable execution
    # envs = AsyncVectorEnv(env_fns)
    
    return envs
 
 
# Usage
envs = create_vectorized_envs("CartPole-v1", num_envs=8)
 
# Step all environments at once
actions = np.array([envs.action_space.sample() for _ in range(8)])
next_obs, rewards, terminated, truncated, infos = envs.step(actions)
 
# next_obs shape: (8, obs_dim) - batch of 8 observations!

Practical Recommendation

Use A2C with vectorized environments rather than A3C. A2C is simpler to implement, easier to debug, has better GPU utilization, and achieves similar or better performance. A3C's original motivation was utilizing multi-core CPUs, but modern GPU-accelerated training favors synchronous batched updates.

Common Failure Modes and Solutions

Actor-Critic algorithms, while powerful, can fail in subtle ways. Understanding common failure modes helps debug training issues.

Failure Mode 1: Policy Entropy Collapse

entropy_collapse.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# PROBLEM: Policy becomes deterministic too quickly
# Symptom: entropy drops to near 0, exploration stops
 
# SOLUTION 1: Increase entropy coefficient
entropy_coef = 0.05  # Try higher values (default often 0.01)
 
# SOLUTION 2: Schedule entropy coefficient
def get_entropy_coef(step, max_steps):
    """Linear decay from high to low entropy."""
    initial = 0.1
    final = 0.01
    return initial - (initial - final) * (step / max_steps)
 
# SOLUTION 3: Use entropy as constraint (SAC-style)
# Target entropy = -action_dim (for continuous)
# Automatically adjust α to maintain target entropy

Failure Mode 2: Value Function Doesn't Converge

value_divergence.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# PROBLEM: Value loss stays high or oscillates
# Symptom: L_V doesn't decrease, advantages are noisy
 
# DIAGNOSIS: Plot value predictions vs actual returns
def diagnose_value_function(critic, states, returns):
    with torch.no_grad():
        predictions = critic(states).squeeze()
    
    # Scatter plot
    plt.scatter(returns.numpy(), predictions.numpy(), alpha=0.3)
    plt.xlabel('Actual Returns')
    plt.ylabel('Predicted Values')
    plt.plot([returns.min(), returns.max()], 
             [returns.min(), returns.max()], 'r--')  # y=x line
    plt.title('Value Function Accuracy')
    plt.show()
    
    # If points are far from y=x line, critic is inaccurate
 
# SOLUTIONS:
# 1. Increase critic learning rate
# 2. Increase number of critic updates per actor update
# 3. Use larger critic network
# 4. Verify returns are computed correctly
# 5. Check for reward scaling issues

Failure Mode 3: Actor-Critic Coupling Instability

coupling_instability.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# PROBLEM: Training is unstable, periodic performance drops
# Symptom: Reward oscillates, both losses spike
 
# DIAGNOSIS: Track gradient interference
def check_gradient_interference(network) -> float:
    """
    Measure how much actor and critic gradients conflict.
    High conflict = unstable training.
    """
    actor_grads = []
    critic_grads = []
    
    for name, param in network.named_parameters():
        if param.grad is not None:
            if 'actor' in name:
                actor_grads.append(param.grad.view(-1))
            elif 'critic' in name:
                critic_grads.append(param.grad.view(-1))
            elif 'shared' in name:
                # Shared layers have contributions from both
                pass
    
    if actor_grads and critic_grads:
        actor_flat = torch.cat(actor_grads)
        critic_flat = torch.cat(critic_grads)
        # Cosine similarity: 1=same direction, -1=opposite
        cos_sim = F.cosine_similarity(
            actor_flat.unsqueeze(0), 
            critic_flat.unsqueeze(0)
        )
        return cos_sim.item()
    return 0.0
 
# SOLUTIONS:
# 1. Use separate networks for actor and critic
# 2. Reduce learning rate
# 3. Add gradient clipping
# 4. Use stop-gradient on value baseline: advantages.detach()

Debugging Checklist

•Monitor entropy: Should stay reasonably high during exploration, gradually decrease.
•Track value loss: Should decrease and stabilize. High value loss = bad advantages.
•Plot returns distributions: Should shift right (higher returns) over training.
•Check gradient norms: Should be stable, not exploding or vanishing.
•Verify reward scale: Large rewards cause large gradients; consider normalization.
•Test on simple env first: If it doesn't work on CartPole, debug before scaling up.

The Silent Killer: Learning Rate

Most Actor-Critic failures trace back to learning rates. If in doubt: lower the learning rate. A common error is using lr=1e-3 (OK for supervised learning) when lr=3e-4 or even lr=1e-4 is needed for stable RL. Always start conservative and increase if learning is too slow.

Complete A2C Training Loop

Let's put everything together into a complete, production-ready A2C training loop with proper logging and monitoring.

a2c_training_loop.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
from typing import Dict
from collections import deque
import time
 
 
def train_a2c(
    env_id: str = "CartPole-v1",
    num_envs: int = 8,
    total_timesteps: int = 1_000_000,
    n_steps: int = 5,
    lr: float = 3e-4,
    gamma: float = 0.99,
    gae_lambda: float = 0.95,
    value_coef: float = 0.5,
    entropy_coef: float = 0.01,
    max_grad_norm: float = 0.5,
    log_interval: int = 10,
    seed: int = 42
) -> Dict:
    """
    Complete A2C training loop with vectorized environments.
    """
    # Set seeds for reproducibility
    torch.manual_seed(seed)
    np.random.seed(seed)
    
    # Create vectorized environments
    envs = create_vectorized_envs(env_id, num_envs)
    state_dim = envs.single_observation_space.shape[0]
    action_dim = envs.single_action_space.n
    
    # Create agent
    agent = A2CAgent(
        state_dim=state_dim,
        action_dim=action_dim,
        lr=lr,
        gamma=gamma,
        gae_lambda=gae_lambda,
        value_coef=value_coef,
        entropy_coef=entropy_coef,
        max_grad_norm=max_grad_norm,
        n_steps=n_steps
    )
    
    # Training tracking
    episode_rewards = deque(maxlen=100)
    episode_lengths = deque(maxlen=100)
    current_rewards = np.zeros(num_envs)
    current_lengths = np.zeros(num_envs, dtype=int)
    
    # Metrics
    start_time = time.time()
    global_step = 0
    num_updates = 0
    
    # Initial observation
    obs, _ = envs.reset(seed=seed)
    
    while global_step < total_timesteps:
        # Collect n_steps rollout
        for _ in range(n_steps):
            # Get actions for all environments
            obs_tensor = torch.FloatTensor(obs)
            with torch.no_grad():
                actions, log_probs, entropies, values = (
                    agent.network.get_action_and_value(obs_tensor)
                )
            
            # Store for learning
            agent.states.append(obs_tensor)
            agent.actions.append(actions)
            agent.log_probs.append(log_probs)
            agent.values.append(values)
            
            # Step environments
            next_obs, rewards, terminated, truncated, infos = envs.step(actions.numpy())
            dones = terminated | truncated
            
            # Store rewards and dones
            for i in range(num_envs):
                agent.rewards.append(rewards[i])
                agent.dones.append(dones[i])
            
            # Track episode statistics
            current_rewards += rewards
            current_lengths += 1
            
            for i in range(num_envs):
                if dones[i]:
                    episode_rewards.append(current_rewards[i])
                    episode_lengths.append(current_lengths[i])
                    current_rewards[i] = 0
                    current_lengths[i] = 0
            
            obs = next_obs
            global_step += num_envs
        
        # Perform A2C update
        metrics = agent.update(obs)
        num_updates += 1
        
        # Logging
        if num_updates % log_interval == 0 and len(episode_rewards) > 0:
            elapsed = time.time() - start_time
            fps = global_step / elapsed
            
            print(f"Step: {global_step:8d} | "
                  f"Episodes: {len(episode_rewards):4d} | "
                  f"Mean Reward: {np.mean(episode_rewards):7.2f} | "
                  f"Policy Loss: {metrics['policy_loss']:7.4f} | "
                  f"Value Loss: {metrics['value_loss']:7.4f} | "
                  f"Entropy: {metrics['entropy']:5.3f} | "
                  f"FPS: {fps:.0f}")
    
    envs.close()
    
    return {
        'episode_rewards': list(episode_rewards),
        'final_mean_reward': np.mean(episode_rewards),
        'agent': agent
    }
 
 
if __name__ == "__main__":
    results = train_a2c(
        env_id="CartPole-v1",
        total_timesteps=500_000,
        num_envs=8
    )
    print(f"\nTraining complete! Final mean reward: {results['final_mean_reward']:.2f}")

Expected Results

On CartPole-v1, A2C should achieve near-optimal performance (~500 reward) within 100-200k steps. Key hyperparameters: lr=3e-4, n_steps=5-10, entropy_coef=0.01. If reward plateaus below 200, try increasing entropy coefficient or reducing learning rate.

Summary: Actor-Critic Methods

We've covered Actor-Critic methods comprehensively. Let's consolidate the key insights:

Key Takeaways

•Actor-Critic combines policy and value learning: The actor optimizes behavior while the critic evaluates performance.
•The critic serves three roles: variance reduction baseline, bootstrap targets for online learning, and temporal credit assignment via GAE.
•A2C is the practical workhorse: Synchronous batch updates with vectorized environments, good GPU utilization, simple implementation.
•Architecture matters: Shared networks are efficient but can be unstable; separate networks are more robust but larger.
•Balancing actor and critic is crucial: Different learning rates, loss coefficients, or update frequencies can help.
•Common failures have solutions: entropy collapse → higher entropy bonus; value divergence → tune critic learning; instability → decrease learning rate.
•Monitoring is essential: Track policy loss, value loss, entropy, and gradient norms to diagnose issues early.

What's next:

We've now assembled the complete Actor-Critic framework. The final piece is understanding the Advantage Function more deeply—how to interpret it, compute it efficiently, and why it's central to understanding policy quality. This understanding will prepare you for advanced algorithms like PPO, which introduces trust-region constraints on policy updates.

Page Complete

You now understand Actor-Critic methods—the foundation of modern policy optimization. A2C combines policy gradients with learned value functions for efficient, stable learning. This sets the stage for understanding the advantage function in depth, completing our treatment of policy-based methods.