Machine LearningTransfer Learning & Domain Adaptation

Multi-Task Learning

LevelAdvanced

Duration90 mins

TopicTransfer Learning & Domain Adaptation

4 / 5

Optimization Challenges

The Complexity of Multi-Task Optimization

Multi-task learning introduces optimization challenges that don't exist in single-task settings. When training on multiple objectives simultaneously, we face conflicting gradients, imbalanced task scales, varying convergence rates, and the fundamental problem of finding a single solution that performs well across all tasks.

This page provides a comprehensive treatment of MTL optimization: the core challenges, their mathematical characterization, and the state-of-the-art techniques developed to address them. Mastering these concepts is essential for building effective MTL systems in practice.

Learning Objectives

By the end of this page, you will understand: (1) gradient conflict and interference, (2) task balancing strategies, (3) multi-objective optimization perspectives, (4) advanced gradient manipulation techniques, and (5) practical optimization recipes for MTL.

Gradient Conflict and Interference

The fundamental optimization challenge in MTL is gradient conflict: when gradients from different tasks point in different (or opposite) directions in parameter space.

Mathematical Characterization:

For tasks $T_1, ..., T_k$ with losses $\mathcal{L}_1, ..., \mathcal{L}_k$, the gradients with respect to shared parameters $\theta$ are:

$$g_t = \nabla_\theta \mathcal{L}_t, \quad t \in {1, ..., k}$$

The combined gradient in naive MTL is: $$g = \sum_{t=1}^{k} \lambda_t g_t$$

Conflict occurs when: $$g_i \cdot g_j < 0 \quad \text{(tasks disagree on update direction)}$$

In severe cases, the combined gradient $g$ may have negative projection onto some task gradients, meaning the update hurts that task.

The Seesaw Effect

Gradient conflict leads to the 'seesaw effect': improving one task degrades another. Training oscillates between favoring different tasks without reliably improving all. This is a key symptom of optimization difficulties in MTL.

Quantifying Conflict:

Gradient Cosine Similarity: $$\cos(g_i, g_j) = \frac{g_i \cdot g_j}{||g_i|| \cdot ||g_j||}$$ Negative values indicate conflict.
Gradient Agreement Ratio: Fraction of parameters where gradients agree on sign.
Conflict Intensity: $$C = \sum_{i<j} \max(0, -g_i \cdot g_j)$$ Measures total magnitude of conflicting gradients.

Sources of Gradient Conflict:

Inherent task incompatibility: Tasks genuinely require different features
Different optimal learning rates: Fast vs slow-learning tasks
Different data scales: Tasks with different loss magnitudes
Noise from small batches: Especially early in training

gradient_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import torch
from typing import Dict, List, Tuple
 
def analyze_gradient_conflict(
    model: torch.nn.Module,
    task_batches: Dict[str, Tuple[torch.Tensor, torch.Tensor]],
    loss_fns: Dict[str, torch.nn.Module]
) -> Dict[str, float]:
    """
    Comprehensive analysis of gradient conflicts in MTL.
    """
    task_names = list(task_batches.keys())
    gradients = {}
    
    # Compute per-task gradients
    for task, (x, y) in task_batches.items():
        model.zero_grad()
        pred = model(x, task)
        loss = loss_fns[task](pred, y)
        loss.backward()
        
        grad = torch.cat([
            p.grad.flatten() for p in model.parameters()
            if p.grad is not None
        ])
        gradients[task] = grad.detach().clone()
    
    results = {}
    
    # Pairwise cosine similarities
    n_conflicts = 0
    total_pairs = 0
    total_cos = 0
    
    for i, t1 in enumerate(task_names):
        for t2 in task_names[i+1:]:
            g1, g2 = gradients[t1], gradients[t2]
            cos = torch.dot(g1, g2) / (g1.norm() * g2.norm() + 1e-8)
            
            total_cos += cos.item()
            total_pairs += 1
            if cos < 0:
                n_conflicts += 1
            
            results[f'cos_{t1}_{t2}'] = cos.item()
    
    results['conflict_rate'] = n_conflicts / max(total_pairs, 1)
    results['avg_cosine'] = total_cos / max(total_pairs, 1)
    
    # Combined gradient analysis
    combined = sum(gradients.values())
    
    for task in task_names:
        # Projection of combined onto task gradient
        proj = torch.dot(combined, gradients[task])
        proj = proj / (gradients[task].norm() + 1e-8)
        results[f'combined_proj_{task}'] = proj.item()
        
        # If negative, combined update hurts this task
        results[f'hurts_{task}'] = proj.item() < 0
    
    return results

Task Balancing Strategies

Tasks often have different loss scales, learning dynamics, and difficulty levels. Without careful balancing, some tasks dominate training while others are neglected.

Static Weighting:

The simplest approach assigns fixed weights $\lambda_t$ to each task: $$\mathcal{L} = \sum_t \lambda_t \mathcal{L}_t$$

Weights can be chosen by:

Manual tuning (expensive, task-specific)
Inverse loss scaling: $\lambda_t \propto 1/\mathcal{L}_t^0$
Uncertainty weighting (Kendall et al.): $\lambda_t = 1/(2\sigma_t^2)$

Uncertainty Weighting

The uncertainty weighting approach learns task-specific uncertainty σ_t during training. Tasks with higher uncertainty (harder to predict confidently) receive lower weight. This provides principled automatic balancing with minimal hyperparameters.

Dynamic Weighting:

More sophisticated methods adapt weights during training:

1. GradNorm (Chen et al., 2018): Balance gradient norms across tasks:

$$\tilde{w}_t(i) \leftarrow \tilde{w}_t(i-1) \cdot \left(\frac{r_t(i)}{\bar{r}(i)}\right)^\alpha$$

where $r_t$ is the relative inverse training rate of task $t$.

2. Dynamic Weight Averaging (DWA): $$\lambda_t(i) = \frac{\exp(w_t(i-1)/T)}{\sum_j \exp(w_j(i-1)/T)}$$ where $w_t$ measures training speed.

3. Gradient-Based Meta-Learning: Treat task weights as learnable parameters optimized on validation set.

task_balancing.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import torch
import torch.nn as nn
from typing import Dict
 
class UncertaintyWeighting(nn.Module):
    """
    Homoscedastic uncertainty weighting for MTL.
    Learns task-specific log variances to balance losses.
    """
    
    def __init__(self, task_names: list):
        super().__init__()
        # Log variance for numerical stability
        self.log_vars = nn.ParameterDict({
            task: nn.Parameter(torch.zeros(1))
            for task in task_names
        })
    
    def forward(
        self,
        task_losses: Dict[str, torch.Tensor]
    ) -> torch.Tensor:
        """
        Compute uncertainty-weighted combined loss.
        
        L = sum_t (1/(2*sigma_t^2)) * L_t + log(sigma_t)
        """
        total_loss = 0
        for task, loss in task_losses.items():
            log_var = self.log_vars[task]
            # Precision weighting with regularization
            precision = torch.exp(-log_var)
            total_loss += precision * loss + log_var
        
        return total_loss
 
 
class GradNormBalancer:
    """
    GradNorm: Gradient normalization for balanced MTL.
    """
    
    def __init__(
        self,
        model: nn.Module,
        task_names: list,
        alpha: float = 1.5
    ):
        self.model = model
        self.task_names = task_names
        self.alpha = alpha
        
        # Task weights (learnable)
        self.weights = {t: 1.0 for t in task_names}
        self.initial_losses = None
    
    def update_weights(
        self,
        task_losses: Dict[str, float],
        shared_params: list
    ):
        """Update task weights based on gradient norms."""
        if self.initial_losses is None:
            self.initial_losses = task_losses.copy()
            return
        
        # Loss ratios (training speed indicators)
        loss_ratios = {
            t: task_losses[t] / (self.initial_losses[t] + 1e-8)
            for t in self.task_names
        }
        mean_ratio = sum(loss_ratios.values()) / len(loss_ratios)
        
        # Relative inverse training rates
        inv_rates = {
            t: (loss_ratios[t] / (mean_ratio + 1e-8)) ** self.alpha
            for t in self.task_names
        }
        
        # Target: balance gradient norms
        # Adjust weights to achieve balanced rates
        for t in self.task_names:
            self.weights[t] *= inv_rates[t]
        
        # Renormalize
        total = sum(self.weights.values())
        for t in self.task_names:
            self.weights[t] /= total
            self.weights[t] *= len(self.task_names)
    
    def get_weighted_loss(
        self,
        task_losses: Dict[str, torch.Tensor]
    ) -> torch.Tensor:
        return sum(
            self.weights[t] * loss
            for t, loss in task_losses.items()
        )

Multi-Objective Optimization Perspective

MTL can be viewed as multi-objective optimization (MOO), where we seek solutions that are optimal across multiple objectives simultaneously.

Pareto Optimality:

A solution $\theta^*$ is Pareto optimal if no other solution improves one task without worsening another:

$$\nexists \theta: \mathcal{L}_t(\theta) \leq \mathcal{L}_t(\theta^) \forall t, \text{ with } \mathcal{L}_j(\theta) < \mathcal{L}_j(\theta^) \text{ for some } j$$

The set of all Pareto optimal solutions forms the Pareto front.

Multiple Gradient Descent Algorithm (MGDA):

MGDA finds a gradient direction that improves all tasks (if one exists):

$$\min_{d} ||d||^2 \quad \text{s.t.} \quad d = \sum_t \alpha_t g_t, \quad \alpha_t \geq 0, \quad \sum_t \alpha_t = 1$$

The solution $d$ is the minimum-norm element of the convex hull of task gradients.

Multi-Objective Optimization Methods for MTL
Method	Approach	Key Property
MGDA	Min-norm in gradient convex hull	Guaranteed Pareto improvement
CAGrad	Maximize worst-case task improvement	Conflict-averse
Nash-MTL	Find Nash equilibrium	Fair to all tasks
Pareto-MTL	Explore Pareto front	Diverse solutions
IMTL-G	Frank-Wolfe optimization	Efficient for many tasks

CAGrad (Conflict-Averse Gradient Descent):

Finds update direction within a bounded region around the average gradient:

$$d = \argmax_{||u - g_{avg}|| \leq c} \min_t g_t \cdot u$$

Maximizes minimum task improvement within a trust region.

Nash-MTL:

Formulates MTL as a bargaining game and finds the Nash equilibrium, ensuring no task can improve without cooperation.

Advanced Gradient Manipulation

Modern MTL research has developed sophisticated techniques to manipulate gradients for better optimization.

1. PCGrad (Projecting Conflicting Gradients):

When $g_i \cdot g_j < 0$, project away the conflicting component: $$g_i' = g_i - \frac{g_i \cdot g_j}{||g_j||^2} g_j$$

Removes the component of $g_i$ that conflicts with $g_j$.

2. Gradient Vaccine:

Similar to PCGrad but applies softer projection based on conflict severity.

3. Gradient Surgery:

Only modifies gradients when conflicts actually cause harm, preserving beneficial interactions.

advanced_gradient_methods.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import torch
import numpy as np
from typing import List
 
def mgda_direction(gradients: List[torch.Tensor]) -> torch.Tensor:
    """
    MGDA: Find minimum-norm direction in convex hull of gradients.
    Uses Frank-Wolfe optimization.
    """
    n_tasks = len(gradients)
    
    # Stack gradients: [n_tasks, n_params]
    G = torch.stack([g.flatten() for g in gradients])
    
    # Frank-Wolfe to find min-norm point
    # Initialize with uniform weights
    alpha = torch.ones(n_tasks) / n_tasks
    
    for _ in range(20):  # FW iterations
        d = G.T @ alpha  # Current direction
        
        # Find task with most negative inner product
        inner_products = G @ d
        min_task = torch.argmin(inner_products)
        
        # Frank-Wolfe step
        gamma = 2.0 / (2 + _)  # Step size
        alpha_new = (1 - gamma) * alpha
        alpha_new[min_task] += gamma
        alpha = alpha_new
    
    return G.T @ alpha
 
 
def cagrad_direction(
    gradients: List[torch.Tensor],
    c: float = 0.5
) -> torch.Tensor:
    """
    CAGrad: Conflict-Averse Gradient Descent.
    Finds direction maximizing worst-case improvement.
    """
    G = torch.stack([g.flatten() for g in gradients])
    n_tasks = len(gradients)
    
    # Average gradient
    g_avg = G.mean(dim=0)
    
    # Solve constrained optimization
    # Approximate with projection onto feasible region
    
    # Compute task projections onto average
    g_avg_norm = g_avg.norm() + 1e-8
    
    # Find direction that maximizes min projection
    # Using the closed-form for 2-task case, iterative for more
    
    if n_tasks == 2:
        g0, g1 = G[0], G[1]
        if torch.dot(g0, g1) >= 0:
            return g_avg
        
        # Project to maximize minimum
        cos_angle = torch.dot(g0, g1) / (g0.norm() * g1.norm())
        if cos_angle < -0.99:
            return torch.zeros_like(g_avg)
        
        # Blend based on magnitudes
        w0 = g1.norm() / (g0.norm() + g1.norm())
        w1 = 1 - w0
        return w0 * g0 + w1 * g1
    
    # General case: iterative refinement
    d = g_avg.clone()
    for _ in range(10):
        projections = G @ d
        min_idx = torch.argmin(projections)
        
        # Move toward improving worst task
        d = d + 0.1 * (G[min_idx] - d.dot(G[min_idx]) * d / (d.norm()**2 + 1e-8))
        
        # Project back to trust region
        diff = d - g_avg
        if diff.norm() > c * g_avg_norm:
            d = g_avg + c * g_avg_norm * diff / diff.norm()
    
    return d

Practical Optimization Recipes

Based on extensive empirical research, here are practical recommendations for MTL optimization:

MTL Optimization Best Practices

•Start with uniform weighting — Establish baseline before complex balancing
•Monitor per-task losses — Track individual task performance, not just average
•Use validation-based task weighting — Weight by validation performance if overfitting varies
•Lower learning rate for shared parameters — Shared params see more gradients; reduce LR to compensate
•Consider task-specific learning rates — Allow different convergence speeds
•Gradient clipping per task — Prevents single task from dominating updates
•Warm-up shared representations — Train on easier tasks before harder ones

When Advanced Methods Are Needed

Simple approaches (uniform weights, uncertainty weighting) work well for related tasks. Use gradient manipulation (PCGrad, CAGrad) when you observe persistent gradient conflicts or the seesaw effect. For many tasks with varying relationships, consider learned weighting or MOO methods.

Summary

Key Takeaways

•Gradient conflict is the core challenge: tasks disagree on update direction, causing the seesaw effect.
•Task balancing (static or dynamic) ensures all tasks contribute appropriately to learning.
•Multi-objective optimization provides principled frameworks (MGDA, CAGrad) for finding Pareto-optimal solutions.
•Gradient manipulation (PCGrad) removes conflicting components to enable all tasks to improve.
•Practical success requires monitoring per-task metrics, appropriate learning rates, and choosing methods based on conflict severity.

Next Up

You now understand the optimization challenges in MTL and techniques to address them. The final page covers When MTL Helps—practical guidelines for when multi-task learning provides benefits over single-task alternatives.

4 / 5

Loading learning content...

Machine LearningTransfer Learning & Domain Adaptation

Multi-Task Learning

LevelAdvanced

Duration90 mins

TopicTransfer Learning & Domain Adaptation

4 / 5

Optimization Challenges

The Complexity of Multi-Task Optimization

Learning Objectives

Gradient Conflict and Interference

The fundamental optimization challenge in MTL is gradient conflict: when gradients from different tasks point in different (or opposite) directions in parameter space.

Mathematical Characterization:

For tasks $T_1, ..., T_k$ with losses $\mathcal{L}_1, ..., \mathcal{L}_k$, the gradients with respect to shared parameters $\theta$ are:

$$g_t = \nabla_\theta \mathcal{L}_t, \quad t \in {1, ..., k}$$

The combined gradient in naive MTL is: $$g = \sum_{t=1}^{k} \lambda_t g_t$$

Conflict occurs when: $$g_i \cdot g_j < 0 \quad \text{(tasks disagree on update direction)}$$

In severe cases, the combined gradient $g$ may have negative projection onto some task gradients, meaning the update hurts that task.

The Seesaw Effect

Quantifying Conflict:

Gradient Cosine Similarity: $$\cos(g_i, g_j) = \frac{g_i \cdot g_j}{||g_i|| \cdot ||g_j||}$$ Negative values indicate conflict.
Gradient Agreement Ratio: Fraction of parameters where gradients agree on sign.
Conflict Intensity: $$C = \sum_{i<j} \max(0, -g_i \cdot g_j)$$ Measures total magnitude of conflicting gradients.

Sources of Gradient Conflict:

Inherent task incompatibility: Tasks genuinely require different features
Different optimal learning rates: Fast vs slow-learning tasks
Different data scales: Tasks with different loss magnitudes
Noise from small batches: Especially early in training

gradient_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import torch
from typing import Dict, List, Tuple
 
def analyze_gradient_conflict(
    model: torch.nn.Module,
    task_batches: Dict[str, Tuple[torch.Tensor, torch.Tensor]],
    loss_fns: Dict[str, torch.nn.Module]
) -> Dict[str, float]:
    """
    Comprehensive analysis of gradient conflicts in MTL.
    """
    task_names = list(task_batches.keys())
    gradients = {}
    
    # Compute per-task gradients
    for task, (x, y) in task_batches.items():
        model.zero_grad()
        pred = model(x, task)
        loss = loss_fns[task](pred, y)
        loss.backward()
        
        grad = torch.cat([
            p.grad.flatten() for p in model.parameters()
            if p.grad is not None
        ])
        gradients[task] = grad.detach().clone()
    
    results = {}
    
    # Pairwise cosine similarities
    n_conflicts = 0
    total_pairs = 0
    total_cos = 0
    
    for i, t1 in enumerate(task_names):
        for t2 in task_names[i+1:]:
            g1, g2 = gradients[t1], gradients[t2]
            cos = torch.dot(g1, g2) / (g1.norm() * g2.norm() + 1e-8)
            
            total_cos += cos.item()
            total_pairs += 1
            if cos < 0:
                n_conflicts += 1
            
            results[f'cos_{t1}_{t2}'] = cos.item()
    
    results['conflict_rate'] = n_conflicts / max(total_pairs, 1)
    results['avg_cosine'] = total_cos / max(total_pairs, 1)
    
    # Combined gradient analysis
    combined = sum(gradients.values())
    
    for task in task_names:
        # Projection of combined onto task gradient
        proj = torch.dot(combined, gradients[task])
        proj = proj / (gradients[task].norm() + 1e-8)
        results[f'combined_proj_{task}'] = proj.item()
        
        # If negative, combined update hurts this task
        results[f'hurts_{task}'] = proj.item() < 0
    
    return results

Task Balancing Strategies

Tasks often have different loss scales, learning dynamics, and difficulty levels. Without careful balancing, some tasks dominate training while others are neglected.

Static Weighting:

The simplest approach assigns fixed weights $\lambda_t$ to each task: $$\mathcal{L} = \sum_t \lambda_t \mathcal{L}_t$$

Weights can be chosen by:

Manual tuning (expensive, task-specific)
Inverse loss scaling: $\lambda_t \propto 1/\mathcal{L}_t^0$
Uncertainty weighting (Kendall et al.): $\lambda_t = 1/(2\sigma_t^2)$

Uncertainty Weighting

Dynamic Weighting:

More sophisticated methods adapt weights during training:

1. GradNorm (Chen et al., 2018): Balance gradient norms across tasks:

$$\tilde{w}_t(i) \leftarrow \tilde{w}_t(i-1) \cdot \left(\frac{r_t(i)}{\bar{r}(i)}\right)^\alpha$$

where $r_t$ is the relative inverse training rate of task $t$.

2. Dynamic Weight Averaging (DWA): $$\lambda_t(i) = \frac{\exp(w_t(i-1)/T)}{\sum_j \exp(w_j(i-1)/T)}$$ where $w_t$ measures training speed.

3. Gradient-Based Meta-Learning: Treat task weights as learnable parameters optimized on validation set.

task_balancing.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import torch
import torch.nn as nn
from typing import Dict
 
class UncertaintyWeighting(nn.Module):
    """
    Homoscedastic uncertainty weighting for MTL.
    Learns task-specific log variances to balance losses.
    """
    
    def __init__(self, task_names: list):
        super().__init__()
        # Log variance for numerical stability
        self.log_vars = nn.ParameterDict({
            task: nn.Parameter(torch.zeros(1))
            for task in task_names
        })
    
    def forward(
        self,
        task_losses: Dict[str, torch.Tensor]
    ) -> torch.Tensor:
        """
        Compute uncertainty-weighted combined loss.
        
        L = sum_t (1/(2*sigma_t^2)) * L_t + log(sigma_t)
        """
        total_loss = 0
        for task, loss in task_losses.items():
            log_var = self.log_vars[task]
            # Precision weighting with regularization
            precision = torch.exp(-log_var)
            total_loss += precision * loss + log_var
        
        return total_loss
 
 
class GradNormBalancer:
    """
    GradNorm: Gradient normalization for balanced MTL.
    """
    
    def __init__(
        self,
        model: nn.Module,
        task_names: list,
        alpha: float = 1.5
    ):
        self.model = model
        self.task_names = task_names
        self.alpha = alpha
        
        # Task weights (learnable)
        self.weights = {t: 1.0 for t in task_names}
        self.initial_losses = None
    
    def update_weights(
        self,
        task_losses: Dict[str, float],
        shared_params: list
    ):
        """Update task weights based on gradient norms."""
        if self.initial_losses is None:
            self.initial_losses = task_losses.copy()
            return
        
        # Loss ratios (training speed indicators)
        loss_ratios = {
            t: task_losses[t] / (self.initial_losses[t] + 1e-8)
            for t in self.task_names
        }
        mean_ratio = sum(loss_ratios.values()) / len(loss_ratios)
        
        # Relative inverse training rates
        inv_rates = {
            t: (loss_ratios[t] / (mean_ratio + 1e-8)) ** self.alpha
            for t in self.task_names
        }
        
        # Target: balance gradient norms
        # Adjust weights to achieve balanced rates
        for t in self.task_names:
            self.weights[t] *= inv_rates[t]
        
        # Renormalize
        total = sum(self.weights.values())
        for t in self.task_names:
            self.weights[t] /= total
            self.weights[t] *= len(self.task_names)
    
    def get_weighted_loss(
        self,
        task_losses: Dict[str, torch.Tensor]
    ) -> torch.Tensor:
        return sum(
            self.weights[t] * loss
            for t, loss in task_losses.items()
        )

Multi-Objective Optimization Perspective

MTL can be viewed as multi-objective optimization (MOO), where we seek solutions that are optimal across multiple objectives simultaneously.

Pareto Optimality:

A solution $\theta^*$ is Pareto optimal if no other solution improves one task without worsening another:

$$\nexists \theta: \mathcal{L}_t(\theta) \leq \mathcal{L}_t(\theta^) \forall t, \text{ with } \mathcal{L}_j(\theta) < \mathcal{L}_j(\theta^) \text{ for some } j$$

The set of all Pareto optimal solutions forms the Pareto front.

Multiple Gradient Descent Algorithm (MGDA):

MGDA finds a gradient direction that improves all tasks (if one exists):

$$\min_{d} ||d||^2 \quad \text{s.t.} \quad d = \sum_t \alpha_t g_t, \quad \alpha_t \geq 0, \quad \sum_t \alpha_t = 1$$

The solution $d$ is the minimum-norm element of the convex hull of task gradients.

Multi-Objective Optimization Methods for MTL
Method	Approach	Key Property
MGDA	Min-norm in gradient convex hull	Guaranteed Pareto improvement
CAGrad	Maximize worst-case task improvement	Conflict-averse
Nash-MTL	Find Nash equilibrium	Fair to all tasks
Pareto-MTL	Explore Pareto front	Diverse solutions
IMTL-G	Frank-Wolfe optimization	Efficient for many tasks

CAGrad (Conflict-Averse Gradient Descent):

Finds update direction within a bounded region around the average gradient:

$$d = \argmax_{||u - g_{avg}|| \leq c} \min_t g_t \cdot u$$

Maximizes minimum task improvement within a trust region.

Nash-MTL:

Formulates MTL as a bargaining game and finds the Nash equilibrium, ensuring no task can improve without cooperation.

Advanced Gradient Manipulation

Modern MTL research has developed sophisticated techniques to manipulate gradients for better optimization.

1. PCGrad (Projecting Conflicting Gradients):

When $g_i \cdot g_j < 0$, project away the conflicting component: $$g_i' = g_i - \frac{g_i \cdot g_j}{||g_j||^2} g_j$$

Removes the component of $g_i$ that conflicts with $g_j$.

2. Gradient Vaccine:

Similar to PCGrad but applies softer projection based on conflict severity.

3. Gradient Surgery:

Only modifies gradients when conflicts actually cause harm, preserving beneficial interactions.

advanced_gradient_methods.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import torch
import numpy as np
from typing import List
 
def mgda_direction(gradients: List[torch.Tensor]) -> torch.Tensor:
    """
    MGDA: Find minimum-norm direction in convex hull of gradients.
    Uses Frank-Wolfe optimization.
    """
    n_tasks = len(gradients)
    
    # Stack gradients: [n_tasks, n_params]
    G = torch.stack([g.flatten() for g in gradients])
    
    # Frank-Wolfe to find min-norm point
    # Initialize with uniform weights
    alpha = torch.ones(n_tasks) / n_tasks
    
    for _ in range(20):  # FW iterations
        d = G.T @ alpha  # Current direction
        
        # Find task with most negative inner product
        inner_products = G @ d
        min_task = torch.argmin(inner_products)
        
        # Frank-Wolfe step
        gamma = 2.0 / (2 + _)  # Step size
        alpha_new = (1 - gamma) * alpha
        alpha_new[min_task] += gamma
        alpha = alpha_new
    
    return G.T @ alpha
 
 
def cagrad_direction(
    gradients: List[torch.Tensor],
    c: float = 0.5
) -> torch.Tensor:
    """
    CAGrad: Conflict-Averse Gradient Descent.
    Finds direction maximizing worst-case improvement.
    """
    G = torch.stack([g.flatten() for g in gradients])
    n_tasks = len(gradients)
    
    # Average gradient
    g_avg = G.mean(dim=0)
    
    # Solve constrained optimization
    # Approximate with projection onto feasible region
    
    # Compute task projections onto average
    g_avg_norm = g_avg.norm() + 1e-8
    
    # Find direction that maximizes min projection
    # Using the closed-form for 2-task case, iterative for more
    
    if n_tasks == 2:
        g0, g1 = G[0], G[1]
        if torch.dot(g0, g1) >= 0:
            return g_avg
        
        # Project to maximize minimum
        cos_angle = torch.dot(g0, g1) / (g0.norm() * g1.norm())
        if cos_angle < -0.99:
            return torch.zeros_like(g_avg)
        
        # Blend based on magnitudes
        w0 = g1.norm() / (g0.norm() + g1.norm())
        w1 = 1 - w0
        return w0 * g0 + w1 * g1
    
    # General case: iterative refinement
    d = g_avg.clone()
    for _ in range(10):
        projections = G @ d
        min_idx = torch.argmin(projections)
        
        # Move toward improving worst task
        d = d + 0.1 * (G[min_idx] - d.dot(G[min_idx]) * d / (d.norm()**2 + 1e-8))
        
        # Project back to trust region
        diff = d - g_avg
        if diff.norm() > c * g_avg_norm:
            d = g_avg + c * g_avg_norm * diff / diff.norm()
    
    return d

Practical Optimization Recipes

Based on extensive empirical research, here are practical recommendations for MTL optimization:

MTL Optimization Best Practices

•Start with uniform weighting — Establish baseline before complex balancing
•Monitor per-task losses — Track individual task performance, not just average
•Use validation-based task weighting — Weight by validation performance if overfitting varies
•Lower learning rate for shared parameters — Shared params see more gradients; reduce LR to compensate
•Consider task-specific learning rates — Allow different convergence speeds
•Gradient clipping per task — Prevents single task from dominating updates
•Warm-up shared representations — Train on easier tasks before harder ones

When Advanced Methods Are Needed

Summary

Key Takeaways

•Gradient conflict is the core challenge: tasks disagree on update direction, causing the seesaw effect.
•Task balancing (static or dynamic) ensures all tasks contribute appropriately to learning.
•Multi-objective optimization provides principled frameworks (MGDA, CAGrad) for finding Pareto-optimal solutions.
•Gradient manipulation (PCGrad) removes conflicting components to enable all tasks to improve.
•Practical success requires monitoring per-task metrics, appropriate learning rates, and choosing methods based on conflict severity.

Next Up

4 / 5