Machine LearningTransfer Learning & Domain Adaptation

Multi-Task Learning

LevelAdvanced

Duration90 mins

TopicTransfer Learning & Domain Adaptation

3 / 5

Task Relationships

Understanding How Tasks Relate

The success of multi-task learning fundamentally depends on the relationships between tasks. When tasks are related, sharing information improves learning. When tasks are unrelated or conflicting, sharing can hurt performance through negative transfer.

This page explores how to conceptualize, measure, and leverage task relationships. We examine theoretical frameworks for task relatedness, practical methods for quantifying similarity, and techniques for using task structure to guide MTL architecture design.

Learning Objectives

By the end of this page, you will understand: (1) theoretical frameworks for task relatedness, (2) methods for measuring task similarity, (3) task clustering and grouping strategies, (4) how to leverage task relationships in architecture design, and (5) techniques for handling task conflicts.

Theoretical Frameworks for Task Relatedness

Several theoretical frameworks formalize the notion of task relatedness, each capturing different aspects of how tasks can share structure.

1. Shared Hypothesis Class:

Tasks are related if they share optimal hypotheses from a common class. Formally, tasks $T_1, ..., T_k$ are related if there exists a hypothesis class $\mathcal{H}$ such that the optimal hypothesis for each task lies in $\mathcal{H}$:

$$h_t^* \in \mathcal{H}, \quad \forall t \in {1, ..., k}$$

2. Shared Representation:

Tasks are related if they share an optimal representation function. Each task's optimal predictor can be decomposed as:

$$f_t^* = g_t \circ h^*$$

where $h^*$ is the shared representation and $g_t$ are task-specific heads.

3. Task Covariance:

In the Bayesian view, task parameters $\theta_t$ are drawn from a prior distribution. Related tasks have correlated parameters:

$$\text{Cov}(\theta_i, \theta_j) > 0 \text{ for related tasks}$$

Task Relatedness Is Not Binary

Tasks exist on a spectrum of relatedness. Two tasks might share low-level features but diverge at higher levels, or vice versa. Understanding the granularity and nature of task relationships is crucial for effective MTL design.

4. Transfer Distance:

A more operational definition measures relatedness through transfer performance. The transfer distance from task $i$ to task $j$ is:

$$d(T_i \to T_j) = \mathcal{L}{T_j}(h{T_i}) - \mathcal{L}{T_j}(h{T_j}^*)$$

where $h_{T_i}$ is trained on $T_i$ and evaluated on $T_j$. Small transfer distance indicates high relatedness.

5. Gradient Alignment:

During MTL training, relatedness manifests in gradient dynamics. Tasks are related if their gradients align:

$$\cos(g_i, g_j) = \frac{g_i \cdot g_j}{||g_i|| \cdot ||g_j||} > 0$$

Positive alignment indicates tasks agree on optimization direction.

Measuring Task Similarity

Before training an MTL system, we often want to estimate task similarity to guide architecture decisions. Several practical approaches exist:

Data-Based Measures:

Input Distribution Similarity:
- Compute statistics (mean, covariance) of input distributions
- Use MMD (Maximum Mean Discrepancy) between task datasets
- Higher similarity in inputs suggests potential for shared representations
Label Correlation:
- For tasks with shared inputs, correlate labels across tasks
- High correlation indicates tasks capture related aspects of data

Transfer-Based Measures:

Probe Networks:
- Train on task $i$, evaluate on task $j$ (and vice versa)
- Compute task affinity matrix: $A_{ij}$ = transfer performance from $i$ to $j$
- Asymmetric: $A_{ij} \neq A_{ji}$ in general
Representation Similarity:
- Train separate models for each task
- Compare learned representations using CCA, CKA, or SVCCA

task_similarity_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import torch
import numpy as np
from sklearn.cross_decomposition import CCA
from typing import Dict, List, Tuple
 
def compute_task_affinity_matrix(
    task_models: Dict[str, torch.nn.Module],
    task_dataloaders: Dict[str, torch.utils.data.DataLoader],
    metric: str = 'accuracy'
) -> np.ndarray:
    """
    Compute pairwise task affinity through transfer evaluation.
    
    Returns:
        Affinity matrix A where A[i,j] = performance of model i on task j
    """
    task_names = list(task_models.keys())
    n_tasks = len(task_names)
    affinity = np.zeros((n_tasks, n_tasks))
    
    for i, source_task in enumerate(task_names):
        model = task_models[source_task]
        model.eval()
        
        for j, target_task in enumerate(task_names):
            loader = task_dataloaders[target_task]
            
            correct = 0
            total = 0
            with torch.no_grad():
                for x, y in loader:
                    pred = model(x).argmax(dim=1)
                    correct += (pred == y).sum().item()
                    total += len(y)
            
            affinity[i, j] = correct / total
    
    return affinity
 
 
def compute_representation_similarity_cka(
    model1: torch.nn.Module,
    model2: torch.nn.Module,
    dataloader: torch.utils.data.DataLoader
) -> float:
    """
    Compute CKA (Centered Kernel Alignment) similarity
    between representations learned by two models.
    """
    model1.eval()
    model2.eval()
    
    reps1, reps2 = [], []
    with torch.no_grad():
        for x, _ in dataloader:
            reps1.append(model1.get_representation(x).cpu())
            reps2.append(model2.get_representation(x).cpu())
    
    X = torch.cat(reps1, dim=0).numpy()
    Y = torch.cat(reps2, dim=0).numpy()
    
    # Compute CKA
    def centering(K):
        n = K.shape[0]
        unit = np.ones([n, n])
        H = np.eye(n) - unit / n
        return H @ K @ H
 
    def linear_kernel(X):
        return X @ X.T
 
    K_X = centering(linear_kernel(X))
    K_Y = centering(linear_kernel(Y))
    
    hsic = np.sum(K_X * K_Y)
    norm = np.sqrt(np.sum(K_X * K_X) * np.sum(K_Y * K_Y))
    
    return hsic / norm if norm > 0 else 0.0
 
 
def compute_gradient_similarity(
    model: torch.nn.Module,
    task_batches: Dict[str, Tuple[torch.Tensor, torch.Tensor]],
    loss_fn: torch.nn.Module
) -> Dict[Tuple[str, str], float]:
    """
    Compute pairwise gradient cosine similarity between tasks.
    """
    task_names = list(task_batches.keys())
    gradients = {}
    
    for task, (x, y) in task_batches.items():
        model.zero_grad()
        pred = model(x, task)
        loss = loss_fn(pred, y)
        loss.backward()
        
        # Collect gradients from shared parameters
        grad = torch.cat([
            p.grad.flatten() for p in model.get_shared_params()
            if p.grad is not None
        ])
        gradients[task] = grad.detach()
    
    similarities = {}
    for i, task_i in enumerate(task_names):
        for task_j in task_names[i+1:]:
            g_i, g_j = gradients[task_i], gradients[task_j]
            cos_sim = torch.dot(g_i, g_j) / (g_i.norm() * g_j.norm())
            similarities[(task_i, task_j)] = cos_sim.item()
    
    return similarities

Gradient-Based Measures (During Training):

Gradient Cosine Similarity: $$\cos(g_i, g_j) = \frac{\nabla_{\theta} \mathcal{L}i \cdot \nabla{\theta} \mathcal{L}j}{||\nabla{\theta} \mathcal{L}i|| \cdot ||\nabla{\theta} \mathcal{L}_j||}$$
Gradient Conflict: Count how often gradients point in opposite directions for each parameter.
Task Affinity via Training: Measure how training on one task affects validation loss on others.

Task Clustering and Grouping

When dealing with many tasks, not all should share parameters equally. Task clustering groups similar tasks to share parameters within clusters while maintaining separation between clusters.

Clustering Strategies:

Pre-determined Clustering:
- Group by domain knowledge (e.g., all NER tasks together)
- Group by output type (classification vs regression)
- Group by data source or modality
Data-Driven Clustering:
- Compute task similarity matrix
- Apply clustering algorithms (hierarchical, spectral, k-means)
- Let data determine natural task groupings
Learned Clustering:
- Learn task embeddings during training
- Cluster in embedding space
- Allow clusters to evolve during optimization

Task Clustering Approaches
Approach	When to Use	Pros/Cons
Domain knowledge	Clear task categories exist	Simple, interpretable / May miss subtle relationships
Transfer-based	Can afford pre-training	Captures true transfer / Computationally expensive
Gradient-based	During MTL training	Dynamic, adapts / Noisy, varies during training
Learned embeddings	Many tasks, complex relationships	Flexible / Adds complexity, may overfit

task_clustering.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import numpy as np
from sklearn.cluster import AgglomerativeClustering, SpectralClustering
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
 
def cluster_tasks_hierarchical(
    affinity_matrix: np.ndarray,
    task_names: list,
    n_clusters: int = None,
    distance_threshold: float = None
) -> dict:
    """
    Hierarchical clustering of tasks based on affinity.
    """
    # Convert affinity to distance
    distance_matrix = 1 - (affinity_matrix + affinity_matrix.T) / 2
    np.fill_diagonal(distance_matrix, 0)
    
    # Perform clustering
    clustering = AgglomerativeClustering(
        n_clusters=n_clusters,
        distance_threshold=distance_threshold,
        metric='precomputed',
        linkage='average'
    )
    labels = clustering.fit_predict(distance_matrix)
    
    # Group tasks by cluster
    clusters = {}
    for task, label in zip(task_names, labels):
        if label not in clusters:
            clusters[label] = []
        clusters[label].append(task)
    
    return clusters
 
 
def visualize_task_relationships(
    affinity_matrix: np.ndarray,
    task_names: list
):
    """Visualize task relationships with dendrogram and heatmap."""
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Heatmap
    im = axes[0].imshow(affinity_matrix, cmap='RdYlGn')
    axes[0].set_xticks(range(len(task_names)))
    axes[0].set_yticks(range(len(task_names)))
    axes[0].set_xticklabels(task_names, rotation=45, ha='right')
    axes[0].set_yticklabels(task_names)
    axes[0].set_title('Task Affinity Matrix')
    plt.colorbar(im, ax=axes[0])
    
    # Dendrogram
    distance = 1 - (affinity_matrix + affinity_matrix.T) / 2
    condensed = distance[np.triu_indices(len(task_names), k=1)]
    Z = linkage(condensed, method='average')
    dendrogram(Z, labels=task_names, ax=axes[1])
    axes[1].set_title('Task Hierarchy')
    axes[1].set_ylabel('Distance')
    
    plt.tight_layout()
    plt.savefig('task_relationships.png', dpi=150)
    plt.close()

Leveraging Task Structure in Architecture Design

Once task relationships are understood, they should inform architecture design:

1. Hierarchical Sharing:

If tasks form a hierarchy (e.g., coarse classification → fine classification), share earlier layers for all tasks, with later layers shared within subtrees:

[Shared Base] → [Cluster-1 Layers] → Task-1a, Task-1b
             → [Cluster-2 Layers] → Task-2a, Task-2b

2. Asymmetric Sharing:

If transfer is asymmetric (task A helps B but not vice versa), use auxiliary task design:

Train on A as auxiliary, with A's representations feeding B
Don't force B's gradients to flow back to A's specific layers

3. Task-Conditional Computation:

Use task identity to modulate shared computation:

Task-specific batch normalization parameters
Task-conditional feature modulation (FiLM layers)
Attention over task-specific aspects of shared features

Architecture Search for MTL

For complex task sets, consider automated architecture search methods that learn optimal sharing patterns. Approaches like AutoML-Zero and neural architecture search can discover effective MTL structures that might not be obvious from task similarity measures alone.

Handling Task Conflicts

When tasks have conflicting requirements, their gradients interfere during training, leading to negative transfer. Several techniques address this:

Gradient Manipulation:

GradNorm: Dynamically adjust task weights to balance gradient magnitudes
PCGrad (Projecting Conflicting Gradients): When $g_i \cdot g_j < 0$, project $g_i$ onto the normal plane of $g_j$: $$g_i' = g_i - \frac{g_i \cdot g_j}{||g_j||^2} g_j$$
CAGrad (Conflict-Averse Gradient Descent): Find gradient direction in the cone of task gradients that maximizes worst-case improvement

Architecture Solutions:

Task-Specific Adapters: Add small task-specific modules to shared architecture
Mixture of Experts: Route different inputs/tasks to specialized sub-networks
Modular Networks: Compose task-specific paths through shared modules

gradient_manipulation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import torch
from typing import Dict, List
 
def pcgrad_update(
    task_gradients: Dict[str, torch.Tensor]
) -> Dict[str, torch.Tensor]:
    """
    PCGrad: Project conflicting gradients.
    """
    task_names = list(task_gradients.keys())
    modified_grads = {t: g.clone() for t, g in task_gradients.items()}
    
    for i, task_i in enumerate(task_names):
        g_i = modified_grads[task_i]
        
        for task_j in task_names:
            if task_i == task_j:
                continue
            g_j = task_gradients[task_j]
            
            # Check for conflict
            dot = torch.dot(g_i.flatten(), g_j.flatten())
            if dot < 0:
                # Project g_i onto normal plane of g_j
                g_i = g_i - (dot / (g_j.norm() ** 2 + 1e-8)) * g_j
        
        modified_grads[task_i] = g_i
    
    # Average modified gradients
    final_grad = torch.stack(list(modified_grads.values())).mean(dim=0)
    return final_grad
 
 
def gradnorm_weights(
    task_losses: Dict[str, torch.Tensor],
    initial_losses: Dict[str, float],
    current_weights: Dict[str, torch.Tensor],
    alpha: float = 1.5
) -> Dict[str, torch.Tensor]:
    """
    GradNorm: Compute balanced task weights.
    """
    # Compute loss ratios
    loss_ratios = {
        t: task_losses[t] / initial_losses[t]
        for t in task_losses
    }
    mean_ratio = sum(loss_ratios.values()) / len(loss_ratios)
    
    # Compute relative inverse training rates
    inv_rates = {
        t: (loss_ratios[t] / mean_ratio) ** alpha
        for t in loss_ratios
    }
    
    # Target gradient norms
    mean_grad_norm = sum(
        current_weights[t] * task_losses[t]
        for t in task_losses
    ).item()
    
    new_weights = {}
    for t in task_losses:
        target = mean_grad_norm * inv_rates[t]
        new_weights[t] = current_weights[t] * target
    
    # Normalize weights
    weight_sum = sum(new_weights.values())
    return {t: w / weight_sum for t, w in new_weights.items()}

Summary

Key Takeaways

•Task relatedness can be formalized through shared hypothesis classes, representations, or parameter covariance.
•Measuring similarity uses transfer performance, representation similarity (CKA), or gradient alignment.
•Task clustering groups similar tasks to share within clusters while separating dissimilar tasks.
•Architecture design should reflect task structure: hierarchical sharing, asymmetric flows, task conditioning.
•Gradient conflicts indicate negative transfer; address with PCGrad, GradNorm, or architectural separation.

Next Up

Understanding task relationships is essential for effective MTL. Next, we explore Optimization Challenges—the practical difficulties of training MTL systems and techniques to overcome them.

3 / 5

Loading learning content...

Machine LearningTransfer Learning & Domain Adaptation

Multi-Task Learning

LevelAdvanced

Duration90 mins

TopicTransfer Learning & Domain Adaptation

3 / 5

Task Relationships

Understanding How Tasks Relate

Learning Objectives

Theoretical Frameworks for Task Relatedness

Several theoretical frameworks formalize the notion of task relatedness, each capturing different aspects of how tasks can share structure.

1. Shared Hypothesis Class:

$$h_t^* \in \mathcal{H}, \quad \forall t \in {1, ..., k}$$

2. Shared Representation:

Tasks are related if they share an optimal representation function. Each task's optimal predictor can be decomposed as:

$$f_t^* = g_t \circ h^*$$

where $h^*$ is the shared representation and $g_t$ are task-specific heads.

3. Task Covariance:

In the Bayesian view, task parameters $\theta_t$ are drawn from a prior distribution. Related tasks have correlated parameters:

$$\text{Cov}(\theta_i, \theta_j) > 0 \text{ for related tasks}$$

Task Relatedness Is Not Binary

4. Transfer Distance:

A more operational definition measures relatedness through transfer performance. The transfer distance from task $i$ to task $j$ is:

$$d(T_i \to T_j) = \mathcal{L}{T_j}(h{T_i}) - \mathcal{L}{T_j}(h{T_j}^*)$$

where $h_{T_i}$ is trained on $T_i$ and evaluated on $T_j$. Small transfer distance indicates high relatedness.

5. Gradient Alignment:

During MTL training, relatedness manifests in gradient dynamics. Tasks are related if their gradients align:

$$\cos(g_i, g_j) = \frac{g_i \cdot g_j}{||g_i|| \cdot ||g_j||} > 0$$

Positive alignment indicates tasks agree on optimization direction.

Measuring Task Similarity

Before training an MTL system, we often want to estimate task similarity to guide architecture decisions. Several practical approaches exist:

Data-Based Measures:

Input Distribution Similarity:
- Compute statistics (mean, covariance) of input distributions
- Use MMD (Maximum Mean Discrepancy) between task datasets
- Higher similarity in inputs suggests potential for shared representations
Label Correlation:
- For tasks with shared inputs, correlate labels across tasks
- High correlation indicates tasks capture related aspects of data

Transfer-Based Measures:

Probe Networks:
- Train on task $i$, evaluate on task $j$ (and vice versa)
- Compute task affinity matrix: $A_{ij}$ = transfer performance from $i$ to $j$
- Asymmetric: $A_{ij} \neq A_{ji}$ in general
Representation Similarity:
- Train separate models for each task
- Compare learned representations using CCA, CKA, or SVCCA

task_similarity_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import torch
import numpy as np
from sklearn.cross_decomposition import CCA
from typing import Dict, List, Tuple
 
def compute_task_affinity_matrix(
    task_models: Dict[str, torch.nn.Module],
    task_dataloaders: Dict[str, torch.utils.data.DataLoader],
    metric: str = 'accuracy'
) -> np.ndarray:
    """
    Compute pairwise task affinity through transfer evaluation.
    
    Returns:
        Affinity matrix A where A[i,j] = performance of model i on task j
    """
    task_names = list(task_models.keys())
    n_tasks = len(task_names)
    affinity = np.zeros((n_tasks, n_tasks))
    
    for i, source_task in enumerate(task_names):
        model = task_models[source_task]
        model.eval()
        
        for j, target_task in enumerate(task_names):
            loader = task_dataloaders[target_task]
            
            correct = 0
            total = 0
            with torch.no_grad():
                for x, y in loader:
                    pred = model(x).argmax(dim=1)
                    correct += (pred == y).sum().item()
                    total += len(y)
            
            affinity[i, j] = correct / total
    
    return affinity
 
 
def compute_representation_similarity_cka(
    model1: torch.nn.Module,
    model2: torch.nn.Module,
    dataloader: torch.utils.data.DataLoader
) -> float:
    """
    Compute CKA (Centered Kernel Alignment) similarity
    between representations learned by two models.
    """
    model1.eval()
    model2.eval()
    
    reps1, reps2 = [], []
    with torch.no_grad():
        for x, _ in dataloader:
            reps1.append(model1.get_representation(x).cpu())
            reps2.append(model2.get_representation(x).cpu())
    
    X = torch.cat(reps1, dim=0).numpy()
    Y = torch.cat(reps2, dim=0).numpy()
    
    # Compute CKA
    def centering(K):
        n = K.shape[0]
        unit = np.ones([n, n])
        H = np.eye(n) - unit / n
        return H @ K @ H
 
    def linear_kernel(X):
        return X @ X.T
 
    K_X = centering(linear_kernel(X))
    K_Y = centering(linear_kernel(Y))
    
    hsic = np.sum(K_X * K_Y)
    norm = np.sqrt(np.sum(K_X * K_X) * np.sum(K_Y * K_Y))
    
    return hsic / norm if norm > 0 else 0.0
 
 
def compute_gradient_similarity(
    model: torch.nn.Module,
    task_batches: Dict[str, Tuple[torch.Tensor, torch.Tensor]],
    loss_fn: torch.nn.Module
) -> Dict[Tuple[str, str], float]:
    """
    Compute pairwise gradient cosine similarity between tasks.
    """
    task_names = list(task_batches.keys())
    gradients = {}
    
    for task, (x, y) in task_batches.items():
        model.zero_grad()
        pred = model(x, task)
        loss = loss_fn(pred, y)
        loss.backward()
        
        # Collect gradients from shared parameters
        grad = torch.cat([
            p.grad.flatten() for p in model.get_shared_params()
            if p.grad is not None
        ])
        gradients[task] = grad.detach()
    
    similarities = {}
    for i, task_i in enumerate(task_names):
        for task_j in task_names[i+1:]:
            g_i, g_j = gradients[task_i], gradients[task_j]
            cos_sim = torch.dot(g_i, g_j) / (g_i.norm() * g_j.norm())
            similarities[(task_i, task_j)] = cos_sim.item()
    
    return similarities

Gradient-Based Measures (During Training):

Gradient Cosine Similarity: $$\cos(g_i, g_j) = \frac{\nabla_{\theta} \mathcal{L}i \cdot \nabla{\theta} \mathcal{L}j}{||\nabla{\theta} \mathcal{L}i|| \cdot ||\nabla{\theta} \mathcal{L}_j||}$$
Gradient Conflict: Count how often gradients point in opposite directions for each parameter.
Task Affinity via Training: Measure how training on one task affects validation loss on others.

Task Clustering and Grouping

When dealing with many tasks, not all should share parameters equally. Task clustering groups similar tasks to share parameters within clusters while maintaining separation between clusters.

Clustering Strategies:

Pre-determined Clustering:
- Group by domain knowledge (e.g., all NER tasks together)
- Group by output type (classification vs regression)
- Group by data source or modality
Data-Driven Clustering:
- Compute task similarity matrix
- Apply clustering algorithms (hierarchical, spectral, k-means)
- Let data determine natural task groupings
Learned Clustering:
- Learn task embeddings during training
- Cluster in embedding space
- Allow clusters to evolve during optimization

Task Clustering Approaches
Approach	When to Use	Pros/Cons
Domain knowledge	Clear task categories exist	Simple, interpretable / May miss subtle relationships
Transfer-based	Can afford pre-training	Captures true transfer / Computationally expensive
Gradient-based	During MTL training	Dynamic, adapts / Noisy, varies during training
Learned embeddings	Many tasks, complex relationships	Flexible / Adds complexity, may overfit

task_clustering.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import numpy as np
from sklearn.cluster import AgglomerativeClustering, SpectralClustering
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
 
def cluster_tasks_hierarchical(
    affinity_matrix: np.ndarray,
    task_names: list,
    n_clusters: int = None,
    distance_threshold: float = None
) -> dict:
    """
    Hierarchical clustering of tasks based on affinity.
    """
    # Convert affinity to distance
    distance_matrix = 1 - (affinity_matrix + affinity_matrix.T) / 2
    np.fill_diagonal(distance_matrix, 0)
    
    # Perform clustering
    clustering = AgglomerativeClustering(
        n_clusters=n_clusters,
        distance_threshold=distance_threshold,
        metric='precomputed',
        linkage='average'
    )
    labels = clustering.fit_predict(distance_matrix)
    
    # Group tasks by cluster
    clusters = {}
    for task, label in zip(task_names, labels):
        if label not in clusters:
            clusters[label] = []
        clusters[label].append(task)
    
    return clusters
 
 
def visualize_task_relationships(
    affinity_matrix: np.ndarray,
    task_names: list
):
    """Visualize task relationships with dendrogram and heatmap."""
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Heatmap
    im = axes[0].imshow(affinity_matrix, cmap='RdYlGn')
    axes[0].set_xticks(range(len(task_names)))
    axes[0].set_yticks(range(len(task_names)))
    axes[0].set_xticklabels(task_names, rotation=45, ha='right')
    axes[0].set_yticklabels(task_names)
    axes[0].set_title('Task Affinity Matrix')
    plt.colorbar(im, ax=axes[0])
    
    # Dendrogram
    distance = 1 - (affinity_matrix + affinity_matrix.T) / 2
    condensed = distance[np.triu_indices(len(task_names), k=1)]
    Z = linkage(condensed, method='average')
    dendrogram(Z, labels=task_names, ax=axes[1])
    axes[1].set_title('Task Hierarchy')
    axes[1].set_ylabel('Distance')
    
    plt.tight_layout()
    plt.savefig('task_relationships.png', dpi=150)
    plt.close()

Leveraging Task Structure in Architecture Design

Once task relationships are understood, they should inform architecture design:

1. Hierarchical Sharing:

If tasks form a hierarchy (e.g., coarse classification → fine classification), share earlier layers for all tasks, with later layers shared within subtrees:

[Shared Base] → [Cluster-1 Layers] → Task-1a, Task-1b
             → [Cluster-2 Layers] → Task-2a, Task-2b

2. Asymmetric Sharing:

If transfer is asymmetric (task A helps B but not vice versa), use auxiliary task design:

Train on A as auxiliary, with A's representations feeding B
Don't force B's gradients to flow back to A's specific layers

3. Task-Conditional Computation:

Use task identity to modulate shared computation:

Task-specific batch normalization parameters
Task-conditional feature modulation (FiLM layers)
Attention over task-specific aspects of shared features

Architecture Search for MTL

Handling Task Conflicts

When tasks have conflicting requirements, their gradients interfere during training, leading to negative transfer. Several techniques address this:

Gradient Manipulation:

GradNorm: Dynamically adjust task weights to balance gradient magnitudes
PCGrad (Projecting Conflicting Gradients): When $g_i \cdot g_j < 0$, project $g_i$ onto the normal plane of $g_j$: $$g_i' = g_i - \frac{g_i \cdot g_j}{||g_j||^2} g_j$$
CAGrad (Conflict-Averse Gradient Descent): Find gradient direction in the cone of task gradients that maximizes worst-case improvement

Architecture Solutions:

Task-Specific Adapters: Add small task-specific modules to shared architecture
Mixture of Experts: Route different inputs/tasks to specialized sub-networks
Modular Networks: Compose task-specific paths through shared modules

gradient_manipulation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import torch
from typing import Dict, List
 
def pcgrad_update(
    task_gradients: Dict[str, torch.Tensor]
) -> Dict[str, torch.Tensor]:
    """
    PCGrad: Project conflicting gradients.
    """
    task_names = list(task_gradients.keys())
    modified_grads = {t: g.clone() for t, g in task_gradients.items()}
    
    for i, task_i in enumerate(task_names):
        g_i = modified_grads[task_i]
        
        for task_j in task_names:
            if task_i == task_j:
                continue
            g_j = task_gradients[task_j]
            
            # Check for conflict
            dot = torch.dot(g_i.flatten(), g_j.flatten())
            if dot < 0:
                # Project g_i onto normal plane of g_j
                g_i = g_i - (dot / (g_j.norm() ** 2 + 1e-8)) * g_j
        
        modified_grads[task_i] = g_i
    
    # Average modified gradients
    final_grad = torch.stack(list(modified_grads.values())).mean(dim=0)
    return final_grad
 
 
def gradnorm_weights(
    task_losses: Dict[str, torch.Tensor],
    initial_losses: Dict[str, float],
    current_weights: Dict[str, torch.Tensor],
    alpha: float = 1.5
) -> Dict[str, torch.Tensor]:
    """
    GradNorm: Compute balanced task weights.
    """
    # Compute loss ratios
    loss_ratios = {
        t: task_losses[t] / initial_losses[t]
        for t in task_losses
    }
    mean_ratio = sum(loss_ratios.values()) / len(loss_ratios)
    
    # Compute relative inverse training rates
    inv_rates = {
        t: (loss_ratios[t] / mean_ratio) ** alpha
        for t in loss_ratios
    }
    
    # Target gradient norms
    mean_grad_norm = sum(
        current_weights[t] * task_losses[t]
        for t in task_losses
    ).item()
    
    new_weights = {}
    for t in task_losses:
        target = mean_grad_norm * inv_rates[t]
        new_weights[t] = current_weights[t] * target
    
    # Normalize weights
    weight_sum = sum(new_weights.values())
    return {t: w / weight_sum for t, w in new_weights.items()}

Summary

Key Takeaways

•Task relatedness can be formalized through shared hypothesis classes, representations, or parameter covariance.
•Measuring similarity uses transfer performance, representation similarity (CKA), or gradient alignment.
•Task clustering groups similar tasks to share within clusters while separating dissimilar tasks.
•Architecture design should reflect task structure: hierarchical sharing, asymmetric flows, task conditioning.
•Gradient conflicts indicate negative transfer; address with PCGrad, GradNorm, or architectural separation.

Next Up

Understanding task relationships is essential for effective MTL. Next, we explore Optimization Challenges—the practical difficulties of training MTL systems and techniques to overcome them.

3 / 5