Machine LearningTransfer Learning & Domain Adaptation

Multi-Task Learning

LevelAdvanced

Duration90 mins

TopicTransfer Learning & Domain Adaptation

2 / 5

Hard vs Soft Parameter Sharing

Two Paradigms for Parameter Sharing

Having established why shared representations are valuable, we now examine how to implement sharing in practice. Multi-task learning architectures fall into two fundamental paradigms: hard parameter sharing and soft parameter sharing. These represent fundamentally different philosophies about how tasks should interact during learning.

Hard parameter sharing enforces identical parameters for shared components across all tasks—the same weights process all task data. Soft parameter sharing allows each task to maintain its own parameters but encourages similarity through regularization. Understanding the tradeoffs between these approaches is essential for designing effective MTL systems.

Learning Objectives

By the end of this page, you will understand: (1) hard parameter sharing architecture and its properties, (2) soft parameter sharing mechanisms, (3) theoretical tradeoffs between the paradigms, (4) when to use each approach, and (5) hybrid architectures that combine both.

Hard Parameter Sharing

Hard parameter sharing is the most common and historically dominant approach to multi-task learning. In this paradigm, tasks share a common set of hidden layers (the encoder), with task-specific output layers (heads) branching from the shared representation.

Architecture:

Input → [Shared Layers] → Shared Representation → [Task-1 Head] → Output-1
                                                → [Task-2 Head] → Output-2
                                                → [Task-N Head] → Output-N

All tasks use exactly the same weights for the shared layers. This creates a strong inductive bias: the representation must be useful for all tasks simultaneously.

Regularization Effect

Hard parameter sharing provides strong regularization. By forcing the model to find representations that work for all tasks, it significantly reduces the risk of overfitting. Theoretical analysis shows that the risk of overfitting shared parameters decreases with the number of tasks.

Mathematical Formulation:

Let $\theta_{\text{shared}}$ denote the shared parameters and $\theta_t$ denote task-specific parameters for task $t$. The hard sharing objective is:

$$\min_{\theta_{\text{shared}}, {\theta_t}} \sum_{t=1}^{T} \lambda_t \mathcal{L}t(\theta{\text{shared}}, \theta_t)$$

The shared parameters receive gradients from all tasks:

$$\nabla_{\theta_{\text{shared}}} = \sum_{t=1}^{T} \lambda_t \nabla_{\theta_{\text{shared}}} \mathcal{L}_t$$

Advantages:

Strong regularization reduces overfitting
Parameter efficient (shared parameters amortized across tasks)
Simple to implement and understand
Well-studied with theoretical guarantees

Disadvantages:

May suffer from negative transfer if tasks conflict
Limited flexibility for task-specific adaptations
Gradient interference between tasks
All tasks constrained to same representation capacity

hard_parameter_sharing.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import torch
import torch.nn as nn
from typing import Dict, List
 
class HardParameterSharingMTL(nn.Module):
    """
    Hard parameter sharing MTL architecture.
    All tasks share the same encoder parameters.
    """
    
    def __init__(
        self,
        input_dim: int,
        shared_hidden_dims: List[int],
        task_configs: Dict[str, Dict]
    ):
        super().__init__()
        
        # Build shared encoder
        layers = []
        prev_dim = input_dim
        for hidden_dim in shared_hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.LayerNorm(hidden_dim),
                nn.GELU(),
                nn.Dropout(0.1)
            ])
            prev_dim = hidden_dim
        self.shared_encoder = nn.Sequential(*layers)
        self.representation_dim = prev_dim
        
        # Task-specific heads
        self.task_heads = nn.ModuleDict()
        for task_name, config in task_configs.items():
            self.task_heads[task_name] = nn.Sequential(
                nn.Linear(self.representation_dim, config['hidden_dim']),
                nn.GELU(),
                nn.Dropout(0.1),
                nn.Linear(config['hidden_dim'], config['output_dim'])
            )
    
    def encode(self, x: torch.Tensor) -> torch.Tensor:
        """Get shared representation."""
        return self.shared_encoder(x)
    
    def forward(self, x: torch.Tensor, task: str) -> torch.Tensor:
        """Forward pass for specific task."""
        h = self.encode(x)
        return self.task_heads[task](h)
    
    def get_shared_params(self):
        """Return shared parameters for analysis."""
        return list(self.shared_encoder.parameters())
    
    def get_task_params(self, task: str):
        """Return task-specific parameters."""
        return list(self.task_heads[task].parameters())

Soft Parameter Sharing

Soft parameter sharing takes a different approach: each task has its own set of parameters, but these parameters are encouraged to be similar through regularization. This provides more flexibility than hard sharing while still enabling knowledge transfer.

Architecture:

Input → [Task-1 Encoder θ₁] → [Task-1 Head] → Output-1
      → [Task-2 Encoder θ₂] → [Task-2 Head] → Output-2
      → [Task-N Encoder θₙ] → [Task-N Head] → Output-N

With regularization: ||θ₁ - θ₂||² + ||θ₁ - θ₃||² + ...

Mathematical Formulation:

Each task $t$ has its own parameters $\theta_t$. The objective includes a regularization term encouraging parameter similarity:

$$\min_{{\theta_t}} \sum_{t=1}^{T} \mathcal{L}t(\theta_t) + \lambda \sum{i < j} \Omega(\theta_i, \theta_j)$$

Common regularization choices:

L2 Regularization: $$\Omega(\theta_i, \theta_j) = ||\theta_i - \theta_j||_2^2$$

Trace Norm Regularization: $$\Omega(\Theta) = ||\Theta||_{\text{trace}}$$ where $\Theta$ is the matrix of stacked task parameters.

Advantages:

Each task can adapt parameters to its specific needs
More robust to negative transfer (parameters can diverge if needed)
Can capture task relationships through learned similarity
No gradient interference between tasks

Disadvantages:

Higher computational and memory cost (separate parameters per task)
Weaker regularization than hard sharing
Regularization hyperparameter $\lambda$ requires tuning
More complex optimization landscape

soft_parameter_sharing.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import torch
import torch.nn as nn
from typing import Dict, List
 
class SoftParameterSharingMTL(nn.Module):
    """
    Soft parameter sharing MTL architecture.
    Each task has own encoder, regularized to be similar.
    """
    
    def __init__(
        self,
        input_dim: int,
        hidden_dims: List[int],
        task_configs: Dict[str, Dict],
        sharing_penalty: float = 0.01
    ):
        super().__init__()
        self.sharing_penalty = sharing_penalty
        self.task_names = list(task_configs.keys())
        
        # Separate encoder per task
        self.task_encoders = nn.ModuleDict()
        for task_name in self.task_names:
            layers = []
            prev_dim = input_dim
            for hidden_dim in hidden_dims:
                layers.extend([
                    nn.Linear(prev_dim, hidden_dim),
                    nn.LayerNorm(hidden_dim),
                    nn.GELU()
                ])
                prev_dim = hidden_dim
            self.task_encoders[task_name] = nn.Sequential(*layers)
        
        self.representation_dim = prev_dim
        
        # Task-specific heads
        self.task_heads = nn.ModuleDict()
        for task_name, config in task_configs.items():
            self.task_heads[task_name] = nn.Linear(
                self.representation_dim, config['output_dim']
            )
    
    def forward(self, x: torch.Tensor, task: str) -> torch.Tensor:
        h = self.task_encoders[task](x)
        return self.task_heads[task](h)
    
    def compute_sharing_loss(self) -> torch.Tensor:
        """Compute L2 regularization between task encoders."""
        loss = 0.0
        encoder_params = {
            task: list(enc.parameters())
            for task, enc in self.task_encoders.items()
        }
        
        for i, task_i in enumerate(self.task_names):
            for task_j in self.task_names[i+1:]:
                for p_i, p_j in zip(
                    encoder_params[task_i],
                    encoder_params[task_j]
                ):
                    loss += torch.sum((p_i - p_j) ** 2)
        
        return self.sharing_penalty * loss

Theoretical Comparison

The choice between hard and soft parameter sharing involves fundamental tradeoffs in the bias-variance spectrum and robustness to task heterogeneity.

Generalization Bounds:

For hard parameter sharing with $T$ tasks and $n$ samples per task: $$\text{Error} \leq \hat{\mathcal{L}} + \mathcal{O}\left(\sqrt{\frac{C_{\text{shared}}}{Tn} + \frac{C_{\text{head}}}{n}}\right)$$

The shared complexity $C_{\text{shared}}$ is amortized across $T$ tasks.

For soft parameter sharing: $$\text{Error} \leq \hat{\mathcal{L}} + \mathcal{O}\left(\sqrt{\frac{C_{\text{encoder}}}{n}}\right) + \lambda \cdot \text{(parameter divergence)}$$

Each task bears its own encoder complexity, but the regularization term controls divergence.

Hard vs Soft Parameter Sharing Comparison
Criterion	Hard Sharing	Soft Sharing
Parameter count	Lower (shared encoder)	Higher (encoder per task)
Regularization strength	Very strong (enforced identity)	Tunable via λ
Task flexibility	Low (same representation)	High (can diverge)
Negative transfer risk	Higher (forced sharing)	Lower (can adapt)
Gradient dynamics	Potential interference	Independent per task
Implementation complexity	Simple	Moderate
Best for	Related tasks, limited data	Diverse tasks, ample data

The Capacity-Sharing Tradeoff

Hard sharing provides maximum sharing but minimum flexibility. Soft sharing provides tunable sharing through the regularization coefficient. As λ→∞ in soft sharing, it approaches hard sharing behavior. As λ→0, tasks become independent.

Practical Guidelines: When to Use Which

Use Hard Parameter Sharing When

•Tasks are highly related and share underlying structure
•Limited training data per task (need strong regularization)
•Memory/compute constraints require parameter efficiency
•You want simplicity and interpretability
•Domain knowledge suggests tasks should use same features

Use Soft Parameter Sharing When

•Tasks have varying degrees of relatedness
•You suspect potential negative transfer
•Ample training data per task
•Tasks require different representational capacity
•You want to learn task relationships from data

Start Simple

In practice, start with hard parameter sharing due to its simplicity and strong regularization. Switch to soft sharing only if you observe negative transfer or tasks clearly benefit from different representations. Use validation performance to guide the decision.

Hybrid Architectures

Modern MTL systems often combine elements of both paradigms, creating hybrid architectures that balance sharing benefits with task-specific flexibility.

Cross-Stitch Networks: Learn linear combinations of task-specific features at each layer: $$h_t^{(l+1)} = \sum_{t'} \alpha_{t,t'}^{(l)} \tilde{h}_{t'}^{(l+1)}$$ where $\alpha$ values are learned, allowing adaptive sharing.

Sluice Networks: Extend cross-stitch with subspace selection, allowing tasks to share different subspaces of representations.

NDDR-CNN (Neural Discriminative Dimensionality Reduction): Uses task-specific batch normalization with shared convolutions, combining hard sharing of features with soft sharing of statistics.

Progressive Networks: Freeze task columns as they're trained, using lateral connections to transfer knowledge to new tasks.

cross_stitch_network.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import torch
import torch.nn as nn
 
class CrossStitchUnit(nn.Module):
    """Cross-stitch unit for adaptive feature sharing."""
    
    def __init__(self, num_tasks: int):
        super().__init__()
        # Initialize with identity (no mixing initially)
        self.alpha = nn.Parameter(torch.eye(num_tasks))
    
    def forward(self, task_features: list) -> list:
        """
        Mix features across tasks.
        task_features: List of [batch, features] tensors
        """
        stacked = torch.stack(task_features, dim=1)  # [B, T, F]
        mixed = torch.einsum('ij,bjf->bif', self.alpha, stacked)
        return [mixed[:, i] for i in range(len(task_features))]
 
 
class CrossStitchMTL(nn.Module):
    """MTL with cross-stitch units for learned sharing."""
    
    def __init__(self, input_dim, hidden_dims, task_configs):
        super().__init__()
        self.task_names = list(task_configs.keys())
        num_tasks = len(self.task_names)
        
        # Per-task encoders with cross-stitch between layers
        self.layers = nn.ModuleList()
        self.cross_stitches = nn.ModuleList()
        
        prev_dim = input_dim
        for hidden_dim in hidden_dims:
            # Task-specific layers
            task_layers = nn.ModuleDict({
                task: nn.Sequential(
                    nn.Linear(prev_dim, hidden_dim),
                    nn.LayerNorm(hidden_dim),
                    nn.GELU()
                )
                for task in self.task_names
            })
            self.layers.append(task_layers)
            self.cross_stitches.append(CrossStitchUnit(num_tasks))
            prev_dim = hidden_dim
        
        # Task heads
        self.heads = nn.ModuleDict({
            task: nn.Linear(prev_dim, config['output_dim'])
            for task, config in task_configs.items()
        })
    
    def forward(self, x, task):
        features = {t: x for t in self.task_names}
        
        for task_layers, cross_stitch in zip(
            self.layers, self.cross_stitches
        ):
            # Apply task-specific transformations
            features = {
                t: task_layers[t](features[t])
                for t in self.task_names
            }
            # Cross-stitch mixing
            mixed = cross_stitch([features[t] for t in self.task_names])
            features = dict(zip(self.task_names, mixed))
        
        return self.heads[task](features[task])

Summary

Key Takeaways

•Hard parameter sharing uses identical parameters for shared layers—simple, parameter-efficient, and strongly regularizing, but inflexible.
•Soft parameter sharing gives each task its own parameters with regularization encouraging similarity—more flexible but higher cost.
•Hard sharing works best for related tasks with limited data; soft sharing suits diverse tasks with ample data.
•Hybrid architectures like cross-stitch networks learn optimal sharing from data.
•Start with hard sharing for simplicity; move to soft/hybrid only if needed.

Next Up

You now understand the two fundamental paradigms for parameter sharing in MTL. Next, we'll explore Task Relationships—how to measure and leverage the structure of how tasks relate to each other.

2 / 5

Loading learning content...

Machine LearningTransfer Learning & Domain Adaptation

Multi-Task Learning

LevelAdvanced

Duration90 mins

TopicTransfer Learning & Domain Adaptation

2 / 5

Hard vs Soft Parameter Sharing

Two Paradigms for Parameter Sharing

Learning Objectives

Hard Parameter Sharing

Architecture:

Input → [Shared Layers] → Shared Representation → [Task-1 Head] → Output-1
                                                → [Task-2 Head] → Output-2
                                                → [Task-N Head] → Output-N

All tasks use exactly the same weights for the shared layers. This creates a strong inductive bias: the representation must be useful for all tasks simultaneously.

Regularization Effect

Mathematical Formulation:

Let $\theta_{\text{shared}}$ denote the shared parameters and $\theta_t$ denote task-specific parameters for task $t$. The hard sharing objective is:

$$\min_{\theta_{\text{shared}}, {\theta_t}} \sum_{t=1}^{T} \lambda_t \mathcal{L}t(\theta{\text{shared}}, \theta_t)$$

The shared parameters receive gradients from all tasks:

$$\nabla_{\theta_{\text{shared}}} = \sum_{t=1}^{T} \lambda_t \nabla_{\theta_{\text{shared}}} \mathcal{L}_t$$

Advantages:

Strong regularization reduces overfitting
Parameter efficient (shared parameters amortized across tasks)
Simple to implement and understand
Well-studied with theoretical guarantees

Disadvantages:

May suffer from negative transfer if tasks conflict
Limited flexibility for task-specific adaptations
Gradient interference between tasks
All tasks constrained to same representation capacity

hard_parameter_sharing.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import torch
import torch.nn as nn
from typing import Dict, List
 
class HardParameterSharingMTL(nn.Module):
    """
    Hard parameter sharing MTL architecture.
    All tasks share the same encoder parameters.
    """
    
    def __init__(
        self,
        input_dim: int,
        shared_hidden_dims: List[int],
        task_configs: Dict[str, Dict]
    ):
        super().__init__()
        
        # Build shared encoder
        layers = []
        prev_dim = input_dim
        for hidden_dim in shared_hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.LayerNorm(hidden_dim),
                nn.GELU(),
                nn.Dropout(0.1)
            ])
            prev_dim = hidden_dim
        self.shared_encoder = nn.Sequential(*layers)
        self.representation_dim = prev_dim
        
        # Task-specific heads
        self.task_heads = nn.ModuleDict()
        for task_name, config in task_configs.items():
            self.task_heads[task_name] = nn.Sequential(
                nn.Linear(self.representation_dim, config['hidden_dim']),
                nn.GELU(),
                nn.Dropout(0.1),
                nn.Linear(config['hidden_dim'], config['output_dim'])
            )
    
    def encode(self, x: torch.Tensor) -> torch.Tensor:
        """Get shared representation."""
        return self.shared_encoder(x)
    
    def forward(self, x: torch.Tensor, task: str) -> torch.Tensor:
        """Forward pass for specific task."""
        h = self.encode(x)
        return self.task_heads[task](h)
    
    def get_shared_params(self):
        """Return shared parameters for analysis."""
        return list(self.shared_encoder.parameters())
    
    def get_task_params(self, task: str):
        """Return task-specific parameters."""
        return list(self.task_heads[task].parameters())

Soft Parameter Sharing

Architecture:

Input → [Task-1 Encoder θ₁] → [Task-1 Head] → Output-1
      → [Task-2 Encoder θ₂] → [Task-2 Head] → Output-2
      → [Task-N Encoder θₙ] → [Task-N Head] → Output-N

With regularization: ||θ₁ - θ₂||² + ||θ₁ - θ₃||² + ...

Mathematical Formulation:

Each task $t$ has its own parameters $\theta_t$. The objective includes a regularization term encouraging parameter similarity:

$$\min_{{\theta_t}} \sum_{t=1}^{T} \mathcal{L}t(\theta_t) + \lambda \sum{i < j} \Omega(\theta_i, \theta_j)$$

Common regularization choices:

L2 Regularization: $$\Omega(\theta_i, \theta_j) = ||\theta_i - \theta_j||_2^2$$

Trace Norm Regularization: $$\Omega(\Theta) = ||\Theta||_{\text{trace}}$$ where $\Theta$ is the matrix of stacked task parameters.

Advantages:

Each task can adapt parameters to its specific needs
More robust to negative transfer (parameters can diverge if needed)
Can capture task relationships through learned similarity
No gradient interference between tasks

Disadvantages:

Higher computational and memory cost (separate parameters per task)
Weaker regularization than hard sharing
Regularization hyperparameter $\lambda$ requires tuning
More complex optimization landscape

soft_parameter_sharing.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import torch
import torch.nn as nn
from typing import Dict, List
 
class SoftParameterSharingMTL(nn.Module):
    """
    Soft parameter sharing MTL architecture.
    Each task has own encoder, regularized to be similar.
    """
    
    def __init__(
        self,
        input_dim: int,
        hidden_dims: List[int],
        task_configs: Dict[str, Dict],
        sharing_penalty: float = 0.01
    ):
        super().__init__()
        self.sharing_penalty = sharing_penalty
        self.task_names = list(task_configs.keys())
        
        # Separate encoder per task
        self.task_encoders = nn.ModuleDict()
        for task_name in self.task_names:
            layers = []
            prev_dim = input_dim
            for hidden_dim in hidden_dims:
                layers.extend([
                    nn.Linear(prev_dim, hidden_dim),
                    nn.LayerNorm(hidden_dim),
                    nn.GELU()
                ])
                prev_dim = hidden_dim
            self.task_encoders[task_name] = nn.Sequential(*layers)
        
        self.representation_dim = prev_dim
        
        # Task-specific heads
        self.task_heads = nn.ModuleDict()
        for task_name, config in task_configs.items():
            self.task_heads[task_name] = nn.Linear(
                self.representation_dim, config['output_dim']
            )
    
    def forward(self, x: torch.Tensor, task: str) -> torch.Tensor:
        h = self.task_encoders[task](x)
        return self.task_heads[task](h)
    
    def compute_sharing_loss(self) -> torch.Tensor:
        """Compute L2 regularization between task encoders."""
        loss = 0.0
        encoder_params = {
            task: list(enc.parameters())
            for task, enc in self.task_encoders.items()
        }
        
        for i, task_i in enumerate(self.task_names):
            for task_j in self.task_names[i+1:]:
                for p_i, p_j in zip(
                    encoder_params[task_i],
                    encoder_params[task_j]
                ):
                    loss += torch.sum((p_i - p_j) ** 2)
        
        return self.sharing_penalty * loss

Theoretical Comparison

The choice between hard and soft parameter sharing involves fundamental tradeoffs in the bias-variance spectrum and robustness to task heterogeneity.

Generalization Bounds:

For hard parameter sharing with $T$ tasks and $n$ samples per task: $$\text{Error} \leq \hat{\mathcal{L}} + \mathcal{O}\left(\sqrt{\frac{C_{\text{shared}}}{Tn} + \frac{C_{\text{head}}}{n}}\right)$$

The shared complexity $C_{\text{shared}}$ is amortized across $T$ tasks.

For soft parameter sharing: $$\text{Error} \leq \hat{\mathcal{L}} + \mathcal{O}\left(\sqrt{\frac{C_{\text{encoder}}}{n}}\right) + \lambda \cdot \text{(parameter divergence)}$$

Each task bears its own encoder complexity, but the regularization term controls divergence.

Hard vs Soft Parameter Sharing Comparison
Criterion	Hard Sharing	Soft Sharing
Parameter count	Lower (shared encoder)	Higher (encoder per task)
Regularization strength	Very strong (enforced identity)	Tunable via λ
Task flexibility	Low (same representation)	High (can diverge)
Negative transfer risk	Higher (forced sharing)	Lower (can adapt)
Gradient dynamics	Potential interference	Independent per task
Implementation complexity	Simple	Moderate
Best for	Related tasks, limited data	Diverse tasks, ample data

The Capacity-Sharing Tradeoff

Practical Guidelines: When to Use Which

Use Hard Parameter Sharing When

•Tasks are highly related and share underlying structure
•Limited training data per task (need strong regularization)
•Memory/compute constraints require parameter efficiency
•You want simplicity and interpretability
•Domain knowledge suggests tasks should use same features

Use Soft Parameter Sharing When

•Tasks have varying degrees of relatedness
•You suspect potential negative transfer
•Ample training data per task
•Tasks require different representational capacity
•You want to learn task relationships from data

Start Simple

Hybrid Architectures

Modern MTL systems often combine elements of both paradigms, creating hybrid architectures that balance sharing benefits with task-specific flexibility.

Sluice Networks: Extend cross-stitch with subspace selection, allowing tasks to share different subspaces of representations.

NDDR-CNN (Neural Discriminative Dimensionality Reduction): Uses task-specific batch normalization with shared convolutions, combining hard sharing of features with soft sharing of statistics.

Progressive Networks: Freeze task columns as they're trained, using lateral connections to transfer knowledge to new tasks.

cross_stitch_network.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import torch
import torch.nn as nn
 
class CrossStitchUnit(nn.Module):
    """Cross-stitch unit for adaptive feature sharing."""
    
    def __init__(self, num_tasks: int):
        super().__init__()
        # Initialize with identity (no mixing initially)
        self.alpha = nn.Parameter(torch.eye(num_tasks))
    
    def forward(self, task_features: list) -> list:
        """
        Mix features across tasks.
        task_features: List of [batch, features] tensors
        """
        stacked = torch.stack(task_features, dim=1)  # [B, T, F]
        mixed = torch.einsum('ij,bjf->bif', self.alpha, stacked)
        return [mixed[:, i] for i in range(len(task_features))]
 
 
class CrossStitchMTL(nn.Module):
    """MTL with cross-stitch units for learned sharing."""
    
    def __init__(self, input_dim, hidden_dims, task_configs):
        super().__init__()
        self.task_names = list(task_configs.keys())
        num_tasks = len(self.task_names)
        
        # Per-task encoders with cross-stitch between layers
        self.layers = nn.ModuleList()
        self.cross_stitches = nn.ModuleList()
        
        prev_dim = input_dim
        for hidden_dim in hidden_dims:
            # Task-specific layers
            task_layers = nn.ModuleDict({
                task: nn.Sequential(
                    nn.Linear(prev_dim, hidden_dim),
                    nn.LayerNorm(hidden_dim),
                    nn.GELU()
                )
                for task in self.task_names
            })
            self.layers.append(task_layers)
            self.cross_stitches.append(CrossStitchUnit(num_tasks))
            prev_dim = hidden_dim
        
        # Task heads
        self.heads = nn.ModuleDict({
            task: nn.Linear(prev_dim, config['output_dim'])
            for task, config in task_configs.items()
        })
    
    def forward(self, x, task):
        features = {t: x for t in self.task_names}
        
        for task_layers, cross_stitch in zip(
            self.layers, self.cross_stitches
        ):
            # Apply task-specific transformations
            features = {
                t: task_layers[t](features[t])
                for t in self.task_names
            }
            # Cross-stitch mixing
            mixed = cross_stitch([features[t] for t in self.task_names])
            features = dict(zip(self.task_names, mixed))
        
        return self.heads[task](features[task])

Summary

Key Takeaways

•Hard parameter sharing uses identical parameters for shared layers—simple, parameter-efficient, and strongly regularizing, but inflexible.
•Soft parameter sharing gives each task its own parameters with regularization encouraging similarity—more flexible but higher cost.
•Hard sharing works best for related tasks with limited data; soft sharing suits diverse tasks with ample data.
•Hybrid architectures like cross-stitch networks learn optimal sharing from data.
•Start with hard sharing for simplicity; move to soft/hybrid only if needed.

Next Up

You now understand the two fundamental paradigms for parameter sharing in MTL. Next, we'll explore Task Relationships—how to measure and leverage the structure of how tasks relate to each other.

2 / 5