Loading learning content...
Having established why shared representations are valuable, we now examine how to implement sharing in practice. Multi-task learning architectures fall into two fundamental paradigms: hard parameter sharing and soft parameter sharing. These represent fundamentally different philosophies about how tasks should interact during learning.
Hard parameter sharing enforces identical parameters for shared components across all tasks—the same weights process all task data. Soft parameter sharing allows each task to maintain its own parameters but encourages similarity through regularization. Understanding the tradeoffs between these approaches is essential for designing effective MTL systems.
By the end of this page, you will understand: (1) hard parameter sharing architecture and its properties, (2) soft parameter sharing mechanisms, (3) theoretical tradeoffs between the paradigms, (4) when to use each approach, and (5) hybrid architectures that combine both.
Hard parameter sharing is the most common and historically dominant approach to multi-task learning. In this paradigm, tasks share a common set of hidden layers (the encoder), with task-specific output layers (heads) branching from the shared representation.
Architecture:
Input → [Shared Layers] → Shared Representation → [Task-1 Head] → Output-1
→ [Task-2 Head] → Output-2
→ [Task-N Head] → Output-N
All tasks use exactly the same weights for the shared layers. This creates a strong inductive bias: the representation must be useful for all tasks simultaneously.
Hard parameter sharing provides strong regularization. By forcing the model to find representations that work for all tasks, it significantly reduces the risk of overfitting. Theoretical analysis shows that the risk of overfitting shared parameters decreases with the number of tasks.
Mathematical Formulation:
Let $\theta_{\text{shared}}$ denote the shared parameters and $\theta_t$ denote task-specific parameters for task $t$. The hard sharing objective is:
$$\min_{\theta_{\text{shared}}, {\theta_t}} \sum_{t=1}^{T} \lambda_t \mathcal{L}t(\theta{\text{shared}}, \theta_t)$$
The shared parameters receive gradients from all tasks:
$$\nabla_{\theta_{\text{shared}}} = \sum_{t=1}^{T} \lambda_t \nabla_{\theta_{\text{shared}}} \mathcal{L}_t$$
Advantages:
Disadvantages:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
import torchimport torch.nn as nnfrom typing import Dict, List class HardParameterSharingMTL(nn.Module): """ Hard parameter sharing MTL architecture. All tasks share the same encoder parameters. """ def __init__( self, input_dim: int, shared_hidden_dims: List[int], task_configs: Dict[str, Dict] ): super().__init__() # Build shared encoder layers = [] prev_dim = input_dim for hidden_dim in shared_hidden_dims: layers.extend([ nn.Linear(prev_dim, hidden_dim), nn.LayerNorm(hidden_dim), nn.GELU(), nn.Dropout(0.1) ]) prev_dim = hidden_dim self.shared_encoder = nn.Sequential(*layers) self.representation_dim = prev_dim # Task-specific heads self.task_heads = nn.ModuleDict() for task_name, config in task_configs.items(): self.task_heads[task_name] = nn.Sequential( nn.Linear(self.representation_dim, config['hidden_dim']), nn.GELU(), nn.Dropout(0.1), nn.Linear(config['hidden_dim'], config['output_dim']) ) def encode(self, x: torch.Tensor) -> torch.Tensor: """Get shared representation.""" return self.shared_encoder(x) def forward(self, x: torch.Tensor, task: str) -> torch.Tensor: """Forward pass for specific task.""" h = self.encode(x) return self.task_heads[task](h) def get_shared_params(self): """Return shared parameters for analysis.""" return list(self.shared_encoder.parameters()) def get_task_params(self, task: str): """Return task-specific parameters.""" return list(self.task_heads[task].parameters())Soft parameter sharing takes a different approach: each task has its own set of parameters, but these parameters are encouraged to be similar through regularization. This provides more flexibility than hard sharing while still enabling knowledge transfer.
Architecture:
Input → [Task-1 Encoder θ₁] → [Task-1 Head] → Output-1
→ [Task-2 Encoder θ₂] → [Task-2 Head] → Output-2
→ [Task-N Encoder θₙ] → [Task-N Head] → Output-N
With regularization: ||θ₁ - θ₂||² + ||θ₁ - θ₃||² + ...
Mathematical Formulation:
Each task $t$ has its own parameters $\theta_t$. The objective includes a regularization term encouraging parameter similarity:
$$\min_{{\theta_t}} \sum_{t=1}^{T} \mathcal{L}t(\theta_t) + \lambda \sum{i < j} \Omega(\theta_i, \theta_j)$$
Common regularization choices:
L2 Regularization: $$\Omega(\theta_i, \theta_j) = ||\theta_i - \theta_j||_2^2$$
Trace Norm Regularization: $$\Omega(\Theta) = ||\Theta||_{\text{trace}}$$ where $\Theta$ is the matrix of stacked task parameters.
Advantages:
Disadvantages:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
import torchimport torch.nn as nnfrom typing import Dict, List class SoftParameterSharingMTL(nn.Module): """ Soft parameter sharing MTL architecture. Each task has own encoder, regularized to be similar. """ def __init__( self, input_dim: int, hidden_dims: List[int], task_configs: Dict[str, Dict], sharing_penalty: float = 0.01 ): super().__init__() self.sharing_penalty = sharing_penalty self.task_names = list(task_configs.keys()) # Separate encoder per task self.task_encoders = nn.ModuleDict() for task_name in self.task_names: layers = [] prev_dim = input_dim for hidden_dim in hidden_dims: layers.extend([ nn.Linear(prev_dim, hidden_dim), nn.LayerNorm(hidden_dim), nn.GELU() ]) prev_dim = hidden_dim self.task_encoders[task_name] = nn.Sequential(*layers) self.representation_dim = prev_dim # Task-specific heads self.task_heads = nn.ModuleDict() for task_name, config in task_configs.items(): self.task_heads[task_name] = nn.Linear( self.representation_dim, config['output_dim'] ) def forward(self, x: torch.Tensor, task: str) -> torch.Tensor: h = self.task_encoders[task](x) return self.task_heads[task](h) def compute_sharing_loss(self) -> torch.Tensor: """Compute L2 regularization between task encoders.""" loss = 0.0 encoder_params = { task: list(enc.parameters()) for task, enc in self.task_encoders.items() } for i, task_i in enumerate(self.task_names): for task_j in self.task_names[i+1:]: for p_i, p_j in zip( encoder_params[task_i], encoder_params[task_j] ): loss += torch.sum((p_i - p_j) ** 2) return self.sharing_penalty * lossThe choice between hard and soft parameter sharing involves fundamental tradeoffs in the bias-variance spectrum and robustness to task heterogeneity.
Generalization Bounds:
For hard parameter sharing with $T$ tasks and $n$ samples per task: $$\text{Error} \leq \hat{\mathcal{L}} + \mathcal{O}\left(\sqrt{\frac{C_{\text{shared}}}{Tn} + \frac{C_{\text{head}}}{n}}\right)$$
The shared complexity $C_{\text{shared}}$ is amortized across $T$ tasks.
For soft parameter sharing: $$\text{Error} \leq \hat{\mathcal{L}} + \mathcal{O}\left(\sqrt{\frac{C_{\text{encoder}}}{n}}\right) + \lambda \cdot \text{(parameter divergence)}$$
Each task bears its own encoder complexity, but the regularization term controls divergence.
| Criterion | Hard Sharing | Soft Sharing |
|---|---|---|
| Parameter count | Lower (shared encoder) | Higher (encoder per task) |
| Regularization strength | Very strong (enforced identity) | Tunable via λ |
| Task flexibility | Low (same representation) | High (can diverge) |
| Negative transfer risk | Higher (forced sharing) | Lower (can adapt) |
| Gradient dynamics | Potential interference | Independent per task |
| Implementation complexity | Simple | Moderate |
| Best for | Related tasks, limited data | Diverse tasks, ample data |
Hard sharing provides maximum sharing but minimum flexibility. Soft sharing provides tunable sharing through the regularization coefficient. As λ→∞ in soft sharing, it approaches hard sharing behavior. As λ→0, tasks become independent.
In practice, start with hard parameter sharing due to its simplicity and strong regularization. Switch to soft sharing only if you observe negative transfer or tasks clearly benefit from different representations. Use validation performance to guide the decision.
Modern MTL systems often combine elements of both paradigms, creating hybrid architectures that balance sharing benefits with task-specific flexibility.
Cross-Stitch Networks: Learn linear combinations of task-specific features at each layer: $$h_t^{(l+1)} = \sum_{t'} \alpha_{t,t'}^{(l)} \tilde{h}_{t'}^{(l+1)}$$ where $\alpha$ values are learned, allowing adaptive sharing.
Sluice Networks: Extend cross-stitch with subspace selection, allowing tasks to share different subspaces of representations.
NDDR-CNN (Neural Discriminative Dimensionality Reduction): Uses task-specific batch normalization with shared convolutions, combining hard sharing of features with soft sharing of statistics.
Progressive Networks: Freeze task columns as they're trained, using lateral connections to transfer knowledge to new tasks.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
import torchimport torch.nn as nn class CrossStitchUnit(nn.Module): """Cross-stitch unit for adaptive feature sharing.""" def __init__(self, num_tasks: int): super().__init__() # Initialize with identity (no mixing initially) self.alpha = nn.Parameter(torch.eye(num_tasks)) def forward(self, task_features: list) -> list: """ Mix features across tasks. task_features: List of [batch, features] tensors """ stacked = torch.stack(task_features, dim=1) # [B, T, F] mixed = torch.einsum('ij,bjf->bif', self.alpha, stacked) return [mixed[:, i] for i in range(len(task_features))] class CrossStitchMTL(nn.Module): """MTL with cross-stitch units for learned sharing.""" def __init__(self, input_dim, hidden_dims, task_configs): super().__init__() self.task_names = list(task_configs.keys()) num_tasks = len(self.task_names) # Per-task encoders with cross-stitch between layers self.layers = nn.ModuleList() self.cross_stitches = nn.ModuleList() prev_dim = input_dim for hidden_dim in hidden_dims: # Task-specific layers task_layers = nn.ModuleDict({ task: nn.Sequential( nn.Linear(prev_dim, hidden_dim), nn.LayerNorm(hidden_dim), nn.GELU() ) for task in self.task_names }) self.layers.append(task_layers) self.cross_stitches.append(CrossStitchUnit(num_tasks)) prev_dim = hidden_dim # Task heads self.heads = nn.ModuleDict({ task: nn.Linear(prev_dim, config['output_dim']) for task, config in task_configs.items() }) def forward(self, x, task): features = {t: x for t in self.task_names} for task_layers, cross_stitch in zip( self.layers, self.cross_stitches ): # Apply task-specific transformations features = { t: task_layers[t](features[t]) for t in self.task_names } # Cross-stitch mixing mixed = cross_stitch([features[t] for t in self.task_names]) features = dict(zip(self.task_names, mixed)) return self.heads[task](features[task])You now understand the two fundamental paradigms for parameter sharing in MTL. Next, we'll explore Task Relationships—how to measure and leverage the structure of how tasks relate to each other.