Loading learning content...
Residual connections, introduced by He et al. (2015) for computer vision, are arguably the most important architectural innovation enabling very deep neural networks. Without residual connections, training networks with hundreds or thousands of layers would be practically impossible due to gradient degradation.
In the Transformer architecture, residual connections appear at every sublayer—around self-attention, cross-attention (in decoders), and feed-forward networks. They enable the 6, 12, 24, or even 100+ layer architectures that power modern language models.
This page provides a comprehensive examination of residual connections: why they're necessary, how they work mathematically, their interaction with other components, and advanced variants used in state-of-the-art models.
Before residual connections, training networks beyond ~20 layers was extremely difficult. ResNet demonstrated 152-layer networks in 2015. Today, large language models like GPT-4 likely use hundreds of Transformer layers, made possible by residual connections.
Before understanding the solution, we must understand the problem that residual connections solve.
Naive Deep Networks
In a traditional deep network without residual connections, each layer transforms its input:
$$h_{l+1} = f_l(h_l)$$
For a network with $L$ layers: $$h_L = f_L(f_{L-1}(...f_1(h_0)...))$$
The Observed Degradation
Empirically, deeper networks (without residual connections) show unexpected behavior:
This is the degradation problem: deeper networks have more expressive power but are harder to train.
Why Does This Happen?
Consider the gradient flow through a deep network:
$$\frac{\partial L}{\partial h_l} = \frac{\partial L}{\partial h_L} \cdot \frac{\partial h_L}{\partial h_{L-1}} \cdot ... \cdot \frac{\partial h_{l+1}}{\partial h_l}$$
This is a product of many Jacobian matrices. If each Jacobian has:
The Theoretical Insight
The key insight from He et al.: if a shallower network can achieve a certain error, a deeper network should be able to achieve at least the same error by having the additional layers learn the identity function.
But learning the identity function is surprisingly hard for a stack of nonlinear layers! The network must precisely arrange weights to output its input unchanged—a highly specific configuration in a vast parameter space.
The Solution: What if we made the identity mapping the default behavior of each layer?
If additional layers default to identity, they can only help (by learning useful transformations) but never hurt (at worst, they pass through unchanged). This provides a monotonic improvement guarantee: deeper can only be as good or better than shallower.
The residual connection implements the identity-default principle with elegant simplicity.
Basic Formulation
Instead of learning a direct transformation $h_{l+1} = f_l(h_l)$, we learn a residual:
$$h_{l+1} = h_l + f_l(h_l)$$
where $f_l$ is called the residual function—it learns what to add to the input.
If the optimal transformation is close to identity, $f_l$ needs only to learn small adjustments. If the optimal transformation is identity exactly, $f_l = 0$ is a simple solution.
Transformer-Specific Formulation
In Transformers, residual connections wrap both attention and FFN sublayers:
$$\text{After Attention}: h' = h + \text{MultiHeadAttention}(h)$$ $$\text{After FFN}: h'' = h' + \text{FFN}(h')$$
With layer normalization (Pre-LN style):
$$h' = h + \text{MultiHeadAttention}(\text{LayerNorm}(h))$$ $$h'' = h' + \text{FFN}(\text{LayerNorm}(h'))$$
Gradient Flow Analysis
The gradient through a residual block:
$$\frac{\partial h_{l+1}}{\partial h_l} = \frac{\partial (h_l + f_l(h_l))}{\partial h_l} = I + \frac{\partial f_l}{\partial h_l}$$
The gradient is identity plus something else. This transforms a product of potentially small/large Jacobians into:
$$\frac{\partial L}{\partial h_l} = \frac{\partial L}{\partial h_L} \left(I + \frac{\partial f_L}{\partial h_{L-1}}\right)\left(I + \frac{\partial f_{L-1}}{\partial h_{L-2}}\right)...$$
Expanding: $$= \frac{\partial L}{\partial h_L} \left(I + \sum_{i>l} \frac{\partial f_i}{\partial h_{i-1}} + \text{higher order terms}\right)$$
The "Gradient Highway"
The identity term $I$ provides a direct path for gradients to flow from loss to any layer, unimpeded by intervening transformations. Each layer adds its contribution, but the baseline gradient is preserved.
For very deep networks, this means the gradient at layer 1 is at least $\frac{\partial L}{\partial h_L}$—the final layer's gradient can reach all the way back.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119
import torchimport torch.nn as nnimport matplotlib.pyplot as pltimport numpy as np class PlainBlock(nn.Module): """Block WITHOUT residual connection.""" def __init__(self, dim: int): super().__init__() self.net = nn.Sequential( nn.Linear(dim, dim * 2), nn.ReLU(), nn.Linear(dim * 2, dim) ) def forward(self, x): return self.net(x) class ResidualBlock(nn.Module): """Block WITH residual connection.""" def __init__(self, dim: int): super().__init__() self.net = nn.Sequential( nn.Linear(dim, dim * 2), nn.ReLU(), nn.Linear(dim * 2, dim) ) def forward(self, x): return x + self.net(x) # Key difference: + x def analyze_gradient_flow(): """Compare gradient magnitudes in deep networks with/without skip connections.""" dim = 256 depths = [4, 8, 16, 32, 64] results_plain = [] results_residual = [] for depth in depths: # Plain network plain_layers = nn.ModuleList([PlainBlock(dim) for _ in range(depth)]) # Residual network res_layers = nn.ModuleList([ResidualBlock(dim) for _ in range(depth)]) x = torch.randn(4, dim, requires_grad=True) # Forward through plain network h_plain = x.clone() for layer in plain_layers: h_plain = layer(h_plain) loss_plain = h_plain.sum() loss_plain.backward() grad_plain = x.grad.norm().item() results_plain.append(grad_plain) x.grad.zero_() # Forward through residual network h_res = x.clone() for layer in res_layers: h_res = layer(h_res) loss_res = h_res.sum() loss_res.backward() grad_res = x.grad.norm().item() results_residual.append(grad_res) print("Gradient Magnitude at Input Layer") print("-" * 50) print(f"{'Depth':<10} {'Plain':>15} {'Residual':>15}") print("-" * 50) for d, p, r in zip(depths, results_plain, results_residual): print(f"{d:<10} {p:>15.6f} {r:>15.6f}") return depths, results_plain, results_residual def analyze_jacobian_spectrum(): """Analyze singular value distribution of layer Jacobians.""" dim = 64 block_plain = PlainBlock(dim) block_residual = ResidualBlock(dim) x = torch.randn(1, dim, requires_grad=True) # Compute Jacobian for plain block jacobian_plain = torch.autograd.functional.jacobian( lambda inp: block_plain(inp).squeeze(0), x ).squeeze() # Compute Jacobian for residual block jacobian_residual = torch.autograd.functional.jacobian( lambda inp: block_residual(inp).squeeze(0), x ).squeeze() # Singular values sv_plain = torch.linalg.svdvals(jacobian_plain).numpy() sv_residual = torch.linalg.svdvals(jacobian_residual).numpy() print("Jacobian Singular Value Analysis") print("-" * 50) print(f"{'Metric':<20} {'Plain':>15} {'Residual':>15}") print("-" * 50) print(f"{'Max singular value':<20} {sv_plain.max():>15.4f} {sv_residual.max():>15.4f}") print(f"{'Min singular value':<20} {sv_plain.min():>15.4f} {sv_residual.min():>15.4f}") print(f"{'Condition number':<20} {sv_plain.max()/sv_plain.min():>15.4f} {sv_residual.max()/sv_residual.min():>15.4f}") # Note: Residual Jacobian = I + J_f, so singular values are shifted toward 1 print("(Residual block has identity + residual, so singular values cluster around 1)") # Run analysesanalyze_gradient_flow()analyze_jacobian_spectrum()Residual connections and layer normalization work together as a team. Their interplay is crucial for stable training.
The Combined System
In a Transformer layer, we have:
These components interact differently depending on their arrangement:
Post-LN (Original Transformer)
$$h' = \text{LayerNorm}(h + f(h))$$
The layer normalization operates on the sum of input and residual. This means:
Pre-LN (Modern Standard)
$$h' = h + f(\text{LayerNorm}(h))$$
The layer normalization only affects the residual branch:
Why Pre-LN Provides Better Gradient Flow
In Pre-LN:
$$\frac{\partial h'}{\partial h} = I + \frac{\partial f(\text{LN}(h))}{\partial h}$$
The identity matrix $I$ is preserved—gradients can flow through unchanged.
In Post-LN:
$$\frac{\partial h'}{\partial h} = \frac{\partial \text{LN}(h + f(h))}{\partial h}$$
This is the Jacobian of LayerNorm applied to the sum. While not catastrophic, it distorts the gradient and doesn't preserve the identity path.
Practical Implications
Some architectures combine benefits: use Pre-LN for training stability but add Post-LN to certain layers for quality. Others use 'sandwich' normalization (norm before AND after sublayer). The optimal choice is still actively researched.
For very deep networks, even residual connections may not be sufficient for stable training. Various residual scaling strategies have been developed.
The Scale Accumulation Problem
With standard residual connections, each layer adds to the representation:
$$h_L = h_0 + \sum_{l=0}^{L-1} f_l(h_l)$$
If each $f_l$ has some variance, the total variance of $h_L$ grows with $L$. For a 96-layer network, activations can become very large.
Fixup Initialization
Zhang et al. (2019) proposed initializing the last layer of each residual branch to zero (or near-zero). This makes $f(h) \approx 0$ at initialization, so $h' \approx h$.
$$W_{\text{last}}^{(l)} = 0 \quad \text{at initialization}$$
Scaled Residual Connections
Scale the residual by a factor that decreases with depth:
$$h' = h + \alpha_l \cdot f(h)$$
Common schemes:
DeepNet Scaling
Microsoft's DeepNet (Wang et al., 2022) uses specific scaling for 1000+ layer Transformers:
$$h' = h + \alpha \cdot f(\beta \cdot h)$$
where $\alpha = (2N)^{-1/4}$ and $\beta = (8N)^{1/4}$ for encoder with $N$ layers.
This stabilizes both forward and backward passes for extreme depth.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171
import torchimport torch.nn as nnimport math class ScaledResidualBlock(nn.Module): """ Residual block with configurable scaling strategies. """ def __init__( self, d_model: int, d_ff: int, layer_idx: int, total_layers: int, scaling_type: str = "none" ): super().__init__() self.layer_norm = nn.LayerNorm(d_model) self.ffn = nn.Sequential( nn.Linear(d_model, d_ff), nn.GELU(), nn.Linear(d_ff, d_model) ) self.scaling_type = scaling_type self.layer_idx = layer_idx self.total_layers = total_layers # Compute scaling factor based on strategy if scaling_type == "none": self.alpha = 1.0 elif scaling_type == "fixed": self.alpha = 0.1 elif scaling_type == "sqrt_depth": self.alpha = 1.0 / math.sqrt(total_layers) elif scaling_type == "deepnet": # DeepNet-style scaling for very deep networks self.alpha = (2 * total_layers) ** (-0.25) elif scaling_type == "learned": self.alpha = nn.Parameter(torch.ones(1)) else: raise ValueError(f"Unknown scaling type: {scaling_type}") # Initialize final layer of FFN to zero for Fixup-style init if scaling_type in ["fixup", "deepnet"]: nn.init.zeros_(self.ffn[-1].weight) nn.init.zeros_(self.ffn[-1].bias) def forward(self, x: torch.Tensor) -> torch.Tensor: # Pre-LN residual with scaling normed = self.layer_norm(x) residual = self.ffn(normed) if self.scaling_type == "learned": return x + self.alpha * residual else: return x + self.alpha * residual class DeepNetBlock(nn.Module): """ Block following DeepNet scaling for 1000+ layer transformers. Uses α and β scaling before and after sublayers. """ def __init__( self, d_model: int, d_ff: int, total_layers: int ): super().__init__() # DeepNet scaling constants self.alpha = (2 * total_layers) ** (-0.25) self.beta = (8 * total_layers) ** 0.25 self.layer_norm = nn.LayerNorm(d_model) self.ffn = nn.Sequential( nn.Linear(d_model, d_ff), nn.GELU(), nn.Linear(d_ff, d_model) ) # Initialize output projection to zero nn.init.zeros_(self.ffn[-1].weight) def forward(self, x: torch.Tensor) -> torch.Tensor: # Scale input before sublayer normed = self.layer_norm(x * self.beta) residual = self.ffn(normed) # Scale residual before adding return x + self.alpha * residual def compare_scaling_strategies(): """Analyze activation magnitudes across strategies.""" d_model = 512 d_ff = 2048 depths = [12, 48, 96] batch_size = 2 seq_len = 64 scaling_types = ["none", "fixed", "sqrt_depth", "deepnet"] print("Activation Magnitude at Final Layer") print("-" * 60) print(f"{'Depth':<10}", end="") for st in scaling_types: print(f"{st:>12}", end="") print() print("-" * 60) for depth in depths: print(f"{depth:<10}", end="") for scaling_type in scaling_types: layers = nn.ModuleList([ ScaledResidualBlock(d_model, d_ff, i, depth, scaling_type) for i in range(depth) ]) x = torch.randn(batch_size, seq_len, d_model) with torch.no_grad(): h = x for layer in layers: h = layer(h) magnitude = h.norm().item() / (batch_size * seq_len * d_model) ** 0.5 print(f"{magnitude:>12.2f}", end="") print() def analyze_variance_growth(): """Track variance through layers with different scaling.""" d_model = 256 d_ff = 1024 depth = 48 x = torch.randn(8, 32, d_model) for scaling in ["none", "sqrt_depth"]: layers = nn.ModuleList([ ScaledResidualBlock(d_model, d_ff, i, depth, scaling) for i in range(depth) ]) variances = [x.var().item()] h = x with torch.no_grad(): for layer in layers: h = layer(h) variances.append(h.var().item()) print(f"Variance growth ({scaling}):") print(f" Layer 0: {variances[0]:.4f}") print(f" Layer {depth//4}: {variances[depth//4]:.4f}") print(f" Layer {depth//2}: {variances[depth//2]:.4f}") print(f" Layer {depth}: {variances[-1]:.4f}") print(f" Growth factor: {variances[-1]/variances[0]:.2f}x") compare_scaling_strategies()analyze_variance_growth()| Strategy | Scale Factor | Use Case | Notes |
|---|---|---|---|
| None | α = 1 | Shallow networks (≤24 layers) | Standard residual connection |
| Fixed | α = 0.1 | Medium depth networks | Simple and often effective |
| √Depth | α = 1/√L | Deep networks | Theoretical basis from signal propagation |
| Learned | α = parameter | Various | Flexibility but adds parameters |
| DeepNet | α = (2N)^(-1/4) | Very deep (100+ layers) | State-of-the-art for extreme depth |
Understanding why residual connections work requires diving into theoretical analysis of gradient flow and optimization landscapes.
Signal Propagation Theory
Consider a deep residual network at initialization. For gradients to flow effectively:
$$\mathbb{E}\left[\left|\frac{\partial L}{\partial h_l}\right|^2\right] \approx \mathbb{E}\left[\left|\frac{\partial L}{\partial h_L}\right|^2\right]$$
The expected gradient magnitude should be roughly constant across layers.
For a residual block $h' = h + f(h)$:
$$\text{Var}[h'] = \text{Var}[h] + \text{Var}[f(h)] + 2\text{Cov}[h, f(h)]$$
With careful initialization, $\text{Var}[f(h)]$ is small, and $h'$ has similar variance to $h$.
The Unrolled Gradient View
Residual networks can be viewed as an ensemble. The gradient at layer $l$:
$$\frac{\partial L}{\partial h_l} = \frac{\partial L}{\partial h_L} \prod_{k=l}^{L-1} \left(I + \frac{\partial f_k}{\partial h_k}\right)$$
Expanding the product:
$$= \frac{\partial L}{\partial h_L} \left(I + \sum_k J_k + \sum_{k<m} J_k J_m + ...\right)$$
This is a sum over all possible "paths" through the network:
Ensemble Interpretation
Veit et al. (2016) showed that residual networks behave like an ensemble of shallower networks. A 54-layer ResNet effectively contains paths of length 0 to 54, with the majority of gradient contributions coming from paths of length 10-30.
Loss Landscape Smoothness
Li et al. (2018) visualized loss landscapes and showed:
The smoothness comes from the gradient highway—even if local curvature is high, there's always a path for gradient signal.
Think of a residual network as implicitly containing 2^L subnetworks (all possible combinations of including or skipping each residual block). During training, gradient flows through all paths simultaneously. At inference, the full ensemble is evaluated. This explains both trainability and generalization benefits.
Dynamical Mean-Field Theory Analysis
Recent work applies tools from physics (mean-field theory) to analyze infinitely deep residual networks:
At infinite depth, with residual scaling $\alpha = 1/\sqrt{L}$:
This leads to Neural ODEs (Chen et al., 2018):
$$\frac{dh}{dt} = f(h, t)$$
where $t$ is a continuous depth parameter. The standard residual network is an Euler discretization of this ODE.
The basic residual connection has spawned many variations, each addressing specific challenges.
Dense Connections (DenseNet)
Instead of connecting only to the previous layer, connect to all previous layers:
$$h_l = f_l([h_0, h_1, ..., h_{l-1}])$$
where $[...]$ denotes concatenation. This provides maximum gradient flow but increases memory usage.
Highway Networks
Add a learned gate that interpolates between identity and transformation:
$$h' = g(h) \odot f(h) + (1 - g(h)) \odot h$$
where $g(h) = \sigma(W_g h + b_g)$ is a learned gate. This was a precursor to residual connections.
ReZero
Initialize the residual branch to output exactly zero, using a single learned scalar:
$$h' = h + \alpha \cdot f(h)$$
where $\alpha = 0$ at initialization. The network starts as identity; nonzero $\alpha$ is learned.
Pre-Activation Residual Blocks
Move batch/layer normalization and activation before the convolutional/linear layers:
$$h' = h + W_2 \cdot \sigma(\text{Norm}(W_1 \cdot \sigma(\text{Norm}(h))))$$
This is the Pre-LN pattern applied to the residual block itself, not just around it.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134
import torchimport torch.nn as nn class HighwayBlock(nn.Module): """ Highway network block: learned gating between identity and transform. h' = g(h) * f(h) + (1-g(h)) * h When g→1: behaves like standard transform When g→0: behaves like identity (pass-through) """ def __init__(self, d_model: int): super().__init__() # Transform self.transform = nn.Sequential( nn.Linear(d_model, d_model), nn.ReLU() ) # Gate (initialized to favor identity) self.gate = nn.Linear(d_model, d_model) nn.init.constant_(self.gate.bias, -2.0) # sigmoid(-2) ≈ 0.12 def forward(self, x: torch.Tensor) -> torch.Tensor: transform = self.transform(x) gate = torch.sigmoid(self.gate(x)) return gate * transform + (1 - gate) * x class ReZeroBlock(nn.Module): """ ReZero: residual connection with learned scalar initialized to 0. h' = h + alpha * f(h) where alpha starts at 0 Enables training very deep networks without warmup. """ def __init__(self, d_model: int, d_ff: int): super().__init__() self.layer_norm = nn.LayerNorm(d_model) self.ffn = nn.Sequential( nn.Linear(d_model, d_ff), nn.GELU(), nn.Linear(d_ff, d_model) ) # The key ingredient: starts at 0 self.alpha = nn.Parameter(torch.zeros(1)) def forward(self, x: torch.Tensor) -> torch.Tensor: residual = self.ffn(self.layer_norm(x)) return x + self.alpha * residual class DenseBlock(nn.Module): """ DenseNet-style block: concatenate outputs from all previous sublayers. Warning: Memory intensive as sequence length and depth increase. """ def __init__(self, d_model: int, growth_rate: int, num_layers: int): super().__init__() self.layers = nn.ModuleList() current_dim = d_model for i in range(num_layers): self.layers.append(nn.Sequential( nn.LayerNorm(current_dim), nn.Linear(current_dim, growth_rate), nn.ReLU() )) current_dim += growth_rate self.final_dim = current_dim def forward(self, x: torch.Tensor) -> torch.Tensor: features = [x] for layer in self.layers: concat = torch.cat(features, dim=-1) new_features = layer(concat) features.append(new_features) return torch.cat(features, dim=-1) def compare_residual_variants(): """Compare different residual connection styles.""" d_model = 256 d_ff = 512 batch, seq = 2, 8 x = torch.randn(batch, seq, d_model) # Standard residual class StandardRes(nn.Module): def __init__(self): super().__init__() self.ffn = nn.Sequential(nn.Linear(d_model, d_ff), nn.ReLU(), nn.Linear(d_ff, d_model)) def forward(self, x): return x + self.ffn(x) blocks = { 'Standard': StandardRes(), 'Highway': HighwayBlock(d_model), 'ReZero': ReZeroBlock(d_model, d_ff), } print("Residual Variant Comparison") print("-" * 50) for name, block in blocks.items(): out = block(x) # How much does the residual change the input? diff = (out - x).norm() / x.norm() params = sum(p.numel() for p in block.parameters()) print(f"{name:15s} | Relative change: {diff.item():.4f} | Params: {params:,}") # At initialization, ReZero should have zero change print("(ReZero has alpha=0 at init, so no change initially)") compare_residual_variants()Residual connections are the architectural foundation enabling deep Transformers. Let's consolidate the key insights:
Looking Ahead
We've now examined all the core components of a Transformer layer: attention mechanisms (from previous modules), layer normalization, feed-forward networks, and residual connections. In the next page, we'll bring everything together to understand the full architecture as an integrated system.
You now understand why residual connections are essential for deep network training, how they provide gradient highways, their interaction with layer normalization, and various scaling strategies for extreme depth. Next, we'll synthesize all components into the full Transformer architecture.