Machine LearningAttention & Transformers

Transformer Architecture

LevelAdvanced

Duration90 mins

TopicAttention & Transformers

4 / 5

Residual Connections

The Foundation of Deep Network Training

Residual connections, introduced by He et al. (2015) for computer vision, are arguably the most important architectural innovation enabling very deep neural networks. Without residual connections, training networks with hundreds or thousands of layers would be practically impossible due to gradient degradation.

In the Transformer architecture, residual connections appear at every sublayer—around self-attention, cross-attention (in decoders), and feed-forward networks. They enable the 6, 12, 24, or even 100+ layer architectures that power modern language models.

This page provides a comprehensive examination of residual connections: why they're necessary, how they work mathematically, their interaction with other components, and advanced variants used in state-of-the-art models.

The Depth Revolution

Before residual connections, training networks beyond ~20 layers was extremely difficult. ResNet demonstrated 152-layer networks in 2015. Today, large language models like GPT-4 likely use hundreds of Transformer layers, made possible by residual connections.

The Degradation Problem

Before understanding the solution, we must understand the problem that residual connections solve.

Naive Deep Networks

In a traditional deep network without residual connections, each layer transforms its input:

$$h_{l+1} = f_l(h_l)$$

For a network with $L$ layers: $$h_L = f_L(f_{L-1}(...f_1(h_0)...))$$

The Observed Degradation

Empirically, deeper networks (without residual connections) show unexpected behavior:

Training a 56-layer network gives worse training error than a 20-layer network
This is not overfitting—it happens on the training set itself
The issue isn't optimization finding bad minima—it's optimization failing to find good solutions at all

This is the degradation problem: deeper networks have more expressive power but are harder to train.

Why Does This Happen?

Consider the gradient flow through a deep network:

$$\frac{\partial L}{\partial h_l} = \frac{\partial L}{\partial h_L} \cdot \frac{\partial h_L}{\partial h_{L-1}} \cdot ... \cdot \frac{\partial h_{l+1}}{\partial h_l}$$

This is a product of many Jacobian matrices. If each Jacobian has:

Singular values $> 1$: gradients explode
Singular values $< 1$: gradients vanish
Even with careful initialization, maintaining singular values ≈ 1 across all layers is extremely difficult

Manifestations of the Degradation Problem

•Vanishing Gradients: Deep layers receive negligibly small gradients, effectively stopping learning in early layers.
•Exploding Gradients: Gradient magnitudes grow exponentially, causing numerical instability and divergent training.
•Poor Optimization Landscape: The loss surface becomes increasingly ill-conditioned with depth, with sharp ravines and many saddle points.
•Representational Collapse: Without proper gradient signal, early layers may collapse to degenerate representations that can't be recovered.
•Training Stall: Loss decreases extremely slowly or not at all, even when the network has sufficient capacity.

The Theoretical Insight

The key insight from He et al.: if a shallower network can achieve a certain error, a deeper network should be able to achieve at least the same error by having the additional layers learn the identity function.

But learning the identity function is surprisingly hard for a stack of nonlinear layers! The network must precisely arrange weights to output its input unchanged—a highly specific configuration in a vast parameter space.

The Solution: What if we made the identity mapping the default behavior of each layer?

The Identity Mapping Hypothesis

If additional layers default to identity, they can only help (by learning useful transformations) but never hurt (at worst, they pass through unchanged). This provides a monotonic improvement guarantee: deeper can only be as good or better than shallower.

Residual Connection Formulation

The residual connection implements the identity-default principle with elegant simplicity.

Basic Formulation

Instead of learning a direct transformation $h_{l+1} = f_l(h_l)$, we learn a residual:

$$h_{l+1} = h_l + f_l(h_l)$$

where $f_l$ is called the residual function—it learns what to add to the input.

If the optimal transformation is close to identity, $f_l$ needs only to learn small adjustments. If the optimal transformation is identity exactly, $f_l = 0$ is a simple solution.

Transformer-Specific Formulation

In Transformers, residual connections wrap both attention and FFN sublayers:

$$\text{After Attention}: h' = h + \text{MultiHeadAttention}(h)$$ $$\text{After FFN}: h'' = h' + \text{FFN}(h')$$

With layer normalization (Pre-LN style):

$$h' = h + \text{MultiHeadAttention}(\text{LayerNorm}(h))$$ $$h'' = h' + \text{FFN}(\text{LayerNorm}(h'))$$

Gradient Flow Analysis

The gradient through a residual block:

$$\frac{\partial h_{l+1}}{\partial h_l} = \frac{\partial (h_l + f_l(h_l))}{\partial h_l} = I + \frac{\partial f_l}{\partial h_l}$$

The gradient is identity plus something else. This transforms a product of potentially small/large Jacobians into:

$$\frac{\partial L}{\partial h_l} = \frac{\partial L}{\partial h_L} \left(I + \frac{\partial f_L}{\partial h_{L-1}}\right)\left(I + \frac{\partial f_{L-1}}{\partial h_{L-2}}\right)...$$

Expanding: $$= \frac{\partial L}{\partial h_L} \left(I + \sum_{i>l} \frac{\partial f_i}{\partial h_{i-1}} + \text{higher order terms}\right)$$

The "Gradient Highway"

The identity term $I$ provides a direct path for gradients to flow from loss to any layer, unimpeded by intervening transformations. Each layer adds its contribution, but the baseline gradient is preserved.

For very deep networks, this means the gradient at layer 1 is at least $\frac{\partial L}{\partial h_L}$—the final layer's gradient can reach all the way back.

residual_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import numpy as np
 
class PlainBlock(nn.Module):
    """Block WITHOUT residual connection."""
    def __init__(self, dim: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, dim * 2),
            nn.ReLU(),
            nn.Linear(dim * 2, dim)
        )
    
    def forward(self, x):
        return self.net(x)
 
 
class ResidualBlock(nn.Module):
    """Block WITH residual connection."""
    def __init__(self, dim: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, dim * 2),
            nn.ReLU(),
            nn.Linear(dim * 2, dim)
        )
    
    def forward(self, x):
        return x + self.net(x)  # Key difference: + x
 
 
def analyze_gradient_flow():
    """Compare gradient magnitudes in deep networks with/without skip connections."""
    dim = 256
    depths = [4, 8, 16, 32, 64]
    
    results_plain = []
    results_residual = []
    
    for depth in depths:
        # Plain network
        plain_layers = nn.ModuleList([PlainBlock(dim) for _ in range(depth)])
        
        # Residual network  
        res_layers = nn.ModuleList([ResidualBlock(dim) for _ in range(depth)])
        
        x = torch.randn(4, dim, requires_grad=True)
        
        # Forward through plain network
        h_plain = x.clone()
        for layer in plain_layers:
            h_plain = layer(h_plain)
        loss_plain = h_plain.sum()
        loss_plain.backward()
        grad_plain = x.grad.norm().item()
        results_plain.append(grad_plain)
        
        x.grad.zero_()
        
        # Forward through residual network
        h_res = x.clone()
        for layer in res_layers:
            h_res = layer(h_res)
        loss_res = h_res.sum()
        loss_res.backward()
        grad_res = x.grad.norm().item()
        results_residual.append(grad_res)
    
    print("Gradient Magnitude at Input Layer")
    print("-" * 50)
    print(f"{'Depth':<10} {'Plain':>15} {'Residual':>15}")
    print("-" * 50)
    for d, p, r in zip(depths, results_plain, results_residual):
        print(f"{d:<10} {p:>15.6f} {r:>15.6f}")
    
    return depths, results_plain, results_residual
 
 
def analyze_jacobian_spectrum():
    """Analyze singular value distribution of layer Jacobians."""
    dim = 64
    
    block_plain = PlainBlock(dim)
    block_residual = ResidualBlock(dim)
    
    x = torch.randn(1, dim, requires_grad=True)
    
    # Compute Jacobian for plain block
    jacobian_plain = torch.autograd.functional.jacobian(
        lambda inp: block_plain(inp).squeeze(0), x
    ).squeeze()
    
    # Compute Jacobian for residual block
    jacobian_residual = torch.autograd.functional.jacobian(
        lambda inp: block_residual(inp).squeeze(0), x
    ).squeeze()
    
    # Singular values
    sv_plain = torch.linalg.svdvals(jacobian_plain).numpy()
    sv_residual = torch.linalg.svdvals(jacobian_residual).numpy()
    
    print("
Jacobian Singular Value Analysis")
    print("-" * 50)
    print(f"{'Metric':<20} {'Plain':>15} {'Residual':>15}")
    print("-" * 50)
    print(f"{'Max singular value':<20} {sv_plain.max():>15.4f} {sv_residual.max():>15.4f}")
    print(f"{'Min singular value':<20} {sv_plain.min():>15.4f} {sv_residual.min():>15.4f}")
    print(f"{'Condition number':<20} {sv_plain.max()/sv_plain.min():>15.4f} {sv_residual.max()/sv_residual.min():>15.4f}")
    
    # Note: Residual Jacobian = I + J_f, so singular values are shifted toward 1
    print("
(Residual block has identity + residual, so singular values cluster around 1)")
 
# Run analyses
analyze_gradient_flow()
analyze_jacobian_spectrum()

Interaction with Layer Normalization

Residual connections and layer normalization work together as a team. Their interplay is crucial for stable training.

The Combined System

In a Transformer layer, we have:

Residual connection: $h + f(h)$
Layer normalization: controls activation magnitude

These components interact differently depending on their arrangement:

Post-LN (Original Transformer)

$$h' = \text{LayerNorm}(h + f(h))$$

The layer normalization operates on the sum of input and residual. This means:

The gradient must flow through LayerNorm before reaching the skip path
At initialization when $f(h) \approx 0$, the gradient through LayerNorm is important
Can cause instability if LayerNorm parameters aren't carefully initialized

Pre-LN (Modern Standard)

$$h' = h + f(\text{LayerNorm}(h))$$

The layer normalization only affects the residual branch:

Gradients can flow directly through the skip connection
The residual branch sees normalized inputs
More stable at initialization—the skip path is truly "clean"

Converting Mermaid diagram...

Why Pre-LN Provides Better Gradient Flow

In Pre-LN:

$$\frac{\partial h'}{\partial h} = I + \frac{\partial f(\text{LN}(h))}{\partial h}$$

The identity matrix $I$ is preserved—gradients can flow through unchanged.

In Post-LN:

$$\frac{\partial h'}{\partial h} = \frac{\partial \text{LN}(h + f(h))}{\partial h}$$

This is the Jacobian of LayerNorm applied to the sum. While not catastrophic, it distorts the gradient and doesn't preserve the identity path.

Practical Implications

Pre-LN trains more easily: Less sensitivity to learning rate, often doesn't need warmup
Post-LN may have slight quality edge: Some evidence that Post-LN achieves marginally better final performance (when training succeeds)
Initialization matters more for Post-LN: Careful initialization of LayerNorm parameters is critical
Pre-LN needs final LayerNorm: Since the last sublayer's output isn't normalized, an explicit final LayerNorm is needed before output projection

Modern Hybrid Approaches

Some architectures combine benefits: use Pre-LN for training stability but add Post-LN to certain layers for quality. Others use 'sandwich' normalization (norm before AND after sublayer). The optimal choice is still actively researched.

Residual Scaling and Initialization

For very deep networks, even residual connections may not be sufficient for stable training. Various residual scaling strategies have been developed.

The Scale Accumulation Problem

With standard residual connections, each layer adds to the representation:

$$h_L = h_0 + \sum_{l=0}^{L-1} f_l(h_l)$$

If each $f_l$ has some variance, the total variance of $h_L$ grows with $L$. For a 96-layer network, activations can become very large.

Fixup Initialization

Zhang et al. (2019) proposed initializing the last layer of each residual branch to zero (or near-zero). This makes $f(h) \approx 0$ at initialization, so $h' \approx h$.

$$W_{\text{last}}^{(l)} = 0 \quad \text{at initialization}$$

Scaled Residual Connections

Scale the residual by a factor that decreases with depth:

$$h' = h + \alpha_l \cdot f(h)$$

Common schemes:

Fixed scaling: $\alpha = 0.1$ or similar small constant
Depth-dependent: $\alpha_l = 1/\sqrt{L}$ where $L$ is total depth
Learned scaling: $\alpha_l$ is a trainable parameter per layer

DeepNet Scaling

Microsoft's DeepNet (Wang et al., 2022) uses specific scaling for 1000+ layer Transformers:

$$h' = h + \alpha \cdot f(\beta \cdot h)$$

where $\alpha = (2N)^{-1/4}$ and $\beta = (8N)^{1/4}$ for encoder with $N$ layers.

This stabilizes both forward and backward passes for extreme depth.

residual_scaling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
import torch
import torch.nn as nn
import math
 
class ScaledResidualBlock(nn.Module):
    """
    Residual block with configurable scaling strategies.
    """
    def __init__(
        self,
        d_model: int,
        d_ff: int,
        layer_idx: int,
        total_layers: int,
        scaling_type: str = "none"
    ):
        super().__init__()
        
        self.layer_norm = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model)
        )
        
        self.scaling_type = scaling_type
        self.layer_idx = layer_idx
        self.total_layers = total_layers
        
        # Compute scaling factor based on strategy
        if scaling_type == "none":
            self.alpha = 1.0
        elif scaling_type == "fixed":
            self.alpha = 0.1
        elif scaling_type == "sqrt_depth":
            self.alpha = 1.0 / math.sqrt(total_layers)
        elif scaling_type == "deepnet":
            # DeepNet-style scaling for very deep networks
            self.alpha = (2 * total_layers) ** (-0.25)
        elif scaling_type == "learned":
            self.alpha = nn.Parameter(torch.ones(1))
        else:
            raise ValueError(f"Unknown scaling type: {scaling_type}")
        
        # Initialize final layer of FFN to zero for Fixup-style init
        if scaling_type in ["fixup", "deepnet"]:
            nn.init.zeros_(self.ffn[-1].weight)
            nn.init.zeros_(self.ffn[-1].bias)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Pre-LN residual with scaling
        normed = self.layer_norm(x)
        residual = self.ffn(normed)
        
        if self.scaling_type == "learned":
            return x + self.alpha * residual
        else:
            return x + self.alpha * residual
 
 
class DeepNetBlock(nn.Module):
    """
    Block following DeepNet scaling for 1000+ layer transformers.
    
    Uses α and β scaling before and after sublayers.
    """
    def __init__(
        self,
        d_model: int,
        d_ff: int,
        total_layers: int
    ):
        super().__init__()
        
        # DeepNet scaling constants
        self.alpha = (2 * total_layers) ** (-0.25)
        self.beta = (8 * total_layers) ** 0.25
        
        self.layer_norm = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model)
        )
        
        # Initialize output projection to zero
        nn.init.zeros_(self.ffn[-1].weight)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Scale input before sublayer
        normed = self.layer_norm(x * self.beta)
        residual = self.ffn(normed)
        
        # Scale residual before adding
        return x + self.alpha * residual
 
 
def compare_scaling_strategies():
    """Analyze activation magnitudes across strategies."""
    d_model = 512
    d_ff = 2048
    depths = [12, 48, 96]
    batch_size = 2
    seq_len = 64
    
    scaling_types = ["none", "fixed", "sqrt_depth", "deepnet"]
    
    print("Activation Magnitude at Final Layer")
    print("-" * 60)
    print(f"{'Depth':<10}", end="")
    for st in scaling_types:
        print(f"{st:>12}", end="")
    print()
    print("-" * 60)
    
    for depth in depths:
        print(f"{depth:<10}", end="")
        
        for scaling_type in scaling_types:
            layers = nn.ModuleList([
                ScaledResidualBlock(d_model, d_ff, i, depth, scaling_type)
                for i in range(depth)
            ])
            
            x = torch.randn(batch_size, seq_len, d_model)
            
            with torch.no_grad():
                h = x
                for layer in layers:
                    h = layer(h)
                
                magnitude = h.norm().item() / (batch_size * seq_len * d_model) ** 0.5
            
            print(f"{magnitude:>12.2f}", end="")
        
        print()
 
 
def analyze_variance_growth():
    """Track variance through layers with different scaling."""
    d_model = 256
    d_ff = 1024
    depth = 48
    
    x = torch.randn(8, 32, d_model)
    
    for scaling in ["none", "sqrt_depth"]:
        layers = nn.ModuleList([
            ScaledResidualBlock(d_model, d_ff, i, depth, scaling)
            for i in range(depth)
        ])
        
        variances = [x.var().item()]
        h = x
        
        with torch.no_grad():
            for layer in layers:
                h = layer(h)
                variances.append(h.var().item())
        
        print(f"
Variance growth ({scaling}):")
        print(f"  Layer 0: {variances[0]:.4f}")
        print(f"  Layer {depth//4}: {variances[depth//4]:.4f}")
        print(f"  Layer {depth//2}: {variances[depth//2]:.4f}")
        print(f"  Layer {depth}: {variances[-1]:.4f}")
        print(f"  Growth factor: {variances[-1]/variances[0]:.2f}x")
 
 
compare_scaling_strategies()
analyze_variance_growth()

Residual Scaling Strategies
Strategy	Scale Factor	Use Case	Notes
None	α = 1	Shallow networks (≤24 layers)	Standard residual connection
Fixed	α = 0.1	Medium depth networks	Simple and often effective
√Depth	α = 1/√L	Deep networks	Theoretical basis from signal propagation
Learned	α = parameter	Various	Flexibility but adds parameters
DeepNet	α = (2N)^(-1/4)	Very deep (100+ layers)	State-of-the-art for extreme depth

Theoretical Foundations

Understanding why residual connections work requires diving into theoretical analysis of gradient flow and optimization landscapes.

Signal Propagation Theory

Consider a deep residual network at initialization. For gradients to flow effectively:

$$\mathbb{E}\left[\left|\frac{\partial L}{\partial h_l}\right|^2\right] \approx \mathbb{E}\left[\left|\frac{\partial L}{\partial h_L}\right|^2\right]$$

The expected gradient magnitude should be roughly constant across layers.

For a residual block $h' = h + f(h)$:

$$\text{Var}[h'] = \text{Var}[h] + \text{Var}[f(h)] + 2\text{Cov}[h, f(h)]$$

With careful initialization, $\text{Var}[f(h)]$ is small, and $h'$ has similar variance to $h$.

The Unrolled Gradient View

Residual networks can be viewed as an ensemble. The gradient at layer $l$:

$$\frac{\partial L}{\partial h_l} = \frac{\partial L}{\partial h_L} \prod_{k=l}^{L-1} \left(I + \frac{\partial f_k}{\partial h_k}\right)$$

Expanding the product:

$$= \frac{\partial L}{\partial h_L} \left(I + \sum_k J_k + \sum_{k<m} J_k J_m + ...\right)$$

This is a sum over all possible "paths" through the network:

The $I$ term: direct path (length 0)
The $\sum_k J_k$ terms: paths through one residual block
Higher-order terms: paths through multiple blocks

Ensemble Interpretation

Veit et al. (2016) showed that residual networks behave like an ensemble of shallower networks. A 54-layer ResNet effectively contains paths of length 0 to 54, with the majority of gradient contributions coming from paths of length 10-30.

Loss Landscape Smoothness

Li et al. (2018) visualized loss landscapes and showed:

Without residual connections: highly non-convex, chaotic surfaces
With residual connections: smoother, more convex-like surfaces

The smoothness comes from the gradient highway—even if local curvature is high, there's always a path for gradient signal.

The Ensemble Intuition

Think of a residual network as implicitly containing 2^L subnetworks (all possible combinations of including or skipping each residual block). During training, gradient flows through all paths simultaneously. At inference, the full ensemble is evaluated. This explains both trainability and generalization benefits.

Dynamical Mean-Field Theory Analysis

Recent work applies tools from physics (mean-field theory) to analyze infinitely deep residual networks:

At infinite depth, with residual scaling $\alpha = 1/\sqrt{L}$:

Activations remain $O(1)$ in magnitude
The network converges to a continuous dynamical system
The discrete layer index becomes a continuous "depth" variable

This leads to Neural ODEs (Chen et al., 2018):

$$\frac{dh}{dt} = f(h, t)$$

where $t$ is a continuous depth parameter. The standard residual network is an Euler discretization of this ODE.

Variations and Extensions

The basic residual connection has spawned many variations, each addressing specific challenges.

Dense Connections (DenseNet)

Instead of connecting only to the previous layer, connect to all previous layers:

$$h_l = f_l([h_0, h_1, ..., h_{l-1}])$$

where $[...]$ denotes concatenation. This provides maximum gradient flow but increases memory usage.

Highway Networks

Add a learned gate that interpolates between identity and transformation:

$$h' = g(h) \odot f(h) + (1 - g(h)) \odot h$$

where $g(h) = \sigma(W_g h + b_g)$ is a learned gate. This was a precursor to residual connections.

ReZero

Initialize the residual branch to output exactly zero, using a single learned scalar:

$$h' = h + \alpha \cdot f(h)$$

where $\alpha = 0$ at initialization. The network starts as identity; nonzero $\alpha$ is learned.

Pre-Activation Residual Blocks

Move batch/layer normalization and activation before the convolutional/linear layers:

$$h' = h + W_2 \cdot \sigma(\text{Norm}(W_1 \cdot \sigma(\text{Norm}(h))))$$

This is the Pre-LN pattern applied to the residual block itself, not just around it.

residual_variants.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
import torch
import torch.nn as nn
 
class HighwayBlock(nn.Module):
    """
    Highway network block: learned gating between identity and transform.
    
    h' = g(h) * f(h) + (1-g(h)) * h
    
    When g→1: behaves like standard transform
    When g→0: behaves like identity (pass-through)
    """
    def __init__(self, d_model: int):
        super().__init__()
        
        # Transform
        self.transform = nn.Sequential(
            nn.Linear(d_model, d_model),
            nn.ReLU()
        )
        
        # Gate (initialized to favor identity)
        self.gate = nn.Linear(d_model, d_model)
        nn.init.constant_(self.gate.bias, -2.0)  # sigmoid(-2) ≈ 0.12
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        transform = self.transform(x)
        gate = torch.sigmoid(self.gate(x))
        
        return gate * transform + (1 - gate) * x
 
 
class ReZeroBlock(nn.Module):
    """
    ReZero: residual connection with learned scalar initialized to 0.
    
    h' = h + alpha * f(h)  where alpha starts at 0
    
    Enables training very deep networks without warmup.
    """
    def __init__(self, d_model: int, d_ff: int):
        super().__init__()
        
        self.layer_norm = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model)
        )
        
        # The key ingredient: starts at 0
        self.alpha = nn.Parameter(torch.zeros(1))
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        residual = self.ffn(self.layer_norm(x))
        return x + self.alpha * residual
 
 
class DenseBlock(nn.Module):
    """
    DenseNet-style block: concatenate outputs from all previous sublayers.
    
    Warning: Memory intensive as sequence length and depth increase.
    """
    def __init__(self, d_model: int, growth_rate: int, num_layers: int):
        super().__init__()
        
        self.layers = nn.ModuleList()
        
        current_dim = d_model
        for i in range(num_layers):
            self.layers.append(nn.Sequential(
                nn.LayerNorm(current_dim),
                nn.Linear(current_dim, growth_rate),
                nn.ReLU()
            ))
            current_dim += growth_rate
        
        self.final_dim = current_dim
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        features = [x]
        
        for layer in self.layers:
            concat = torch.cat(features, dim=-1)
            new_features = layer(concat)
            features.append(new_features)
        
        return torch.cat(features, dim=-1)
 
 
def compare_residual_variants():
    """Compare different residual connection styles."""
    d_model = 256
    d_ff = 512
    batch, seq = 2, 8
    
    x = torch.randn(batch, seq, d_model)
    
    # Standard residual
    class StandardRes(nn.Module):
        def __init__(self):
            super().__init__()
            self.ffn = nn.Sequential(nn.Linear(d_model, d_ff), 
                                     nn.ReLU(), 
                                     nn.Linear(d_ff, d_model))
        def forward(self, x):
            return x + self.ffn(x)
    
    blocks = {
        'Standard': StandardRes(),
        'Highway': HighwayBlock(d_model),
        'ReZero': ReZeroBlock(d_model, d_ff),
    }
    
    print("Residual Variant Comparison")
    print("-" * 50)
    
    for name, block in blocks.items():
        out = block(x)
        
        # How much does the residual change the input?
        diff = (out - x).norm() / x.norm()
        
        params = sum(p.numel() for p in block.parameters())
        
        print(f"{name:15s} | Relative change: {diff.item():.4f} | Params: {params:,}")
    
    # At initialization, ReZero should have zero change
    print("
(ReZero has alpha=0 at init, so no change initially)")
 
 
compare_residual_variants()

When to Use Each Variant

•Standard Residual: Default choice for most Transformer applications. Well-understood, efficient, and effective.
•ReZero: When training very deep networks without warmup, or when you want guaranteed smooth initialization.
•Highway Networks: Rarely used in modern Transformers, but useful when you need fine-grained gating control.
•Dense Connections: When memory isn't a constraint and you want maximum gradient flow (small models, research).
•Scaled Residuals: Deep networks (>48 layers) where standard residuals cause variance growth.

Summary: Residual Connections

Residual connections are the architectural foundation enabling deep Transformers. Let's consolidate the key insights:

Key Takeaways

•The Degradation Problem: Deep networks without residual connections suffer from training degradation—not overfitting, but optimization failure. Gradients vanish or explode through long chains of transformations.
•Residual Formulation: h' = h + f(h) makes identity the default. The network learns what to add, not what to replace. This simple change enables training 100+ layer networks.
•Gradient Highway: The identity term provides direct gradient paths from loss to any layer. Mathematically: ∂h'/∂h = I + ∂f/∂h. The identity matrix I ensures gradients always have a clear path.
•Interaction with Layer Norm: Pre-LN places normalization inside the residual branch, preserving the clean skip path. Post-LN normalizes after the sum, distorting gradients slightly but sometimes achieving better final quality.
•Residual Scaling: For very deep networks, scale residuals by factors like 1/√L to prevent variance growth. DeepNet-style scaling enables 1000+ layer training.
•Theoretical Foundations: Residual networks can be viewed as ensembles of shallow networks, or as discretized neural ODEs. Both views explain their trainability and generalization.
•Variants: ReZero (zero-init scalar), Highway (learned gating), Dense connections (concatenate all), and others offer specialized benefits.

Looking Ahead

We've now examined all the core components of a Transformer layer: attention mechanisms (from previous modules), layer normalization, feed-forward networks, and residual connections. In the next page, we'll bring everything together to understand the full architecture as an integrated system.

Page Complete

You now understand why residual connections are essential for deep network training, how they provide gradient highways, their interaction with layer normalization, and various scaling strategies for extreme depth. Next, we'll synthesize all components into the full Transformer architecture.

4 / 5

Loading learning content...

Machine LearningAttention & Transformers

Transformer Architecture

LevelAdvanced

Duration90 mins

TopicAttention & Transformers

4 / 5

Residual Connections

The Foundation of Deep Network Training

The Depth Revolution

The Degradation Problem

Before understanding the solution, we must understand the problem that residual connections solve.

Naive Deep Networks

In a traditional deep network without residual connections, each layer transforms its input:

$$h_{l+1} = f_l(h_l)$$

For a network with $L$ layers: $$h_L = f_L(f_{L-1}(...f_1(h_0)...))$$

The Observed Degradation

Empirically, deeper networks (without residual connections) show unexpected behavior:

Training a 56-layer network gives worse training error than a 20-layer network
This is not overfitting—it happens on the training set itself
The issue isn't optimization finding bad minima—it's optimization failing to find good solutions at all

This is the degradation problem: deeper networks have more expressive power but are harder to train.

Why Does This Happen?

Consider the gradient flow through a deep network:

$$\frac{\partial L}{\partial h_l} = \frac{\partial L}{\partial h_L} \cdot \frac{\partial h_L}{\partial h_{L-1}} \cdot ... \cdot \frac{\partial h_{l+1}}{\partial h_l}$$

This is a product of many Jacobian matrices. If each Jacobian has:

Singular values $> 1$: gradients explode
Singular values $< 1$: gradients vanish
Even with careful initialization, maintaining singular values ≈ 1 across all layers is extremely difficult

Manifestations of the Degradation Problem

•Vanishing Gradients: Deep layers receive negligibly small gradients, effectively stopping learning in early layers.
•Exploding Gradients: Gradient magnitudes grow exponentially, causing numerical instability and divergent training.
•Poor Optimization Landscape: The loss surface becomes increasingly ill-conditioned with depth, with sharp ravines and many saddle points.
•Representational Collapse: Without proper gradient signal, early layers may collapse to degenerate representations that can't be recovered.
•Training Stall: Loss decreases extremely slowly or not at all, even when the network has sufficient capacity.

The Theoretical Insight

The Solution: What if we made the identity mapping the default behavior of each layer?

The Identity Mapping Hypothesis

Residual Connection Formulation

The residual connection implements the identity-default principle with elegant simplicity.

Basic Formulation

Instead of learning a direct transformation $h_{l+1} = f_l(h_l)$, we learn a residual:

$$h_{l+1} = h_l + f_l(h_l)$$

where $f_l$ is called the residual function—it learns what to add to the input.

If the optimal transformation is close to identity, $f_l$ needs only to learn small adjustments. If the optimal transformation is identity exactly, $f_l = 0$ is a simple solution.

Transformer-Specific Formulation

In Transformers, residual connections wrap both attention and FFN sublayers:

$$\text{After Attention}: h' = h + \text{MultiHeadAttention}(h)$$ $$\text{After FFN}: h'' = h' + \text{FFN}(h')$$

With layer normalization (Pre-LN style):

$$h' = h + \text{MultiHeadAttention}(\text{LayerNorm}(h))$$ $$h'' = h' + \text{FFN}(\text{LayerNorm}(h'))$$

Gradient Flow Analysis

The gradient through a residual block:

$$\frac{\partial h_{l+1}}{\partial h_l} = \frac{\partial (h_l + f_l(h_l))}{\partial h_l} = I + \frac{\partial f_l}{\partial h_l}$$

The gradient is identity plus something else. This transforms a product of potentially small/large Jacobians into:

$$\frac{\partial L}{\partial h_l} = \frac{\partial L}{\partial h_L} \left(I + \frac{\partial f_L}{\partial h_{L-1}}\right)\left(I + \frac{\partial f_{L-1}}{\partial h_{L-2}}\right)...$$

Expanding: $$= \frac{\partial L}{\partial h_L} \left(I + \sum_{i>l} \frac{\partial f_i}{\partial h_{i-1}} + \text{higher order terms}\right)$$

The "Gradient Highway"

For very deep networks, this means the gradient at layer 1 is at least $\frac{\partial L}{\partial h_L}$—the final layer's gradient can reach all the way back.

residual_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import numpy as np
 
class PlainBlock(nn.Module):
    """Block WITHOUT residual connection."""
    def __init__(self, dim: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, dim * 2),
            nn.ReLU(),
            nn.Linear(dim * 2, dim)
        )
    
    def forward(self, x):
        return self.net(x)
 
 
class ResidualBlock(nn.Module):
    """Block WITH residual connection."""
    def __init__(self, dim: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, dim * 2),
            nn.ReLU(),
            nn.Linear(dim * 2, dim)
        )
    
    def forward(self, x):
        return x + self.net(x)  # Key difference: + x
 
 
def analyze_gradient_flow():
    """Compare gradient magnitudes in deep networks with/without skip connections."""
    dim = 256
    depths = [4, 8, 16, 32, 64]
    
    results_plain = []
    results_residual = []
    
    for depth in depths:
        # Plain network
        plain_layers = nn.ModuleList([PlainBlock(dim) for _ in range(depth)])
        
        # Residual network  
        res_layers = nn.ModuleList([ResidualBlock(dim) for _ in range(depth)])
        
        x = torch.randn(4, dim, requires_grad=True)
        
        # Forward through plain network
        h_plain = x.clone()
        for layer in plain_layers:
            h_plain = layer(h_plain)
        loss_plain = h_plain.sum()
        loss_plain.backward()
        grad_plain = x.grad.norm().item()
        results_plain.append(grad_plain)
        
        x.grad.zero_()
        
        # Forward through residual network
        h_res = x.clone()
        for layer in res_layers:
            h_res = layer(h_res)
        loss_res = h_res.sum()
        loss_res.backward()
        grad_res = x.grad.norm().item()
        results_residual.append(grad_res)
    
    print("Gradient Magnitude at Input Layer")
    print("-" * 50)
    print(f"{'Depth':<10} {'Plain':>15} {'Residual':>15}")
    print("-" * 50)
    for d, p, r in zip(depths, results_plain, results_residual):
        print(f"{d:<10} {p:>15.6f} {r:>15.6f}")
    
    return depths, results_plain, results_residual
 
 
def analyze_jacobian_spectrum():
    """Analyze singular value distribution of layer Jacobians."""
    dim = 64
    
    block_plain = PlainBlock(dim)
    block_residual = ResidualBlock(dim)
    
    x = torch.randn(1, dim, requires_grad=True)
    
    # Compute Jacobian for plain block
    jacobian_plain = torch.autograd.functional.jacobian(
        lambda inp: block_plain(inp).squeeze(0), x
    ).squeeze()
    
    # Compute Jacobian for residual block
    jacobian_residual = torch.autograd.functional.jacobian(
        lambda inp: block_residual(inp).squeeze(0), x
    ).squeeze()
    
    # Singular values
    sv_plain = torch.linalg.svdvals(jacobian_plain).numpy()
    sv_residual = torch.linalg.svdvals(jacobian_residual).numpy()
    
    print("
Jacobian Singular Value Analysis")
    print("-" * 50)
    print(f"{'Metric':<20} {'Plain':>15} {'Residual':>15}")
    print("-" * 50)
    print(f"{'Max singular value':<20} {sv_plain.max():>15.4f} {sv_residual.max():>15.4f}")
    print(f"{'Min singular value':<20} {sv_plain.min():>15.4f} {sv_residual.min():>15.4f}")
    print(f"{'Condition number':<20} {sv_plain.max()/sv_plain.min():>15.4f} {sv_residual.max()/sv_residual.min():>15.4f}")
    
    # Note: Residual Jacobian = I + J_f, so singular values are shifted toward 1
    print("
(Residual block has identity + residual, so singular values cluster around 1)")
 
# Run analyses
analyze_gradient_flow()
analyze_jacobian_spectrum()

Interaction with Layer Normalization

Residual connections and layer normalization work together as a team. Their interplay is crucial for stable training.

The Combined System

In a Transformer layer, we have:

Residual connection: $h + f(h)$
Layer normalization: controls activation magnitude

These components interact differently depending on their arrangement:

Post-LN (Original Transformer)

$$h' = \text{LayerNorm}(h + f(h))$$

The layer normalization operates on the sum of input and residual. This means:

The gradient must flow through LayerNorm before reaching the skip path
At initialization when $f(h) \approx 0$, the gradient through LayerNorm is important
Can cause instability if LayerNorm parameters aren't carefully initialized

Pre-LN (Modern Standard)

$$h' = h + f(\text{LayerNorm}(h))$$

The layer normalization only affects the residual branch:

Gradients can flow directly through the skip connection
The residual branch sees normalized inputs
More stable at initialization—the skip path is truly "clean"

Converting Mermaid diagram...

Why Pre-LN Provides Better Gradient Flow

In Pre-LN:

$$\frac{\partial h'}{\partial h} = I + \frac{\partial f(\text{LN}(h))}{\partial h}$$

The identity matrix $I$ is preserved—gradients can flow through unchanged.

In Post-LN:

$$\frac{\partial h'}{\partial h} = \frac{\partial \text{LN}(h + f(h))}{\partial h}$$

This is the Jacobian of LayerNorm applied to the sum. While not catastrophic, it distorts the gradient and doesn't preserve the identity path.

Practical Implications

Pre-LN trains more easily: Less sensitivity to learning rate, often doesn't need warmup
Post-LN may have slight quality edge: Some evidence that Post-LN achieves marginally better final performance (when training succeeds)
Initialization matters more for Post-LN: Careful initialization of LayerNorm parameters is critical
Pre-LN needs final LayerNorm: Since the last sublayer's output isn't normalized, an explicit final LayerNorm is needed before output projection

Modern Hybrid Approaches

Residual Scaling and Initialization

For very deep networks, even residual connections may not be sufficient for stable training. Various residual scaling strategies have been developed.

The Scale Accumulation Problem

With standard residual connections, each layer adds to the representation:

$$h_L = h_0 + \sum_{l=0}^{L-1} f_l(h_l)$$

If each $f_l$ has some variance, the total variance of $h_L$ grows with $L$. For a 96-layer network, activations can become very large.

Fixup Initialization

Zhang et al. (2019) proposed initializing the last layer of each residual branch to zero (or near-zero). This makes $f(h) \approx 0$ at initialization, so $h' \approx h$.

$$W_{\text{last}}^{(l)} = 0 \quad \text{at initialization}$$

Scaled Residual Connections

Scale the residual by a factor that decreases with depth:

$$h' = h + \alpha_l \cdot f(h)$$

Common schemes:

Fixed scaling: $\alpha = 0.1$ or similar small constant
Depth-dependent: $\alpha_l = 1/\sqrt{L}$ where $L$ is total depth
Learned scaling: $\alpha_l$ is a trainable parameter per layer

DeepNet Scaling

Microsoft's DeepNet (Wang et al., 2022) uses specific scaling for 1000+ layer Transformers:

$$h' = h + \alpha \cdot f(\beta \cdot h)$$

where $\alpha = (2N)^{-1/4}$ and $\beta = (8N)^{1/4}$ for encoder with $N$ layers.

This stabilizes both forward and backward passes for extreme depth.

residual_scaling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
import torch
import torch.nn as nn
import math
 
class ScaledResidualBlock(nn.Module):
    """
    Residual block with configurable scaling strategies.
    """
    def __init__(
        self,
        d_model: int,
        d_ff: int,
        layer_idx: int,
        total_layers: int,
        scaling_type: str = "none"
    ):
        super().__init__()
        
        self.layer_norm = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model)
        )
        
        self.scaling_type = scaling_type
        self.layer_idx = layer_idx
        self.total_layers = total_layers
        
        # Compute scaling factor based on strategy
        if scaling_type == "none":
            self.alpha = 1.0
        elif scaling_type == "fixed":
            self.alpha = 0.1
        elif scaling_type == "sqrt_depth":
            self.alpha = 1.0 / math.sqrt(total_layers)
        elif scaling_type == "deepnet":
            # DeepNet-style scaling for very deep networks
            self.alpha = (2 * total_layers) ** (-0.25)
        elif scaling_type == "learned":
            self.alpha = nn.Parameter(torch.ones(1))
        else:
            raise ValueError(f"Unknown scaling type: {scaling_type}")
        
        # Initialize final layer of FFN to zero for Fixup-style init
        if scaling_type in ["fixup", "deepnet"]:
            nn.init.zeros_(self.ffn[-1].weight)
            nn.init.zeros_(self.ffn[-1].bias)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Pre-LN residual with scaling
        normed = self.layer_norm(x)
        residual = self.ffn(normed)
        
        if self.scaling_type == "learned":
            return x + self.alpha * residual
        else:
            return x + self.alpha * residual
 
 
class DeepNetBlock(nn.Module):
    """
    Block following DeepNet scaling for 1000+ layer transformers.
    
    Uses α and β scaling before and after sublayers.
    """
    def __init__(
        self,
        d_model: int,
        d_ff: int,
        total_layers: int
    ):
        super().__init__()
        
        # DeepNet scaling constants
        self.alpha = (2 * total_layers) ** (-0.25)
        self.beta = (8 * total_layers) ** 0.25
        
        self.layer_norm = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model)
        )
        
        # Initialize output projection to zero
        nn.init.zeros_(self.ffn[-1].weight)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Scale input before sublayer
        normed = self.layer_norm(x * self.beta)
        residual = self.ffn(normed)
        
        # Scale residual before adding
        return x + self.alpha * residual
 
 
def compare_scaling_strategies():
    """Analyze activation magnitudes across strategies."""
    d_model = 512
    d_ff = 2048
    depths = [12, 48, 96]
    batch_size = 2
    seq_len = 64
    
    scaling_types = ["none", "fixed", "sqrt_depth", "deepnet"]
    
    print("Activation Magnitude at Final Layer")
    print("-" * 60)
    print(f"{'Depth':<10}", end="")
    for st in scaling_types:
        print(f"{st:>12}", end="")
    print()
    print("-" * 60)
    
    for depth in depths:
        print(f"{depth:<10}", end="")
        
        for scaling_type in scaling_types:
            layers = nn.ModuleList([
                ScaledResidualBlock(d_model, d_ff, i, depth, scaling_type)
                for i in range(depth)
            ])
            
            x = torch.randn(batch_size, seq_len, d_model)
            
            with torch.no_grad():
                h = x
                for layer in layers:
                    h = layer(h)
                
                magnitude = h.norm().item() / (batch_size * seq_len * d_model) ** 0.5
            
            print(f"{magnitude:>12.2f}", end="")
        
        print()
 
 
def analyze_variance_growth():
    """Track variance through layers with different scaling."""
    d_model = 256
    d_ff = 1024
    depth = 48
    
    x = torch.randn(8, 32, d_model)
    
    for scaling in ["none", "sqrt_depth"]:
        layers = nn.ModuleList([
            ScaledResidualBlock(d_model, d_ff, i, depth, scaling)
            for i in range(depth)
        ])
        
        variances = [x.var().item()]
        h = x
        
        with torch.no_grad():
            for layer in layers:
                h = layer(h)
                variances.append(h.var().item())
        
        print(f"
Variance growth ({scaling}):")
        print(f"  Layer 0: {variances[0]:.4f}")
        print(f"  Layer {depth//4}: {variances[depth//4]:.4f}")
        print(f"  Layer {depth//2}: {variances[depth//2]:.4f}")
        print(f"  Layer {depth}: {variances[-1]:.4f}")
        print(f"  Growth factor: {variances[-1]/variances[0]:.2f}x")
 
 
compare_scaling_strategies()
analyze_variance_growth()

Residual Scaling Strategies
Strategy	Scale Factor	Use Case	Notes
None	α = 1	Shallow networks (≤24 layers)	Standard residual connection
Fixed	α = 0.1	Medium depth networks	Simple and often effective
√Depth	α = 1/√L	Deep networks	Theoretical basis from signal propagation
Learned	α = parameter	Various	Flexibility but adds parameters
DeepNet	α = (2N)^(-1/4)	Very deep (100+ layers)	State-of-the-art for extreme depth

Theoretical Foundations

Understanding why residual connections work requires diving into theoretical analysis of gradient flow and optimization landscapes.

Signal Propagation Theory

Consider a deep residual network at initialization. For gradients to flow effectively:

$$\mathbb{E}\left[\left|\frac{\partial L}{\partial h_l}\right|^2\right] \approx \mathbb{E}\left[\left|\frac{\partial L}{\partial h_L}\right|^2\right]$$

The expected gradient magnitude should be roughly constant across layers.

For a residual block $h' = h + f(h)$:

$$\text{Var}[h'] = \text{Var}[h] + \text{Var}[f(h)] + 2\text{Cov}[h, f(h)]$$

With careful initialization, $\text{Var}[f(h)]$ is small, and $h'$ has similar variance to $h$.

The Unrolled Gradient View

Residual networks can be viewed as an ensemble. The gradient at layer $l$:

$$\frac{\partial L}{\partial h_l} = \frac{\partial L}{\partial h_L} \prod_{k=l}^{L-1} \left(I + \frac{\partial f_k}{\partial h_k}\right)$$

Expanding the product:

$$= \frac{\partial L}{\partial h_L} \left(I + \sum_k J_k + \sum_{k<m} J_k J_m + ...\right)$$

This is a sum over all possible "paths" through the network:

The $I$ term: direct path (length 0)
The $\sum_k J_k$ terms: paths through one residual block
Higher-order terms: paths through multiple blocks

Ensemble Interpretation

Loss Landscape Smoothness

Li et al. (2018) visualized loss landscapes and showed:

Without residual connections: highly non-convex, chaotic surfaces
With residual connections: smoother, more convex-like surfaces

The smoothness comes from the gradient highway—even if local curvature is high, there's always a path for gradient signal.

The Ensemble Intuition

Dynamical Mean-Field Theory Analysis

Recent work applies tools from physics (mean-field theory) to analyze infinitely deep residual networks:

At infinite depth, with residual scaling $\alpha = 1/\sqrt{L}$:

Activations remain $O(1)$ in magnitude
The network converges to a continuous dynamical system
The discrete layer index becomes a continuous "depth" variable

This leads to Neural ODEs (Chen et al., 2018):

$$\frac{dh}{dt} = f(h, t)$$

where $t$ is a continuous depth parameter. The standard residual network is an Euler discretization of this ODE.

Variations and Extensions

The basic residual connection has spawned many variations, each addressing specific challenges.

Dense Connections (DenseNet)

Instead of connecting only to the previous layer, connect to all previous layers:

$$h_l = f_l([h_0, h_1, ..., h_{l-1}])$$

where $[...]$ denotes concatenation. This provides maximum gradient flow but increases memory usage.

Highway Networks

Add a learned gate that interpolates between identity and transformation:

$$h' = g(h) \odot f(h) + (1 - g(h)) \odot h$$

where $g(h) = \sigma(W_g h + b_g)$ is a learned gate. This was a precursor to residual connections.

ReZero

Initialize the residual branch to output exactly zero, using a single learned scalar:

$$h' = h + \alpha \cdot f(h)$$

where $\alpha = 0$ at initialization. The network starts as identity; nonzero $\alpha$ is learned.

Pre-Activation Residual Blocks

Move batch/layer normalization and activation before the convolutional/linear layers:

$$h' = h + W_2 \cdot \sigma(\text{Norm}(W_1 \cdot \sigma(\text{Norm}(h))))$$

This is the Pre-LN pattern applied to the residual block itself, not just around it.

residual_variants.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
import torch
import torch.nn as nn
 
class HighwayBlock(nn.Module):
    """
    Highway network block: learned gating between identity and transform.
    
    h' = g(h) * f(h) + (1-g(h)) * h
    
    When g→1: behaves like standard transform
    When g→0: behaves like identity (pass-through)
    """
    def __init__(self, d_model: int):
        super().__init__()
        
        # Transform
        self.transform = nn.Sequential(
            nn.Linear(d_model, d_model),
            nn.ReLU()
        )
        
        # Gate (initialized to favor identity)
        self.gate = nn.Linear(d_model, d_model)
        nn.init.constant_(self.gate.bias, -2.0)  # sigmoid(-2) ≈ 0.12
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        transform = self.transform(x)
        gate = torch.sigmoid(self.gate(x))
        
        return gate * transform + (1 - gate) * x
 
 
class ReZeroBlock(nn.Module):
    """
    ReZero: residual connection with learned scalar initialized to 0.
    
    h' = h + alpha * f(h)  where alpha starts at 0
    
    Enables training very deep networks without warmup.
    """
    def __init__(self, d_model: int, d_ff: int):
        super().__init__()
        
        self.layer_norm = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model)
        )
        
        # The key ingredient: starts at 0
        self.alpha = nn.Parameter(torch.zeros(1))
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        residual = self.ffn(self.layer_norm(x))
        return x + self.alpha * residual
 
 
class DenseBlock(nn.Module):
    """
    DenseNet-style block: concatenate outputs from all previous sublayers.
    
    Warning: Memory intensive as sequence length and depth increase.
    """
    def __init__(self, d_model: int, growth_rate: int, num_layers: int):
        super().__init__()
        
        self.layers = nn.ModuleList()
        
        current_dim = d_model
        for i in range(num_layers):
            self.layers.append(nn.Sequential(
                nn.LayerNorm(current_dim),
                nn.Linear(current_dim, growth_rate),
                nn.ReLU()
            ))
            current_dim += growth_rate
        
        self.final_dim = current_dim
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        features = [x]
        
        for layer in self.layers:
            concat = torch.cat(features, dim=-1)
            new_features = layer(concat)
            features.append(new_features)
        
        return torch.cat(features, dim=-1)
 
 
def compare_residual_variants():
    """Compare different residual connection styles."""
    d_model = 256
    d_ff = 512
    batch, seq = 2, 8
    
    x = torch.randn(batch, seq, d_model)
    
    # Standard residual
    class StandardRes(nn.Module):
        def __init__(self):
            super().__init__()
            self.ffn = nn.Sequential(nn.Linear(d_model, d_ff), 
                                     nn.ReLU(), 
                                     nn.Linear(d_ff, d_model))
        def forward(self, x):
            return x + self.ffn(x)
    
    blocks = {
        'Standard': StandardRes(),
        'Highway': HighwayBlock(d_model),
        'ReZero': ReZeroBlock(d_model, d_ff),
    }
    
    print("Residual Variant Comparison")
    print("-" * 50)
    
    for name, block in blocks.items():
        out = block(x)
        
        # How much does the residual change the input?
        diff = (out - x).norm() / x.norm()
        
        params = sum(p.numel() for p in block.parameters())
        
        print(f"{name:15s} | Relative change: {diff.item():.4f} | Params: {params:,}")
    
    # At initialization, ReZero should have zero change
    print("
(ReZero has alpha=0 at init, so no change initially)")
 
 
compare_residual_variants()

When to Use Each Variant

•Standard Residual: Default choice for most Transformer applications. Well-understood, efficient, and effective.
•ReZero: When training very deep networks without warmup, or when you want guaranteed smooth initialization.
•Highway Networks: Rarely used in modern Transformers, but useful when you need fine-grained gating control.
•Dense Connections: When memory isn't a constraint and you want maximum gradient flow (small models, research).
•Scaled Residuals: Deep networks (>48 layers) where standard residuals cause variance growth.

Summary: Residual Connections

Residual connections are the architectural foundation enabling deep Transformers. Let's consolidate the key insights:

Key Takeaways

•The Degradation Problem: Deep networks without residual connections suffer from training degradation—not overfitting, but optimization failure. Gradients vanish or explode through long chains of transformations.
•Residual Formulation: h' = h + f(h) makes identity the default. The network learns what to add, not what to replace. This simple change enables training 100+ layer networks.
•Gradient Highway: The identity term provides direct gradient paths from loss to any layer. Mathematically: ∂h'/∂h = I + ∂f/∂h. The identity matrix I ensures gradients always have a clear path.
•Interaction with Layer Norm: Pre-LN places normalization inside the residual branch, preserving the clean skip path. Post-LN normalizes after the sum, distorting gradients slightly but sometimes achieving better final quality.
•Residual Scaling: For very deep networks, scale residuals by factors like 1/√L to prevent variance growth. DeepNet-style scaling enables 1000+ layer training.
•Theoretical Foundations: Residual networks can be viewed as ensembles of shallow networks, or as discretized neural ODEs. Both views explain their trainability and generalization.
•Variants: ReZero (zero-init scalar), Highway (learned gating), Dense connections (concatenate all), and others offer specialized benefits.

Looking Ahead

Page Complete

4 / 5