Dropout - Learning Module

Loading content...

0/245

Dropping Other Things

The Dropout Family

Standard dropout randomly zeros neuron activations. But why stop there? The core insight—introducing structured noise to prevent overfitting—can be applied to virtually any component of a neural network.

Researchers have developed numerous "dropout variants" that drop different things:

DropConnect: Drop individual weights instead of neurons
Spatial Dropout: Drop entire feature maps instead of pixels
DropBlock: Drop contiguous regions instead of random pixels
DropPath/Stochastic Depth: Drop entire layers or paths
Attention Dropout: Drop attention connections in transformers
Embedding Dropout: Drop words in language models

Each variant addresses specific challenges in different architectures. Understanding when and why to use each one is crucial for effective deep learning practice.

What You Will Learn

This page covers: (1) DropConnect and its differences from standard dropout; (2) Spatial dropout for convolutional networks; (3) DropBlock for better regularization in CNNs; (4) Stochastic depth and DropPath for skip-connected networks; and (5) Specialized dropout variants for transformers and other architectures.

DropConnect: Dropping Weights

DropConnect extends dropout by applying the mask to weights rather than activations. Instead of zeroing neurons, we zero individual connections.

Mathematical Formulation:

Standard dropout: $$\mathbf{y} = (\mathbf{x} \odot \mathbf{m}) \mathbf{W}$$

DropConnect: $$\mathbf{y} = \mathbf{x} (\mathbf{W} \odot \mathbf{M})$$

where M is a mask matrix of the same shape as W, with each entry independently sampled from Bernoulli(1-p).

Key Differences:

Mask size: Dropout mask has size equal to input dimension; DropConnect mask has size equal to weight matrix (often much larger)
Expressiveness: DropConnect creates more possible sub-networks (2^(d_in × d_out) vs 2^d_in)
Computational cost: DropConnect requires generating and applying a larger mask
Inference: DropConnect's inference approximation is more complex

dropconnect.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
import numpy as np
from typing import Tuple
 
class DropConnectLinear:
    """
    DropConnect linear layer: drops weights instead of activations.
    
    Creates a larger and more diverse ensemble of sub-networks
    compared to standard dropout.
    """
    
    def __init__(
        self,
        in_features: int,
        out_features: int,
        drop_prob: float = 0.5
    ):
        """
        Initialize DropConnect layer.
        
        Args:
            in_features: Input dimension
            out_features: Output dimension
            drop_prob: Probability of dropping each weight
        """
        self.in_features = in_features
        self.out_features = out_features
        self.p = drop_prob
        
        # Initialize weights
        self.W = np.random.randn(in_features, out_features) * np.sqrt(2.0 / in_features)
        self.b = np.zeros(out_features)
        
        # Cache for backward pass
        self.mask = None
        self.training = True
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        """
        Forward pass with DropConnect.
        
        During training:
            - Generate mask for weight matrix
            - Apply mask: W_masked = W ⊙ M / (1-p)
            - Compute y = x @ W_masked + b
        
        During inference:
            - Use full weights (inverted scaling already applied)
        """
        if not self.training:
            return x @ self.W + self.b
        
        # Generate mask for weight matrix
        self.mask = np.random.binomial(1, 1 - self.p, size=self.W.shape)
        
        # Apply mask with inverted scaling
        W_masked = self.W * self.mask / (1 - self.p)
        
        return x @ W_masked + self.b
    
    def backward(self, x: np.ndarray, grad_output: np.ndarray) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
        """
        Backward pass for DropConnect.
        
        Returns:
            grad_input: Gradient w.r.t. input
            grad_W: Gradient w.r.t. weights
            grad_b: Gradient w.r.t. bias
        """
        if not self.training or self.mask is None:
            grad_input = grad_output @ self.W.T
            grad_W = x.T @ grad_output
        else:
            W_masked = self.W * self.mask / (1 - self.p)
            grad_input = grad_output @ W_masked.T
            grad_W = (x.T @ grad_output) * self.mask / (1 - self.p)
        
        grad_b = grad_output.sum(axis=0)
        
        return grad_input, grad_W, grad_b
 
 
def compare_dropout_dropconnect():
    """Compare dropout and DropConnect regularization."""
    np.random.seed(42)
    
    print("Dropout vs DropConnect Comparison")
    print("=" * 60)
    
    in_dim, out_dim = 100, 50
    batch_size = 32
    
    x = np.random.randn(batch_size, in_dim)
    
    # Standard dropout layer
    W = np.random.randn(in_dim, out_dim) * 0.1
    
    # Dropout: mask on input
    dropout_outputs = []
    for _ in range(1000):
        mask = np.random.binomial(1, 0.5, size=(batch_size, in_dim)) / 0.5
        output = (x * mask) @ W
        dropout_outputs.append(output)
    
    dropout_outputs = np.stack(dropout_outputs)
    
    # DropConnect: mask on weights
    dropconnect_outputs = []
    for _ in range(1000):
        mask = np.random.binomial(1, 0.5, size=W.shape) / 0.5
        output = x @ (W * mask)
        dropconnect_outputs.append(output)
    
    dropconnect_outputs = np.stack(dropconnect_outputs)
    
    # Compare statistics
    print("\nStatistics across 1000 forward passes:")
    print("-" * 60)
    print(f"{'Metric':<25} {'Dropout':<18} {'DropConnect':<18}")
    print("-" * 60)
    
    print(f"{'Output mean':<25} "
          f"{dropout_outputs.mean():<18.4f} "
          f"{dropconnect_outputs.mean():<18.4f}")
    
    print(f"{'Output std per sample':<25} "
          f"{dropout_outputs.std(axis=0).mean():<18.4f} "
          f"{dropconnect_outputs.std(axis=0).mean():<18.4f}")
    
    print(f"{'Variance across samples':<25} "
          f"{dropout_outputs.var(axis=(0,2)).mean():<18.4f} "
          f"{dropconnect_outputs.var(axis=(0,2)).mean():<18.4f}")
    
    # Ensemble size
    dropout_ensemble = 2 ** in_dim
    dropconnect_ensemble = 2 ** (in_dim * out_dim)
    
    print(f"\nTheoretical ensemble sizes:")
    print(f"  Dropout: 2^{in_dim} = {dropout_ensemble:.2e}")
    print(f"  DropConnect: 2^{in_dim * out_dim} = {dropconnect_ensemble:.2e}")
    
    print("\n✓ DropConnect creates exponentially more sub-networks")
    print("  but is computationally more expensive")
 
 
compare_dropout_dropconnect()

Dropout vs DropConnect
Aspect	Dropout	DropConnect
Mask location	Activations	Weights
Mask size	O(d)	O(d_in × d_out)
Sub-networks	2^d neurons	2^(d_in × d_out) connections
Memory cost	Lower	Higher
Inference	Simple (use full network)	Complex (Gaussian approximation)
Practical use	Universal	Less common; mainly research

DropConnect in Practice

Despite its theoretical advantages (larger ensemble), DropConnect is rarely used in practice. The computational overhead and complex inference approximation usually don't justify the marginal improvements over standard dropout. However, it remains important conceptually, showing that dropout is not unique to activations.

Spatial Dropout for Convolutional Networks

Standard dropout drops individual activations independently. For convolutional networks, this creates a problem: adjacent pixels in feature maps are highly correlated. Dropping one pixel has little effect when its neighbors carry similar information.

Spatial Dropout addresses this by dropping entire feature maps (channels) instead of individual pixels.

Mathematical Formulation:

For a feature map of shape (batch, channels, height, width):

Standard dropout: mask of shape (batch, channels, height, width)
Spatial dropout: mask of shape (batch, channels, 1, 1) — one value per channel

The mask is broadcast across the spatial dimensions, so entire feature maps are either kept or dropped.

Why it works:

Spatial dropout forces the network to not rely on any single feature map. It must learn redundant representations across multiple channels. This is a much stronger constraint than standard dropout, which can be "worked around" by the network learning similar features in adjacent pixels.

spatial_dropout.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
import numpy as np
from typing import Tuple
 
class SpatialDropout2D:
    """
    Spatial Dropout for 2D convolutional feature maps.
    
    Drops entire feature maps (channels) rather than individual pixels.
    This is more effective for convolutional networks because adjacent
    pixels are highly correlated.
    """
    
    def __init__(self, drop_prob: float = 0.5):
        """
        Initialize spatial dropout.
        
        Args:
            drop_prob: Probability of dropping each feature map
        """
        self.p = drop_prob
        self.mask = None
        self.training = True
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        """
        Apply spatial dropout.
        
        Args:
            x: Feature maps of shape (batch, channels, height, width)
        
        Returns:
            Feature maps with some channels zeroed
        """
        if not self.training:
            return x
        
        batch_size, num_channels, height, width = x.shape
        
        # Mask shape: (batch, channels, 1, 1)
        # One binary value per feature map per sample
        self.mask = np.random.binomial(
            1, 1 - self.p, size=(batch_size, num_channels, 1, 1)
        ) / (1 - self.p)
        
        # Broadcast mask across spatial dimensions
        return x * self.mask
    
    def backward(self, grad_output: np.ndarray) -> np.ndarray:
        """Backward pass."""
        if not self.training or self.mask is None:
            return grad_output
        return grad_output * self.mask
 
 
class StandardDropout2D:
    """Standard dropout for comparison (drops individual pixels)."""
    
    def __init__(self, drop_prob: float = 0.5):
        self.p = drop_prob
        self.mask = None
        self.training = True
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        if not self.training:
            return x
        
        self.mask = np.random.binomial(1, 1 - self.p, size=x.shape) / (1 - self.p)
        return x * self.mask
 
 
def compare_standard_vs_spatial():
    """Compare standard and spatial dropout on conv features."""
    np.random.seed(42)
    
    print("Standard Dropout vs Spatial Dropout")
    print("=" * 60)
    
    # Simulated feature maps
    batch_size = 4
    channels = 64
    height, width = 14, 14
    
    x = np.random.randn(batch_size, channels, height, width)
    
    # Create dropout layers
    standard = StandardDropout2D(drop_prob=0.5)
    spatial = SpatialDropout2D(drop_prob=0.5)
    
    # Apply dropout
    standard_out = standard.forward(x)
    spatial_out = spatial.forward(x)
    
    print("\nInput shape:", x.shape)
    print("-" * 60)
    
    # Standard dropout: count zeros per channel
    std_zeros_per_channel = []
    for c in range(channels):
        zeros = np.sum(standard_out[:, c] == 0)
        std_zeros_per_channel.append(zeros)
    
    print("\nStandard Dropout:")
    print(f"  Total zeros: {np.sum(standard_out == 0)}")
    print(f"  Zeros per channel: mean={np.mean(std_zeros_per_channel):.1f}, "
          f"std={np.std(std_zeros_per_channel):.1f}")
    
    # Spatial dropout: channels are all-or-nothing
    spatial_zeros_per_channel = []
    for c in range(channels):
        zeros = np.sum(spatial_out[:, c] == 0)
        spatial_zeros_per_channel.append(zeros)
    
    print("\nSpatial Dropout:")
    print(f"  Total zeros: {np.sum(spatial_out == 0)}")
    print(f"  Zeros per channel: {set(spatial_zeros_per_channel)}")  # Only two values: 0 or all
    
    # Count fully dropped channels
    fully_dropped = sum(1 for z in spatial_zeros_per_channel 
                        if z == batch_size * height * width)
    print(f"  Fully dropped channels: {fully_dropped} / {channels}")
    
    # Visualize a single sample
    print("\nVisualization (first sample, first 8 channels):")
    print("  Standard: per-pixel variation in each channel")
    print("  Spatial: entire channels are kept or dropped")
    
    for c in range(8):
        std_zeros = np.sum(standard_out[0, c] == 0)
        spa_zeros = np.sum(spatial_out[0, c] == 0)
        
        std_status = f"{std_zeros}/{height*width} zeros"
        spa_status = "DROPPED" if spa_zeros == height*width else "KEPT"
        
        print(f"    Channel {c}: Standard: {std_status}, Spatial: {spa_status}")
 
 
def demonstrate_correlation_robustness():
    """Show why spatial dropout is better for correlated features."""
    np.random.seed(42)
    
    print("\n" + "=" * 60)
    print("Redundancy in Convolutional Features")
    print("=" * 60)
    
    # Simulate highly correlated feature maps (smooth gradients)
    batch_size, channels = 1, 1
    size = 8
    
    # Create a smooth feature (simulating real conv features)
    y, x = np.meshgrid(np.linspace(-1, 1, size), np.linspace(-1, 1, size))
    feature = np.exp(-(x**2 + y**2) / 0.5)  # Gaussian blob
    feature = feature.reshape(1, 1, size, size)
    
    print("\nOriginal smooth feature (Gaussian blob):")
    print_feature(feature[0, 0])
    
    # Standard dropout
    standard = StandardDropout2D(0.5)
    standard_out = standard.forward(feature.copy())
    
    print("\nAfter standard dropout (50%):")
    print_feature(standard_out[0, 0])
    
    # The issue: even with 50% dropout, pattern is still recognizable
    # because adjacent pixels carry similar information
    
    print("\nProblem: Information leaks through adjacent pixels!")
    print("Spatial dropout forces learning of truly redundant features.")
 
 
def print_feature(feature: np.ndarray):
    """Print a 2D feature map as ASCII."""
    for row in feature:
        line = ""
        for val in row:
            if val == 0:
                line += "  ."
            elif val < 0.3:
                line += "  ░"
            elif val < 0.6:
                line += "  ▒"
            else:
                line += "  █"
        print(line)
 
 
compare_standard_vs_spatial()
demonstrate_correlation_robustness()

When to Use Spatial Dropout

Use Spatial Dropout in early convolutional layers where activations are highly spatially correlated. Standard dropout becomes more appropriate in later layers and fully-connected layers where activations are less correlated. Many modern CNN architectures use batch normalization instead, which provides similar regularization effects.

DropBlock: Dropping Contiguous Regions

DropBlock (Ghiasi et al., 2018) takes spatial dropout further by dropping contiguous regions of feature maps instead of entire channels or random pixels.

The Key Insight:

While spatial dropout is effective, it might be too aggressive—dropping an entire feature map removes all spatial information from that channel. DropBlock finds a middle ground: drop contiguous rectangular regions within feature maps.

Algorithm:

Sample random "seed" pixels with probability γ
Expand each seed to a block of size (block_size × block_size)
Zero out these blocks
Scale remaining activations by 1/(1 - fraction_dropped)

Computing γ:

To achieve an effective dropout probability of p, set: $$\gamma = \frac{p \cdot \text{feat_size}^2}{\text{block_size}^2 \cdot (\text{feat_size} - \text{block_size} + 1)^2}$$

This accounts for block overlap and edge effects.

dropblock.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
import numpy as np
 
class DropBlock2D:
    """
    DropBlock regularization for convolutional networks.
    
    Drops contiguous regions (blocks) of feature maps, providing
    stronger regularization than random dropout while being less
    aggressive than spatial dropout.
    """
    
    def __init__(
        self,
        block_size: int = 7,
        drop_prob: float = 0.1
    ):
        """
        Initialize DropBlock.
        
        Args:
            block_size: Size of contiguous region to drop
            drop_prob: Target probability of dropping each activation
        """
        self.block_size = block_size
        self.drop_prob = drop_prob
        self.mask = None
        self.training = True
    
    def _compute_gamma(self, feat_size: int) -> float:
        """
        Compute sampling probability gamma to achieve target drop_prob.
        
        Accounts for block size and valid placement positions.
        """
        # Number of valid block positions
        valid_positions = feat_size - self.block_size + 1
        if valid_positions <= 0:
            return 0.0
        
        # Gamma formula from the paper
        gamma = (
            self.drop_prob * (feat_size ** 2) /
            (self.block_size ** 2 * valid_positions ** 2)
        )
        
        return min(gamma, 1.0)
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        """
        Apply DropBlock.
        
        Args:
            x: Feature maps of shape (batch, channels, height, width)
        
        Returns:
            Feature maps with blocks dropped
        """
        if not self.training or self.drop_prob == 0:
            return x
        
        batch_size, channels, height, width = x.shape
        
        # Compute gamma for this feature map size
        gamma = self._compute_gamma(height)
        
        # Sample block centers
        mask = np.random.binomial(1, 1 - gamma, size=(batch_size, channels, height, width))
        
        # Expand to blocks using max pooling (keep smallest values)
        # This inverts the mask, grows zeros, then inverts back
        block_mask = np.ones_like(mask)
        for i in range(batch_size):
            for c in range(channels):
                block_mask[i, c] = self._expand_blocks(mask[i, c])
        
        # Compute scaling factor
        count = block_mask.size
        count_ones = block_mask.sum()
        
        if count_ones == 0:
            return x * 0
        
        scale = count / count_ones
        
        self.mask = block_mask * scale
        
        return x * self.mask
    
    def _expand_blocks(self, mask: np.ndarray) -> np.ndarray:
        """Expand zeros in mask to block_size blocks."""
        height, width = mask.shape
        block_mask = mask.copy()
        
        # Find zero positions (block centers)
        zero_positions = np.argwhere(mask == 0)
        
        for pos in zero_positions:
            y, x = pos
            # Expand to block
            y_start = max(0, y - self.block_size // 2)
            y_end = min(height, y + self.block_size // 2 + 1)
            x_start = max(0, x - self.block_size // 2)
            x_end = min(width, x + self.block_size // 2 + 1)
            
            block_mask[y_start:y_end, x_start:x_end] = 0
        
        return block_mask
 
 
def demonstrate_dropblock():
    """Visualize DropBlock behavior."""
    np.random.seed(42)
    
    print("DropBlock Demonstration")
    print("=" * 60)
    
    # Create feature map
    height, width = 14, 14
    x = np.ones((1, 1, height, width))
    
    dropblock = DropBlock2D(block_size=5, drop_prob=0.3)
    
    output = dropblock.forward(x)
    
    print(f"\nFeature map size: {height}x{width}")
    print(f"Block size: {dropblock.block_size}")
    print(f"Target drop probability: {dropblock.drop_prob}")
    
    # Count dropped activations
    dropped = np.sum(output == 0)
    total = output.size
    actual_drop_rate = dropped / total
    
    print(f"\nActual drop rate: {actual_drop_rate:.1%}")
    
    print("\nVisualization (█ = kept, · = dropped):")
    mask = (output[0, 0] > 0).astype(int)
    for row in mask:
        line = "  "
        for val in row:
            line += "█ " if val else "· "
        print(line)
    
    print("\nNotice: Dropped regions are contiguous blocks, not random pixels")
 
 
def compare_dropout_variants():
    """Compare different dropout variants on conv features."""
    np.random.seed(42)
    
    print("\n" + "=" * 60)
    print("Comparison: Standard vs Spatial vs DropBlock")
    print("=" * 60)
    
    batch_size, channels = 4, 32
    height, width = 14, 14
    drop_prob = 0.2
    
    x = np.random.randn(batch_size, channels, height, width)
    
    # Apply each variant
    # Standard dropout
    std_mask = np.random.binomial(1, 1 - drop_prob, size=x.shape) / (1 - drop_prob)
    std_out = x * std_mask
    
    # Spatial dropout
    spatial_mask = np.random.binomial(
        1, 1 - drop_prob, size=(batch_size, channels, 1, 1)
    ) / (1 - drop_prob)
    spatial_out = x * spatial_mask
    
    # DropBlock
    dropblock = DropBlock2D(block_size=5, drop_prob=drop_prob)
    dropblock_out = dropblock.forward(x)
    
    # Statistics
    print(f"\nDrop probability: {drop_prob}")
    print(f"Feature map shape: {x.shape}")
    print()
    print(f"{'Variant':<15} {'Zeros (%)':<12} {'Channels affected':<20}")
    print("-" * 55)
    
    # Standard: count zeros
    std_zeros = np.sum(std_out == 0) / std_out.size
    std_channels = np.sum(np.any(std_out == 0, axis=(2, 3))) / (batch_size * channels)
    print(f"{'Standard':<15} {std_zeros:<12.1%} {std_channels:<20.1%}")
    
    # Spatial: count zeros
    spatial_zeros = np.sum(spatial_out == 0) / spatial_out.size
    spatial_channels = np.sum(np.all(spatial_out == 0, axis=(2, 3))) / (batch_size * channels)
    print(f"{'Spatial':<15} {spatial_zeros:<12.1%} {spatial_channels:<20.1%} (fully dropped)")
    
    # DropBlock: count zeros
    db_zeros = np.sum(dropblock_out == 0) / dropblock_out.size
    db_partial = np.sum(
        np.any(dropblock_out == 0, axis=(2, 3)) & 
        ~np.all(dropblock_out == 0, axis=(2, 3))
    ) / (batch_size * channels)
    print(f"{'DropBlock':<15} {db_zeros:<12.1%} {db_partial:<20.1%} (partially dropped)")
    
    print("\nKey insight:")
    print("  Standard: All channels affected, random pixels")
    print("  Spatial: Some channels fully dropped, others untouched")
    print("  DropBlock: Many channels partially affected (blocks removed)")
 
 
demonstrate_dropblock()
compare_dropout_variants()

DropBlock in Practice

DropBlock is particularly effective in ResNets and similar architectures. The original paper recommends block_size=7 for 14×14 feature maps and linearly scheduling drop_prob from 0 to the target value during training. DropBlock is available in most deep learning frameworks as a standard layer.

Stochastic Depth and DropPath

As networks grow deeper, a natural question arises: can we drop entire layers during training? Stochastic Depth and DropPath do exactly this.

Stochastic Depth (Huang et al., 2016):

For networks with residual connections, randomly skip entire residual blocks during training:

$$\mathbf{H}l = \text{ReLU}(b_l \cdot f_l(\mathbf{H}{l-1}) + \mathbf{H}_{l-1})$$

where bₗ ~ Bernoulli(pₗ) determines whether block l is active.

Linear Decay Rule:

The survival probability pₗ typically follows a linear decay: $$p_l = 1 - \frac{l}{L}(1 - p_L)$$

Early layers (more important) have higher survival probability; deeper layers (often redundant) are dropped more frequently.

DropPath (Larsson et al., 2017):

Generalizes stochastic depth to any path in a network with multiple parallel paths. Each path is independently dropped with some probability.

stochastic_depth.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
import numpy as np
from typing import Callable, List
 
class StochasticDepthBlock:
    """
    Residual block with stochastic depth.
    
    During training, the block is randomly skipped (identity shortcut only).
    During inference, output is scaled by survival probability.
    """
    
    def __init__(
        self,
        in_features: int,
        survival_prob: float = 0.8,
        transform_fn: Callable = None
    ):
        """
        Initialize stochastic depth block.
        
        Args:
            in_features: Input/output dimension
            survival_prob: Probability of keeping this block during training
            transform_fn: Optional transformation function (default: linear + ReLU)
        """
        self.survival_prob = survival_prob
        self.in_features = in_features
        
        # Default transformation
        if transform_fn is None:
            self.W = np.random.randn(in_features, in_features) * 0.01
            self.b = np.zeros(in_features)
            self.transform = lambda x: np.maximum(0, x @ self.W + self.b)
        else:
            self.transform = transform_fn
        
        self.training = True
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        """
        Forward pass with stochastic depth.
        
        Training: Skip block with probability (1 - survival_prob)
        Inference: Apply block with scaling by survival_prob
        """
        if self.training:
            # Sample whether to keep this block
            if np.random.random() < self.survival_prob:
                # Block is active: apply transformation + residual
                return x + self.transform(x) / self.survival_prob
            else:
                # Block is skipped: identity
                return x
        else:
            # Inference: always apply, scaled by survival probability
            return x + self.transform(x)
 
 
class StochasticDepthNetwork:
    """
    Network with stochastic depth across residual blocks.
    
    Uses linear decay: early blocks have high survival prob,
    later blocks have lower survival prob.
    """
    
    def __init__(
        self,
        input_dim: int,
        num_blocks: int,
        p_final: float = 0.5  # Survival prob for the last block
    ):
        """
        Initialize network.
        
        Args:
            input_dim: Input dimension
            num_blocks: Number of residual blocks
            p_final: Survival probability for the last block
        """
        self.blocks = []
        
        for l in range(num_blocks):
            # Linear decay: p_l = 1 - (l/L) * (1 - p_L)
            survival_prob = 1 - (l / num_blocks) * (1 - p_final)
            
            block = StochasticDepthBlock(
                in_features=input_dim,
                survival_prob=survival_prob
            )
            self.blocks.append(block)
        
        self.training = True
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        """Forward pass through all blocks."""
        for block in self.blocks:
            block.training = self.training
            x = block.forward(x)
        return x
    
    def expected_depth(self) -> float:
        """Expected number of active blocks during training."""
        return sum(b.survival_prob for b in self.blocks)
 
 
class DropPath:
    """
    Drop entire paths in multi-path networks.
    
    Used in architectures like NASNet, EfficientNet, Vision Transformers.
    """
    
    def __init__(self, drop_prob: float = 0.0):
        """
        Initialize DropPath.
        
        Args:
            drop_prob: Probability of dropping the path
        """
        self.drop_prob = drop_prob
        self.training = True
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        """
        Forward pass.
        
        During training: drop entire input with probability drop_prob
        During inference: scale input by (1 - drop_prob)
        """
        if not self.training or self.drop_prob == 0:
            return x
        
        # Sample keep probability per sample in batch
        keep_prob = 1 - self.drop_prob
        
        # Random mask: one value per sample
        shape = (x.shape[0],) + (1,) * (x.ndim - 1)
        mask = np.random.binomial(1, keep_prob, size=shape) / keep_prob
        
        return x * mask
 
 
def demonstrate_stochastic_depth():
    """Demonstrate stochastic depth behavior."""
    np.random.seed(42)
    
    print("Stochastic Depth Demonstration")
    print("=" * 60)
    
    input_dim = 64
    num_blocks = 20
    p_final = 0.5
    
    network = StochasticDepthNetwork(input_dim, num_blocks, p_final)
    
    print(f"\nNetwork: {num_blocks} residual blocks")
    print(f"Final block survival probability: {p_final}")
    print(f"\nSurvival probabilities (linear decay):")
    
    for i, block in enumerate(network.blocks):
        prob = block.survival_prob
        bar = "█" * int(prob * 30) + "░" * int((1-prob) * 30)
        print(f"  Block {i+1:2d}: {bar} {prob:.2f}")
    
    print(f"\nExpected depth: {network.expected_depth():.1f} / {num_blocks} blocks")
    
    # Run forward pass multiple times
    x = np.random.randn(32, input_dim)
    
    print("\nTraining mode - blocks skipped vary each pass:")
    network.training = True
    
    active_counts = []
    for _ in range(5):
        # Count active blocks (those that modify output)
        active = 0
        h = x.copy()
        for block in network.blocks:
            h_new = block.forward(h)
            if not np.allclose(h, h_new - (h_new - h)):  # Check if block was active
                active += 1
            h = h_new
        active_counts.append(active)
    
    print(f"  Active blocks per pass: ~{network.expected_depth():.1f} expected")
    
    # Test output consistency in eval mode
    print("\nInference mode - output is consistent:")
    network.training = False
    
    outputs = [network.forward(x) for _ in range(3)]
    all_equal = all(np.allclose(outputs[0], out) for out in outputs)
    print(f"  All outputs identical: {all_equal}")
 
 
def demonstrate_droppath():
    """Demonstrate DropPath in multi-path architecture."""
    np.random.seed(42)
    
    print("\n" + "=" * 60)
    print("DropPath in Multi-Path Architecture")
    print("=" * 60)
    
    # Simulate a cell with multiple paths
    input_dim = 64
    batch_size = 8
    
    x = np.random.randn(batch_size, input_dim)
    
    # Multiple parallel paths with different drop probabilities
    path_drops = [0.0, 0.1, 0.2, 0.3]  # Increasing drop probability
    paths = [DropPath(p) for p in path_drops]
    
    print("\nMulti-path cell with 4 paths:")
    print("  Path 0: drop_prob = 0.0 (never dropped)")
    print("  Path 1: drop_prob = 0.1")
    print("  Path 2: drop_prob = 0.2")
    print("  Path 3: drop_prob = 0.3")
    
    # Training: some paths may be dropped
    print("\nTraining mode (10 forward passes):")
    all_active = []
    for _ in range(10):
        outputs = [path.forward(x) for path in paths]
        active = [not np.allclose(out, 0) for out in outputs]
        all_active.append(active)
    
    active_rates = np.mean(all_active, axis=0)
    for i, rate in enumerate(active_rates):
        expected = 1 - path_drops[i]
        print(f"  Path {i}: active {rate:.0%} (expected {expected:.0%})")
 
 
demonstrate_stochastic_depth()
demonstrate_droppath()

Stochastic Depth Benefits
Benefit	Explanation
Training speedup	Fewer computations per iteration (skipped blocks)
Regularization	Implicit ensemble over different depths
Gradient flow	Shorter paths for gradients when deep blocks are skipped
Training stability	Reduces vanishing gradients in very deep networks
Memory savings	Skipped blocks don't need activation caching

Stochastic Depth in Modern Architectures

Stochastic depth (often called DropPath) is widely used in modern architectures including EfficientNet, Vision Transformers, and Swin Transformers. It's particularly effective for very deep networks (>100 layers) where it provides both regularization and training stability.

Architecture-Specific Dropout Variants

Different architecture families have developed specialized dropout variants tailored to their unique characteristics.

Attention Dropout (Transformers):

Transformers apply dropout in multiple places:

After attention weights (before value mixing)
After the attention output projection
After feed-forward layers

Attention dropout specifically drops attention connections, forcing the model to not rely too heavily on any single token relationship.

Embedding Dropout:

For language models, dropping entire token embeddings forces the model to learn robust representations that don't depend on any single word. This is particularly effective for regularizing embedding layers.

RNN Dropout Variants:

Standard dropout doesn't work well for RNNs because dropping activations at each time step breaks temporal consistency. Variants include:

Variational RNN Dropout: Same mask at every time step
Zoneout: Keep previous hidden state instead of dropping
Recurrent Dropout: Drop recurrent connections, not hidden states

architecture_specific_dropout.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
import numpy as np
 
class AttentionDropout:
    """
    Dropout for attention mechanisms.
    
    Applied to attention weights after softmax, before
    multiplying with values.
    """
    
    def __init__(self, drop_prob: float = 0.1):
        self.drop_prob = drop_prob
        self.training = True
    
    def forward(self, attention_weights: np.ndarray) -> np.ndarray:
        """
        Apply dropout to attention weights.
        
        Args:
            attention_weights: Softmax-normalized attention of shape
                             (batch, heads, seq_len, seq_len)
        
        Returns:
            Attention weights with dropout applied
        """
        if not self.training or self.drop_prob == 0:
            return attention_weights
        
        # Generate mask
        mask = np.random.binomial(
            1, 1 - self.drop_prob, 
            size=attention_weights.shape
        ) / (1 - self.drop_prob)
        
        # Apply and re-normalize to maintain sum
        dropped = attention_weights * mask
        
        # Note: Unlike standard dropout, we might want to renormalize
        # so attention weights still sum to 1
        # dropped = dropped / dropped.sum(axis=-1, keepdims=True)
        
        return dropped
 
 
class EmbeddingDropout:
    """
    Dropout that drops entire word embeddings.
    
    More effective than element-wise dropout for embeddings
    because it forces the model to not depend on specific words.
    """
    
    def __init__(self, drop_prob: float = 0.1):
        self.drop_prob = drop_prob
        self.training = True
    
    def forward(self, embeddings: np.ndarray) -> np.ndarray:
        """
        Apply embedding dropout.
        
        Args:
            embeddings: Token embeddings of shape (batch, seq_len, embed_dim)
        
        Returns:
            Embeddings with some tokens zeroed
        """
        if not self.training or self.drop_prob == 0:
            return embeddings
        
        batch_size, seq_len, embed_dim = embeddings.shape
        
        # Mask shape: (batch, seq_len, 1) - same for all dimensions
        mask = np.random.binomial(
            1, 1 - self.drop_prob,
            size=(batch_size, seq_len, 1)
        ) / (1 - self.drop_prob)
        
        return embeddings * mask
 
 
class VariationalRNNDropout:
    """
    Variational dropout for RNNs.
    
    Uses the same dropout mask at every time step, maintaining
    temporal consistency.
    """
    
    def __init__(self, drop_prob: float = 0.3, input_dropout: bool = True):
        self.drop_prob = drop_prob
        self.input_dropout = input_dropout
        self.mask = None
        self.training = True
    
    def reset_mask(self, batch_size: int, hidden_size: int):
        """Generate mask once per sequence."""
        if not self.training or self.drop_prob == 0:
            self.mask = None
            return
        
        self.mask = np.random.binomial(
            1, 1 - self.drop_prob,
            size=(batch_size, hidden_size)
        ) / (1 - self.drop_prob)
    
    def forward(self, hidden: np.ndarray) -> np.ndarray:
        """Apply the same mask across all time steps."""
        if self.mask is None:
            return hidden
        return hidden * self.mask
 
 
class Zoneout:
    """
    Zoneout for RNNs: randomly keep previous hidden state.
    
    Instead of zeroing activations, keeps the previous time step's
    hidden state. This creates skip connections in time.
    """
    
    def __init__(self, zoneout_prob: float = 0.1):
        self.zoneout_prob = zoneout_prob
        self.training = True
    
    def forward(
        self, 
        h_new: np.ndarray, 
        h_prev: np.ndarray
    ) -> np.ndarray:
        """
        Apply zoneout.
        
        Args:
            h_new: New hidden state
            h_prev: Previous hidden state
        
        Returns:
            Mixed hidden state
        """
        if not self.training or self.zoneout_prob == 0:
            return h_new
        
        # For each element: keep old with prob zoneout_prob
        mask = np.random.binomial(
            1, 1 - self.zoneout_prob,
            size=h_new.shape
        )
        
        return mask * h_new + (1 - mask) * h_prev
 
 
def demonstrate_transformer_dropout():
    """Show dropout positions in transformer architecture."""
    print("Dropout in Transformer Architecture")
    print("=" * 60)
    
    print("""
    ┌─────────────────────────────────────────┐
    │              Transformer Block          │
    ├─────────────────────────────────────────┤
    │                                         │
    │  Input X                                │
    │     │                                   │
    │     ▼                                   │
    │  ┌─────────────────────────────────┐   │
    │  │      Multi-Head Attention       │   │
    │  │  ┌───────────────────────────┐  │   │
    │  │  │   Q    K    V   ───────┐  │  │   │
    │  │  │   │    │    │          │  │  │   │
    │  │  │   ▼    ▼    ▼          │  │  │   │
    │  │  │  Attention Weights     │  │  │   │
    │  │  │         │              │  │  │   │
    │  │  │  [ATTENTION DROPOUT]   │  │  │   │  ◄── Drop attention
    │  │  │         │              │  │  │   │      connections
    │  │  │         ▼              │  │  │   │
    │  │  │     Weighted V         │  │  │   │
    │  │  └───────────────────────────┘  │   │
    │  │              │                   │   │
    │  │       Output Projection          │   │
    │  │              │                   │   │
    │  │       [DROPOUT]                  │   │  ◄── Drop after 
    │  └─────────────────────────────────┘   │      projection
    │              │                          │
    │              ▼                          │
    │        Add & LayerNorm ────────────────│
    │              │                          │
    │              ▼                          │
    │  ┌─────────────────────────────────┐   │
    │  │       Feed-Forward              │   │
    │  │     Linear → GELU → Linear      │   │
    │  │              │                   │   │
    │  │       [DROPOUT]                  │   │  ◄── Drop after FFN
    │  └─────────────────────────────────┘   │
    │              │                          │
    │        Add & LayerNorm                 │
    │              │                          │
    │           Output                        │
    └─────────────────────────────────────────┘
    """)
    
    print("Typical dropout rates:")
    print("  - Attention dropout: 0.1")
    print("  - Hidden/residual dropout: 0.1")
    print("  - DropPath (for residual): 0.0-0.3 (linear increase)")
 
 
def demonstrate_rnn_dropout():
    """Compare RNN dropout variants."""
    np.random.seed(42)
    
    print("\n" + "=" * 60)
    print("RNN Dropout Variants Comparison")
    print("=" * 60)
    
    batch_size = 4
    hidden_size = 64
    seq_len = 10
    
    # Simulate hidden states across time
    hidden_states = np.random.randn(seq_len, batch_size, hidden_size)
    
    print("\n1. Standard Dropout (different mask each time step):")
    standard_masks = [
        np.random.binomial(1, 0.7, (batch_size, hidden_size))
        for _ in range(seq_len)
    ]
    # These would be different, breaking temporal consistency
    print(f"   Mask at t=0 == Mask at t=1: {np.allclose(standard_masks[0], standard_masks[1])}")
    
    print("\n2. Variational Dropout (same mask all time steps):")
    variational = VariationalRNNDropout(drop_prob=0.3)
    variational.reset_mask(batch_size, hidden_size)
    # Same mask used throughout
    outputs = [variational.forward(hidden_states[t]) for t in range(seq_len)]
    mask_consistent = variational.mask is not None
    print(f"   Same mask used for all {seq_len} time steps: {mask_consistent}")
    
    print("\n3. Zoneout (keep previous state with probability):")
    zoneout = Zoneout(zoneout_prob=0.15)
    h_prev = np.zeros((batch_size, hidden_size))
    
    kept_from_prev = []
    for t in range(seq_len):
        h_new = hidden_states[t]
        h_out = zoneout.forward(h_new, h_prev)
        # Count elements kept from previous
        if t > 0:
            kept = np.sum(np.isclose(h_out, h_prev))
            kept_from_prev.append(kept / h_out.size)
        h_prev = h_out
    
    avg_kept = np.mean(kept_from_prev)
    print(f"   Average fraction kept from previous: {avg_kept:.1%} (target: 15%)")
    print("   Creates 'skip connections in time'")
 
 
demonstrate_transformer_dropout()
demonstrate_rnn_dropout()

Architecture-Specific Dropout Summary
Architecture	Variant	Where Applied	Typical Rate
Transformers	Attention dropout	After softmax attention	0.1
Transformers	Hidden dropout	After projections/FFN	0.1
Transformers	DropPath	Skip connection paths	0.0-0.3
RNNs/LSTMs	Variational dropout	Input/hidden (same mask)	0.2-0.5
RNNs/LSTMs	Zoneout	Hidden state transitions	0.1-0.15
Embeddings	Embedding dropout	Entire token vectors	0.1-0.2

Choosing the Right Dropout Variant

With so many dropout variants, how do you choose? Here's a practical decision framework.

Decision Framework

•Fully-connected layers: Standard dropout (p=0.5) or no dropout if using BatchNorm
•Early conv layers: Spatial dropout or DropBlock (p=0.1-0.2)
•Deep residual networks: Stochastic depth / DropPath (linearly increasing rate)
•Transformers: Attention dropout + hidden dropout + DropPath (all ~0.1)
•RNNs/LSTMs: Variational dropout (same mask over time, p=0.2-0.5)
•Embeddings: Embedding dropout or word dropout (p=0.1-0.2)
•Very deep networks (>100 layers): Definitely use DropPath for training stability

Use Specialized Variants When

•Standard dropout isn't providing enough regularization
•Working with convolutional or attention architectures
•Training very deep networks (>50 layers)
•Sequential data with temporal dependencies
•Want structured sparsity or pruning

Stick with Standard Dropout When

•Working with simple MLP architectures
•Already using BatchNorm (often sufficient)
•Not overfitting significantly
•Implementation simplicity is a priority
•Computational overhead is a concern

Modern Best Practice

In modern architectures (ResNets, Transformers), the combination of (1) DropPath for residual connections, (2) standard dropout after projections, and (3) attention dropout in transformers has become the de facto standard. Batch/Layer normalization often reduces the need for aggressive dropout rates.

Summary: Dropping Other Things

The dropout concept extends far beyond randomly zeroing neuron activations. Let's consolidate the key variants:

Key Takeaways

•DropConnect: Drops weights instead of activations; larger implicit ensemble but more expensive
•Spatial Dropout: Drops entire feature maps; better for early conv layers with correlated activations
•DropBlock: Drops contiguous regions; middle ground between random and spatial dropout
•Stochastic Depth/DropPath: Drops entire layers/paths; crucial for very deep networks
•Attention Dropout: Drops attention connections; standard in transformers
•Variational RNN Dropout: Same mask across time steps; maintains temporal consistency
•Choose based on architecture: Each variant is designed for specific architectural properties

Module Complete:

You've now completed the comprehensive study of dropout regularization in deep learning. You understand:

The core dropout mechanism and its theoretical foundations
Inverted dropout and its practical advantages
The Bayesian interpretation and uncertainty quantification
Variational dropout and learned dropout rates
Architecture-specific variants for different network types

Dropout and its variants remain among the most important regularization techniques in deep learning, and mastering them is essential for training robust, generalizing neural networks.

Module Complete

Congratulations! You've completed the Dropout module. You now have a deep understanding of dropout regularization—from the basic algorithm through Bayesian interpretations to specialized variants for every major architecture type. This knowledge will serve you in designing, training, and debugging neural networks across all domains of deep learning.

Dropping Other Things

The Dropout Family

Researchers have developed numerous "dropout variants" that drop different things:

DropConnect: Drop individual weights instead of neurons
Spatial Dropout: Drop entire feature maps instead of pixels
DropBlock: Drop contiguous regions instead of random pixels
DropPath/Stochastic Depth: Drop entire layers or paths
Attention Dropout: Drop attention connections in transformers
Embedding Dropout: Drop words in language models

Each variant addresses specific challenges in different architectures. Understanding when and why to use each one is crucial for effective deep learning practice.

What You Will Learn

DropConnect: Dropping Weights

DropConnect extends dropout by applying the mask to weights rather than activations. Instead of zeroing neurons, we zero individual connections.

Mathematical Formulation:

Standard dropout: $$\mathbf{y} = (\mathbf{x} \odot \mathbf{m}) \mathbf{W}$$

DropConnect: $$\mathbf{y} = \mathbf{x} (\mathbf{W} \odot \mathbf{M})$$

where M is a mask matrix of the same shape as W, with each entry independently sampled from Bernoulli(1-p).

Key Differences:

Mask size: Dropout mask has size equal to input dimension; DropConnect mask has size equal to weight matrix (often much larger)
Expressiveness: DropConnect creates more possible sub-networks (2^(d_in × d_out) vs 2^d_in)
Computational cost: DropConnect requires generating and applying a larger mask
Inference: DropConnect's inference approximation is more complex

dropconnect.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
import numpy as np
from typing import Tuple
 
class DropConnectLinear:
    """
    DropConnect linear layer: drops weights instead of activations.
    
    Creates a larger and more diverse ensemble of sub-networks
    compared to standard dropout.
    """
    
    def __init__(
        self,
        in_features: int,
        out_features: int,
        drop_prob: float = 0.5
    ):
        """
        Initialize DropConnect layer.
        
        Args:
            in_features: Input dimension
            out_features: Output dimension
            drop_prob: Probability of dropping each weight
        """
        self.in_features = in_features
        self.out_features = out_features
        self.p = drop_prob
        
        # Initialize weights
        self.W = np.random.randn(in_features, out_features) * np.sqrt(2.0 / in_features)
        self.b = np.zeros(out_features)
        
        # Cache for backward pass
        self.mask = None
        self.training = True
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        """
        Forward pass with DropConnect.
        
        During training:
            - Generate mask for weight matrix
            - Apply mask: W_masked = W ⊙ M / (1-p)
            - Compute y = x @ W_masked + b
        
        During inference:
            - Use full weights (inverted scaling already applied)
        """
        if not self.training:
            return x @ self.W + self.b
        
        # Generate mask for weight matrix
        self.mask = np.random.binomial(1, 1 - self.p, size=self.W.shape)
        
        # Apply mask with inverted scaling
        W_masked = self.W * self.mask / (1 - self.p)
        
        return x @ W_masked + self.b
    
    def backward(self, x: np.ndarray, grad_output: np.ndarray) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
        """
        Backward pass for DropConnect.
        
        Returns:
            grad_input: Gradient w.r.t. input
            grad_W: Gradient w.r.t. weights
            grad_b: Gradient w.r.t. bias
        """
        if not self.training or self.mask is None:
            grad_input = grad_output @ self.W.T
            grad_W = x.T @ grad_output
        else:
            W_masked = self.W * self.mask / (1 - self.p)
            grad_input = grad_output @ W_masked.T
            grad_W = (x.T @ grad_output) * self.mask / (1 - self.p)
        
        grad_b = grad_output.sum(axis=0)
        
        return grad_input, grad_W, grad_b
 
 
def compare_dropout_dropconnect():
    """Compare dropout and DropConnect regularization."""
    np.random.seed(42)
    
    print("Dropout vs DropConnect Comparison")
    print("=" * 60)
    
    in_dim, out_dim = 100, 50
    batch_size = 32
    
    x = np.random.randn(batch_size, in_dim)
    
    # Standard dropout layer
    W = np.random.randn(in_dim, out_dim) * 0.1
    
    # Dropout: mask on input
    dropout_outputs = []
    for _ in range(1000):
        mask = np.random.binomial(1, 0.5, size=(batch_size, in_dim)) / 0.5
        output = (x * mask) @ W
        dropout_outputs.append(output)
    
    dropout_outputs = np.stack(dropout_outputs)
    
    # DropConnect: mask on weights
    dropconnect_outputs = []
    for _ in range(1000):
        mask = np.random.binomial(1, 0.5, size=W.shape) / 0.5
        output = x @ (W * mask)
        dropconnect_outputs.append(output)
    
    dropconnect_outputs = np.stack(dropconnect_outputs)
    
    # Compare statistics
    print("\nStatistics across 1000 forward passes:")
    print("-" * 60)
    print(f"{'Metric':<25} {'Dropout':<18} {'DropConnect':<18}")
    print("-" * 60)
    
    print(f"{'Output mean':<25} "
          f"{dropout_outputs.mean():<18.4f} "
          f"{dropconnect_outputs.mean():<18.4f}")
    
    print(f"{'Output std per sample':<25} "
          f"{dropout_outputs.std(axis=0).mean():<18.4f} "
          f"{dropconnect_outputs.std(axis=0).mean():<18.4f}")
    
    print(f"{'Variance across samples':<25} "
          f"{dropout_outputs.var(axis=(0,2)).mean():<18.4f} "
          f"{dropconnect_outputs.var(axis=(0,2)).mean():<18.4f}")
    
    # Ensemble size
    dropout_ensemble = 2 ** in_dim
    dropconnect_ensemble = 2 ** (in_dim * out_dim)
    
    print(f"\nTheoretical ensemble sizes:")
    print(f"  Dropout: 2^{in_dim} = {dropout_ensemble:.2e}")
    print(f"  DropConnect: 2^{in_dim * out_dim} = {dropconnect_ensemble:.2e}")
    
    print("\n✓ DropConnect creates exponentially more sub-networks")
    print("  but is computationally more expensive")
 
 
compare_dropout_dropconnect()

Dropout vs DropConnect
Aspect	Dropout	DropConnect
Mask location	Activations	Weights
Mask size	O(d)	O(d_in × d_out)
Sub-networks	2^d neurons	2^(d_in × d_out) connections
Memory cost	Lower	Higher
Inference	Simple (use full network)	Complex (Gaussian approximation)
Practical use	Universal	Less common; mainly research

DropConnect in Practice

Spatial Dropout for Convolutional Networks

Spatial Dropout addresses this by dropping entire feature maps (channels) instead of individual pixels.

Mathematical Formulation:

For a feature map of shape (batch, channels, height, width):

Standard dropout: mask of shape (batch, channels, height, width)
Spatial dropout: mask of shape (batch, channels, 1, 1) — one value per channel

The mask is broadcast across the spatial dimensions, so entire feature maps are either kept or dropped.

Why it works:

spatial_dropout.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
import numpy as np
from typing import Tuple
 
class SpatialDropout2D:
    """
    Spatial Dropout for 2D convolutional feature maps.
    
    Drops entire feature maps (channels) rather than individual pixels.
    This is more effective for convolutional networks because adjacent
    pixels are highly correlated.
    """
    
    def __init__(self, drop_prob: float = 0.5):
        """
        Initialize spatial dropout.
        
        Args:
            drop_prob: Probability of dropping each feature map
        """
        self.p = drop_prob
        self.mask = None
        self.training = True
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        """
        Apply spatial dropout.
        
        Args:
            x: Feature maps of shape (batch, channels, height, width)
        
        Returns:
            Feature maps with some channels zeroed
        """
        if not self.training:
            return x
        
        batch_size, num_channels, height, width = x.shape
        
        # Mask shape: (batch, channels, 1, 1)
        # One binary value per feature map per sample
        self.mask = np.random.binomial(
            1, 1 - self.p, size=(batch_size, num_channels, 1, 1)
        ) / (1 - self.p)
        
        # Broadcast mask across spatial dimensions
        return x * self.mask
    
    def backward(self, grad_output: np.ndarray) -> np.ndarray:
        """Backward pass."""
        if not self.training or self.mask is None:
            return grad_output
        return grad_output * self.mask
 
 
class StandardDropout2D:
    """Standard dropout for comparison (drops individual pixels)."""
    
    def __init__(self, drop_prob: float = 0.5):
        self.p = drop_prob
        self.mask = None
        self.training = True
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        if not self.training:
            return x
        
        self.mask = np.random.binomial(1, 1 - self.p, size=x.shape) / (1 - self.p)
        return x * self.mask
 
 
def compare_standard_vs_spatial():
    """Compare standard and spatial dropout on conv features."""
    np.random.seed(42)
    
    print("Standard Dropout vs Spatial Dropout")
    print("=" * 60)
    
    # Simulated feature maps
    batch_size = 4
    channels = 64
    height, width = 14, 14
    
    x = np.random.randn(batch_size, channels, height, width)
    
    # Create dropout layers
    standard = StandardDropout2D(drop_prob=0.5)
    spatial = SpatialDropout2D(drop_prob=0.5)
    
    # Apply dropout
    standard_out = standard.forward(x)
    spatial_out = spatial.forward(x)
    
    print("\nInput shape:", x.shape)
    print("-" * 60)
    
    # Standard dropout: count zeros per channel
    std_zeros_per_channel = []
    for c in range(channels):
        zeros = np.sum(standard_out[:, c] == 0)
        std_zeros_per_channel.append(zeros)
    
    print("\nStandard Dropout:")
    print(f"  Total zeros: {np.sum(standard_out == 0)}")
    print(f"  Zeros per channel: mean={np.mean(std_zeros_per_channel):.1f}, "
          f"std={np.std(std_zeros_per_channel):.1f}")
    
    # Spatial dropout: channels are all-or-nothing
    spatial_zeros_per_channel = []
    for c in range(channels):
        zeros = np.sum(spatial_out[:, c] == 0)
        spatial_zeros_per_channel.append(zeros)
    
    print("\nSpatial Dropout:")
    print(f"  Total zeros: {np.sum(spatial_out == 0)}")
    print(f"  Zeros per channel: {set(spatial_zeros_per_channel)}")  # Only two values: 0 or all
    
    # Count fully dropped channels
    fully_dropped = sum(1 for z in spatial_zeros_per_channel 
                        if z == batch_size * height * width)
    print(f"  Fully dropped channels: {fully_dropped} / {channels}")
    
    # Visualize a single sample
    print("\nVisualization (first sample, first 8 channels):")
    print("  Standard: per-pixel variation in each channel")
    print("  Spatial: entire channels are kept or dropped")
    
    for c in range(8):
        std_zeros = np.sum(standard_out[0, c] == 0)
        spa_zeros = np.sum(spatial_out[0, c] == 0)
        
        std_status = f"{std_zeros}/{height*width} zeros"
        spa_status = "DROPPED" if spa_zeros == height*width else "KEPT"
        
        print(f"    Channel {c}: Standard: {std_status}, Spatial: {spa_status}")
 
 
def demonstrate_correlation_robustness():
    """Show why spatial dropout is better for correlated features."""
    np.random.seed(42)
    
    print("\n" + "=" * 60)
    print("Redundancy in Convolutional Features")
    print("=" * 60)
    
    # Simulate highly correlated feature maps (smooth gradients)
    batch_size, channels = 1, 1
    size = 8
    
    # Create a smooth feature (simulating real conv features)
    y, x = np.meshgrid(np.linspace(-1, 1, size), np.linspace(-1, 1, size))
    feature = np.exp(-(x**2 + y**2) / 0.5)  # Gaussian blob
    feature = feature.reshape(1, 1, size, size)
    
    print("\nOriginal smooth feature (Gaussian blob):")
    print_feature(feature[0, 0])
    
    # Standard dropout
    standard = StandardDropout2D(0.5)
    standard_out = standard.forward(feature.copy())
    
    print("\nAfter standard dropout (50%):")
    print_feature(standard_out[0, 0])
    
    # The issue: even with 50% dropout, pattern is still recognizable
    # because adjacent pixels carry similar information
    
    print("\nProblem: Information leaks through adjacent pixels!")
    print("Spatial dropout forces learning of truly redundant features.")
 
 
def print_feature(feature: np.ndarray):
    """Print a 2D feature map as ASCII."""
    for row in feature:
        line = ""
        for val in row:
            if val == 0:
                line += "  ."
            elif val < 0.3:
                line += "  ░"
            elif val < 0.6:
                line += "  ▒"
            else:
                line += "  █"
        print(line)
 
 
compare_standard_vs_spatial()
demonstrate_correlation_robustness()

When to Use Spatial Dropout

DropBlock: Dropping Contiguous Regions

DropBlock (Ghiasi et al., 2018) takes spatial dropout further by dropping contiguous regions of feature maps instead of entire channels or random pixels.

The Key Insight:

Algorithm:

Sample random "seed" pixels with probability γ
Expand each seed to a block of size (block_size × block_size)
Zero out these blocks
Scale remaining activations by 1/(1 - fraction_dropped)

Computing γ:

To achieve an effective dropout probability of p, set: $$\gamma = \frac{p \cdot \text{feat_size}^2}{\text{block_size}^2 \cdot (\text{feat_size} - \text{block_size} + 1)^2}$$

This accounts for block overlap and edge effects.

dropblock.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
import numpy as np
 
class DropBlock2D:
    """
    DropBlock regularization for convolutional networks.
    
    Drops contiguous regions (blocks) of feature maps, providing
    stronger regularization than random dropout while being less
    aggressive than spatial dropout.
    """
    
    def __init__(
        self,
        block_size: int = 7,
        drop_prob: float = 0.1
    ):
        """
        Initialize DropBlock.
        
        Args:
            block_size: Size of contiguous region to drop
            drop_prob: Target probability of dropping each activation
        """
        self.block_size = block_size
        self.drop_prob = drop_prob
        self.mask = None
        self.training = True
    
    def _compute_gamma(self, feat_size: int) -> float:
        """
        Compute sampling probability gamma to achieve target drop_prob.
        
        Accounts for block size and valid placement positions.
        """
        # Number of valid block positions
        valid_positions = feat_size - self.block_size + 1
        if valid_positions <= 0:
            return 0.0
        
        # Gamma formula from the paper
        gamma = (
            self.drop_prob * (feat_size ** 2) /
            (self.block_size ** 2 * valid_positions ** 2)
        )
        
        return min(gamma, 1.0)
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        """
        Apply DropBlock.
        
        Args:
            x: Feature maps of shape (batch, channels, height, width)
        
        Returns:
            Feature maps with blocks dropped
        """
        if not self.training or self.drop_prob == 0:
            return x
        
        batch_size, channels, height, width = x.shape
        
        # Compute gamma for this feature map size
        gamma = self._compute_gamma(height)
        
        # Sample block centers
        mask = np.random.binomial(1, 1 - gamma, size=(batch_size, channels, height, width))
        
        # Expand to blocks using max pooling (keep smallest values)
        # This inverts the mask, grows zeros, then inverts back
        block_mask = np.ones_like(mask)
        for i in range(batch_size):
            for c in range(channels):
                block_mask[i, c] = self._expand_blocks(mask[i, c])
        
        # Compute scaling factor
        count = block_mask.size
        count_ones = block_mask.sum()
        
        if count_ones == 0:
            return x * 0
        
        scale = count / count_ones
        
        self.mask = block_mask * scale
        
        return x * self.mask
    
    def _expand_blocks(self, mask: np.ndarray) -> np.ndarray:
        """Expand zeros in mask to block_size blocks."""
        height, width = mask.shape
        block_mask = mask.copy()
        
        # Find zero positions (block centers)
        zero_positions = np.argwhere(mask == 0)
        
        for pos in zero_positions:
            y, x = pos
            # Expand to block
            y_start = max(0, y - self.block_size // 2)
            y_end = min(height, y + self.block_size // 2 + 1)
            x_start = max(0, x - self.block_size // 2)
            x_end = min(width, x + self.block_size // 2 + 1)
            
            block_mask[y_start:y_end, x_start:x_end] = 0
        
        return block_mask
 
 
def demonstrate_dropblock():
    """Visualize DropBlock behavior."""
    np.random.seed(42)
    
    print("DropBlock Demonstration")
    print("=" * 60)
    
    # Create feature map
    height, width = 14, 14
    x = np.ones((1, 1, height, width))
    
    dropblock = DropBlock2D(block_size=5, drop_prob=0.3)
    
    output = dropblock.forward(x)
    
    print(f"\nFeature map size: {height}x{width}")
    print(f"Block size: {dropblock.block_size}")
    print(f"Target drop probability: {dropblock.drop_prob}")
    
    # Count dropped activations
    dropped = np.sum(output == 0)
    total = output.size
    actual_drop_rate = dropped / total
    
    print(f"\nActual drop rate: {actual_drop_rate:.1%}")
    
    print("\nVisualization (█ = kept, · = dropped):")
    mask = (output[0, 0] > 0).astype(int)
    for row in mask:
        line = "  "
        for val in row:
            line += "█ " if val else "· "
        print(line)
    
    print("\nNotice: Dropped regions are contiguous blocks, not random pixels")
 
 
def compare_dropout_variants():
    """Compare different dropout variants on conv features."""
    np.random.seed(42)
    
    print("\n" + "=" * 60)
    print("Comparison: Standard vs Spatial vs DropBlock")
    print("=" * 60)
    
    batch_size, channels = 4, 32
    height, width = 14, 14
    drop_prob = 0.2
    
    x = np.random.randn(batch_size, channels, height, width)
    
    # Apply each variant
    # Standard dropout
    std_mask = np.random.binomial(1, 1 - drop_prob, size=x.shape) / (1 - drop_prob)
    std_out = x * std_mask
    
    # Spatial dropout
    spatial_mask = np.random.binomial(
        1, 1 - drop_prob, size=(batch_size, channels, 1, 1)
    ) / (1 - drop_prob)
    spatial_out = x * spatial_mask
    
    # DropBlock
    dropblock = DropBlock2D(block_size=5, drop_prob=drop_prob)
    dropblock_out = dropblock.forward(x)
    
    # Statistics
    print(f"\nDrop probability: {drop_prob}")
    print(f"Feature map shape: {x.shape}")
    print()
    print(f"{'Variant':<15} {'Zeros (%)':<12} {'Channels affected':<20}")
    print("-" * 55)
    
    # Standard: count zeros
    std_zeros = np.sum(std_out == 0) / std_out.size
    std_channels = np.sum(np.any(std_out == 0, axis=(2, 3))) / (batch_size * channels)
    print(f"{'Standard':<15} {std_zeros:<12.1%} {std_channels:<20.1%}")
    
    # Spatial: count zeros
    spatial_zeros = np.sum(spatial_out == 0) / spatial_out.size
    spatial_channels = np.sum(np.all(spatial_out == 0, axis=(2, 3))) / (batch_size * channels)
    print(f"{'Spatial':<15} {spatial_zeros:<12.1%} {spatial_channels:<20.1%} (fully dropped)")
    
    # DropBlock: count zeros
    db_zeros = np.sum(dropblock_out == 0) / dropblock_out.size
    db_partial = np.sum(
        np.any(dropblock_out == 0, axis=(2, 3)) & 
        ~np.all(dropblock_out == 0, axis=(2, 3))
    ) / (batch_size * channels)
    print(f"{'DropBlock':<15} {db_zeros:<12.1%} {db_partial:<20.1%} (partially dropped)")
    
    print("\nKey insight:")
    print("  Standard: All channels affected, random pixels")
    print("  Spatial: Some channels fully dropped, others untouched")
    print("  DropBlock: Many channels partially affected (blocks removed)")
 
 
demonstrate_dropblock()
compare_dropout_variants()

DropBlock in Practice

Stochastic Depth and DropPath

As networks grow deeper, a natural question arises: can we drop entire layers during training? Stochastic Depth and DropPath do exactly this.

Stochastic Depth (Huang et al., 2016):

For networks with residual connections, randomly skip entire residual blocks during training:

$$\mathbf{H}l = \text{ReLU}(b_l \cdot f_l(\mathbf{H}{l-1}) + \mathbf{H}_{l-1})$$

where bₗ ~ Bernoulli(pₗ) determines whether block l is active.

Linear Decay Rule:

The survival probability pₗ typically follows a linear decay: $$p_l = 1 - \frac{l}{L}(1 - p_L)$$

Early layers (more important) have higher survival probability; deeper layers (often redundant) are dropped more frequently.

DropPath (Larsson et al., 2017):

Generalizes stochastic depth to any path in a network with multiple parallel paths. Each path is independently dropped with some probability.

stochastic_depth.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
import numpy as np
from typing import Callable, List
 
class StochasticDepthBlock:
    """
    Residual block with stochastic depth.
    
    During training, the block is randomly skipped (identity shortcut only).
    During inference, output is scaled by survival probability.
    """
    
    def __init__(
        self,
        in_features: int,
        survival_prob: float = 0.8,
        transform_fn: Callable = None
    ):
        """
        Initialize stochastic depth block.
        
        Args:
            in_features: Input/output dimension
            survival_prob: Probability of keeping this block during training
            transform_fn: Optional transformation function (default: linear + ReLU)
        """
        self.survival_prob = survival_prob
        self.in_features = in_features
        
        # Default transformation
        if transform_fn is None:
            self.W = np.random.randn(in_features, in_features) * 0.01
            self.b = np.zeros(in_features)
            self.transform = lambda x: np.maximum(0, x @ self.W + self.b)
        else:
            self.transform = transform_fn
        
        self.training = True
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        """
        Forward pass with stochastic depth.
        
        Training: Skip block with probability (1 - survival_prob)
        Inference: Apply block with scaling by survival_prob
        """
        if self.training:
            # Sample whether to keep this block
            if np.random.random() < self.survival_prob:
                # Block is active: apply transformation + residual
                return x + self.transform(x) / self.survival_prob
            else:
                # Block is skipped: identity
                return x
        else:
            # Inference: always apply, scaled by survival probability
            return x + self.transform(x)
 
 
class StochasticDepthNetwork:
    """
    Network with stochastic depth across residual blocks.
    
    Uses linear decay: early blocks have high survival prob,
    later blocks have lower survival prob.
    """
    
    def __init__(
        self,
        input_dim: int,
        num_blocks: int,
        p_final: float = 0.5  # Survival prob for the last block
    ):
        """
        Initialize network.
        
        Args:
            input_dim: Input dimension
            num_blocks: Number of residual blocks
            p_final: Survival probability for the last block
        """
        self.blocks = []
        
        for l in range(num_blocks):
            # Linear decay: p_l = 1 - (l/L) * (1 - p_L)
            survival_prob = 1 - (l / num_blocks) * (1 - p_final)
            
            block = StochasticDepthBlock(
                in_features=input_dim,
                survival_prob=survival_prob
            )
            self.blocks.append(block)
        
        self.training = True
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        """Forward pass through all blocks."""
        for block in self.blocks:
            block.training = self.training
            x = block.forward(x)
        return x
    
    def expected_depth(self) -> float:
        """Expected number of active blocks during training."""
        return sum(b.survival_prob for b in self.blocks)
 
 
class DropPath:
    """
    Drop entire paths in multi-path networks.
    
    Used in architectures like NASNet, EfficientNet, Vision Transformers.
    """
    
    def __init__(self, drop_prob: float = 0.0):
        """
        Initialize DropPath.
        
        Args:
            drop_prob: Probability of dropping the path
        """
        self.drop_prob = drop_prob
        self.training = True
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        """
        Forward pass.
        
        During training: drop entire input with probability drop_prob
        During inference: scale input by (1 - drop_prob)
        """
        if not self.training or self.drop_prob == 0:
            return x
        
        # Sample keep probability per sample in batch
        keep_prob = 1 - self.drop_prob
        
        # Random mask: one value per sample
        shape = (x.shape[0],) + (1,) * (x.ndim - 1)
        mask = np.random.binomial(1, keep_prob, size=shape) / keep_prob
        
        return x * mask
 
 
def demonstrate_stochastic_depth():
    """Demonstrate stochastic depth behavior."""
    np.random.seed(42)
    
    print("Stochastic Depth Demonstration")
    print("=" * 60)
    
    input_dim = 64
    num_blocks = 20
    p_final = 0.5
    
    network = StochasticDepthNetwork(input_dim, num_blocks, p_final)
    
    print(f"\nNetwork: {num_blocks} residual blocks")
    print(f"Final block survival probability: {p_final}")
    print(f"\nSurvival probabilities (linear decay):")
    
    for i, block in enumerate(network.blocks):
        prob = block.survival_prob
        bar = "█" * int(prob * 30) + "░" * int((1-prob) * 30)
        print(f"  Block {i+1:2d}: {bar} {prob:.2f}")
    
    print(f"\nExpected depth: {network.expected_depth():.1f} / {num_blocks} blocks")
    
    # Run forward pass multiple times
    x = np.random.randn(32, input_dim)
    
    print("\nTraining mode - blocks skipped vary each pass:")
    network.training = True
    
    active_counts = []
    for _ in range(5):
        # Count active blocks (those that modify output)
        active = 0
        h = x.copy()
        for block in network.blocks:
            h_new = block.forward(h)
            if not np.allclose(h, h_new - (h_new - h)):  # Check if block was active
                active += 1
            h = h_new
        active_counts.append(active)
    
    print(f"  Active blocks per pass: ~{network.expected_depth():.1f} expected")
    
    # Test output consistency in eval mode
    print("\nInference mode - output is consistent:")
    network.training = False
    
    outputs = [network.forward(x) for _ in range(3)]
    all_equal = all(np.allclose(outputs[0], out) for out in outputs)
    print(f"  All outputs identical: {all_equal}")
 
 
def demonstrate_droppath():
    """Demonstrate DropPath in multi-path architecture."""
    np.random.seed(42)
    
    print("\n" + "=" * 60)
    print("DropPath in Multi-Path Architecture")
    print("=" * 60)
    
    # Simulate a cell with multiple paths
    input_dim = 64
    batch_size = 8
    
    x = np.random.randn(batch_size, input_dim)
    
    # Multiple parallel paths with different drop probabilities
    path_drops = [0.0, 0.1, 0.2, 0.3]  # Increasing drop probability
    paths = [DropPath(p) for p in path_drops]
    
    print("\nMulti-path cell with 4 paths:")
    print("  Path 0: drop_prob = 0.0 (never dropped)")
    print("  Path 1: drop_prob = 0.1")
    print("  Path 2: drop_prob = 0.2")
    print("  Path 3: drop_prob = 0.3")
    
    # Training: some paths may be dropped
    print("\nTraining mode (10 forward passes):")
    all_active = []
    for _ in range(10):
        outputs = [path.forward(x) for path in paths]
        active = [not np.allclose(out, 0) for out in outputs]
        all_active.append(active)
    
    active_rates = np.mean(all_active, axis=0)
    for i, rate in enumerate(active_rates):
        expected = 1 - path_drops[i]
        print(f"  Path {i}: active {rate:.0%} (expected {expected:.0%})")
 
 
demonstrate_stochastic_depth()
demonstrate_droppath()

Stochastic Depth Benefits
Benefit	Explanation
Training speedup	Fewer computations per iteration (skipped blocks)
Regularization	Implicit ensemble over different depths
Gradient flow	Shorter paths for gradients when deep blocks are skipped
Training stability	Reduces vanishing gradients in very deep networks
Memory savings	Skipped blocks don't need activation caching

Stochastic Depth in Modern Architectures

Architecture-Specific Dropout Variants

Different architecture families have developed specialized dropout variants tailored to their unique characteristics.

Attention Dropout (Transformers):

Transformers apply dropout in multiple places:

After attention weights (before value mixing)
After the attention output projection
After feed-forward layers

Attention dropout specifically drops attention connections, forcing the model to not rely too heavily on any single token relationship.

Embedding Dropout:

RNN Dropout Variants:

Standard dropout doesn't work well for RNNs because dropping activations at each time step breaks temporal consistency. Variants include:

Variational RNN Dropout: Same mask at every time step
Zoneout: Keep previous hidden state instead of dropping
Recurrent Dropout: Drop recurrent connections, not hidden states

architecture_specific_dropout.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
import numpy as np
 
class AttentionDropout:
    """
    Dropout for attention mechanisms.
    
    Applied to attention weights after softmax, before
    multiplying with values.
    """
    
    def __init__(self, drop_prob: float = 0.1):
        self.drop_prob = drop_prob
        self.training = True
    
    def forward(self, attention_weights: np.ndarray) -> np.ndarray:
        """
        Apply dropout to attention weights.
        
        Args:
            attention_weights: Softmax-normalized attention of shape
                             (batch, heads, seq_len, seq_len)
        
        Returns:
            Attention weights with dropout applied
        """
        if not self.training or self.drop_prob == 0:
            return attention_weights
        
        # Generate mask
        mask = np.random.binomial(
            1, 1 - self.drop_prob, 
            size=attention_weights.shape
        ) / (1 - self.drop_prob)
        
        # Apply and re-normalize to maintain sum
        dropped = attention_weights * mask
        
        # Note: Unlike standard dropout, we might want to renormalize
        # so attention weights still sum to 1
        # dropped = dropped / dropped.sum(axis=-1, keepdims=True)
        
        return dropped
 
 
class EmbeddingDropout:
    """
    Dropout that drops entire word embeddings.
    
    More effective than element-wise dropout for embeddings
    because it forces the model to not depend on specific words.
    """
    
    def __init__(self, drop_prob: float = 0.1):
        self.drop_prob = drop_prob
        self.training = True
    
    def forward(self, embeddings: np.ndarray) -> np.ndarray:
        """
        Apply embedding dropout.
        
        Args:
            embeddings: Token embeddings of shape (batch, seq_len, embed_dim)
        
        Returns:
            Embeddings with some tokens zeroed
        """
        if not self.training or self.drop_prob == 0:
            return embeddings
        
        batch_size, seq_len, embed_dim = embeddings.shape
        
        # Mask shape: (batch, seq_len, 1) - same for all dimensions
        mask = np.random.binomial(
            1, 1 - self.drop_prob,
            size=(batch_size, seq_len, 1)
        ) / (1 - self.drop_prob)
        
        return embeddings * mask
 
 
class VariationalRNNDropout:
    """
    Variational dropout for RNNs.
    
    Uses the same dropout mask at every time step, maintaining
    temporal consistency.
    """
    
    def __init__(self, drop_prob: float = 0.3, input_dropout: bool = True):
        self.drop_prob = drop_prob
        self.input_dropout = input_dropout
        self.mask = None
        self.training = True
    
    def reset_mask(self, batch_size: int, hidden_size: int):
        """Generate mask once per sequence."""
        if not self.training or self.drop_prob == 0:
            self.mask = None
            return
        
        self.mask = np.random.binomial(
            1, 1 - self.drop_prob,
            size=(batch_size, hidden_size)
        ) / (1 - self.drop_prob)
    
    def forward(self, hidden: np.ndarray) -> np.ndarray:
        """Apply the same mask across all time steps."""
        if self.mask is None:
            return hidden
        return hidden * self.mask
 
 
class Zoneout:
    """
    Zoneout for RNNs: randomly keep previous hidden state.
    
    Instead of zeroing activations, keeps the previous time step's
    hidden state. This creates skip connections in time.
    """
    
    def __init__(self, zoneout_prob: float = 0.1):
        self.zoneout_prob = zoneout_prob
        self.training = True
    
    def forward(
        self, 
        h_new: np.ndarray, 
        h_prev: np.ndarray
    ) -> np.ndarray:
        """
        Apply zoneout.
        
        Args:
            h_new: New hidden state
            h_prev: Previous hidden state
        
        Returns:
            Mixed hidden state
        """
        if not self.training or self.zoneout_prob == 0:
            return h_new
        
        # For each element: keep old with prob zoneout_prob
        mask = np.random.binomial(
            1, 1 - self.zoneout_prob,
            size=h_new.shape
        )
        
        return mask * h_new + (1 - mask) * h_prev
 
 
def demonstrate_transformer_dropout():
    """Show dropout positions in transformer architecture."""
    print("Dropout in Transformer Architecture")
    print("=" * 60)
    
    print("""
    ┌─────────────────────────────────────────┐
    │              Transformer Block          │
    ├─────────────────────────────────────────┤
    │                                         │
    │  Input X                                │
    │     │                                   │
    │     ▼                                   │
    │  ┌─────────────────────────────────┐   │
    │  │      Multi-Head Attention       │   │
    │  │  ┌───────────────────────────┐  │   │
    │  │  │   Q    K    V   ───────┐  │  │   │
    │  │  │   │    │    │          │  │  │   │
    │  │  │   ▼    ▼    ▼          │  │  │   │
    │  │  │  Attention Weights     │  │  │   │
    │  │  │         │              │  │  │   │
    │  │  │  [ATTENTION DROPOUT]   │  │  │   │  ◄── Drop attention
    │  │  │         │              │  │  │   │      connections
    │  │  │         ▼              │  │  │   │
    │  │  │     Weighted V         │  │  │   │
    │  │  └───────────────────────────┘  │   │
    │  │              │                   │   │
    │  │       Output Projection          │   │
    │  │              │                   │   │
    │  │       [DROPOUT]                  │   │  ◄── Drop after 
    │  └─────────────────────────────────┘   │      projection
    │              │                          │
    │              ▼                          │
    │        Add & LayerNorm ────────────────│
    │              │                          │
    │              ▼                          │
    │  ┌─────────────────────────────────┐   │
    │  │       Feed-Forward              │   │
    │  │     Linear → GELU → Linear      │   │
    │  │              │                   │   │
    │  │       [DROPOUT]                  │   │  ◄── Drop after FFN
    │  └─────────────────────────────────┘   │
    │              │                          │
    │        Add & LayerNorm                 │
    │              │                          │
    │           Output                        │
    └─────────────────────────────────────────┘
    """)
    
    print("Typical dropout rates:")
    print("  - Attention dropout: 0.1")
    print("  - Hidden/residual dropout: 0.1")
    print("  - DropPath (for residual): 0.0-0.3 (linear increase)")
 
 
def demonstrate_rnn_dropout():
    """Compare RNN dropout variants."""
    np.random.seed(42)
    
    print("\n" + "=" * 60)
    print("RNN Dropout Variants Comparison")
    print("=" * 60)
    
    batch_size = 4
    hidden_size = 64
    seq_len = 10
    
    # Simulate hidden states across time
    hidden_states = np.random.randn(seq_len, batch_size, hidden_size)
    
    print("\n1. Standard Dropout (different mask each time step):")
    standard_masks = [
        np.random.binomial(1, 0.7, (batch_size, hidden_size))
        for _ in range(seq_len)
    ]
    # These would be different, breaking temporal consistency
    print(f"   Mask at t=0 == Mask at t=1: {np.allclose(standard_masks[0], standard_masks[1])}")
    
    print("\n2. Variational Dropout (same mask all time steps):")
    variational = VariationalRNNDropout(drop_prob=0.3)
    variational.reset_mask(batch_size, hidden_size)
    # Same mask used throughout
    outputs = [variational.forward(hidden_states[t]) for t in range(seq_len)]
    mask_consistent = variational.mask is not None
    print(f"   Same mask used for all {seq_len} time steps: {mask_consistent}")
    
    print("\n3. Zoneout (keep previous state with probability):")
    zoneout = Zoneout(zoneout_prob=0.15)
    h_prev = np.zeros((batch_size, hidden_size))
    
    kept_from_prev = []
    for t in range(seq_len):
        h_new = hidden_states[t]
        h_out = zoneout.forward(h_new, h_prev)
        # Count elements kept from previous
        if t > 0:
            kept = np.sum(np.isclose(h_out, h_prev))
            kept_from_prev.append(kept / h_out.size)
        h_prev = h_out
    
    avg_kept = np.mean(kept_from_prev)
    print(f"   Average fraction kept from previous: {avg_kept:.1%} (target: 15%)")
    print("   Creates 'skip connections in time'")
 
 
demonstrate_transformer_dropout()
demonstrate_rnn_dropout()

Architecture-Specific Dropout Summary
Architecture	Variant	Where Applied	Typical Rate
Transformers	Attention dropout	After softmax attention	0.1
Transformers	Hidden dropout	After projections/FFN	0.1
Transformers	DropPath	Skip connection paths	0.0-0.3
RNNs/LSTMs	Variational dropout	Input/hidden (same mask)	0.2-0.5
RNNs/LSTMs	Zoneout	Hidden state transitions	0.1-0.15
Embeddings	Embedding dropout	Entire token vectors	0.1-0.2

Choosing the Right Dropout Variant

With so many dropout variants, how do you choose? Here's a practical decision framework.

Decision Framework

•Fully-connected layers: Standard dropout (p=0.5) or no dropout if using BatchNorm
•Early conv layers: Spatial dropout or DropBlock (p=0.1-0.2)
•Deep residual networks: Stochastic depth / DropPath (linearly increasing rate)
•Transformers: Attention dropout + hidden dropout + DropPath (all ~0.1)
•RNNs/LSTMs: Variational dropout (same mask over time, p=0.2-0.5)
•Embeddings: Embedding dropout or word dropout (p=0.1-0.2)
•Very deep networks (>100 layers): Definitely use DropPath for training stability

Use Specialized Variants When

•Standard dropout isn't providing enough regularization
•Working with convolutional or attention architectures
•Training very deep networks (>50 layers)
•Sequential data with temporal dependencies
•Want structured sparsity or pruning

Stick with Standard Dropout When

•Working with simple MLP architectures
•Already using BatchNorm (often sufficient)
•Not overfitting significantly
•Implementation simplicity is a priority
•Computational overhead is a concern

Modern Best Practice

Summary: Dropping Other Things

The dropout concept extends far beyond randomly zeroing neuron activations. Let's consolidate the key variants:

Key Takeaways

•DropConnect: Drops weights instead of activations; larger implicit ensemble but more expensive
•Spatial Dropout: Drops entire feature maps; better for early conv layers with correlated activations
•DropBlock: Drops contiguous regions; middle ground between random and spatial dropout
•Stochastic Depth/DropPath: Drops entire layers/paths; crucial for very deep networks
•Attention Dropout: Drops attention connections; standard in transformers
•Variational RNN Dropout: Same mask across time steps; maintains temporal consistency
•Choose based on architecture: Each variant is designed for specific architectural properties

Module Complete:

You've now completed the comprehensive study of dropout regularization in deep learning. You understand:

The core dropout mechanism and its theoretical foundations
Inverted dropout and its practical advantages
The Bayesian interpretation and uncertainty quantification
Variational dropout and learned dropout rates
Architecture-specific variants for different network types

Dropout and its variants remain among the most important regularization techniques in deep learning, and mastering them is essential for training robust, generalizing neural networks.

Module Complete