Regularization in Deep LearningImplicit Regularization

Implicit Regularization in Deep Learning

LevelAdvanced

Duration90 mins

TopicImplicit Regularization

4 / 5

The Lottery Ticket Hypothesis

The Hidden Lottery: Sparse Networks Within Dense Ones

In 2019, Jonathan Frankle and Michael Carlin published a paper that would reshape our understanding of neural network training. The Lottery Ticket Hypothesis makes a remarkable claim: within a randomly initialized dense neural network, there exists a sparse subnetwork (a 'winning ticket') that, when trained in isolation from the same initialization, can match the test accuracy of the full network.

This discovery has profound implications for network compression, our understanding of overparameterization, and the fundamental nature of neural network training.

What You Will Learn

By the end of this page, you will understand the formal statement and implications of the Lottery Ticket Hypothesis, the experimental methodology used to identify winning tickets, why random initialization matters for lottery ticket, extensions including the Linear Mode Connectivity and pruning at initialization, and connections to implicit regularization and network compression.

The core insight:

The lottery metaphor is apt: training a neural network is like buying a lottery ticket. A dense network is like buying many tickets—one of them is likely to win. The 'winning ticket' is a specific sparse subnetwork, defined by both its structure (which connections exist) and its initialization (what the initial weights are).

$$\text{Dense Network} = \text{Many Lottery Tickets (subnetworks)}$$ $$\text{Training} = \text{Finding which ticket wins}$$ $$\text{Winning Ticket} = \text{Sparse subnetwork + original initialization}$$

The hypothesis suggests that the role of overparameterization isn't to have more capacity, but to increase the probability of containing a good sparse subnetwork among the exponentially many possible subnetworks.

The Formal Hypothesis

Let us state the Lottery Ticket Hypothesis precisely and understand its components.

Formal statement:

A randomly initialized, dense neural network N with parameters θ₀ contains a subnetwork N' ⊂ N with mask m ∈ {0,1}^d such that:

When trained in isolation from initialization m ⊙ θ₀, the subnetwork N' achieves test accuracy comparable to N trained from θ₀
Training N' requires at most the same number of iterations as training N
N' is significantly sparser than N (e.g., 10-20% of parameters)

Where m ⊙ θ represents element-wise multiplication (applying the mask to weights).

Initialization is Critical

The hypothesis specifically requires the original initialization θ₀. A sparse subnetwork with the right structure but random re-initialization typically trains poorly. This reveals that trainability depends not just on network architecture but on the specific initial weights.

What makes a winning ticket 'win':

Property	Description
Structure (mask m)	Which weights are present; the subnetwork's topology
Initialization (θ₀)	The specific starting values of retained weights
Joint property	Both structure AND initialization matter

Why this is surprising:

Sparse networks are hard to train from scratch: Randomly initialized sparse networks typically underperform dense ones
Structure alone is insufficient: The same sparse structure with different initialization fails
Initialization alone is insufficient: The masked initialization only works with the correct sparse structure

The winning ticket is a joint property of structure and initialization that emerges from the dense network's training process.

The search space perspective:

For a network with d parameters, there are 2^d possible subnetworks. A sparse subnetwork with k parameters (k << d) is found among these 2^d possibilities. The probability that a random sparse subnetwork is a winning ticket is negligible:

$$P(\text{random sparse subnetwork is winning}) \approx 0$$

Yet training the dense network identifies one. This suggests training performs an implicit architecture search, discovering not just good weights but good sparsity patterns.

Connection to overparameterization:

Overparameterization serves a dual purpose:

Classical view: More parameters = more capacity to fit complex functions
Lottery ticket view: More parameters = more possible subnetworks = higher chance of containing a winning ticket

Large networks may succeed not because all their parameters are needed, but because having many parameters makes it likely that some sparse subset forms an effective learner when combined with good initialization.

Finding Winning Tickets: The Iterative Magnitude Pruning Algorithm

The original Lottery Ticket paper identified winning tickets using Iterative Magnitude Pruning (IMP)—a simple yet powerful algorithm that alternates between training and pruning.

The IMP Algorithm:

Initialize: Randomly initialize network with weights θ₀
Train: Train network for T iterations to get θ_T
Prune: Remove the p% of weights with smallest magnitude |θ_T|
Reset: Reset remaining weights to their original values from θ₀
Repeat: Go to step 2 with the pruned network; repeat until desired sparsity

The key insight: we use the training process to identify which weights matter (the pruning mask), but then we reset to the original initialization before the next training round.

iterative_magnitude_pruning.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
import numpy as np
from copy import deepcopy
 
class IterativeMagnitudePruning:
    """
    Implementation of Iterative Magnitude Pruning to find lottery tickets.
    
    This is a simplified demonstration showing the core algorithm.
    """
    
    def __init__(
        self,
        initial_weights: dict,
        prune_percent: float = 20.0,
        num_rounds: int = 10
    ):
        """
        Args:
            initial_weights: Dictionary of initial weight arrays
            prune_percent: Percentage of remaining weights to prune each round
            num_rounds: Number of prune-train cycles
        """
        self.theta_0 = deepcopy(initial_weights)  # Save original init
        self.prune_percent = prune_percent
        self.num_rounds = num_rounds
        
        # Initialize mask (all ones = all weights present)
        self.mask = {k: np.ones_like(v) for k, v in initial_weights.items()}
        
    def get_current_weights(self):
        """Apply mask to original initialization."""
        return {k: self.mask[k] * self.theta_0[k] for k in self.theta_0}
    
    def get_current_sparsity(self):
        """Calculate current sparsity percentage."""
        total_params = sum(m.size for m in self.mask.values())
        nonzero_params = sum(np.sum(m != 0) for m in self.mask.values())
        return 100.0 * (1 - nonzero_params / total_params)
    
    def prune_round(self, trained_weights: dict):
        """
        Perform one round of magnitude pruning.
        
        Args:
            trained_weights: Weights after training (used to determine magnitudes)
        
        Returns:
            Updated mask after pruning
        """
        # Collect all weight magnitudes (only for currently unmasked weights)
        all_magnitudes = []
        for k in trained_weights:
            masked_weights = self.mask[k] * trained_weights[k]
            magnitudes = np.abs(masked_weights[self.mask[k] != 0])
            all_magnitudes.extend(magnitudes)
        
        all_magnitudes = np.array(all_magnitudes)
        
        # Find threshold for pruning
        num_to_prune = int(len(all_magnitudes) * self.prune_percent / 100)
        if num_to_prune == 0:
            return self.mask
            
        threshold = np.sort(all_magnitudes)[num_to_prune]
        
        # Apply pruning to mask
        for k in trained_weights:
            masked_weights = self.mask[k] * trained_weights[k]
            prune_mask = np.abs(masked_weights) > threshold
            self.mask[k] = self.mask[k] * prune_mask.astype(float)
        
        return self.mask
    
    def run_imp(self, train_function):
        """
        Run the full IMP algorithm.
        
        Args:
            train_function: Function that takes weights and returns trained weights
        
        Returns:
            Final mask defining the winning ticket
        """
        results = []
        
        for round_idx in range(self.num_rounds):
            # Get current masked initialization
            current_weights = self.get_current_weights()
            
            # Train the network
            trained_weights = train_function(current_weights)
            
            # Record results before pruning
            sparsity = self.get_current_sparsity()
            results.append({
                'round': round_idx,
                'sparsity': sparsity,
                'weights': deepcopy(current_weights),
            })
            
            print(f"Round {round_idx}: Sparsity = {sparsity:.1f}%")
            
            # Prune and reset to original init (key step!)
            self.prune_round(trained_weights)
        
        return self.mask, results
 
 
def demonstrate_imp():
    """Demonstrate IMP algorithm with a toy example."""
    np.random.seed(42)
    
    # Create a simple "network" with random weights
    initial_weights = {
        'layer1': np.random.randn(100, 50) * 0.1,
        'layer2': np.random.randn(50, 10) * 0.1,
    }
    
    total_params = sum(w.size for w in initial_weights.values())
    print(f"Total parameters: {total_params}")
    print()
    
    # Mock training function (in reality, this would train the network)
    def mock_train(weights):
        """Simulate training by scaling weights randomly."""
        trained = {}
        for k, v in weights.items():
            # Simulate: some weights grow important, others stay small
            importance = np.random.exponential(0.5, v.shape)
            trained[k] = v * importance
        return trained
    
    # Run IMP
    imp = IterativeMagnitudePruning(
        initial_weights=initial_weights,
        prune_percent=20.0,
        num_rounds=8
    )
    
    print("Running Iterative Magnitude Pruning")
    print("=" * 50)
    final_mask, results = imp.run_imp(mock_train)
    
    print()
    print("Final winning ticket structure identified!")
    final_sparsity = imp.get_current_sparsity()
    print(f"Final sparsity: {final_sparsity:.1f}%")
    print(f"Remaining parameters: {int(total_params * (100 - final_sparsity) / 100)}")
 
demonstrate_imp()

Why IMP works:

Magnitude as importance proxy: Weights that are large after training are likely important for the learned function
Iterative refinement: Gradual pruning (20% at a time) is more effective than one-shot pruning to extreme sparsity
Initialization reset: Returning to θ₀ tests whether the structure alone (with original init) suffices, not whether the trained weights can be pruned

Computational cost:

IMP is expensive—it requires multiple full training runs. For n rounds of pruning:

$$\text{Total Training Cost} = n \times \text{Cost of One Training Run}$$

This has motivated research into more efficient methods for finding winning tickets.

The Rewind Variant

For larger networks, resetting to iteration 0 often fails. The 'rewinding' variant resets to an early iteration (e.g., iteration k where k is 0.1% to 1% of total training). This early point provides enough training signal for initialization while still enabling ticket identification.

Theoretical Perspectives on Lottery Tickets

The Lottery Ticket Hypothesis has spawned significant theoretical work trying to explain why winning tickets exist and what they tell us about neural network training.

The pruning as architecture search view:

Training a dense network and pruning can be seen as implicit architecture search. The training process evaluates which weights (and hence, which connections) are important, while pruning crystallizes this into an explicit sparse architecture.

$$\text{Dense Training} + \text{Pruning} \approx \text{Architecture Search}$$

This perspective connects lottery tickets to Neural Architecture Search (NAS), but with a key difference: instead of searching over discrete architecture choices, we search over the 2^d possible binary masks.

The linear mode connectivity hypothesis:

Research has shown that winning tickets often lie in the same loss basin as the fully trained dense network. Specifically:

The trained dense network θ_T and the trained sparse ticket θ'_T are linearly connected in loss space
There's no loss barrier between them: L(αθ_T + (1-α)θ'_T) ≤ max(L(θ_T), L(θ'_T)) for α ∈ [0,1]

This suggests the dense network and winning ticket converge to the same functional solution, just expressed with different (dense vs. sparse) parameterizations.

The Strong Lottery Ticket Hypothesis

A stronger claim: a randomly initialized network contains a subnetwork that achieves good accuracy without any training—just by selecting the right weights. This has been proven for sufficiently wide networks, connecting to the theory of random features and the Neural Tangent Kernel.

Mathematical formalization:

Let f(x; θ, m) be the function computed by the network with parameters θ and mask m. The Lottery Ticket Hypothesis claims:

$$\exists m^* \text{ sparse}: \quad \min_\theta L(f(\cdot; \theta, m^*)) \approx \min_\theta L(f(\cdot; \theta, \mathbf{1}))$$

where 1 is the all-ones mask (dense network). Moreover, the minimization over θ starting from m^* ⊙ θ₀ succeeds, while starting from random initialization typically fails.

Why random sparse initialization fails:

A randomly initialized sparse network:

Has the right structure but wrong initialization
Weights are optimized for density, not this specific sparse connectivity
Gradients flow through paths that may be suboptimal for sparse training
The 'lucky' initialization of the winning ticket is lost

Lottery Ticket Variants
Variant	Reset Point	When It Works	When It Fails
Original (reset to θ₀)	Iteration 0	Small networks, MNIST, CIFAR-10	Large networks (ResNet-50+)
Rewinding	Early iteration k	Larger networks, ImageNet	Very aggressive sparsity
Learning Rate Rewinding	Iteration k, reset LR schedule	Matches rewinding performance	No significant failures known
Strong LTH (no training)	N/A	Sufficiently overparameterized	Practical networks

Implications for Neural Network Training

The Lottery Ticket Hypothesis reshapes our understanding of what happens during neural network training.

Training as ticket selection:

Rather than viewing training as 'learning the weights,' we can view it as 'identifying the winning ticket.' The dense network contains exponentially many possible sparse subnetworks; training selects which one 'wins.'

This perspective explains several phenomena:

Why overparameterization helps: More parameters = more possible tickets = higher chance of a winner
Why initialization matters: The winning ticket's effectiveness depends on its specific initialization
Why pruning works: Training already identified the important weights; we're just removing the unimportant ones

The Overparameterization Paradox Resolved

The Lottery Ticket Hypothesis offers a resolution to the overparameterization paradox: large networks generalize well not despite having many parameters, but because having many parameters increases the probability of containing effective sparse subnetworks. The network uses its parameters as a 'search space' for finding good architectures.

Implications for regularization:

The lottery ticket perspective connects to implicit regularization:

SGD selects tickets: The stochastic optimization process biases toward certain tickets (perhaps simpler ones that are found more easily)
Architecture limits ticket pool: Network architecture determines which tickets are possible; this is architectural regularization
Sparsity is implicit: The winning ticket is sparse even though we never explicitly regularized for sparsity

Practical implications:

Finding	Practical Implication
Dense networks contain sparse winners	Dense training + pruning is effective
Initialization matters	Keep original init when pruning; don't re-initialize
Gradual pruning outperforms one-shot	Use iterative pruning schedules
Early weights predict winners	May be possible to prune early in training

What LTH Enables

•10-20x compression with minimal accuracy loss
•Understanding of overparameterization benefits
•Principled pruning methodologies
•Connection between architecture and optimization
•Foundation for efficient neural networks

Remaining Challenges

•Finding tickets is computationally expensive
•Tickets may not transfer across tasks
•Extreme sparsity remains difficult
•Hardware support for sparsity is limited
•Theoretical foundations still developing

Pruning at Initialization: Early Lottery Tickets

A natural question: can we find winning tickets before training, avoiding the expensive train-prune-reset cycle? This has led to research on pruning at initialization.

The SNIP algorithm (Single-shot Network Pruning):

SNIP identifies important connections based on their gradient-signal magnitudes at initialization:

$$s_j = \left|\frac{\partial L}{\partial c_j}\right|_{c=\mathbf{1}}$$

where c_j indicates whether connection j exists (c_j ∈ {0,1}). Connections with high sensitivity to removal are kept.

Key insight: Even before training, gradients reveal which connections are important for the loss function on the training data.

GraSP: Gradient Signal Preservation

GraSP extends SNIP by preserving gradient flow rather than just gradient magnitude. It keeps connections that, when removed, would most reduce the gradient signal to remaining parameters. This leads to better trainability of the resulting sparse network.

Comparison of pruning-at-init methods:

Method	Criterion	Computational Cost	Performance
Random	None	Minimal	Poor
SNIP	Connection sensitivity	One backward pass	Moderate
GraSP	Gradient flow preservation	One backward pass	Good
SynFlow	Synaptic saliency	Few forward passes	Good
IMP	Post-training magnitude	Full training × rounds	Best

The gap to IMP:

While pruning-at-init methods are much faster, they typically underperform IMP, especially at high sparsity:

At 80% sparsity: Gap is small (1-2% accuracy)
At 95% sparsity: Gap is significant (5-10% accuracy)
At 99% sparsity: Gap is large (pruning-at-init often fails)

This suggests that some training signal is necessary to identify the best tickets.

snip_pruning.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import numpy as np
 
def snip_pruning(weights, gradients, target_sparsity=0.9):
    """
    Simplified SNIP (Single-shot Network Pruning) implementation.
    
    SNIP prunes based on connection sensitivity: how much the loss
    changes when a connection is removed.
    
    Args:
        weights: Dictionary of weight arrays
        gradients: Dictionary of gradient arrays (from one forward-backward pass)
        target_sparsity: Fraction of weights to prune (0.9 = 90% pruned)
    
    Returns:
        Pruning mask
    """
    # Compute connection sensitivity: |weight * gradient|
    # This approximates the change in loss from removing the connection
    sensitivities = {}
    all_scores = []
    
    for layer_name in weights:
        w = weights[layer_name]
        g = gradients[layer_name]
        
        # Connection sensitivity score
        sensitivity = np.abs(w * g)
        sensitivities[layer_name] = sensitivity
        all_scores.extend(sensitivity.flatten())
    
    # Find threshold for target sparsity
    all_scores = np.array(all_scores)
    threshold = np.percentile(all_scores, target_sparsity * 100)
    
    # Create masks
    masks = {}
    for layer_name, sens in sensitivities.items():
        masks[layer_name] = (sens > threshold).astype(float)
    
    # Report statistics
    total_params = sum(w.size for w in weights.values())
    remaining_params = sum(np.sum(m) for m in masks.values())
    actual_sparsity = 1 - remaining_params / total_params
    
    print(f"SNIP Pruning Results:")
    print(f"  Target sparsity: {target_sparsity * 100:.1f}%")
    print(f"  Actual sparsity: {actual_sparsity * 100:.1f}%")
    print(f"  Remaining parameters: {int(remaining_params)} / {total_params}")
    
    return masks
 
 
def demonstrate_snip():
    """Demonstrate SNIP on a toy example."""
    np.random.seed(42)
    
    # Simulate network weights at initialization
    weights = {
        'layer1': np.random.randn(784, 256) * 0.01,
        'layer2': np.random.randn(256, 128) * 0.01,
        'layer3': np.random.randn(128, 10) * 0.01,
    }
    
    # Simulate gradients from one forward-backward pass
    # In practice, these come from backprop on a mini-batch
    gradients = {
        'layer1': np.random.randn(784, 256) * np.random.exponential(0.1, (784, 256)),
        'layer2': np.random.randn(256, 128) * np.random.exponential(0.1, (256, 128)),
        'layer3': np.random.randn(128, 10) * np.random.exponential(0.1, (128, 10)),
    }
    
    print("=" * 50)
    print("SNIP: Single-shot Network Pruning")
    print("=" * 50)
    print()
    
    for sparsity in [0.5, 0.8, 0.9, 0.95]:
        print(f"\nSparsity level: {sparsity * 100}%")
        masks = snip_pruning(weights, gradients, target_sparsity=sparsity)
        
        # Show per-layer statistics
        for layer_name in weights:
            layer_remaining = np.sum(masks[layer_name])
            layer_total = masks[layer_name].size
            print(f"    {layer_name}: {int(layer_remaining)}/{layer_total} "
                  f"({100*layer_remaining/layer_total:.1f}% remaining)")
 
demonstrate_snip()

Applications and Extensions

The Lottery Ticket Hypothesis has inspired numerous applications and extensions beyond its original scope.

Network compression:

The most direct application is efficient model deployment:

Train a large dense network
Find the winning ticket via IMP
Deploy only the sparse subnetwork
Achieve ~10x compression with minimal accuracy loss

This is particularly valuable for edge devices, mobile phones, and embedded systems where memory and computation are constrained.

Transfer of Lottery Tickets

Winning tickets transfer across datasets to some extent. A ticket found on ImageNet often works (with retraining) on other vision tasks. This suggests tickets capture general-purpose visual processing structures, not task-specific solutions.

Extensions to different domains:

Domain	Finding	Implication
NLP (Transformers)	Tickets exist but need rewinding	LLMs are compressible post-training
Reinforcement Learning	Tickets transfer across tasks	Reusable RL agent structures
GANs	Both generator and discriminator have tickets	Efficient generative models
Graph Neural Networks	Tickets exist with structural sparsity	Efficient graph processing

Structured lottery tickets:

Beyond unstructured (individual weight) sparsity, research has explored structured tickets:

Filter pruning: Remove entire convolutional filters
Attention head pruning: Remove entire attention heads in Transformers
Layer pruning: Remove entire layers

Structured sparsity is hardware-friendly but typically requires higher density for equivalent accuracy.

Lottery tickets and neural architecture search:

There's a deep connection between lottery tickets and NAS:

Supernet training: Train a very large network containing all candidate architectures
Ticket finding: The winning ticket corresponds to the optimal architecture
One-shot NAS: Architectures can be evaluated by measuring their performance as subnetworks of the supernet

This view unifies two successful paradigms: large-scale training + pruning (lottery tickets) and architecture search (NAS).

Lottery tickets in pre-training:

For pre-trained models (like BERT, GPT):

Pre-train a large dense model
Find winning ticket via task-agnostic pruning
Fine-tune the sparse ticket on downstream tasks

This enables deploying compact versions of large language models while retaining most of their capability.

Connection to Implicit Regularization

The Lottery Ticket Hypothesis connects deeply to the theme of implicit regularization, revealing how training naturally discovers effective sparse structures.

Implicit sparsity through training:

Even without explicit sparsity regularization (like L1), training induces a form of implicit sparsity:

Magnitude disparity: After training, weight magnitudes vary by orders of magnitude
Effective sparsity: Many weights are small enough to prune without affecting performance
Training dynamics: SGD implicitly favors solutions where few weights dominate

The lottery ticket is a crystallization of this implicit sparsity—it makes explicit the sparse structure that was latent in the dense solution.

SGD as Ticket Selector

SGD doesn't just find good weights—it implicitly selects which sparse subnetwork 'wins.' The stochasticity of SGD may help explore different tickets early in training, while convergence later 'commits' to one. This is analogous to simulated annealing: explore broadly, then exploit locally.

Lottery tickets and flat minima:

Winning tickets tend to occupy flat regions of the loss landscape:

The sparse structure constrains the parameter space
The good initialization places us in a good basin
Training from this point converges to a flat minimum

This connects to SGD's implicit bias toward flat minima: winning tickets may be 'easy' for SGD to train precisely because they lead to flat solutions.

Lottery tickets and generalization:

Why do winning tickets generalize well?

Simplicity: Fewer parameters = lower capacity = implicit regularization
Structure: The sparse connectivity pattern may encode useful inductive biases
Initialization: The original initialization was optimized (implicitly, via the dense network's training) for this structure
Selection pressure: Only structures that train well (and generalize well) are selected as winners

Lottery Tickets in the Implicit Regularization Framework
Regularization Source	Mechanism in Lottery Tickets
SGD stochasticity	Explores many tickets; settles on trainable ones
Architecture	Defines which tickets are possible
Initialization	Provides 'good starting points' for some tickets
Sparsity (implicit)	Training creates magnitude disparity; tickets crystallize this
Early stopping	May select simpler tickets by halting before complex patterns learned

The broader picture:

The Lottery Ticket Hypothesis reveals that neural network training is doing more than gradient descent on a fixed architecture. It's simultaneously:

Learning weights: Adjusting parameter values to minimize loss
Selecting structure: Implicitly identifying which connections matter
Discovering architecture: Finding the sparse subnetwork that can solve the task

This unified view positions overparameterization as an exploration strategy: have many options (parameters/connections) to increase the chance of discovering a good solution. The 'waste' of unused parameters isn't waste—it's the search cost for finding winning tickets.

Summary: The Lottery Ticket Hypothesis

The Lottery Ticket Hypothesis has fundamentally changed our understanding of neural network training, overparameterization, and the implicit structures discovered during learning.

Key Takeaways

•Dense networks contain sparse winners — Sparse subnetworks can match full network accuracy when trained from original initialization
•Structure + initialization = winning ticket — Both the sparse structure AND the original init are necessary; neither alone suffices
•IMP finds tickets reliably — Iterative magnitude pruning trains, prunes, and resets to identify winning tickets
•Overparameterization increases ticket variety — More parameters = more possible subnetworks = better chance of winners
•Tickets enable ~10x compression — Practical networks can be pruned to 10-20% of original size
•Pruning at init is possible but weaker — Methods like SNIP find tickets before training, but IMP remains superior
•Training = implicit architecture search — SGD selects among exponentially many possible sparse structures
•Connects to implicit regularization — Lottery tickets reveal implicit sparsity in trained networks

Page Complete

You now understand the Lottery Ticket Hypothesis—a landmark discovery connecting overparameterization, architecture search, and implicit regularization. Next, we'll explore benign overfitting—another surprising phenomenon where models that perfectly memorize training data can still generalize well.

4 / 5

Loading learning content...

Regularization in Deep LearningImplicit Regularization

Implicit Regularization in Deep Learning

LevelAdvanced

Duration90 mins

TopicImplicit Regularization

4 / 5

The Lottery Ticket Hypothesis

The Hidden Lottery: Sparse Networks Within Dense Ones

This discovery has profound implications for network compression, our understanding of overparameterization, and the fundamental nature of neural network training.

What You Will Learn

The core insight:

The Formal Hypothesis

Let us state the Lottery Ticket Hypothesis precisely and understand its components.

Formal statement:

A randomly initialized, dense neural network N with parameters θ₀ contains a subnetwork N' ⊂ N with mask m ∈ {0,1}^d such that:

When trained in isolation from initialization m ⊙ θ₀, the subnetwork N' achieves test accuracy comparable to N trained from θ₀
Training N' requires at most the same number of iterations as training N
N' is significantly sparser than N (e.g., 10-20% of parameters)

Where m ⊙ θ represents element-wise multiplication (applying the mask to weights).

Initialization is Critical

What makes a winning ticket 'win':

Property	Description
Structure (mask m)	Which weights are present; the subnetwork's topology
Initialization (θ₀)	The specific starting values of retained weights
Joint property	Both structure AND initialization matter

Why this is surprising:

Sparse networks are hard to train from scratch: Randomly initialized sparse networks typically underperform dense ones
Structure alone is insufficient: The same sparse structure with different initialization fails
Initialization alone is insufficient: The masked initialization only works with the correct sparse structure

The winning ticket is a joint property of structure and initialization that emerges from the dense network's training process.

The search space perspective:

$$P(\text{random sparse subnetwork is winning}) \approx 0$$

Yet training the dense network identifies one. This suggests training performs an implicit architecture search, discovering not just good weights but good sparsity patterns.

Connection to overparameterization:

Overparameterization serves a dual purpose:

Classical view: More parameters = more capacity to fit complex functions
Lottery ticket view: More parameters = more possible subnetworks = higher chance of containing a winning ticket

Finding Winning Tickets: The Iterative Magnitude Pruning Algorithm

The original Lottery Ticket paper identified winning tickets using Iterative Magnitude Pruning (IMP)—a simple yet powerful algorithm that alternates between training and pruning.

The IMP Algorithm:

Initialize: Randomly initialize network with weights θ₀
Train: Train network for T iterations to get θ_T
Prune: Remove the p% of weights with smallest magnitude |θ_T|
Reset: Reset remaining weights to their original values from θ₀
Repeat: Go to step 2 with the pruned network; repeat until desired sparsity

The key insight: we use the training process to identify which weights matter (the pruning mask), but then we reset to the original initialization before the next training round.

iterative_magnitude_pruning.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
import numpy as np
from copy import deepcopy
 
class IterativeMagnitudePruning:
    """
    Implementation of Iterative Magnitude Pruning to find lottery tickets.
    
    This is a simplified demonstration showing the core algorithm.
    """
    
    def __init__(
        self,
        initial_weights: dict,
        prune_percent: float = 20.0,
        num_rounds: int = 10
    ):
        """
        Args:
            initial_weights: Dictionary of initial weight arrays
            prune_percent: Percentage of remaining weights to prune each round
            num_rounds: Number of prune-train cycles
        """
        self.theta_0 = deepcopy(initial_weights)  # Save original init
        self.prune_percent = prune_percent
        self.num_rounds = num_rounds
        
        # Initialize mask (all ones = all weights present)
        self.mask = {k: np.ones_like(v) for k, v in initial_weights.items()}
        
    def get_current_weights(self):
        """Apply mask to original initialization."""
        return {k: self.mask[k] * self.theta_0[k] for k in self.theta_0}
    
    def get_current_sparsity(self):
        """Calculate current sparsity percentage."""
        total_params = sum(m.size for m in self.mask.values())
        nonzero_params = sum(np.sum(m != 0) for m in self.mask.values())
        return 100.0 * (1 - nonzero_params / total_params)
    
    def prune_round(self, trained_weights: dict):
        """
        Perform one round of magnitude pruning.
        
        Args:
            trained_weights: Weights after training (used to determine magnitudes)
        
        Returns:
            Updated mask after pruning
        """
        # Collect all weight magnitudes (only for currently unmasked weights)
        all_magnitudes = []
        for k in trained_weights:
            masked_weights = self.mask[k] * trained_weights[k]
            magnitudes = np.abs(masked_weights[self.mask[k] != 0])
            all_magnitudes.extend(magnitudes)
        
        all_magnitudes = np.array(all_magnitudes)
        
        # Find threshold for pruning
        num_to_prune = int(len(all_magnitudes) * self.prune_percent / 100)
        if num_to_prune == 0:
            return self.mask
            
        threshold = np.sort(all_magnitudes)[num_to_prune]
        
        # Apply pruning to mask
        for k in trained_weights:
            masked_weights = self.mask[k] * trained_weights[k]
            prune_mask = np.abs(masked_weights) > threshold
            self.mask[k] = self.mask[k] * prune_mask.astype(float)
        
        return self.mask
    
    def run_imp(self, train_function):
        """
        Run the full IMP algorithm.
        
        Args:
            train_function: Function that takes weights and returns trained weights
        
        Returns:
            Final mask defining the winning ticket
        """
        results = []
        
        for round_idx in range(self.num_rounds):
            # Get current masked initialization
            current_weights = self.get_current_weights()
            
            # Train the network
            trained_weights = train_function(current_weights)
            
            # Record results before pruning
            sparsity = self.get_current_sparsity()
            results.append({
                'round': round_idx,
                'sparsity': sparsity,
                'weights': deepcopy(current_weights),
            })
            
            print(f"Round {round_idx}: Sparsity = {sparsity:.1f}%")
            
            # Prune and reset to original init (key step!)
            self.prune_round(trained_weights)
        
        return self.mask, results
 
 
def demonstrate_imp():
    """Demonstrate IMP algorithm with a toy example."""
    np.random.seed(42)
    
    # Create a simple "network" with random weights
    initial_weights = {
        'layer1': np.random.randn(100, 50) * 0.1,
        'layer2': np.random.randn(50, 10) * 0.1,
    }
    
    total_params = sum(w.size for w in initial_weights.values())
    print(f"Total parameters: {total_params}")
    print()
    
    # Mock training function (in reality, this would train the network)
    def mock_train(weights):
        """Simulate training by scaling weights randomly."""
        trained = {}
        for k, v in weights.items():
            # Simulate: some weights grow important, others stay small
            importance = np.random.exponential(0.5, v.shape)
            trained[k] = v * importance
        return trained
    
    # Run IMP
    imp = IterativeMagnitudePruning(
        initial_weights=initial_weights,
        prune_percent=20.0,
        num_rounds=8
    )
    
    print("Running Iterative Magnitude Pruning")
    print("=" * 50)
    final_mask, results = imp.run_imp(mock_train)
    
    print()
    print("Final winning ticket structure identified!")
    final_sparsity = imp.get_current_sparsity()
    print(f"Final sparsity: {final_sparsity:.1f}%")
    print(f"Remaining parameters: {int(total_params * (100 - final_sparsity) / 100)}")
 
demonstrate_imp()

Why IMP works:

Magnitude as importance proxy: Weights that are large after training are likely important for the learned function
Iterative refinement: Gradual pruning (20% at a time) is more effective than one-shot pruning to extreme sparsity
Initialization reset: Returning to θ₀ tests whether the structure alone (with original init) suffices, not whether the trained weights can be pruned

Computational cost:

IMP is expensive—it requires multiple full training runs. For n rounds of pruning:

$$\text{Total Training Cost} = n \times \text{Cost of One Training Run}$$

This has motivated research into more efficient methods for finding winning tickets.

The Rewind Variant

Theoretical Perspectives on Lottery Tickets

The Lottery Ticket Hypothesis has spawned significant theoretical work trying to explain why winning tickets exist and what they tell us about neural network training.

The pruning as architecture search view:

$$\text{Dense Training} + \text{Pruning} \approx \text{Architecture Search}$$

The linear mode connectivity hypothesis:

Research has shown that winning tickets often lie in the same loss basin as the fully trained dense network. Specifically:

The trained dense network θ_T and the trained sparse ticket θ'_T are linearly connected in loss space
There's no loss barrier between them: L(αθ_T + (1-α)θ'_T) ≤ max(L(θ_T), L(θ'_T)) for α ∈ [0,1]

This suggests the dense network and winning ticket converge to the same functional solution, just expressed with different (dense vs. sparse) parameterizations.

The Strong Lottery Ticket Hypothesis

Mathematical formalization:

Let f(x; θ, m) be the function computed by the network with parameters θ and mask m. The Lottery Ticket Hypothesis claims:

$$\exists m^* \text{ sparse}: \quad \min_\theta L(f(\cdot; \theta, m^*)) \approx \min_\theta L(f(\cdot; \theta, \mathbf{1}))$$

where 1 is the all-ones mask (dense network). Moreover, the minimization over θ starting from m^* ⊙ θ₀ succeeds, while starting from random initialization typically fails.

Why random sparse initialization fails:

A randomly initialized sparse network:

Has the right structure but wrong initialization
Weights are optimized for density, not this specific sparse connectivity
Gradients flow through paths that may be suboptimal for sparse training
The 'lucky' initialization of the winning ticket is lost

Lottery Ticket Variants
Variant	Reset Point	When It Works	When It Fails
Original (reset to θ₀)	Iteration 0	Small networks, MNIST, CIFAR-10	Large networks (ResNet-50+)
Rewinding	Early iteration k	Larger networks, ImageNet	Very aggressive sparsity
Learning Rate Rewinding	Iteration k, reset LR schedule	Matches rewinding performance	No significant failures known
Strong LTH (no training)	N/A	Sufficiently overparameterized	Practical networks

Implications for Neural Network Training

The Lottery Ticket Hypothesis reshapes our understanding of what happens during neural network training.

Training as ticket selection:

This perspective explains several phenomena:

Why overparameterization helps: More parameters = more possible tickets = higher chance of a winner
Why initialization matters: The winning ticket's effectiveness depends on its specific initialization
Why pruning works: Training already identified the important weights; we're just removing the unimportant ones

The Overparameterization Paradox Resolved

Implications for regularization:

The lottery ticket perspective connects to implicit regularization:

SGD selects tickets: The stochastic optimization process biases toward certain tickets (perhaps simpler ones that are found more easily)
Architecture limits ticket pool: Network architecture determines which tickets are possible; this is architectural regularization
Sparsity is implicit: The winning ticket is sparse even though we never explicitly regularized for sparsity

Practical implications:

Finding	Practical Implication
Dense networks contain sparse winners	Dense training + pruning is effective
Initialization matters	Keep original init when pruning; don't re-initialize
Gradual pruning outperforms one-shot	Use iterative pruning schedules
Early weights predict winners	May be possible to prune early in training

What LTH Enables

•10-20x compression with minimal accuracy loss
•Understanding of overparameterization benefits
•Principled pruning methodologies
•Connection between architecture and optimization
•Foundation for efficient neural networks

Remaining Challenges

•Finding tickets is computationally expensive
•Tickets may not transfer across tasks
•Extreme sparsity remains difficult
•Hardware support for sparsity is limited
•Theoretical foundations still developing

Pruning at Initialization: Early Lottery Tickets

A natural question: can we find winning tickets before training, avoiding the expensive train-prune-reset cycle? This has led to research on pruning at initialization.

The SNIP algorithm (Single-shot Network Pruning):

SNIP identifies important connections based on their gradient-signal magnitudes at initialization:

$$s_j = \left|\frac{\partial L}{\partial c_j}\right|_{c=\mathbf{1}}$$

where c_j indicates whether connection j exists (c_j ∈ {0,1}). Connections with high sensitivity to removal are kept.

Key insight: Even before training, gradients reveal which connections are important for the loss function on the training data.

GraSP: Gradient Signal Preservation

Comparison of pruning-at-init methods:

Method	Criterion	Computational Cost	Performance
Random	None	Minimal	Poor
SNIP	Connection sensitivity	One backward pass	Moderate
GraSP	Gradient flow preservation	One backward pass	Good
SynFlow	Synaptic saliency	Few forward passes	Good
IMP	Post-training magnitude	Full training × rounds	Best

The gap to IMP:

While pruning-at-init methods are much faster, they typically underperform IMP, especially at high sparsity:

At 80% sparsity: Gap is small (1-2% accuracy)
At 95% sparsity: Gap is significant (5-10% accuracy)
At 99% sparsity: Gap is large (pruning-at-init often fails)

This suggests that some training signal is necessary to identify the best tickets.

snip_pruning.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import numpy as np
 
def snip_pruning(weights, gradients, target_sparsity=0.9):
    """
    Simplified SNIP (Single-shot Network Pruning) implementation.
    
    SNIP prunes based on connection sensitivity: how much the loss
    changes when a connection is removed.
    
    Args:
        weights: Dictionary of weight arrays
        gradients: Dictionary of gradient arrays (from one forward-backward pass)
        target_sparsity: Fraction of weights to prune (0.9 = 90% pruned)
    
    Returns:
        Pruning mask
    """
    # Compute connection sensitivity: |weight * gradient|
    # This approximates the change in loss from removing the connection
    sensitivities = {}
    all_scores = []
    
    for layer_name in weights:
        w = weights[layer_name]
        g = gradients[layer_name]
        
        # Connection sensitivity score
        sensitivity = np.abs(w * g)
        sensitivities[layer_name] = sensitivity
        all_scores.extend(sensitivity.flatten())
    
    # Find threshold for target sparsity
    all_scores = np.array(all_scores)
    threshold = np.percentile(all_scores, target_sparsity * 100)
    
    # Create masks
    masks = {}
    for layer_name, sens in sensitivities.items():
        masks[layer_name] = (sens > threshold).astype(float)
    
    # Report statistics
    total_params = sum(w.size for w in weights.values())
    remaining_params = sum(np.sum(m) for m in masks.values())
    actual_sparsity = 1 - remaining_params / total_params
    
    print(f"SNIP Pruning Results:")
    print(f"  Target sparsity: {target_sparsity * 100:.1f}%")
    print(f"  Actual sparsity: {actual_sparsity * 100:.1f}%")
    print(f"  Remaining parameters: {int(remaining_params)} / {total_params}")
    
    return masks
 
 
def demonstrate_snip():
    """Demonstrate SNIP on a toy example."""
    np.random.seed(42)
    
    # Simulate network weights at initialization
    weights = {
        'layer1': np.random.randn(784, 256) * 0.01,
        'layer2': np.random.randn(256, 128) * 0.01,
        'layer3': np.random.randn(128, 10) * 0.01,
    }
    
    # Simulate gradients from one forward-backward pass
    # In practice, these come from backprop on a mini-batch
    gradients = {
        'layer1': np.random.randn(784, 256) * np.random.exponential(0.1, (784, 256)),
        'layer2': np.random.randn(256, 128) * np.random.exponential(0.1, (256, 128)),
        'layer3': np.random.randn(128, 10) * np.random.exponential(0.1, (128, 10)),
    }
    
    print("=" * 50)
    print("SNIP: Single-shot Network Pruning")
    print("=" * 50)
    print()
    
    for sparsity in [0.5, 0.8, 0.9, 0.95]:
        print(f"\nSparsity level: {sparsity * 100}%")
        masks = snip_pruning(weights, gradients, target_sparsity=sparsity)
        
        # Show per-layer statistics
        for layer_name in weights:
            layer_remaining = np.sum(masks[layer_name])
            layer_total = masks[layer_name].size
            print(f"    {layer_name}: {int(layer_remaining)}/{layer_total} "
                  f"({100*layer_remaining/layer_total:.1f}% remaining)")
 
demonstrate_snip()

Applications and Extensions

The Lottery Ticket Hypothesis has inspired numerous applications and extensions beyond its original scope.

Network compression:

The most direct application is efficient model deployment:

Train a large dense network
Find the winning ticket via IMP
Deploy only the sparse subnetwork
Achieve ~10x compression with minimal accuracy loss

This is particularly valuable for edge devices, mobile phones, and embedded systems where memory and computation are constrained.

Transfer of Lottery Tickets

Extensions to different domains:

Domain	Finding	Implication
NLP (Transformers)	Tickets exist but need rewinding	LLMs are compressible post-training
Reinforcement Learning	Tickets transfer across tasks	Reusable RL agent structures
GANs	Both generator and discriminator have tickets	Efficient generative models
Graph Neural Networks	Tickets exist with structural sparsity	Efficient graph processing

Structured lottery tickets:

Beyond unstructured (individual weight) sparsity, research has explored structured tickets:

Filter pruning: Remove entire convolutional filters
Attention head pruning: Remove entire attention heads in Transformers
Layer pruning: Remove entire layers

Structured sparsity is hardware-friendly but typically requires higher density for equivalent accuracy.

Lottery tickets and neural architecture search:

There's a deep connection between lottery tickets and NAS:

Supernet training: Train a very large network containing all candidate architectures
Ticket finding: The winning ticket corresponds to the optimal architecture
One-shot NAS: Architectures can be evaluated by measuring their performance as subnetworks of the supernet

This view unifies two successful paradigms: large-scale training + pruning (lottery tickets) and architecture search (NAS).

Lottery tickets in pre-training:

For pre-trained models (like BERT, GPT):

Pre-train a large dense model
Find winning ticket via task-agnostic pruning
Fine-tune the sparse ticket on downstream tasks

This enables deploying compact versions of large language models while retaining most of their capability.

Connection to Implicit Regularization

The Lottery Ticket Hypothesis connects deeply to the theme of implicit regularization, revealing how training naturally discovers effective sparse structures.

Implicit sparsity through training:

Even without explicit sparsity regularization (like L1), training induces a form of implicit sparsity:

Magnitude disparity: After training, weight magnitudes vary by orders of magnitude
Effective sparsity: Many weights are small enough to prune without affecting performance
Training dynamics: SGD implicitly favors solutions where few weights dominate

The lottery ticket is a crystallization of this implicit sparsity—it makes explicit the sparse structure that was latent in the dense solution.

SGD as Ticket Selector

Lottery tickets and flat minima:

Winning tickets tend to occupy flat regions of the loss landscape:

The sparse structure constrains the parameter space
The good initialization places us in a good basin
Training from this point converges to a flat minimum

This connects to SGD's implicit bias toward flat minima: winning tickets may be 'easy' for SGD to train precisely because they lead to flat solutions.

Lottery tickets and generalization:

Why do winning tickets generalize well?

Simplicity: Fewer parameters = lower capacity = implicit regularization
Structure: The sparse connectivity pattern may encode useful inductive biases
Initialization: The original initialization was optimized (implicitly, via the dense network's training) for this structure
Selection pressure: Only structures that train well (and generalize well) are selected as winners

Lottery Tickets in the Implicit Regularization Framework
Regularization Source	Mechanism in Lottery Tickets
SGD stochasticity	Explores many tickets; settles on trainable ones
Architecture	Defines which tickets are possible
Initialization	Provides 'good starting points' for some tickets
Sparsity (implicit)	Training creates magnitude disparity; tickets crystallize this
Early stopping	May select simpler tickets by halting before complex patterns learned

The broader picture:

The Lottery Ticket Hypothesis reveals that neural network training is doing more than gradient descent on a fixed architecture. It's simultaneously:

Learning weights: Adjusting parameter values to minimize loss
Selecting structure: Implicitly identifying which connections matter
Discovering architecture: Finding the sparse subnetwork that can solve the task

Summary: The Lottery Ticket Hypothesis

The Lottery Ticket Hypothesis has fundamentally changed our understanding of neural network training, overparameterization, and the implicit structures discovered during learning.

Key Takeaways

•Dense networks contain sparse winners — Sparse subnetworks can match full network accuracy when trained from original initialization
•Structure + initialization = winning ticket — Both the sparse structure AND the original init are necessary; neither alone suffices
•IMP finds tickets reliably — Iterative magnitude pruning trains, prunes, and resets to identify winning tickets
•Overparameterization increases ticket variety — More parameters = more possible subnetworks = better chance of winners
•Tickets enable ~10x compression — Practical networks can be pruned to 10-20% of original size
•Pruning at init is possible but weaker — Methods like SNIP find tickets before training, but IMP remains superior
•Training = implicit architecture search — SGD selects among exponentially many possible sparse structures
•Connects to implicit regularization — Lottery tickets reveal implicit sparsity in trained networks

Page Complete

4 / 5