Loading learning content...
In 2019, Jonathan Frankle and Michael Carlin published a paper that would reshape our understanding of neural network training. The Lottery Ticket Hypothesis makes a remarkable claim: within a randomly initialized dense neural network, there exists a sparse subnetwork (a 'winning ticket') that, when trained in isolation from the same initialization, can match the test accuracy of the full network.
This discovery has profound implications for network compression, our understanding of overparameterization, and the fundamental nature of neural network training.
By the end of this page, you will understand the formal statement and implications of the Lottery Ticket Hypothesis, the experimental methodology used to identify winning tickets, why random initialization matters for lottery ticket, extensions including the Linear Mode Connectivity and pruning at initialization, and connections to implicit regularization and network compression.
The core insight:
The lottery metaphor is apt: training a neural network is like buying a lottery ticket. A dense network is like buying many tickets—one of them is likely to win. The 'winning ticket' is a specific sparse subnetwork, defined by both its structure (which connections exist) and its initialization (what the initial weights are).
$$\text{Dense Network} = \text{Many Lottery Tickets (subnetworks)}$$ $$\text{Training} = \text{Finding which ticket wins}$$ $$\text{Winning Ticket} = \text{Sparse subnetwork + original initialization}$$
The hypothesis suggests that the role of overparameterization isn't to have more capacity, but to increase the probability of containing a good sparse subnetwork among the exponentially many possible subnetworks.
Let us state the Lottery Ticket Hypothesis precisely and understand its components.
Formal statement:
A randomly initialized, dense neural network N with parameters θ₀ contains a subnetwork N' ⊂ N with mask m ∈ {0,1}^d such that:
Where m ⊙ θ represents element-wise multiplication (applying the mask to weights).
The hypothesis specifically requires the original initialization θ₀. A sparse subnetwork with the right structure but random re-initialization typically trains poorly. This reveals that trainability depends not just on network architecture but on the specific initial weights.
What makes a winning ticket 'win':
| Property | Description |
|---|---|
| Structure (mask m) | Which weights are present; the subnetwork's topology |
| Initialization (θ₀) | The specific starting values of retained weights |
| Joint property | Both structure AND initialization matter |
Why this is surprising:
Sparse networks are hard to train from scratch: Randomly initialized sparse networks typically underperform dense ones
Structure alone is insufficient: The same sparse structure with different initialization fails
Initialization alone is insufficient: The masked initialization only works with the correct sparse structure
The winning ticket is a joint property of structure and initialization that emerges from the dense network's training process.
The search space perspective:
For a network with d parameters, there are 2^d possible subnetworks. A sparse subnetwork with k parameters (k << d) is found among these 2^d possibilities. The probability that a random sparse subnetwork is a winning ticket is negligible:
$$P(\text{random sparse subnetwork is winning}) \approx 0$$
Yet training the dense network identifies one. This suggests training performs an implicit architecture search, discovering not just good weights but good sparsity patterns.
Connection to overparameterization:
Overparameterization serves a dual purpose:
Large networks may succeed not because all their parameters are needed, but because having many parameters makes it likely that some sparse subset forms an effective learner when combined with good initialization.
The original Lottery Ticket paper identified winning tickets using Iterative Magnitude Pruning (IMP)—a simple yet powerful algorithm that alternates between training and pruning.
The IMP Algorithm:
The key insight: we use the training process to identify which weights matter (the pruning mask), but then we reset to the original initialization before the next training round.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150
import numpy as npfrom copy import deepcopy class IterativeMagnitudePruning: """ Implementation of Iterative Magnitude Pruning to find lottery tickets. This is a simplified demonstration showing the core algorithm. """ def __init__( self, initial_weights: dict, prune_percent: float = 20.0, num_rounds: int = 10 ): """ Args: initial_weights: Dictionary of initial weight arrays prune_percent: Percentage of remaining weights to prune each round num_rounds: Number of prune-train cycles """ self.theta_0 = deepcopy(initial_weights) # Save original init self.prune_percent = prune_percent self.num_rounds = num_rounds # Initialize mask (all ones = all weights present) self.mask = {k: np.ones_like(v) for k, v in initial_weights.items()} def get_current_weights(self): """Apply mask to original initialization.""" return {k: self.mask[k] * self.theta_0[k] for k in self.theta_0} def get_current_sparsity(self): """Calculate current sparsity percentage.""" total_params = sum(m.size for m in self.mask.values()) nonzero_params = sum(np.sum(m != 0) for m in self.mask.values()) return 100.0 * (1 - nonzero_params / total_params) def prune_round(self, trained_weights: dict): """ Perform one round of magnitude pruning. Args: trained_weights: Weights after training (used to determine magnitudes) Returns: Updated mask after pruning """ # Collect all weight magnitudes (only for currently unmasked weights) all_magnitudes = [] for k in trained_weights: masked_weights = self.mask[k] * trained_weights[k] magnitudes = np.abs(masked_weights[self.mask[k] != 0]) all_magnitudes.extend(magnitudes) all_magnitudes = np.array(all_magnitudes) # Find threshold for pruning num_to_prune = int(len(all_magnitudes) * self.prune_percent / 100) if num_to_prune == 0: return self.mask threshold = np.sort(all_magnitudes)[num_to_prune] # Apply pruning to mask for k in trained_weights: masked_weights = self.mask[k] * trained_weights[k] prune_mask = np.abs(masked_weights) > threshold self.mask[k] = self.mask[k] * prune_mask.astype(float) return self.mask def run_imp(self, train_function): """ Run the full IMP algorithm. Args: train_function: Function that takes weights and returns trained weights Returns: Final mask defining the winning ticket """ results = [] for round_idx in range(self.num_rounds): # Get current masked initialization current_weights = self.get_current_weights() # Train the network trained_weights = train_function(current_weights) # Record results before pruning sparsity = self.get_current_sparsity() results.append({ 'round': round_idx, 'sparsity': sparsity, 'weights': deepcopy(current_weights), }) print(f"Round {round_idx}: Sparsity = {sparsity:.1f}%") # Prune and reset to original init (key step!) self.prune_round(trained_weights) return self.mask, results def demonstrate_imp(): """Demonstrate IMP algorithm with a toy example.""" np.random.seed(42) # Create a simple "network" with random weights initial_weights = { 'layer1': np.random.randn(100, 50) * 0.1, 'layer2': np.random.randn(50, 10) * 0.1, } total_params = sum(w.size for w in initial_weights.values()) print(f"Total parameters: {total_params}") print() # Mock training function (in reality, this would train the network) def mock_train(weights): """Simulate training by scaling weights randomly.""" trained = {} for k, v in weights.items(): # Simulate: some weights grow important, others stay small importance = np.random.exponential(0.5, v.shape) trained[k] = v * importance return trained # Run IMP imp = IterativeMagnitudePruning( initial_weights=initial_weights, prune_percent=20.0, num_rounds=8 ) print("Running Iterative Magnitude Pruning") print("=" * 50) final_mask, results = imp.run_imp(mock_train) print() print("Final winning ticket structure identified!") final_sparsity = imp.get_current_sparsity() print(f"Final sparsity: {final_sparsity:.1f}%") print(f"Remaining parameters: {int(total_params * (100 - final_sparsity) / 100)}") demonstrate_imp()Why IMP works:
Magnitude as importance proxy: Weights that are large after training are likely important for the learned function
Iterative refinement: Gradual pruning (20% at a time) is more effective than one-shot pruning to extreme sparsity
Initialization reset: Returning to θ₀ tests whether the structure alone (with original init) suffices, not whether the trained weights can be pruned
Computational cost:
IMP is expensive—it requires multiple full training runs. For n rounds of pruning:
$$\text{Total Training Cost} = n \times \text{Cost of One Training Run}$$
This has motivated research into more efficient methods for finding winning tickets.
For larger networks, resetting to iteration 0 often fails. The 'rewinding' variant resets to an early iteration (e.g., iteration k where k is 0.1% to 1% of total training). This early point provides enough training signal for initialization while still enabling ticket identification.
The Lottery Ticket Hypothesis has spawned significant theoretical work trying to explain why winning tickets exist and what they tell us about neural network training.
The pruning as architecture search view:
Training a dense network and pruning can be seen as implicit architecture search. The training process evaluates which weights (and hence, which connections) are important, while pruning crystallizes this into an explicit sparse architecture.
$$\text{Dense Training} + \text{Pruning} \approx \text{Architecture Search}$$
This perspective connects lottery tickets to Neural Architecture Search (NAS), but with a key difference: instead of searching over discrete architecture choices, we search over the 2^d possible binary masks.
The linear mode connectivity hypothesis:
Research has shown that winning tickets often lie in the same loss basin as the fully trained dense network. Specifically:
This suggests the dense network and winning ticket converge to the same functional solution, just expressed with different (dense vs. sparse) parameterizations.
A stronger claim: a randomly initialized network contains a subnetwork that achieves good accuracy without any training—just by selecting the right weights. This has been proven for sufficiently wide networks, connecting to the theory of random features and the Neural Tangent Kernel.
Mathematical formalization:
Let f(x; θ, m) be the function computed by the network with parameters θ and mask m. The Lottery Ticket Hypothesis claims:
$$\exists m^* \text{ sparse}: \quad \min_\theta L(f(\cdot; \theta, m^*)) \approx \min_\theta L(f(\cdot; \theta, \mathbf{1}))$$
where 1 is the all-ones mask (dense network). Moreover, the minimization over θ starting from m^* ⊙ θ₀ succeeds, while starting from random initialization typically fails.
Why random sparse initialization fails:
A randomly initialized sparse network:
| Variant | Reset Point | When It Works | When It Fails |
|---|---|---|---|
| Original (reset to θ₀) | Iteration 0 | Small networks, MNIST, CIFAR-10 | Large networks (ResNet-50+) |
| Rewinding | Early iteration k | Larger networks, ImageNet | Very aggressive sparsity |
| Learning Rate Rewinding | Iteration k, reset LR schedule | Matches rewinding performance | No significant failures known |
| Strong LTH (no training) | N/A | Sufficiently overparameterized | Practical networks |
The Lottery Ticket Hypothesis reshapes our understanding of what happens during neural network training.
Training as ticket selection:
Rather than viewing training as 'learning the weights,' we can view it as 'identifying the winning ticket.' The dense network contains exponentially many possible sparse subnetworks; training selects which one 'wins.'
This perspective explains several phenomena:
Why overparameterization helps: More parameters = more possible tickets = higher chance of a winner
Why initialization matters: The winning ticket's effectiveness depends on its specific initialization
Why pruning works: Training already identified the important weights; we're just removing the unimportant ones
The Lottery Ticket Hypothesis offers a resolution to the overparameterization paradox: large networks generalize well not despite having many parameters, but because having many parameters increases the probability of containing effective sparse subnetworks. The network uses its parameters as a 'search space' for finding good architectures.
Implications for regularization:
The lottery ticket perspective connects to implicit regularization:
SGD selects tickets: The stochastic optimization process biases toward certain tickets (perhaps simpler ones that are found more easily)
Architecture limits ticket pool: Network architecture determines which tickets are possible; this is architectural regularization
Sparsity is implicit: The winning ticket is sparse even though we never explicitly regularized for sparsity
Practical implications:
| Finding | Practical Implication |
|---|---|
| Dense networks contain sparse winners | Dense training + pruning is effective |
| Initialization matters | Keep original init when pruning; don't re-initialize |
| Gradual pruning outperforms one-shot | Use iterative pruning schedules |
| Early weights predict winners | May be possible to prune early in training |
A natural question: can we find winning tickets before training, avoiding the expensive train-prune-reset cycle? This has led to research on pruning at initialization.
The SNIP algorithm (Single-shot Network Pruning):
SNIP identifies important connections based on their gradient-signal magnitudes at initialization:
$$s_j = \left|\frac{\partial L}{\partial c_j}\right|_{c=\mathbf{1}}$$
where c_j indicates whether connection j exists (c_j ∈ {0,1}). Connections with high sensitivity to removal are kept.
Key insight: Even before training, gradients reveal which connections are important for the loss function on the training data.
GraSP extends SNIP by preserving gradient flow rather than just gradient magnitude. It keeps connections that, when removed, would most reduce the gradient signal to remaining parameters. This leads to better trainability of the resulting sparse network.
Comparison of pruning-at-init methods:
| Method | Criterion | Computational Cost | Performance |
|---|---|---|---|
| Random | None | Minimal | Poor |
| SNIP | Connection sensitivity | One backward pass | Moderate |
| GraSP | Gradient flow preservation | One backward pass | Good |
| SynFlow | Synaptic saliency | Few forward passes | Good |
| IMP | Post-training magnitude | Full training × rounds | Best |
The gap to IMP:
While pruning-at-init methods are much faster, they typically underperform IMP, especially at high sparsity:
This suggests that some training signal is necessary to identify the best tickets.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
import numpy as np def snip_pruning(weights, gradients, target_sparsity=0.9): """ Simplified SNIP (Single-shot Network Pruning) implementation. SNIP prunes based on connection sensitivity: how much the loss changes when a connection is removed. Args: weights: Dictionary of weight arrays gradients: Dictionary of gradient arrays (from one forward-backward pass) target_sparsity: Fraction of weights to prune (0.9 = 90% pruned) Returns: Pruning mask """ # Compute connection sensitivity: |weight * gradient| # This approximates the change in loss from removing the connection sensitivities = {} all_scores = [] for layer_name in weights: w = weights[layer_name] g = gradients[layer_name] # Connection sensitivity score sensitivity = np.abs(w * g) sensitivities[layer_name] = sensitivity all_scores.extend(sensitivity.flatten()) # Find threshold for target sparsity all_scores = np.array(all_scores) threshold = np.percentile(all_scores, target_sparsity * 100) # Create masks masks = {} for layer_name, sens in sensitivities.items(): masks[layer_name] = (sens > threshold).astype(float) # Report statistics total_params = sum(w.size for w in weights.values()) remaining_params = sum(np.sum(m) for m in masks.values()) actual_sparsity = 1 - remaining_params / total_params print(f"SNIP Pruning Results:") print(f" Target sparsity: {target_sparsity * 100:.1f}%") print(f" Actual sparsity: {actual_sparsity * 100:.1f}%") print(f" Remaining parameters: {int(remaining_params)} / {total_params}") return masks def demonstrate_snip(): """Demonstrate SNIP on a toy example.""" np.random.seed(42) # Simulate network weights at initialization weights = { 'layer1': np.random.randn(784, 256) * 0.01, 'layer2': np.random.randn(256, 128) * 0.01, 'layer3': np.random.randn(128, 10) * 0.01, } # Simulate gradients from one forward-backward pass # In practice, these come from backprop on a mini-batch gradients = { 'layer1': np.random.randn(784, 256) * np.random.exponential(0.1, (784, 256)), 'layer2': np.random.randn(256, 128) * np.random.exponential(0.1, (256, 128)), 'layer3': np.random.randn(128, 10) * np.random.exponential(0.1, (128, 10)), } print("=" * 50) print("SNIP: Single-shot Network Pruning") print("=" * 50) print() for sparsity in [0.5, 0.8, 0.9, 0.95]: print(f"\nSparsity level: {sparsity * 100}%") masks = snip_pruning(weights, gradients, target_sparsity=sparsity) # Show per-layer statistics for layer_name in weights: layer_remaining = np.sum(masks[layer_name]) layer_total = masks[layer_name].size print(f" {layer_name}: {int(layer_remaining)}/{layer_total} " f"({100*layer_remaining/layer_total:.1f}% remaining)") demonstrate_snip()The Lottery Ticket Hypothesis has inspired numerous applications and extensions beyond its original scope.
Network compression:
The most direct application is efficient model deployment:
This is particularly valuable for edge devices, mobile phones, and embedded systems where memory and computation are constrained.
Winning tickets transfer across datasets to some extent. A ticket found on ImageNet often works (with retraining) on other vision tasks. This suggests tickets capture general-purpose visual processing structures, not task-specific solutions.
Extensions to different domains:
| Domain | Finding | Implication |
|---|---|---|
| NLP (Transformers) | Tickets exist but need rewinding | LLMs are compressible post-training |
| Reinforcement Learning | Tickets transfer across tasks | Reusable RL agent structures |
| GANs | Both generator and discriminator have tickets | Efficient generative models |
| Graph Neural Networks | Tickets exist with structural sparsity | Efficient graph processing |
Structured lottery tickets:
Beyond unstructured (individual weight) sparsity, research has explored structured tickets:
Structured sparsity is hardware-friendly but typically requires higher density for equivalent accuracy.
Lottery tickets and neural architecture search:
There's a deep connection between lottery tickets and NAS:
This view unifies two successful paradigms: large-scale training + pruning (lottery tickets) and architecture search (NAS).
Lottery tickets in pre-training:
For pre-trained models (like BERT, GPT):
This enables deploying compact versions of large language models while retaining most of their capability.
The Lottery Ticket Hypothesis connects deeply to the theme of implicit regularization, revealing how training naturally discovers effective sparse structures.
Implicit sparsity through training:
Even without explicit sparsity regularization (like L1), training induces a form of implicit sparsity:
The lottery ticket is a crystallization of this implicit sparsity—it makes explicit the sparse structure that was latent in the dense solution.
SGD doesn't just find good weights—it implicitly selects which sparse subnetwork 'wins.' The stochasticity of SGD may help explore different tickets early in training, while convergence later 'commits' to one. This is analogous to simulated annealing: explore broadly, then exploit locally.
Lottery tickets and flat minima:
Winning tickets tend to occupy flat regions of the loss landscape:
This connects to SGD's implicit bias toward flat minima: winning tickets may be 'easy' for SGD to train precisely because they lead to flat solutions.
Lottery tickets and generalization:
Why do winning tickets generalize well?
| Regularization Source | Mechanism in Lottery Tickets |
|---|---|
| SGD stochasticity | Explores many tickets; settles on trainable ones |
| Architecture | Defines which tickets are possible |
| Initialization | Provides 'good starting points' for some tickets |
| Sparsity (implicit) | Training creates magnitude disparity; tickets crystallize this |
| Early stopping | May select simpler tickets by halting before complex patterns learned |
The broader picture:
The Lottery Ticket Hypothesis reveals that neural network training is doing more than gradient descent on a fixed architecture. It's simultaneously:
This unified view positions overparameterization as an exploration strategy: have many options (parameters/connections) to increase the chance of discovering a good solution. The 'waste' of unused parameters isn't waste—it's the search cost for finding winning tickets.
The Lottery Ticket Hypothesis has fundamentally changed our understanding of neural network training, overparameterization, and the implicit structures discovered during learning.
You now understand the Lottery Ticket Hypothesis—a landmark discovery connecting overparameterization, architecture search, and implicit regularization. Next, we'll explore benign overfitting—another surprising phenomenon where models that perfectly memorize training data can still generalize well.