Neural Architecture Search - Learning Module

Loading content...

0/278

Weight Sharing

The Efficiency Breakthrough

Early NAS methods faced a fundamental efficiency problem: each candidate architecture required training from scratch. With training taking hours to days and search spaces containing billions of architectures, exhaustive evaluation was impossible.

Weight sharing (also called one-shot NAS) solved this by training a single supernet that contains all architectures in the search space as subnetworks. Different architectures share weights in the supernet, allowing performance estimation without separate training.

This innovation reduced NAS cost from thousands of GPU-days to hours—a 10,000x improvement that democratized neural architecture search.

Learning Objectives

Understand the supernet paradigm, how weight sharing works, the training and search phases of one-shot NAS, the coupling effects that limit weight sharing, and techniques to improve supernet quality.

The Supernet Concept

A supernet (or one-shot model) is a single over-parameterized network that encodes the entire search space. Every architecture in the search space corresponds to a subnetwork (or subnet) of the supernet.

Key Insight:

If architectures share structure (same layer positions, operation choices), they can share weights. The supernet contains weights for all possible operations at all positions. A specific architecture uses a subset of these weights.

Mathematical Formulation:

Let $W$ be the supernet weights. For architecture $a$, the subnet weights are: $$w_a = M_a \odot W$$

where $M_a$ is a binary mask selecting relevant weights for architecture $a$.

Evaluation without retraining:

Once the supernet is trained, any architecture can be evaluated by:

Extracting its subnet weights from the supernet
Running forward pass with those weights
Measuring validation performance

No additional training required—evaluation takes seconds instead of hours.

supernet.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import torch
import torch.nn as nn
import random
 
class SupernetCell(nn.Module):
    """
    A cell containing all possible operations.
    Architectures are evaluated by selecting specific paths.
    """
    
    def __init__(self, channels, operations, num_nodes=4):
        super().__init__()
        self.num_nodes = num_nodes
        self.num_ops = len(operations)
        
        # Create all possible operations for all edges
        self.edges = nn.ModuleDict()
        for i in range(num_nodes):
            for j in range(2 + i):  # 2 cell inputs + previous nodes
                edge_ops = nn.ModuleList([
                    self._build_op(op, channels) for op in operations
                ])
                self.edges[f'{j}->{2+i}'] = edge_ops
    
    def forward(self, s0, s1, architecture):
        """
        Forward pass for a specific architecture.
        
        architecture: dict mapping edge names to operation indices
        """
        states = [s0, s1]
        
        for i in range(self.num_nodes):
            # Get inputs and operations for this node from architecture
            inp1, op1, inp2, op2 = architecture[f'node_{i}']
            
            # Apply selected operations from supernet
            out1 = self.edges[f'{inp1}->{2+i}'][op1](states[inp1])
            out2 = self.edges[f'{inp2}->{2+i}'][op2](states[inp2])
            
            states.append(out1 + out2)
        
        return torch.cat(states[2:], dim=1)
 
class Supernet(nn.Module):
    """Complete supernet for one-shot NAS"""
    
    def __init__(self, config):
        super().__init__()
        self.cells = nn.ModuleList([
            SupernetCell(config.channels, config.operations)
            for _ in range(config.num_cells)
        ])
        
    def forward(self, x, architecture):
        """Forward with specific architecture"""
        s0 = s1 = self.stem(x)
        for cell in self.cells:
            s0, s1 = s1, cell(s0, s1, architecture)
        return self.classifier(s1)
    
    def sample_random_architecture(self):
        """Sample random subnet from supernet"""
        arch = {}
        for i in range(self.num_nodes):
            valid_inputs = list(range(2 + i))
            arch[f'node_{i}'] = (
                random.choice(valid_inputs),  # input1
                random.randint(0, self.num_ops - 1),  # op1
                random.choice(valid_inputs),  # input2
                random.randint(0, self.num_ops - 1),  # op2
            )
        return arch

Training the Supernet

Supernet training presents unique challenges: we want all subnets to inherit good weights, but we can't train all exponentially many of them explicitly.

Path Sampling Training:

The standard approach samples random paths (architectures) during training:

For each batch, sample a random architecture from the search space
Activate only that subnet's operations
Compute loss and update only the active weights
Repeat with different random architectures each batch

Over training, all operations receive updates proportional to their sampling probability.

supernet_training.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
def train_supernet_epoch(supernet, train_loader, optimizer):
    """
    Train supernet with random path sampling.
    """
    supernet.train()
    
    for batch_idx, (inputs, targets) in enumerate(train_loader):
        # Sample random architecture for this batch
        architecture = supernet.sample_random_architecture()
        
        # Forward with this architecture
        outputs = supernet(inputs, architecture)
        loss = F.cross_entropy(outputs, targets)
        
        # Backward updates only active weights
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
def train_supernet_fairnas(supernet, train_loader, optimizer):
    """
    FairNAS-style training: ensure all operations trained equally.
    """
    supernet.train()
    
    for batch_idx, (inputs, targets) in enumerate(train_loader):
        total_loss = 0
        
        # Sample multiple architectures to ensure coverage
        num_samples = 4
        for _ in range(num_samples):
            architecture = supernet.sample_random_architecture()
            outputs = supernet(inputs, architecture)
            total_loss += F.cross_entropy(outputs, targets)
        
        avg_loss = total_loss / num_samples
        
        optimizer.zero_grad()
        avg_loss.backward()
        optimizer.step()

Supernet Training Best Practices

•Uniform sampling: Ensure all operations receive roughly equal training
•Sufficient epochs: Supernet needs more training than standalone networks
•Appropriate learning rate: Often lower than standalone training
•Batch normalization handling: Running stats can be problematic; use synchronized or post-hoc calibration

Searching with the Supernet

Once trained, the supernet enables fast architecture evaluation. The search phase applies any search strategy (random, evolution, RL) using supernet performance as a proxy for standalone performance.

Two-Stage Search Pipeline:

Supernet Training: Train supernet with path sampling (~few hours to days)
Architecture Search: Evaluate candidates using supernet (~minutes to hours)
Final Training: Train best architecture(s) from scratch (validates results)

supernet_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
def search_with_supernet(
    supernet,
    val_loader,
    search_space,
    search_budget=1000
):
    """
    Search for best architecture using trained supernet.
    """
    supernet.eval()
    best_arch = None
    best_acc = 0
    
    for _ in range(search_budget):
        # Sample architecture
        arch = search_space.sample_random()
        
        # Evaluate using supernet (fast!)
        acc = evaluate_architecture(supernet, arch, val_loader)
        
        if acc > best_acc:
            best_acc = acc
            best_arch = arch
    
    return best_arch, best_acc
 
def evolutionary_search_supernet(
    supernet,
    val_loader,
    search_space,
    population_size=50,
    generations=20
):
    """
    Evolutionary search using supernet for fitness evaluation.
    """
    # Initialize population
    population = [
        search_space.sample_random() 
        for _ in range(population_size)
    ]
    
    for gen in range(generations):
        # Evaluate all individuals using supernet
        fitness = [
            evaluate_architecture(supernet, arch, val_loader)
            for arch in population
        ]
        
        # Select top-k as parents
        top_k = 10
        sorted_pop = sorted(
            zip(population, fitness), 
            key=lambda x: x[1], 
            reverse=True
        )
        parents = [p[0] for p in sorted_pop[:top_k]]
        
        # Generate offspring through mutation/crossover
        offspring = []
        while len(offspring) < population_size - top_k:
            parent = random.choice(parents)
            child = mutate(parent, search_space)
            offspring.append(child)
        
        population = parents + offspring
    
    # Return best found
    return max(zip(population, fitness), key=lambda x: x[1])

The Weight Coupling Problem

Weight sharing has a fundamental limitation: weight coupling. When architectures share weights, the optimal weights for one architecture may not be optimal for another.

The Problem:

Consider two architectures $a_1$ and $a_2$ that share some weights $w_s$:

Optimal weights for $a_1$: $(w_s^, w_{1}^)$
Optimal weights for $a_2$: $(w_s^{**}, w_{2}^*)$

If $w_s^* eq w_s^{**}$, the shared supernet weights can't be optimal for both.

Consequences:

Rank mismatch: Architecture ranking from supernet may differ from standalone ranking
Suboptimal subnets: Individual architectures perform worse with shared weights than optimal weights
Search bias: Search may favor architectures that work well with shared weights, not those that would work best standalone

Rank Correlation Studies
Study	Search Space	Kendall τ	Finding
Yu et al. 2020	DARTS space	~0.2	Low correlation; significant mismatch
Zela et al. 2020	NAS-Bench-201	~0.5-0.7	Moderate correlation; depends on training
Yang et al. 2021	Various	Varies	Correlation degrades with space size

Critical Implication

Weight coupling means supernet evaluation is a proxy, not ground truth. The best architecture by supernet evaluation may not be best when trained standalone. Always validate top candidates with full training.

Improving Supernet Quality

Researchers have developed techniques to mitigate weight coupling and improve supernet reliability:

Supernet Improvement Techniques

•FairNAS: Strict uniform sampling ensures all operations receive equal training opportunity
•SPOS (Single Path One-Shot): Sample single path per batch; strict decoupling
•Once-for-All (OFA): Progressive shrinking trains large subnets first, then smaller ones
•AttentiveNAS: Weight sampling based on architecture importance
•Sandwich Training: Always include smallest and largest subnets alongside random samples

sandwich_training.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def sandwich_training_step(supernet, inputs, targets, optimizer):
    """
    Sandwich training: train smallest, largest, and random subnets.
    Improves supernet quality for extreme architectures.
    """
    total_loss = 0
    
    # 1. Train smallest subnet
    smallest_arch = supernet.get_smallest_architecture()
    out = supernet(inputs, smallest_arch)
    loss = F.cross_entropy(out, targets)
    total_loss += loss
    
    # 2. Train largest subnet
    largest_arch = supernet.get_largest_architecture()
    out = supernet(inputs, largest_arch)
    loss = F.cross_entropy(out, targets)
    total_loss += loss
    
    # 3. Train random subnets (in-between)
    for _ in range(2):
        random_arch = supernet.sample_random_architecture()
        out = supernet(inputs, random_arch)
        loss = F.cross_entropy(out, targets)
        total_loss += loss
    
    # Average and update
    optimizer.zero_grad()
    (total_loss / 4).backward()
    optimizer.step()

ENAS: Weight Sharing Origins

ENAS (Efficient NAS) introduced weight sharing to NAS, reducing search cost from 1000s to ~0.5 GPU-days.

Key Ideas:

Train a supernet DAG containing all possible architectures
Use an RNN controller to sample architecture (like original NAS)
Evaluate sampled architecture using supernet weights
Update controller with REINFORCE
Update supernet weights with SGD

Innovation: By sharing weights, ENAS amortizes the cost of training across all evaluated architectures.

ENAS Contributions

•Introduced weight sharing paradigm
•1000x cost reduction
•Made NAS accessible
•Competitive results on benchmarks

ENAS Limitations

•Controller training unstable
•Weight coupling issues
•Requires careful hyperparameter tuning
•Reproducibility concerns

Summary: Weight Sharing

Key Takeaways

•Weight sharing trains one supernet containing all architectures as subnets
•Evaluation becomes fast since subnets inherit weights without retraining
•Path sampling trains supernet by randomly activating subnets each batch
•Weight coupling is the fundamental limitation—shared weights can't be optimal for all subnets
•Rank correlation between supernet and standalone training is imperfect; always validate
•Techniques like sandwich training and progressive shrinking improve supernet quality

Page Complete

You now understand weight sharing—the technique that made NAS practical. Next, we'll explore efficient NAS methods including zero-cost proxies and other acceleration techniques.