Neural Architecture Search - Learning Module

Loading content...

0/245

Efficient NAS

Beyond Weight Sharing

Even with weight sharing reducing NAS from thousands of GPU-days to hours, further efficiency gains remain valuable. Searching faster means:

More experiments within fixed compute budgets
NAS becomes accessible on consumer hardware
Environmental impact reduction
Faster iteration in research and development

This page explores the frontier of efficient NAS: zero-cost proxies that evaluate architectures without training, early stopping methods that eliminate poor candidates quickly, performance predictors that estimate final accuracy from partial training, and practical systems that combine these techniques.

Learning Objectives

Master efficient NAS techniques: zero-cost proxies based on network properties, learning curve extrapolation, predictor-based evaluation, and practical efficient NAS systems like Once-for-All and BigNAS.

Zero-Cost Proxies

Zero-cost proxies estimate architecture quality from network properties computed at initialization—before any training. If reliable, they enable evaluating millions of architectures in minutes.

The Idea:

Certain properties of randomly-initialized networks correlate with trained performance. By measuring these properties, we can rank architectures without training.

Common Zero-Cost Proxies:

Zero-Cost Proxy Methods
Proxy	Measures	Intuition	Compute Cost
synflow	Synaptic flow (gradient × weight)	Trainability; gradient signal preservation	1 forward + backward
grasp	Gradient × Hessian	Trainability; gradient flow quality	2 forward + backward
snip	Sensitivity-based saliency	Parameter importance	1 forward + backward
jacob_cov	Jacobian covariance	Input-output correlation	Multiple forward passes
fisher	Fisher information	Information about labels	1 forward + backward
#params	Parameter count	Capacity (crude baseline)	Instantaneous

zero_cost_proxies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import torch
import torch.nn as nn
import numpy as np
 
def compute_synflow(model, input_shape, device='cuda'):
    """
    SynFlow: measures gradient × weight product at initialization.
    Higher values indicate better trainability.
    """
    model.to(device)
    model.eval()
    
    # Create dummy input (all ones for SynFlow)
    x = torch.ones(1, *input_shape, device=device)
    
    # Make all parameters positive for SynFlow
    @torch.no_grad()
    def linearize(model):
        signs = {}
        for name, param in model.named_parameters():
            signs[name] = param.sign()
            param.abs_()
        return signs
    
    @torch.no_grad()
    def restore(model, signs):
        for name, param in model.named_parameters():
            param.mul_(signs[name])
    
    signs = linearize(model)
    
    # Forward pass with all-ones input
    model.zero_grad()
    output = model(x)
    
    # Sum of outputs as scalar
    if isinstance(output, tuple):
        output = output[0]
    torch.sum(output).backward()
    
    # SynFlow score: sum of |gradient × param|
    score = 0.0
    for param in model.parameters():
        if param.grad is not None:
            score += (param.grad * param).abs().sum().item()
    
    restore(model, signs)
    return score
 
def compute_jacob_cov(model, data_loader, num_batches=5, device='cuda'):
    """
    Jacobian covariance: measures correlation between inputs and outputs.
    Higher correlation suggests better feature propagation.
    """
    model.to(device)
    model.eval()
    
    jacobians = []
    
    for i, (x, _) in enumerate(data_loader):
        if i >= num_batches:
            break
            
        x = x.to(device).requires_grad_(True)
        y = model(x)
        
        # Compute Jacobian: dy/dx
        for j in range(y.shape[1]):  # For each output dimension
            model.zero_grad()
            y[:, j].sum().backward(retain_graph=True)
            jacobians.append(x.grad.detach().view(x.shape[0], -1))
            x.grad.zero_()
    
    # Stack and compute covariance
    J = torch.cat(jacobians, dim=0)
    # Score: trace of covariance
    return torch.trace(J.T @ J).item() / J.shape[0]
 
def zcp_ranking(architectures, model_builder, proxy_fn, **kwargs):
    """
    Rank architectures using zero-cost proxy.
    """
    scores = []
    for arch in architectures:
        model = model_builder(arch)
        score = proxy_fn(model, **kwargs)
        scores.append((arch, score))
    
    # Sort by score (higher is better for most proxies)
    return sorted(scores, key=lambda x: x[1], reverse=True)

ZCP Limitations

Zero-cost proxies have limited accuracy—they explain ~50-70% of performance variance in typical benchmarks. They're best used for initial filtering (eliminating clearly bad architectures) rather than final selection. Combine with other methods for best results.

Early Stopping and Learning Curve Extrapolation

Rather than training to completion, we can stop unpromising architectures early and extrapolate final performance from partial learning curves.

Successive Halving (SHA):

A simple but effective multi-fidelity approach:

Start with $n$ architectures, train for $r$ epochs
Keep top 50%, discard rest
Double training budget ($2r$ epochs), halve population
Repeat until one architecture remains

Hyperband:

Combines multiple SHA runs with different starting configurations:

Some runs: many candidates, aggressive elimination
Other runs: fewer candidates, more training per candidate
Balances exploration vs exploitation

successive_halving.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
def successive_halving(
    architectures,
    evaluate_fn,
    initial_budget=1,    # Initial epochs per architecture
    max_budget=81,       # Maximum epochs
    eta=3                # Reduction factor
):
    """
    Successive Halving for architecture selection.
    
    Args:
        architectures: List of candidate architectures
        evaluate_fn(arch, budget): Train for 'budget' epochs, return accuracy
        initial_budget: Starting training budget
        max_budget: Maximum training budget
        eta: Reduction factor (keep 1/eta best each round)
    """
    # Current candidates with their accumulated training
    candidates = [(arch, 0) for arch in architectures]
    current_budget = initial_budget
    
    while len(candidates) > 1 and current_budget <= max_budget:
        print(f"Round: {len(candidates)} candidates, {current_budget} epochs each")
        
        # Evaluate all remaining candidates
        results = []
        for arch, trained_epochs in candidates:
            # Train for additional epochs (up to current_budget total)
            acc = evaluate_fn(arch, budget=current_budget)
            results.append((arch, current_budget, acc))
        
        # Sort by accuracy, keep top 1/eta
        results.sort(key=lambda x: x[2], reverse=True)
        n_keep = max(1, len(results) // eta)
        candidates = [(r[0], r[1]) for r in results[:n_keep]]
        
        current_budget *= eta
    
    return candidates[0]  # (best_arch, epochs_trained)
 
def hyperband(architectures, evaluate_fn, max_budget=81, eta=3):
    """
    Hyperband: multiple successive halving runs with different configs.
    """
    s_max = int(np.log(max_budget) / np.log(eta))
    B = (s_max + 1) * max_budget  # Total budget
    
    best_arch = None
    best_acc = 0
    
    for s in range(s_max, -1, -1):
        # Number of configs for this bracket
        n = int(np.ceil(B / max_budget / (s + 1) * eta ** s))
        # Initial budget for this bracket
        r = max_budget * eta ** (-s)
        
        # Sample n architectures
        archs = random.sample(architectures, min(n, len(architectures)))
        
        # Run successive halving on this bracket
        winner = successive_halving(archs, evaluate_fn, r, max_budget, eta)
        
        if winner[2] > best_acc:
            best_acc = winner[2]
            best_arch = winner[0]
    
    return best_arch, best_acc

Learning Curve Extrapolation

•Parametric models: Fit curves like $a - b \cdot c^{-x}$ to partial training data
•Bayesian extrapolation: Probabilistic predictions with uncertainty
•Neural network predictors: Learn to predict final accuracy from early epochs
•Ensemble methods: Combine multiple extrapolation models

Performance Predictors

Performance predictors are ML models trained to predict architecture accuracy from architectural features. Once trained on previously evaluated architectures, they enable instant prediction for new candidates.

Predictor Types:

Architecture-only: Input is architecture encoding (adjacency matrix, operation sequence)
With partial training: Include early training metrics as features
With zero-cost proxies: Combine architectural features with proxy scores

Building a Predictor:

performance_predictor.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import torch
import torch.nn as nn
from sklearn.ensemble import RandomForestRegressor
import numpy as np
 
def encode_architecture(arch, num_nodes, num_ops):
    """
    Encode architecture as feature vector.
    Simple approach: one-hot encoding of operations and connections.
    """
    features = []
    
    for node in arch.nodes:
        # One-hot for input selections
        inp1_onehot = [0] * (2 + num_nodes)
        inp2_onehot = [0] * (2 + num_nodes)
        inp1_onehot[node['inp1']] = 1
        inp2_onehot[node['inp2']] = 1
        
        # One-hot for operation selections
        op1_onehot = [0] * num_ops
        op2_onehot = [0] * num_ops
        op1_onehot[node['op1']] = 1
        op2_onehot[node['op2']] = 1
        
        features.extend(inp1_onehot + inp2_onehot + op1_onehot + op2_onehot)
    
    return np.array(features)
 
class ArchPredictor:
    """
    Predictor for architecture performance.
    Uses Random Forest for interpretability and few-shot performance.
    """
    
    def __init__(self):
        self.model = RandomForestRegressor(
            n_estimators=100,
            max_depth=10,
            random_state=42
        )
        self.trained = False
    
    def fit(self, architectures, accuracies):
        """Train predictor on evaluated architectures."""
        X = np.array([encode_architecture(a) for a in architectures])
        y = np.array(accuracies)
        self.model.fit(X, y)
        self.trained = True
    
    def predict(self, architecture):
        """Predict accuracy for new architecture."""
        if not self.trained:
            raise ValueError("Predictor not trained")
        x = encode_architecture(architecture).reshape(1, -1)
        return self.model.predict(x)[0]
    
    def predict_batch(self, architectures):
        """Predict for multiple architectures."""
        X = np.array([encode_architecture(a) for a in architectures])
        return self.model.predict(X)
 
class GNNPredictor(nn.Module):
    """
    Graph Neural Network predictor for architectures.
    Treats architecture as a graph and predicts performance.
    """
    
    def __init__(self, num_ops, hidden_dim=64):
        super().__init__()
        self.op_embed = nn.Embedding(num_ops, hidden_dim)
        
        # Message passing layers
        self.conv1 = nn.Linear(hidden_dim, hidden_dim)
        self.conv2 = nn.Linear(hidden_dim, hidden_dim)
        
        # Readout
        self.readout = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )
    
    def forward(self, adj, ops):
        """
        adj: Adjacency matrix [N, N]
        ops: Operation indices per node [N]
        """
        x = self.op_embed(ops)  # [N, hidden]
        
        # Message passing
        x = torch.relu(self.conv1(adj @ x))
        x = torch.relu(self.conv2(adj @ x))
        
        # Global pooling
        x = x.mean(dim=0)  # [hidden]
        
        return self.readout(x)  # [1]

Predictor-Guided Search

Combine predictors with acquisition functions for smart sampling: evaluate architectures the predictor is uncertain about (exploration) or predicts will be good (exploitation). This is essentially Bayesian optimization with a learned surrogate.

Once-for-All Networks

Once-for-All (OFA) networks take weight sharing to its logical conclusion: train one network, deploy many architectural variants without any additional training.

Key Innovation: Progressive Shrinking

Rather than training all subnets equally (problematic due to coupling), OFA trains progressively:

Train largest network to convergence
Progressively shrink (reduce depth/width/resolution) while fine-tuning
All smaller subnets inherit from larger ones

This produces a single model supporting 10^19+ architectural configurations.

OFA Elastic Dimensions

•Elastic Depth: Choose how many layers to use (skip later layers)
•Elastic Width: Choose channel multiplier for each layer
•Elastic Kernel Size: Choose 3×3, 5×5, or 7×7 convolutions
•Elastic Resolution: Input resolution (224, 192, 160, etc.)

once_for_all.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
class OFAMobileNet(nn.Module):
    """
    Once-for-All style network with elastic depth/width/kernel.
    """
    
    def __init__(self, max_depth=4, max_width=1.0, kernel_sizes=[3,5,7]):
        super().__init__()
        self.max_depth = max_depth
        self.max_width = max_width
        self.kernel_sizes = kernel_sizes
        
        # Build blocks with all kernel sizes
        self.blocks = nn.ModuleList([
            ElasticBlock(
                in_channels=...,
                out_channels=...,
                kernel_sizes=kernel_sizes
            ) for _ in range(max_depth)
        ])
    
    def forward(self, x, config):
        """
        config: dict with 'depth', 'width', 'kernel_sizes'
        """
        depth = config['depth']
        width = config['width']
        kernels = config['kernel_sizes']
        
        for i in range(depth):
            x = self.blocks[i](x, width=width, kernel=kernels[i])
        
        return self.classifier(x)
    
    def sample_subnet(self, constraint=None):
        """
        Sample random subnet, optionally respecting hardware constraint.
        """
        if constraint:
            # Use predictor to find architecture meeting constraint
            return self.constrained_sample(constraint)
        
        return {
            'depth': random.randint(2, self.max_depth),
            'width': random.choice([0.5, 0.75, 1.0]),
            'kernel_sizes': [random.choice(self.kernel_sizes) 
                           for _ in range(self.max_depth)]
        }
 
def progressive_shrinking_training(ofa_net, train_loader, epochs_per_stage):
    """
    Progressive shrinking: train large first, then shrink.
    """
    optimizer = torch.optim.SGD(ofa_net.parameters(), lr=0.01)
    
    # Stage 1: Train full network
    print("Stage 1: Full network")
    for epoch in range(epochs_per_stage):
        train_epoch(ofa_net, train_loader, optimizer, 
                   subnet_config={'depth': 4, 'width': 1.0, 'kernel': 7})
    
    # Stage 2: Add elastic kernel
    print("Stage 2: Elastic kernel")
    for epoch in range(epochs_per_stage):
        config = ofa_net.sample_subnet(elastic=['kernel'])
        train_epoch(ofa_net, train_loader, optimizer, config)
    
    # Stage 3: Add elastic depth
    print("Stage 3: Elastic depth")
    for epoch in range(epochs_per_stage):
        config = ofa_net.sample_subnet(elastic=['kernel', 'depth'])
        train_epoch(ofa_net, train_loader, optimizer, config)
    
    # Stage 4: Add elastic width
    print("Stage 4: Elastic width (full)")
    for epoch in range(epochs_per_stage):
        config = ofa_net.sample_subnet(elastic=['kernel', 'depth', 'width'])
        train_epoch(ofa_net, train_loader, optimizer, config)

OFA Impact

OFA enables deploying to thousands of different hardware targets (phones, tablets, edge devices) from a single trained network. Search takes minutes since no training is needed—just evaluate subnets on validation set.

Hardware-Aware NAS

Modern efficient NAS doesn't just optimize accuracy—it optimizes for hardware constraints: latency, energy, memory, or FLOPs on specific target devices.

Multi-Objective Formulation:

$$\max_{a \in \mathcal{A}} \text{Accuracy}(a)$$ $$\text{s.t. } \text{Latency}(a) \leq L_{max}$$

Or Pareto optimization: $$\max_{a} (\text{Accuracy}(a), -\text{Latency}(a))$$

Approaches:

Hardware-Aware Methods
Method	Approach	Advantages
ProxylessNAS	Differentiable latency with learnable thresholds	Direct hardware optimization
FBNet	Latency lookup table in search	Fast, accurate for specific HW
MnasNet	Multi-objective RL with Pareto reward	Flexible constraint handling
AttentionNAS	Attention-based architecture sampling	Efficient multi-objective

Latency Modeling:

Direct hardware measurement is slow. Alternatives:

Lookup tables: Pre-measure latency per operation, sum for architecture
Latency predictors: Neural network predicting latency from architecture
Analytical models: FLOPs × device-specific coefficient

Key insight: FLOPs don't perfectly correlate with latency. Memory access patterns, parallelism, and hardware-specific optimizations matter.

Practical NAS Systems

Several libraries and systems make NAS accessible to practitioners:

NAS Libraries and Frameworks
System	Key Features	Best For
AutoKeras	User-friendly; automatic; Keras-based	Beginners; tabular/image/text
NNI (Microsoft)	Multiple NAS algorithms; distributed	Research; customizable search
Auto-PyTorch	Automated deep learning; HPO + NAS	End-to-end AutoML
NATS-Bench	Benchmarking; topology + size search	Fair algorithm comparison
AutoGluon	Production-ready; ensemble + NAS	Production deployment

When to Use NAS in Practice

•High-stakes applications: When 1-2% accuracy gain justifies search cost
•Hardware-constrained deployment: Mobile, edge, embedded systems
•Novel domains: Where standard architectures may not transfer
•Scaling existing work: Finding efficient variants of proven designs
•Research: Understanding what architectural patterns work

Summary: Efficient NAS

Key Takeaways

•Zero-cost proxies evaluate architectures without training using network properties at initialization
•Successive halving and Hyperband eliminate poor candidates early through multi-fidelity evaluation
•Performance predictors learn to estimate accuracy from architectural features
•Once-for-All trains one network supporting 10^19+ subnets through progressive shrinking
•Hardware-aware NAS optimizes for latency/energy/memory, not just accuracy
•Practical systems (AutoKeras, NNI, AutoGluon) make NAS accessible to practitioners

Module Complete

You have now mastered Neural Architecture Search: from foundational concepts through search space design, search strategies, weight sharing, and efficient methods. These techniques represent the frontier of automated machine learning, enabling the discovery of architectures that match or exceed human-designed networks.