Machine LearningHyperparameter Optimization

Advanced HPO Topics

LevelAdvanced

Duration150 mins

TopicHyperparameter Optimization

1 / 5

Neural Architecture Search

The Quest for Automated Architecture Design

For decades, designing neural network architectures was an exclusively human endeavor—an art form requiring deep intuition, extensive experimentation, and often years of specialized expertise. The breakthrough architectures that defined modern deep learning—AlexNet, ResNet, Inception, Transformer—were products of human creativity, trial and error, and domain insight.

Neural Architecture Search (NAS) fundamentally challenges this paradigm. Rather than relying on human intuition to craft architectures, NAS treats architecture design itself as a learning problem: given a dataset and computational budget, can we automatically discover neural network architectures that match or exceed human-designed ones?

The answer, remarkably, is yes. NAS algorithms have discovered architectures that outperform hand-crafted designs on image classification (NASNet, EfficientNet), object detection (NAS-FPN), language modeling (Evolved Transformer), and numerous other tasks. More importantly, these automatically discovered architectures often reveal design principles that humans hadn't considered.

What You Will Learn

By the end of this page, you will understand the complete NAS framework: how to define architecture search spaces, the major families of search strategies (reinforcement learning, evolutionary, gradient-based), the critical role of weight sharing and one-shot methods, and practical considerations for deploying NAS in real-world scenarios. You'll gain the conceptual foundation to both apply existing NAS methods and reason about their trade-offs.

The NAS Problem Formulation

Neural Architecture Search can be formally defined as a bi-level optimization problem. Understanding this formulation is essential because it reveals the computational challenges that drive all NAS research.

The Outer Optimization (Architecture Search):

We seek an architecture α from a search space A that minimizes validation loss:

α* = argmin_{α ∈ A} L_val(w*(α), α)

where L_val is the validation loss and w(α)* represents the optimal weights for architecture α.

The Inner Optimization (Weight Training):

For each candidate architecture α, we must find optimal weights:

w*(α) = argmin_w L_train(w, α)

This bi-level structure creates the fundamental computational challenge: evaluating even a single architecture requires training it to convergence (or near-convergence), which for modern networks can take hours to days on powerful hardware.

The Computational Curse

Early NAS methods evaluated thousands of candidate architectures, each trained from scratch. The original NASNet paper used 500 GPUs for 4 days—approximately 48,000 GPU-hours—to search a single architecture. This computational expense motivated the development of weight-sharing methods that reduce search costs by 1000× or more.

The Three Pillars of NAS:

Every NAS method must address three interconnected components:

Search Space: What architectures are possible? The search space defines the set of candidate architectures, encoding prior knowledge about what constitutes a reasonable network.
Search Strategy: How do we explore the space? The search strategy determines how we sample, evaluate, and navigate among candidate architectures.
Performance Estimation Strategy: How do we evaluate candidates efficiently? Since full training is prohibitively expensive, we need methods to estimate architecture quality quickly.

These three components interact in complex ways. A larger search space increases expressivity but requires more sophisticated search strategies. Faster performance estimation enables exploring more candidates but may introduce ranking errors. The art of NAS lies in balancing these trade-offs.

The Three Pillars of NAS and Their Trade-offs
Component	Design Choices	Trade-offs
Search Space	Cell-based vs macro architecture; operation types; connectivity patterns	Larger spaces → more potential but harder search; smaller spaces → faster but may miss optimal designs
Search Strategy	Random search, RL, evolutionary, gradient-based, Bayesian	Sample efficiency vs computational cost; exploration vs exploitation balance
Performance Estimation	Full training, early stopping, weight sharing, zero-cost proxies	Speed vs accuracy; ranking correlation with true performance

Search Space Design

The search space is perhaps the most critical component of NAS. A well-designed search space encodes human knowledge about what makes good architectures while remaining expressive enough to discover novel designs. Poor search space design can doom a NAS method regardless of how sophisticated its search strategy is.

Macro Architecture Search:

The earliest NAS approaches searched for complete network architectures directly. This macro search paradigm offers maximum flexibility but suffers from combinatorial explosion. For a network with L layers, each choosing from O operations with C connectivity options, the search space grows as O(O^L × C^L). For reasonable values (L=20, O=10, C=5), this yields approximately 10^34 possible architectures—far too large for any tractable search.

Cell-Based Search:

The breakthrough insight that made modern NAS practical was cell-based search: instead of searching for entire networks, search for a small computational cell that is then repeated to form the full architecture. This approach, pioneered by NASNet, reduces the search space exponentially while leveraging the human insight that successful architectures often use repeated motifs.

Cell-Based Search Space Components

•Normal Cell: Maintains spatial dimensions; used in the body of the network for feature extraction
•Reduction Cell: Reduces spatial dimensions (typically by 2×); used at network bottlenecks for efficiency
•Cell Structure: Directed acyclic graph (DAG) with nodes representing intermediate features and edges representing operations
•Operation Set: The candidate operations (convolutions, pooling, skip connections, etc.) that can be placed on edges
•Connectivity Pattern: How the intermediate nodes connect to each other and to the cell output

Defining the Operation Set:

The choice of candidate operations fundamentally shapes what the search can discover. Common operation sets include:

Convolutional Operations: 3×3 conv, 5×5 conv, 1×1 conv, separable convolutions, dilated convolutions
Pooling Operations: max pooling, average pooling (various kernel sizes)
Skip Connections: identity mapping (crucial for enabling deep networks)
No Connection: zero operation (allowing the search to remove edges)

The operation set represents a trade-off between expressivity and search difficulty. Including more operations increases the chance of discovering novel designs but makes the search harder. Modern practice typically uses 7-15 carefully chosen operations that cover the space of commonly successful motifs.

The Power of Inductive Bias

Cell-based search spaces embed a strong inductive bias: the assumption that repeated local structures are sufficient for good performance. This bias is remarkably well-aligned with successful CNN and Transformer architectures, which heavily rely on repeated blocks. By restricting the search to cells, we trade some expressivity for massive computational savings—typically reducing search space size by a factor of 10^20 or more.

DAG-Based Cell Representation:

Within the cell-based paradigm, cells are typically represented as directed acyclic graphs (DAGs). The standard formulation uses:

Input Nodes: Outputs from previous cells (typically 2 inputs: the previous cell and the cell before that)
Intermediate Nodes: Each intermediate node receives inputs from previous nodes, applies operations to each input, and sums the results
Output Node: Concatenation of all intermediate node outputs that aren't used elsewhere

For a cell with N intermediate nodes and O candidate operations, the search space size is approximately O(O^(N×(N+1)/2)), which for typical values (N=4, O=8) gives roughly 10^13 cells—large but tractable with modern search methods.

Beyond Cell-Based Search:

Recent work has pushed beyond simple cell-based search in several directions:

Hierarchical Search: Search for cells at multiple scales, with higher-level cells composed of lower-level ones
Topology Search: Allow variable DAG structures rather than fixed connectivity templates
Network-Level Search: Jointly search for cell structure and how cells are arranged (number of cells, channel widths, etc.)
Task-Adaptive Spaces: Automatically adjust the search space based on the target task

Reinforcement Learning for NAS

The original breakthrough in Neural Architecture Search came from framing the problem as reinforcement learning. In this formulation, a controller (typically an RNN or Transformer) generates architecture descriptions, which are then evaluated on held-out data. The controller is trained to maximize expected validation accuracy as a reward signal.

The RL Framework:

State: The partial architecture generated so far
Action: Choose the next architectural decision (operation type, connection, etc.)
Reward: Validation accuracy of the complete architecture after training
Policy: The controller network that maps states to action distributions

The controller outputs a sequence of tokens that define the architecture. For a cell-based search, this typically includes:

For each edge in the cell DAG: which operation to use
For each intermediate node: which previous nodes to connect to

nas_rl_controller.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
# Simplified RL-based NAS Controller (conceptual)
import torch
import torch.nn as nn
 
class NASController(nn.Module):
    """
    RNN-based controller that generates architecture descriptions.
    Each forward pass produces a complete architecture specification.
    """
    def __init__(self, num_operations, num_nodes, hidden_size=100):
        super().__init__()
        self.num_operations = num_operations
        self.num_nodes = num_nodes
        self.hidden_size = hidden_size
        
        # LSTM controller
        self.lstm = nn.LSTMCell(hidden_size, hidden_size)
        
        # Embedding for previous decisions
        self.op_embedding = nn.Embedding(num_operations, hidden_size)
        self.node_embedding = nn.Embedding(num_nodes + 2, hidden_size)  # +2 for inputs
        
        # Output heads for different decision types
        self.op_classifier = nn.Linear(hidden_size, num_operations)
        self.node_selector = nn.Linear(hidden_size, num_nodes + 2)
        
    def forward(self, batch_size=1):
        """Generate architecture descriptions and log probabilities."""
        architectures = []
        log_probs = []
        entropies = []
        
        # Initialize hidden state
        h = torch.zeros(batch_size, self.hidden_size)
        c = torch.zeros(batch_size, self.hidden_size)
        
        # Start token
        inputs = torch.zeros(batch_size, self.hidden_size)
        
        for node_idx in range(self.num_nodes):
            # For each intermediate node, select 2 input nodes and 2 operations
            node_decisions = []
            
            for edge_idx in range(2):  # Each node has 2 inputs
                # Select input node
                h, c = self.lstm(inputs, (h, c))
                node_logits = self.node_selector(h)
                
                # Mask future nodes (can only connect to previous nodes)
                mask = torch.ones_like(node_logits) * -1e9
                mask[:, :node_idx + 2] = 0  # +2 for the 2 input nodes
                node_logits = node_logits + mask
                
                node_probs = torch.softmax(node_logits, dim=-1)
                node_dist = torch.distributions.Categorical(node_probs)
                selected_node = node_dist.sample()
                
                log_probs.append(node_dist.log_prob(selected_node))
                entropies.append(node_dist.entropy())
                
                # Select operation
                inputs = self.node_embedding(selected_node)
                h, c = self.lstm(inputs, (h, c))
                op_logits = self.op_classifier(h)
                op_probs = torch.softmax(op_logits, dim=-1)
                op_dist = torch.distributions.Categorical(op_probs)
                selected_op = op_dist.sample()
                
                log_probs.append(op_dist.log_prob(selected_op))
                entropies.append(op_dist.entropy())
                
                node_decisions.append((selected_node.item(), selected_op.item()))
                inputs = self.op_embedding(selected_op)
            
            architectures.append(node_decisions)
        
        return architectures, torch.stack(log_probs), torch.stack(entropies)
 
 
def train_controller(controller, optimizer, architecture, reward, baseline):
    """
    REINFORCE update for the controller.
    Uses baseline subtraction for variance reduction.
    """
    # Compute advantage
    advantage = reward - baseline
    
    # Get log probabilities for this architecture
    _, log_probs, entropies = controller(batch_size=1)
    
    # Policy gradient loss with entropy regularization
    policy_loss = -(log_probs * advantage).mean()
    entropy_bonus = -0.01 * entropies.mean()  # Encourage exploration
    
    loss = policy_loss + entropy_bonus
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    return loss.item()

Policy Gradient Optimization:

The controller is trained using policy gradient methods, typically REINFORCE with baseline subtraction. The objective is to maximize expected reward:

J(θ) = E_{α~π_θ}[R(α)]

where π_θ is the controller policy, α is the sampled architecture, and R(α) is its validation accuracy. The gradient is estimated as:

∇_θ J(θ) ≈ (1/n) Σ_i (R(α_i) - b) ∇_θ log π_θ(α_i)

where b is a baseline (often an exponential moving average of past rewards) used for variance reduction.

Challenges with RL-Based NAS:

While pioneering, RL-based NAS faces significant challenges:

High Variance: Reward signals are noisy (due to training randomness), making gradient estimates high-variance
Sample Inefficiency: Policy gradients require many samples to converge, each requiring architecture training
Credit Assignment: It's unclear which architectural decisions led to good/bad performance
Exploration: The controller may converge prematurely to suboptimal regions

Historical Significance

The NAS paper by Zoph & Le (2017) demonstrated that automated architecture search could match human-designed architectures, sparking the modern NAS field. While later methods (evolutionary, gradient-based) have largely superseded RL-NAS due to better sample efficiency, the RL formulation remains influential and continues to be refined with advances like PPO, actor-critic methods, and improved exploration strategies.

Evolutionary Approaches to NAS

Evolutionary algorithms (EAs) offer a natural framework for architecture search: maintain a population of architectures, evaluate their fitness, select the best, and generate new candidates through mutation and crossover. This paradigm has proven remarkably effective, often matching or exceeding RL-based methods while being conceptually simpler.

The Evolutionary NAS Framework:

Initialization: Generate an initial population of random architectures
Evaluation: Train each architecture and measure validation performance (fitness)
Selection: Choose parent architectures based on fitness
Reproduction: Create offspring through mutation and/or crossover
Replacement: Update the population with offspring, possibly removing least-fit members
Iteration: Repeat until budget exhausted or convergence

Mutation Operators for Neural Architectures

•Operation Mutation: Replace one operation with another from the candidate set (e.g., 3×3 conv → 5×5 conv)
•Connection Mutation: Change which nodes an intermediate node receives input from
•Insertion Mutation: Add a new node/layer to the architecture
•Deletion Mutation: Remove a node/layer (if the architecture allows)
•Identity Mutation: Replace an operation with identity/skip connection
•Compound Mutations: Apply multiple atomic mutations together for larger exploration steps

Regularized Evolution (AmoebaNet):

One of the most successful evolutionary NAS methods is Regularized Evolution, which introduced a simple but powerful modification: instead of removing the worst individual from the population, remove the oldest. This aging mechanism:

Prevents population stagnation around local optima
Encourages continuous exploration of the search space
Reduces the risk of getting stuck with early lucky individuals
Makes the algorithm more robust to noisy fitness estimates

The algorithm maintains a queue of individuals. Parents are selected via tournament selection (randomly sample k individuals, choose the fittest). Each offspring is created by mutating a parent, trained, evaluated, and added to the queue. The oldest individual is then removed.

Tournament Selection:

Tournament selection with aging offers an excellent exploration-exploitation balance:

t = random_sample(population, k)  # Sample k individuals
parent = argmax_fitness(t)         # Select fittest in tournament
offspring = mutate(parent)         # Create offspring
population.append(offspring)       # Add to population
population.pop_oldest()            # Remove oldest individual

Small tournament sizes (k=2-5) favor exploration; larger sizes favor exploitation of known good architectures.

Advantages of Evolutionary NAS

•No gradients required—works with any search space
•Naturally parallel—population members train independently
•Immune to local optima—maintains population diversity
•Easy to implement and modify
•Robust to noisy evaluations
•Anytime algorithm—can stop early with good results

Challenges of Evolutionary NAS

•Sample inefficient—requires many evaluations
•Computationally expensive per generation
•Mutation design requires domain expertise
•Crossover often hurts more than helps for NAS
•Population size tuning is crucial
•Hard to incorporate gradient information

Why Crossover Often Fails in NAS

Unlike biological evolution, neural architectures don't have 'building blocks' that combine well. Swapping half a cell from one architecture with half from another typically produces a poor hybrid, because the operations were optimized to work together. Most successful evolutionary NAS methods use mutation only, or use crossover at coarser granularity (e.g., swapping entire cells rather than operations).

Gradient-Based Search: DARTS and Beyond

The computational expense of RL and evolutionary methods stems from treating architecture selection as a discrete optimization problem—each candidate must be trained separately. Differentiable Architecture Search (DARTS) revolutionized NAS by relaxing this discrete decision, enabling gradient-based optimization of architectures.

The Key Insight:

Instead of selecting a single operation for each edge, DARTS maintains a mixture of all candidate operations, weighted by architecture parameters α. The output of an edge is:

output(x) = Σ_o exp(α_o) / Σ_o' exp(α_o') × o(x)

where the sum runs over all operations o in the candidate set. This softmax relaxation makes the architecture continuously differentiable with respect to α, enabling gradient descent.

Bi-Level Optimization:

DARTS solves the bi-level optimization problem by alternating:

Inner Optimization: Fix architecture weights α, update network weights w by gradient descent on training loss
Outer Optimization: Fix network weights w, update architecture weights α by gradient descent on validation loss

This alternating optimization approximates the true bi-level problem and can be executed efficiently on a single GPU.

darts_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
# DARTS-style Differentiable Architecture Search (conceptual)
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class MixedOperation(nn.Module):
    """
    A mixed operation that combines multiple candidate operations
    with learned architecture weights (softmax over alphas).
    """
    def __init__(self, channels, operations_dict):
        super().__init__()
        self.operations = nn.ModuleList([
            op_class(channels) for op_class in operations_dict.values()
        ])
        self.n_ops = len(self.operations)
        
    def forward(self, x, weights):
        """
        Forward pass with architecture weights.
        weights: softmax-normalized architecture parameters for this edge
        """
        return sum(w * op(x) for w, op in zip(weights, self.operations))
 
 
class DARTSCell(nn.Module):
    """
    A DARTS cell with continuous architecture parameters.
    """
    def __init__(self, channels, num_nodes, operations):
        super().__init__()
        self.num_nodes = num_nodes
        
        # Create mixed operations for each possible edge
        # Edge (i,j) connects node i to node j where i < j
        self.edges = nn.ModuleDict()
        for i in range(num_nodes + 2):  # +2 for input nodes
            for j in range(2, num_nodes + 2):  # intermediate nodes only
                if i < j:
                    edge_key = f"edge_{i}_{j}"
                    self.edges[edge_key] = MixedOperation(channels, operations)
        
        # Architecture parameters (one per edge per operation)
        self.num_edges = len(self.edges)
        self.n_ops = len(operations)
        
    def forward(self, s0, s1, arch_weights):
        """
        Forward pass through the cell.
        s0, s1: inputs from previous two cells
        arch_weights: architecture parameters of shape (num_edges, n_ops)
        """
        states = [s0, s1]
        edge_idx = 0
        
        for j in range(2, self.num_nodes + 2):
            # Sum contributions from all previous nodes
            node_input = 0
            for i in range(j):
                edge_key = f"edge_{i}_{j}"
                if edge_key in self.edges:
                    weights = F.softmax(arch_weights[edge_idx], dim=0)
                    node_input = node_input + self.edges[edge_key](states[i], weights)
                    edge_idx += 1
            states.append(node_input)
        
        # Concatenate intermediate nodes as output
        return torch.cat(states[2:], dim=1)
 
 
class DARTSArchitect:
    """
    Handles architecture parameter optimization using validation loss.
    """
    def __init__(self, model, arch_params, arch_lr=3e-4, arch_weight_decay=1e-3):
        self.model = model
        self.arch_params = arch_params
        self.optimizer = torch.optim.Adam(
            arch_params, lr=arch_lr, weight_decay=arch_weight_decay
        )
        
    def step(self, val_data, val_target, network_optimizer):
        """
        Single step of architecture optimization.
        Uses first-order approximation by default.
        """
        self.optimizer.zero_grad()
        
        # Forward pass on validation data
        output = self.model(val_data)
        loss = F.cross_entropy(output, val_target)
        
        # Backward pass updates architecture parameters
        loss.backward()
        self.optimizer.step()
        
        return loss.item()
 
 
def discretize_architecture(arch_weights, topk=2):
    """
    Convert continuous architecture weights to discrete architecture.
    For each node, keep top-k operations with highest weights.
    """
    discrete_arch = []
    for node_weights in arch_weights:
        # Get indices of top-k operations
        _, top_indices = torch.topk(node_weights, topk)
        discrete_arch.append(top_indices.tolist())
    return discrete_arch

The Discretization Gap:

After DARTS optimization, the continuous architecture must be discretized to obtain a final architecture (typically by keeping the highest-weighted operation per edge). This creates a discretization gap: the continuous architecture's performance doesn't perfectly predict the discrete architecture's performance.

This gap can be significant. An operation might have high weight during continuous search because it works well in combination with others, but fail when used alone. Research has identified several causes:

Skip Connection Dominance: DARTS often over-selects skip connections because they allow gradients to flow easily during search
Sharp vs Flat Minima: Some operations lead to sharper loss landscapes, making them harder to optimize post-discretization
Weight Coupling: Operation weights become adapted to work with the mixed combination, not individually

Second-Order Optimization:

The original DARTS used first-order approximation for computational efficiency. Second-order methods (computing the Hessian ∂²L/∂α∂w) can improve architecture quality but are expensive. Approximations like computing only diagonal Hessian elements offer a middle ground.

The Skip Connection Collapse Problem

A well-known failure mode of DARTS is 'collapse' to architectures dominated by skip connections and parameter-free operations. This occurs because these operations have the smallest validation loss gradient—not because they lead to the best final architectures. Numerous DARTS variants (DARTS+, P-DARTS, PC-DARTS, R-DARTS) address this through regularization, progressive pruning, or partial channel connections.

DARTS Variants and Improvements

•P-DARTS (Progressive DARTS): Progressively deepens the search space and prunes operations during search
•PC-DARTS (Partial Channel): Uses partial channel connections to reduce memory and improve stability
•DARTS+: Adds auxiliary skip connections during search to prevent collapse
•R-DARTS (Robust DARTS): Regularizes architecture parameters to prevent premature convergence
•GDAS: Uses Gumbel-softmax for hard operation selection during search
•SNAS: Uses concrete/Gumbel distribution for discrete architecture sampling

Weight Sharing and One-Shot Methods

The fundamental bottleneck in NAS is the cost of evaluating architectures—each evaluation traditionally required training from scratch. Weight sharing (also called one-shot NAS) addresses this by training a single supernet that contains all possible architectures in the search space as subnetworks. Architectures can then be evaluated by inheriting weights from the supernet, eliminating the need for retraining.

The Supernet Paradigm:

A supernet is an over-parameterized network where:

Each edge contains all candidate operations simultaneously
Any architecture in the search space is a subgraph of the supernet
The supernet is trained once on the full training data
Candidate architectures are evaluated using inherited supernet weights

This reduces the cost of evaluating an architecture from hours (full training) to seconds (forward pass on validation data), enabling exploration of thousands to millions of candidates.

NAS Paradigm Comparison
Method	Architectures Evaluated	Cost per Evaluation	Total Search Cost (GPU-days)
NASNet (RL)	~20,000	Full training (~4 GPU-hours)	~3,000-3,500
AmoebaNet (Evolution)	~27,000	Full training (~4 GPU-hours)	~3,150
DARTS (Gradient)	1 (continuous)	Search + Final train	~1-4
One-Shot (Weight Sharing)	~10,000+	Forward pass (~seconds)	~0.5-2

Training Strategies for Supernets:

How the supernet is trained dramatically affects the quality of inherited weights. Common strategies include:

1. Path Sampling (Single-Path One-Shot)

For each training step, sample a single path (architecture) through the supernet and only update its weights. This ensures each subnetwork receives dedicated training signal:

for x, y in dataloader:
    arch = sample_architecture()  # Random architecture
    output = supernet(x, arch)    # Forward only sampled path
    loss = criterion(output, y)
    loss.backward()               # Gradients only for sampled path
    optimizer.step()

2. Uniform Sampling

Operations are sampled uniformly, ensuring all parts of the supernet receive equal training. This prevents popular operations from dominating gradient updates.

3. Fairness Constraints

Explicitly enforce that each operation receives similar training by tracking operation counts and adjusting sampling probabilities.

The Ranking Hypothesis

Weight-sharing methods don't need supernet performance to match stand-alone training—they only need the ranking of architectures to be preserved. An architecture that performs well with inherited weights should also perform well when fully trained. Research shows this ranking correlation is typically 0.6-0.9, sufficient for effective architecture search.

Challenges with Weight Sharing:

Weight Coupling: Operations sharing the same feature maps may learn coupled representations that don't transfer well to stand-alone training
Unfair Comparison: Some operations (e.g., skip connections) may be easier to train than others (e.g., large convolutions), biasing evaluations
Capacity Interference: With limited supernet capacity, operations compete for representational power, and some may not be adequately trained
Memory Requirements: The supernet must simultaneously hold all candidate operations, requiring significant GPU memory

Improving Weight Sharing Quality:

Multiple techniques improve the reliability of weight-sharing evaluations:

Progressive Shrinking: Start with full supernet, gradually reduce to smaller subnetworks
Knowledge Distillation: Use the supernet as a teacher for stand-alone training
Batch Normalization Calibration: Recalibrate BN statistics for each architecture before evaluation
Fine-tuning Top Candidates: Briefly fine-tune the top-k architectures before final selection

Performance Estimation Strategies

Beyond weight sharing, researchers have developed numerous strategies to estimate architecture quality without full training. These performance estimation strategies trade off estimation accuracy for computational speed, and their effectiveness varies by domain and search space.

Lower-Fidelity Proxies:

The simplest approach is to train for fewer epochs, on smaller datasets, or with reduced model sizes:

Reduced Epochs: Train for 10-20 epochs instead of 600+
Subset Training: Use 10-20% of training data
Proxy Tasks: Evaluate on smaller datasets (CIFAR-10 as proxy for ImageNet)
Scaled-Down Models: Search with fewer channels/layers, then scale up

These proxies are widely used but can introduce ranking errors—architectures that learn quickly may not ultimately be the best.

Zero-Cost Proxies

•Grad Norm: Sum of gradient norms at initialization—correlates with trainability
•SNIP (Single-shot Network Pruning): Sensitivity of loss to weight masking
•GraSP (Gradient Signal Preservation): Preservation of gradient flow through the network
•Fisher Information: Fisher trace as a measure of network capacity
•Synflow: A training-free, data-independent metric based on parameter flow
•NASWOT (NAS Without Training): Correlates activation patterns with performance

Learning Curve Prediction:

Rather than training to completion, observe early training behavior and predict final performance. This approach includes:

Extrapolation Models: Fit parametric curves (exponential, power law) to early epochs and extrapolate
Multi-Fidelity Bayesian Optimization: Use evaluations at various training lengths to model the full learning curve
Successive Halving / Hyperband: Progressively eliminate poorly-performing architectures, allocating more budget to promising ones

The Speed-Accuracy Trade-off:

All performance estimation strategies trade accuracy for speed. The optimal strategy depends on:

Search Budget: With limited budget, aggressive approximations are necessary
Search Space Size: Larger spaces require faster evaluations to cover
Task Characteristics: Some proxies work better for certain domains (e.g., Synflow excels for pruning)
Risk Tolerance: If missing the absolute best architecture is acceptable, faster proxies suffice

Combining Multiple Proxies

Recent work shows that combining multiple inexpensive proxies often outperforms any single proxy. For example, using an ensemble of zero-cost proxies followed by short-duration training for top candidates provides excellent speed-accuracy balance.

Practical Considerations for NAS

Deploying NAS in real-world scenarios requires careful attention to practical concerns that go beyond algorithmic sophistication. The gap between research papers and production deployment can be substantial.

Reproducibility and Variance:

NAS results are notoriously difficult to reproduce. Sources of variance include:

Random initialization of search
Stochastic training (data shuffling, dropout)
Random search space sampling
Hardware-specific behavior (different GPUs, memory limits)

Best practices:

Report confidence intervals over multiple search runs
Control random seeds where possible
Document hardware and software environment completely
Compare against random search baseline with equal budget

When NAS is Worth the Investment

•Large-scale deployment where tiny accuracy gains save significant cost
•Novel domains without well-established architectures
•Hardware-aware optimization for specific deployment targets
•Multi-task or multi-objective requirements
•When human expertise in architecture design is limited

When to Use Existing Architectures

•Limited computational budget for search
•Well-studied domains with mature architectures (ResNet, BERT)
•Rapid prototyping where time-to-solution matters
•When transfer learning from pre-trained models is effective
•Small datasets where architecture matters less

Hardware-Aware NAS:

Modern NAS increasingly incorporates hardware constraints directly into the search objective. This hardware-aware NAS optimizes not just accuracy but also:

Latency on target hardware (GPU, CPU, mobile, edge)
Memory footprint during inference
Energy consumption
Model size for deployment

The objective becomes multi-objective:

max_α [Accuracy(α) - λ × Latency(α)]

where λ controls the trade-off. Methods like MNASNET, Once-For-All, and EfficientNet explicitly incorporate hardware metrics.

Transferability:

Architectures discovered on one task/dataset often transfer to others, but with caveats:

ImageNet-searched architectures transfer well to other vision tasks
Transfer between very different domains (images → text) is less reliable
Search space design may encode task-specific biases
The 'optimal' architecture is often dataset-dependent

The Hidden Cost of NAS

Published NAS papers often underreport total computational cost. The true cost includes: hyperparameter tuning of the search method itself, preliminary experiments to design the search space, failed search runs, and extensive ablation studies. Industrial deployments should budget 5-10× the 'official' search cost for end-to-end development.

Summary: Neural Architecture Search

Neural Architecture Search represents a paradigm shift in deep learning: from human-designed architectures to automated discovery. While the field has matured significantly since its inception, the fundamental trade-offs—between search space expressivity, sample efficiency, and evaluation accuracy—remain central challenges.

Key Takeaways

•NAS is a bi-level optimization problem: finding the architecture that, when optimally trained, achieves the best validation performance
•Cell-based search spaces reduce complexity by searching for repeatable building blocks rather than full architectures
•RL, evolutionary, and gradient-based methods each offer different trade-offs between sample efficiency and implementation complexity
•Weight sharing reduces evaluation cost by orders of magnitude, enabling practical architecture search
•Performance estimation strategies (proxies, learning curve prediction, zero-cost metrics) enable rapid candidate screening
•Practical deployment requires careful attention to reproducibility, hardware constraints, and transferability

Looking Ahead:

The next pages in this module will explore related advanced HPO topics:

Meta-learning for HPO: Learning to tune hyperparameters from previous experience
Transfer HPO: Leveraging knowledge from prior optimization runs
Multi-objective HPO: Optimizing multiple conflicting objectives simultaneously
Practical HPO Systems: Production-ready tools for automated machine learning

NAS represents just one facet of the broader AutoML vision: fully automated machine learning pipelines that adapt to data and constraints with minimal human intervention.

Page Complete

You now understand the complete framework of Neural Architecture Search: the problem formulation, search space design, major search strategies (RL, evolutionary, gradient-based), weight sharing methods, and practical considerations. Next, we'll explore how meta-learning can accelerate hyperparameter optimization by learning from past experiments.

1 / 5

Loading learning content...

Machine LearningHyperparameter Optimization

Advanced HPO Topics

LevelAdvanced

Duration150 mins

TopicHyperparameter Optimization

1 / 5

Neural Architecture Search

The Quest for Automated Architecture Design

What You Will Learn

The NAS Problem Formulation

The Outer Optimization (Architecture Search):

We seek an architecture α from a search space A that minimizes validation loss:

α* = argmin_{α ∈ A} L_val(w*(α), α)

where L_val is the validation loss and w(α)* represents the optimal weights for architecture α.

The Inner Optimization (Weight Training):

For each candidate architecture α, we must find optimal weights:

w*(α) = argmin_w L_train(w, α)

The Computational Curse

The Three Pillars of NAS:

Every NAS method must address three interconnected components:

Search Space: What architectures are possible? The search space defines the set of candidate architectures, encoding prior knowledge about what constitutes a reasonable network.
Search Strategy: How do we explore the space? The search strategy determines how we sample, evaluate, and navigate among candidate architectures.
Performance Estimation Strategy: How do we evaluate candidates efficiently? Since full training is prohibitively expensive, we need methods to estimate architecture quality quickly.

The Three Pillars of NAS and Their Trade-offs
Component	Design Choices	Trade-offs
Search Space	Cell-based vs macro architecture; operation types; connectivity patterns	Larger spaces → more potential but harder search; smaller spaces → faster but may miss optimal designs
Search Strategy	Random search, RL, evolutionary, gradient-based, Bayesian	Sample efficiency vs computational cost; exploration vs exploitation balance
Performance Estimation	Full training, early stopping, weight sharing, zero-cost proxies	Speed vs accuracy; ranking correlation with true performance

Search Space Design

Macro Architecture Search:

Cell-Based Search:

Cell-Based Search Space Components

•Normal Cell: Maintains spatial dimensions; used in the body of the network for feature extraction
•Reduction Cell: Reduces spatial dimensions (typically by 2×); used at network bottlenecks for efficiency
•Cell Structure: Directed acyclic graph (DAG) with nodes representing intermediate features and edges representing operations
•Operation Set: The candidate operations (convolutions, pooling, skip connections, etc.) that can be placed on edges
•Connectivity Pattern: How the intermediate nodes connect to each other and to the cell output

Defining the Operation Set:

The choice of candidate operations fundamentally shapes what the search can discover. Common operation sets include:

Convolutional Operations: 3×3 conv, 5×5 conv, 1×1 conv, separable convolutions, dilated convolutions
Pooling Operations: max pooling, average pooling (various kernel sizes)
Skip Connections: identity mapping (crucial for enabling deep networks)
No Connection: zero operation (allowing the search to remove edges)

The Power of Inductive Bias

DAG-Based Cell Representation:

Within the cell-based paradigm, cells are typically represented as directed acyclic graphs (DAGs). The standard formulation uses:

Input Nodes: Outputs from previous cells (typically 2 inputs: the previous cell and the cell before that)
Intermediate Nodes: Each intermediate node receives inputs from previous nodes, applies operations to each input, and sums the results
Output Node: Concatenation of all intermediate node outputs that aren't used elsewhere

Beyond Cell-Based Search:

Recent work has pushed beyond simple cell-based search in several directions:

Hierarchical Search: Search for cells at multiple scales, with higher-level cells composed of lower-level ones
Topology Search: Allow variable DAG structures rather than fixed connectivity templates
Network-Level Search: Jointly search for cell structure and how cells are arranged (number of cells, channel widths, etc.)
Task-Adaptive Spaces: Automatically adjust the search space based on the target task

Reinforcement Learning for NAS

The RL Framework:

State: The partial architecture generated so far
Action: Choose the next architectural decision (operation type, connection, etc.)
Reward: Validation accuracy of the complete architecture after training
Policy: The controller network that maps states to action distributions

The controller outputs a sequence of tokens that define the architecture. For a cell-based search, this typically includes:

For each edge in the cell DAG: which operation to use
For each intermediate node: which previous nodes to connect to

nas_rl_controller.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
# Simplified RL-based NAS Controller (conceptual)
import torch
import torch.nn as nn
 
class NASController(nn.Module):
    """
    RNN-based controller that generates architecture descriptions.
    Each forward pass produces a complete architecture specification.
    """
    def __init__(self, num_operations, num_nodes, hidden_size=100):
        super().__init__()
        self.num_operations = num_operations
        self.num_nodes = num_nodes
        self.hidden_size = hidden_size
        
        # LSTM controller
        self.lstm = nn.LSTMCell(hidden_size, hidden_size)
        
        # Embedding for previous decisions
        self.op_embedding = nn.Embedding(num_operations, hidden_size)
        self.node_embedding = nn.Embedding(num_nodes + 2, hidden_size)  # +2 for inputs
        
        # Output heads for different decision types
        self.op_classifier = nn.Linear(hidden_size, num_operations)
        self.node_selector = nn.Linear(hidden_size, num_nodes + 2)
        
    def forward(self, batch_size=1):
        """Generate architecture descriptions and log probabilities."""
        architectures = []
        log_probs = []
        entropies = []
        
        # Initialize hidden state
        h = torch.zeros(batch_size, self.hidden_size)
        c = torch.zeros(batch_size, self.hidden_size)
        
        # Start token
        inputs = torch.zeros(batch_size, self.hidden_size)
        
        for node_idx in range(self.num_nodes):
            # For each intermediate node, select 2 input nodes and 2 operations
            node_decisions = []
            
            for edge_idx in range(2):  # Each node has 2 inputs
                # Select input node
                h, c = self.lstm(inputs, (h, c))
                node_logits = self.node_selector(h)
                
                # Mask future nodes (can only connect to previous nodes)
                mask = torch.ones_like(node_logits) * -1e9
                mask[:, :node_idx + 2] = 0  # +2 for the 2 input nodes
                node_logits = node_logits + mask
                
                node_probs = torch.softmax(node_logits, dim=-1)
                node_dist = torch.distributions.Categorical(node_probs)
                selected_node = node_dist.sample()
                
                log_probs.append(node_dist.log_prob(selected_node))
                entropies.append(node_dist.entropy())
                
                # Select operation
                inputs = self.node_embedding(selected_node)
                h, c = self.lstm(inputs, (h, c))
                op_logits = self.op_classifier(h)
                op_probs = torch.softmax(op_logits, dim=-1)
                op_dist = torch.distributions.Categorical(op_probs)
                selected_op = op_dist.sample()
                
                log_probs.append(op_dist.log_prob(selected_op))
                entropies.append(op_dist.entropy())
                
                node_decisions.append((selected_node.item(), selected_op.item()))
                inputs = self.op_embedding(selected_op)
            
            architectures.append(node_decisions)
        
        return architectures, torch.stack(log_probs), torch.stack(entropies)
 
 
def train_controller(controller, optimizer, architecture, reward, baseline):
    """
    REINFORCE update for the controller.
    Uses baseline subtraction for variance reduction.
    """
    # Compute advantage
    advantage = reward - baseline
    
    # Get log probabilities for this architecture
    _, log_probs, entropies = controller(batch_size=1)
    
    # Policy gradient loss with entropy regularization
    policy_loss = -(log_probs * advantage).mean()
    entropy_bonus = -0.01 * entropies.mean()  # Encourage exploration
    
    loss = policy_loss + entropy_bonus
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    return loss.item()

Policy Gradient Optimization:

The controller is trained using policy gradient methods, typically REINFORCE with baseline subtraction. The objective is to maximize expected reward:

J(θ) = E_{α~π_θ}[R(α)]

where π_θ is the controller policy, α is the sampled architecture, and R(α) is its validation accuracy. The gradient is estimated as:

∇_θ J(θ) ≈ (1/n) Σ_i (R(α_i) - b) ∇_θ log π_θ(α_i)

where b is a baseline (often an exponential moving average of past rewards) used for variance reduction.

Challenges with RL-Based NAS:

While pioneering, RL-based NAS faces significant challenges:

High Variance: Reward signals are noisy (due to training randomness), making gradient estimates high-variance
Sample Inefficiency: Policy gradients require many samples to converge, each requiring architecture training
Credit Assignment: It's unclear which architectural decisions led to good/bad performance
Exploration: The controller may converge prematurely to suboptimal regions

Historical Significance

Evolutionary Approaches to NAS

The Evolutionary NAS Framework:

Initialization: Generate an initial population of random architectures
Evaluation: Train each architecture and measure validation performance (fitness)
Selection: Choose parent architectures based on fitness
Reproduction: Create offspring through mutation and/or crossover
Replacement: Update the population with offspring, possibly removing least-fit members
Iteration: Repeat until budget exhausted or convergence

Mutation Operators for Neural Architectures

•Operation Mutation: Replace one operation with another from the candidate set (e.g., 3×3 conv → 5×5 conv)
•Connection Mutation: Change which nodes an intermediate node receives input from
•Insertion Mutation: Add a new node/layer to the architecture
•Deletion Mutation: Remove a node/layer (if the architecture allows)
•Identity Mutation: Replace an operation with identity/skip connection
•Compound Mutations: Apply multiple atomic mutations together for larger exploration steps

Regularized Evolution (AmoebaNet):

Prevents population stagnation around local optima
Encourages continuous exploration of the search space
Reduces the risk of getting stuck with early lucky individuals
Makes the algorithm more robust to noisy fitness estimates

Tournament Selection:

Tournament selection with aging offers an excellent exploration-exploitation balance:

t = random_sample(population, k)  # Sample k individuals
parent = argmax_fitness(t)         # Select fittest in tournament
offspring = mutate(parent)         # Create offspring
population.append(offspring)       # Add to population
population.pop_oldest()            # Remove oldest individual

Small tournament sizes (k=2-5) favor exploration; larger sizes favor exploitation of known good architectures.

Advantages of Evolutionary NAS

•No gradients required—works with any search space
•Naturally parallel—population members train independently
•Immune to local optima—maintains population diversity
•Easy to implement and modify
•Robust to noisy evaluations
•Anytime algorithm—can stop early with good results

Challenges of Evolutionary NAS

•Sample inefficient—requires many evaluations
•Computationally expensive per generation
•Mutation design requires domain expertise
•Crossover often hurts more than helps for NAS
•Population size tuning is crucial
•Hard to incorporate gradient information

Why Crossover Often Fails in NAS

Gradient-Based Search: DARTS and Beyond

The Key Insight:

Instead of selecting a single operation for each edge, DARTS maintains a mixture of all candidate operations, weighted by architecture parameters α. The output of an edge is:

output(x) = Σ_o exp(α_o) / Σ_o' exp(α_o') × o(x)

where the sum runs over all operations o in the candidate set. This softmax relaxation makes the architecture continuously differentiable with respect to α, enabling gradient descent.

Bi-Level Optimization:

DARTS solves the bi-level optimization problem by alternating:

Inner Optimization: Fix architecture weights α, update network weights w by gradient descent on training loss
Outer Optimization: Fix network weights w, update architecture weights α by gradient descent on validation loss

This alternating optimization approximates the true bi-level problem and can be executed efficiently on a single GPU.

darts_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
# DARTS-style Differentiable Architecture Search (conceptual)
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class MixedOperation(nn.Module):
    """
    A mixed operation that combines multiple candidate operations
    with learned architecture weights (softmax over alphas).
    """
    def __init__(self, channels, operations_dict):
        super().__init__()
        self.operations = nn.ModuleList([
            op_class(channels) for op_class in operations_dict.values()
        ])
        self.n_ops = len(self.operations)
        
    def forward(self, x, weights):
        """
        Forward pass with architecture weights.
        weights: softmax-normalized architecture parameters for this edge
        """
        return sum(w * op(x) for w, op in zip(weights, self.operations))
 
 
class DARTSCell(nn.Module):
    """
    A DARTS cell with continuous architecture parameters.
    """
    def __init__(self, channels, num_nodes, operations):
        super().__init__()
        self.num_nodes = num_nodes
        
        # Create mixed operations for each possible edge
        # Edge (i,j) connects node i to node j where i < j
        self.edges = nn.ModuleDict()
        for i in range(num_nodes + 2):  # +2 for input nodes
            for j in range(2, num_nodes + 2):  # intermediate nodes only
                if i < j:
                    edge_key = f"edge_{i}_{j}"
                    self.edges[edge_key] = MixedOperation(channels, operations)
        
        # Architecture parameters (one per edge per operation)
        self.num_edges = len(self.edges)
        self.n_ops = len(operations)
        
    def forward(self, s0, s1, arch_weights):
        """
        Forward pass through the cell.
        s0, s1: inputs from previous two cells
        arch_weights: architecture parameters of shape (num_edges, n_ops)
        """
        states = [s0, s1]
        edge_idx = 0
        
        for j in range(2, self.num_nodes + 2):
            # Sum contributions from all previous nodes
            node_input = 0
            for i in range(j):
                edge_key = f"edge_{i}_{j}"
                if edge_key in self.edges:
                    weights = F.softmax(arch_weights[edge_idx], dim=0)
                    node_input = node_input + self.edges[edge_key](states[i], weights)
                    edge_idx += 1
            states.append(node_input)
        
        # Concatenate intermediate nodes as output
        return torch.cat(states[2:], dim=1)
 
 
class DARTSArchitect:
    """
    Handles architecture parameter optimization using validation loss.
    """
    def __init__(self, model, arch_params, arch_lr=3e-4, arch_weight_decay=1e-3):
        self.model = model
        self.arch_params = arch_params
        self.optimizer = torch.optim.Adam(
            arch_params, lr=arch_lr, weight_decay=arch_weight_decay
        )
        
    def step(self, val_data, val_target, network_optimizer):
        """
        Single step of architecture optimization.
        Uses first-order approximation by default.
        """
        self.optimizer.zero_grad()
        
        # Forward pass on validation data
        output = self.model(val_data)
        loss = F.cross_entropy(output, val_target)
        
        # Backward pass updates architecture parameters
        loss.backward()
        self.optimizer.step()
        
        return loss.item()
 
 
def discretize_architecture(arch_weights, topk=2):
    """
    Convert continuous architecture weights to discrete architecture.
    For each node, keep top-k operations with highest weights.
    """
    discrete_arch = []
    for node_weights in arch_weights:
        # Get indices of top-k operations
        _, top_indices = torch.topk(node_weights, topk)
        discrete_arch.append(top_indices.tolist())
    return discrete_arch

The Discretization Gap:

Skip Connection Dominance: DARTS often over-selects skip connections because they allow gradients to flow easily during search
Sharp vs Flat Minima: Some operations lead to sharper loss landscapes, making them harder to optimize post-discretization
Weight Coupling: Operation weights become adapted to work with the mixed combination, not individually

Second-Order Optimization:

The Skip Connection Collapse Problem

DARTS Variants and Improvements

•P-DARTS (Progressive DARTS): Progressively deepens the search space and prunes operations during search
•PC-DARTS (Partial Channel): Uses partial channel connections to reduce memory and improve stability
•DARTS+: Adds auxiliary skip connections during search to prevent collapse
•R-DARTS (Robust DARTS): Regularizes architecture parameters to prevent premature convergence
•GDAS: Uses Gumbel-softmax for hard operation selection during search
•SNAS: Uses concrete/Gumbel distribution for discrete architecture sampling

Weight Sharing and One-Shot Methods

The Supernet Paradigm:

A supernet is an over-parameterized network where:

Each edge contains all candidate operations simultaneously
Any architecture in the search space is a subgraph of the supernet
The supernet is trained once on the full training data
Candidate architectures are evaluated using inherited supernet weights

This reduces the cost of evaluating an architecture from hours (full training) to seconds (forward pass on validation data), enabling exploration of thousands to millions of candidates.

NAS Paradigm Comparison
Method	Architectures Evaluated	Cost per Evaluation	Total Search Cost (GPU-days)
NASNet (RL)	~20,000	Full training (~4 GPU-hours)	~3,000-3,500
AmoebaNet (Evolution)	~27,000	Full training (~4 GPU-hours)	~3,150
DARTS (Gradient)	1 (continuous)	Search + Final train	~1-4
One-Shot (Weight Sharing)	~10,000+	Forward pass (~seconds)	~0.5-2

Training Strategies for Supernets:

How the supernet is trained dramatically affects the quality of inherited weights. Common strategies include:

1. Path Sampling (Single-Path One-Shot)

For each training step, sample a single path (architecture) through the supernet and only update its weights. This ensures each subnetwork receives dedicated training signal:

for x, y in dataloader:
    arch = sample_architecture()  # Random architecture
    output = supernet(x, arch)    # Forward only sampled path
    loss = criterion(output, y)
    loss.backward()               # Gradients only for sampled path
    optimizer.step()

2. Uniform Sampling

Operations are sampled uniformly, ensuring all parts of the supernet receive equal training. This prevents popular operations from dominating gradient updates.

3. Fairness Constraints

Explicitly enforce that each operation receives similar training by tracking operation counts and adjusting sampling probabilities.

The Ranking Hypothesis

Challenges with Weight Sharing:

Weight Coupling: Operations sharing the same feature maps may learn coupled representations that don't transfer well to stand-alone training
Unfair Comparison: Some operations (e.g., skip connections) may be easier to train than others (e.g., large convolutions), biasing evaluations
Capacity Interference: With limited supernet capacity, operations compete for representational power, and some may not be adequately trained
Memory Requirements: The supernet must simultaneously hold all candidate operations, requiring significant GPU memory

Improving Weight Sharing Quality:

Multiple techniques improve the reliability of weight-sharing evaluations:

Progressive Shrinking: Start with full supernet, gradually reduce to smaller subnetworks
Knowledge Distillation: Use the supernet as a teacher for stand-alone training
Batch Normalization Calibration: Recalibrate BN statistics for each architecture before evaluation
Fine-tuning Top Candidates: Briefly fine-tune the top-k architectures before final selection

Performance Estimation Strategies

Lower-Fidelity Proxies:

The simplest approach is to train for fewer epochs, on smaller datasets, or with reduced model sizes:

Reduced Epochs: Train for 10-20 epochs instead of 600+
Subset Training: Use 10-20% of training data
Proxy Tasks: Evaluate on smaller datasets (CIFAR-10 as proxy for ImageNet)
Scaled-Down Models: Search with fewer channels/layers, then scale up

These proxies are widely used but can introduce ranking errors—architectures that learn quickly may not ultimately be the best.

Zero-Cost Proxies

•Grad Norm: Sum of gradient norms at initialization—correlates with trainability
•SNIP (Single-shot Network Pruning): Sensitivity of loss to weight masking
•GraSP (Gradient Signal Preservation): Preservation of gradient flow through the network
•Fisher Information: Fisher trace as a measure of network capacity
•Synflow: A training-free, data-independent metric based on parameter flow
•NASWOT (NAS Without Training): Correlates activation patterns with performance

Learning Curve Prediction:

Rather than training to completion, observe early training behavior and predict final performance. This approach includes:

Extrapolation Models: Fit parametric curves (exponential, power law) to early epochs and extrapolate
Multi-Fidelity Bayesian Optimization: Use evaluations at various training lengths to model the full learning curve
Successive Halving / Hyperband: Progressively eliminate poorly-performing architectures, allocating more budget to promising ones

The Speed-Accuracy Trade-off:

All performance estimation strategies trade accuracy for speed. The optimal strategy depends on:

Search Budget: With limited budget, aggressive approximations are necessary
Search Space Size: Larger spaces require faster evaluations to cover
Task Characteristics: Some proxies work better for certain domains (e.g., Synflow excels for pruning)
Risk Tolerance: If missing the absolute best architecture is acceptable, faster proxies suffice

Combining Multiple Proxies

Practical Considerations for NAS

Reproducibility and Variance:

NAS results are notoriously difficult to reproduce. Sources of variance include:

Random initialization of search
Stochastic training (data shuffling, dropout)
Random search space sampling
Hardware-specific behavior (different GPUs, memory limits)

Best practices:

Report confidence intervals over multiple search runs
Control random seeds where possible
Document hardware and software environment completely
Compare against random search baseline with equal budget

When NAS is Worth the Investment

•Large-scale deployment where tiny accuracy gains save significant cost
•Novel domains without well-established architectures
•Hardware-aware optimization for specific deployment targets
•Multi-task or multi-objective requirements
•When human expertise in architecture design is limited

When to Use Existing Architectures

•Limited computational budget for search
•Well-studied domains with mature architectures (ResNet, BERT)
•Rapid prototyping where time-to-solution matters
•When transfer learning from pre-trained models is effective
•Small datasets where architecture matters less

Hardware-Aware NAS:

Modern NAS increasingly incorporates hardware constraints directly into the search objective. This hardware-aware NAS optimizes not just accuracy but also:

Latency on target hardware (GPU, CPU, mobile, edge)
Memory footprint during inference
Energy consumption
Model size for deployment

The objective becomes multi-objective:

max_α [Accuracy(α) - λ × Latency(α)]

where λ controls the trade-off. Methods like MNASNET, Once-For-All, and EfficientNet explicitly incorporate hardware metrics.

Transferability:

Architectures discovered on one task/dataset often transfer to others, but with caveats:

ImageNet-searched architectures transfer well to other vision tasks
Transfer between very different domains (images → text) is less reliable
Search space design may encode task-specific biases
The 'optimal' architecture is often dataset-dependent

The Hidden Cost of NAS

Summary: Neural Architecture Search

Key Takeaways

•NAS is a bi-level optimization problem: finding the architecture that, when optimally trained, achieves the best validation performance
•Cell-based search spaces reduce complexity by searching for repeatable building blocks rather than full architectures
•RL, evolutionary, and gradient-based methods each offer different trade-offs between sample efficiency and implementation complexity
•Weight sharing reduces evaluation cost by orders of magnitude, enabling practical architecture search
•Performance estimation strategies (proxies, learning curve prediction, zero-cost metrics) enable rapid candidate screening
•Practical deployment requires careful attention to reproducibility, hardware constraints, and transferability

Looking Ahead:

The next pages in this module will explore related advanced HPO topics:

Meta-learning for HPO: Learning to tune hyperparameters from previous experience
Transfer HPO: Leveraging knowledge from prior optimization runs
Multi-objective HPO: Optimizing multiple conflicting objectives simultaneously
Practical HPO Systems: Production-ready tools for automated machine learning

NAS represents just one facet of the broader AutoML vision: fully automated machine learning pipelines that adapt to data and constraints with minimal human intervention.

Page Complete

1 / 5