Machine LearningNeural Architecture Search

Neural Architecture Search

LevelAdvanced

Duration90 mins

TopicNeural Architecture Search

1 / 5

NAS Overview

The Neural Architecture Problem

Every deep learning practitioner has faced this fundamental challenge: given a problem, what neural network architecture should I use?

This question has no simple answer. The space of possible architectures is effectively infinite—the number of layers, neurons per layer, activation functions, skip connections, normalization strategies, pooling operations, and countless other design choices combine combinatorially. Expert practitioners spend years developing intuition about which architectural patterns work for which problems, and even then, their choices are often sub-optimal.

Consider what designing a neural network truly involves:

Macro-architecture decisions: How deep should the network be? Should it use residual connections? What's the overall topology—sequential, multi-branch, U-shaped?
Micro-architecture decisions: What operations should each layer perform? What kernel sizes for convolutions? How many attention heads?
Connectivity patterns: How should information flow between layers? Dense connections? Sparse patterns? Cross-stage connections?
Capacity allocation: Where should the network invest its parameters—early layers or late layers? Shallow-and-wide or deep-and-narrow?

Historically, these decisions were made through a combination of theoretical intuition, empirical trial-and-error, and creative inspiration. LeNet's architecture emerged from understanding of visual processing. ResNet's skip connections came from observing gradient flow problems. The Transformer attention mechanism was designed based on theoretical analysis of sequence modeling.

But what if machines could design these architectures themselves?

What You Will Learn

By the end of this page, you will understand the fundamental motivation for Neural Architecture Search (NAS), its core components and taxonomy, the historical evolution of the field, and the key trade-offs that make NAS both powerful and challenging. This foundation prepares you for the technical depth in subsequent pages.

What is Neural Architecture Search?

Neural Architecture Search (NAS) is the process of automating the design of neural network architectures. Rather than having human experts manually craft network structures, NAS algorithms explore the space of possible architectures to discover ones that perform well on given tasks.

At its core, NAS treats architecture design as an optimization problem:

$$\text{architecture}^* = \arg\max_{a \in \mathcal{A}} \text{Performance}(a, D)$$

Where $\mathcal{A}$ is the search space of possible architectures, $D$ is the dataset, and Performance measures how well architecture $a$ performs when trained on $D$.

This formulation reveals the three fundamental components of any NAS system:

The Three Pillars of NAS

•Search Space (A) — The set of possible architectures that can be discovered. Defines what architectural choices are available: layer types, connections, hyperparameters, etc. A well-designed search space constrains the problem while including high-performing architectures.
•Search Strategy — The algorithm that explores the search space to find good architectures. Common strategies include reinforcement learning, evolutionary algorithms, gradient-based methods, and Bayesian optimization. The strategy determines efficiency and discovery capability.
•Performance Estimation — How we evaluate candidate architectures. The naive approach (fully training each candidate) is prohibitively expensive. Most NAS methods use approximations: weight sharing, early stopping, performance predictors, or lower-fidelity evaluations.

Why NAS is fundamentally difficult:

The challenge of NAS stems from several compounding factors:

Exponentially large search space: Even a modest search space often contains $10^{10}$ to $10^{20}$ possible architectures. Exhaustive search is impossible.
Expensive evaluation: Training a neural network to convergence can take hours to days on modern hardware. Evaluating even a fraction of the search space is computationally prohibitive.
Non-differentiable and discrete: Unlike weight optimization, architecture choices are discrete (number of layers, connection patterns), making gradient-based optimization non-trivial.
Multi-objective nature: Good architectures must balance accuracy, inference speed, memory consumption, training efficiency, and often deployment constraints.
Generalization uncertainty: An architecture found to work well on one dataset may not transfer to others, and overfitting the search to a validation set is a real risk.

These challenges explain why NAS remained largely theoretical until computational resources became sufficient to make it practical—and why efficient NAS methods continue to be an active research area.

NAS vs Hyperparameter Optimization

While related, NAS differs from hyperparameter optimization (HPO). HPO optimizes continuous or categorical parameters of a fixed architecture (learning rate, regularization strength), while NAS changes the architecture structure itself. NAS can be viewed as a form of HPO where the 'hyperparameters' define the network topology. The Combined Algorithm Selection and Hyperparameter optimization (CASH) problem unifies both perspectives.

The NAS Taxonomy

Understanding NAS requires appreciating its diverse landscape. Different NAS methods make fundamentally different choices across each pillar, leading to vastly different computational costs, searched architectures, and practical applicability.

Search Space Taxonomy:

Search spaces can be categorized by their granularity and structure:

Search Space Types in NAS
Search Space Type	Description	Examples	Trade-offs
Macro Search	Searches over high-level architecture structure: number of layers, layer types, global connections	NASNet, AmoebaNet chain-structured search	Flexible but expensive; large space to explore
Cell-based Search	Searches for a repeating cell/block structure that is stacked to form the full network	NASNet cells, DARTS cells, EfficientNet	Reduces search space; assumes repetitive structure
Hierarchical Search	Multi-level search: cells at low level, stacking patterns at high level	Hierarchical NAS, Auto-DeepLab	Balances flexibility and tractability
Unbounded Search	No fixed template; discovers completely novel topologies	Some evolutionary methods, NEAT	Maximum flexibility; hardest to optimize

Search Strategy Taxonomy:

Search strategies define how we navigate the architecture space:

Search Strategies in NAS
Strategy	Core Idea	Advantages	Disadvantages
Random Search	Sample architectures uniformly at random	Simple baseline; surprisingly competitive	No learning; wasteful sampling
Reinforcement Learning	Train a controller to generate architectures; reward = validation performance	Can learn complex patterns; flexible	Sample inefficient; high variance
Evolutionary Algorithms	Maintain population; mutate and select based on fitness	Parallel exploration; robust to local optima	Requires many evaluations; tuning sensitivity
Gradient-based (DARTS)	Continuous relaxation of discrete choices; use gradients	Extremely efficient; orders of magnitude faster	Can collapse; relaxation gap issues
Bayesian Optimization	Build surrogate model of architecture-performance mapping	Sample efficient; principled uncertainty	Scaling to high dimensions; surrogate design

Performance Estimation Taxonomy:

Since fully training each candidate is expensive, various acceleration strategies exist:

Performance Estimation Strategies

•Lower Fidelity Estimates: Train for fewer epochs, on subsampled data, or with reduced input resolution. Fast but biased—optimal architecture at low fidelity may differ from optimal at full fidelity.
•Weight Sharing / One-Shot NAS: Train a single supernet containing all possible architectures; share weights across candidates. Dramatically reduces cost but introduces weight coupling effects.
•Predictor-based Estimation: Train a model to predict architecture performance from architectural features without training. Enables extremely fast evaluation but requires training data from prior evaluations.
•Learning Curve Extrapolation: Train partially and extrapolate final performance. Can eliminate poor candidates early while keeping promising ones.
•Zero-Cost Proxies: Estimate performance without any training using measures like gradient statistics, Jacobian properties, or network topology features. Extremely fast but proxy quality varies.

The Efficiency-Fidelity Trade-off

Every performance estimation strategy trades fidelity for efficiency. The ideal method provides estimates that correctly rank architectures (high rank correlation with true performance) while being fast enough to evaluate many candidates. This trade-off is central to practical NAS—and choosing the right acceleration strategy often matters more than the search algorithm itself.

Historical Evolution of NAS

Neural Architecture Search has evolved dramatically over the past decade, transitioning from computationally impractical academic curiosity to a standard tool in the deep learning practitioner's arsenal. Understanding this evolution illuminates both the core challenges and the ingenious solutions developed to address them.

The Pre-NAS Era (1980s-2010s):

Before modern NAS, architecture design was entirely manual:

NeuroEvolution of Augmenting Topologies (NEAT, 2002): Stan Stanley's NEAT algorithm evolved both architecture and weights simultaneously, demonstrating that structure could be learned. However, it addressed smaller-scale problems than modern deep learning.
Genetic Programming for Neural Networks: Various attempts to evolve network structures existed but were limited by computational constraints.
Meta-learning approaches: Some work explored learning architecture patterns, but practical success was limited.

The First Wave: Brute-Force NAS (2016-2017):

The modern NAS era began when Google's research demonstrated that architectural search could produce networks rivaling or exceeding human-designed ones:

NAS Breakthrough Papers

•NAS with RL (Zoph & Le, 2017): Trained an RNN controller using REINFORCE to generate architecture strings. Required 800 GPUs for weeks but produced competitive architectures. Demonstrated that NAS was fundamentally viable.
•NASNet (Zoph et al., 2018): Introduced the cell-based search space—search for a cell structure, then stack cells. Reduced search cost and produced transferable building blocks. NASNet cells achieved state-of-the-art on ImageNet.
•AmoebaNet (Real et al., 2018): Applied evolutionary algorithms to NAS. Demonstrated that evolution could match or exceed RL-based search, with better parallelization properties.

These seminal works proved that automated architecture design could compete with expert-designed networks. However, the computational cost was staggering:

NAS (2017): 800 GPUs × 28 days ≈ 22,400 GPU-days
NASNet (2018): 450 GPUs × 4 days ≈ 1,800 GPU-days
AmoebaNet (2018): Similar to NASNet

Such costs restricted NAS to organizations with exceptional computational resources.

The Second Wave: Efficient NAS (2018-2019):

Recognizing the computational barrier, researchers developed dramatically more efficient methods:

The Efficiency Revolution in NAS
Method	Year	Key Innovation	Approximate Cost
ENAS	2018	Weight sharing across architectures in the search space	0.5 GPU-days
DARTS	2019	Continuous relaxation enabling gradient-based search	1 GPU-day
ProxylessNAS	2019	Direct search on target task/hardware; memory-efficient	4 GPU-days on ImageNet
Once-for-All	2020	Train one network supporting many sub-networks	Search: minutes; Training: once

This represents a 10,000x to 100,000x reduction in computational cost—transforming NAS from industrial-scale research to something achievable on a single GPU.

The Third Wave: Practical NAS (2020-Present):

Current NAS research focuses on:

Hardware-aware NAS: Directly optimizing for latency, energy, or FLOPs on target hardware
Multi-objective NAS: Pareto optimization of accuracy, efficiency, and other metrics
Transferable NAS: Finding architectures that generalize across tasks and datasets
Zero-cost NAS: Finding good architectures without any training
NAS for emerging domains: Transformers, GNNs, speech, scientific computing

The field has matured to the point where NAS-discovered architectures underlie many production systems—EfficientNet, emerged from NAS-like design processes, powers image classification across Google products; MobileNetV3 uses NAS for mobile optimization; and various transformer architectures incorporate NAS-discovered design elements.

The Democratization of NAS

What once required Google-scale resources can now be done on a single workstation. Libraries like AutoKeras, NNI, and AutoGluon provide practical NAS capabilities to individual practitioners. The computational barrier that once restricted NAS to elite research labs has largely fallen.

NAS Success Stories

NAS has produced some of the most important neural network architectures of the last five years. Understanding these success stories illustrates both the power of automated design and the role of careful search space engineering.

EfficientNet: Compound Scaling from NAS

EfficientNet (Tan & Le, 2019) used NAS to discover an efficient base architecture, then introduced compound scaling—simultaneously scaling depth, width, and resolution with fixed ratios. The result was a family of models that achieved state-of-the-art accuracy at every computational budget:

EfficientNet-B0: 5.3M parameters, 77.3% ImageNet accuracy
EfficientNet-B7: 66M parameters, 84.4% ImageNet accuracy

The key insight: NAS found a base structure optimized for efficiency, and systematic scaling amplified that efficiency advantage.

efficientnet_scaling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# EfficientNet Compound Scaling Principle
# NAS found optimal base architecture (B0)
# Compound scaling extends it systematically
 
import math
 
def efficientnet_compound_scaling(phi):
    """
    Compound scaling coefficients discovered through grid search
    phi controls the overall scale factor
    
    EfficientNet-B0: phi=0
    EfficientNet-B1: phi=0.5
    EfficientNet-B7: phi=6.0
    """
    alpha = 1.2   # depth coefficient (NAS-discovered)
    beta = 1.1    # width coefficient (NAS-discovered)
    gamma = 1.15  # resolution coefficient (NAS-discovered)
    
    # Compound scaling formula:
    # depth = alpha^phi
    # width = beta^phi
    # resolution = gamma^phi
    # Constraint: alpha * beta^2 * gamma^2 ≈ 2
    
    depth_mult = alpha ** phi
    width_mult = beta ** phi
    resolution_mult = gamma ** phi
    
    return {
        'depth_multiplier': depth_mult,
        'width_multiplier': width_mult,
        'resolution': int(224 * resolution_mult)
    }
 
# EfficientNet variants
efficientnet_configs = {
    'B0': efficientnet_compound_scaling(0),
    'B1': efficientnet_compound_scaling(0.5),
    'B2': efficientnet_compound_scaling(1.0),
    'B3': efficientnet_compound_scaling(2.0),
    'B4': efficientnet_compound_scaling(3.0),
    'B5': efficientnet_compound_scaling(4.0),
    'B6': efficientnet_compound_scaling(5.0),
    'B7': efficientnet_compound_scaling(6.0),
}
 
for name, config in efficientnet_configs.items():
    print(f"EfficientNet-{name}: {config}")

MobileNetV3: Hardware-Aware NAS

MobileNetV3 combined NAS with human design expertise for mobile deployment:

Platform-aware NAS found optimal layer configurations for mobile latency
NetAdapt fine-tuned individual layer widths for specific hardware
Human refinements incorporated hard-swish activation and squeeze-excitation

The result: 3.2% better accuracy than MobileNetV2 with 25% lower latency—demonstrating that NAS and human expertise can be complementary.

NAS for Transformers:

Recent work has applied NAS to transformer architectures:

Evolved Transformer: Used evolution to discover better transformer blocks for machine translation
AutoTinyBERT: NAS for efficient BERT variants
BigNAS / HAT: One-shot NAS for efficient transformers across hardware targets

These applications show NAS extending beyond CNNs to sequence models and other domains.

NAS Strengths Demonstrated

•Discovers non-intuitive but effective designs
•Optimizes for specific hardware constraints
•Explores larger design spaces than humans
•Produces systematically efficient architectures
•Finds designs that transfer across scales

Remaining Challenges

•Search space design still requires expertise
•Reproducibility and benchmark concerns
•Searched architectures can be complex/fragile
•Generalization across domains uncertain
•Environmental cost of extensive search

Why Manual Design Persists

Despite NAS's impressive achievements, human-designed architectures remain dominant in many applications. Understanding why illuminates both the current limitations of NAS and the directions for improvement.

The Search Space Bootstrap Problem:

NAS still requires humans to design the search space. The choices of:

What operations to include (convolutions, attention, pooling)
What connectivity patterns to allow
What constraints to enforce

...profoundly affect what architectures can be discovered. In some sense, NAS finds the best architecture within a space that humans have defined. The creative "leaps" like attention mechanisms or residual connections that fundamentally changed deep learning have come from human insight, not search.

The Interpretability Gap:

Human-designed architectures often embody interpretable principles:

ResNets: "Skip connections enable gradient flow"
Transformers: "Attention captures dependencies directly"
U-Nets: "Downsampling loses spatial info; skip connections restore it"

NAS-discovered architectures can be opaque. Why does this particular combination of operations work? The lack of interpretability makes it harder to understand failure modes, extend designs, or debug issues.

When to Use NAS vs Manual Design

•Use NAS when: Well-understood task with established search spaces (e.g., image classification, object detection); target hardware is fixed and efficiency is critical; have sufficient compute to run search
•Use Manual Design when: Novel problem requiring new architectural concepts; interpretability and debuggability are paramount; limited compute resources; rapid iteration needed
•Use Hybrid when: Start from NAS-discovered building blocks, refine manually; use NAS to validate human design choices; apply NAS for hardware-specific tuning of base architecture

The Reproducibility Concern:

NAS research has faced reproducibility challenges:

Search randomness: Different random seeds can produce different best architectures
Benchmark sensitivity: Architectures optimized for one dataset may not transfer
Evaluation protocols: Differences in training procedures confound architecture comparisons
Publication bias: Positive results are published; failed searches aren't

Recent work has established more rigorous NAS benchmarks (NAS-Bench-101, NAS-Bench-201, NAS-Bench-360) that enable fair comparison and reproducibility.

The Efficiency Paradox:

Some of the most efficient NAS-discovered architectures could have been found by well-tuned random search, raising questions about whether the sophisticated search algorithms provide value beyond computing more evaluations. This has led to important work on understanding when and why NAS outperforms baselines.

The Future of Architecture Design

The trajectory suggests a future where NAS and human design are deeply integrated. Humans contribute high-level structural innovations and interpretable principles; NAS handles detailed optimization and hardware-specific tuning. The most successful architectures—like EfficientNet—already blend both paradigms.

NAS Formalization

To understand NAS algorithms deeply, we need a formal framework that precisely defines the problem. Let us develop this formalization step by step.

Architecture Representation:

An architecture $a$ can be represented as a directed acyclic graph (DAG):

Nodes: Feature maps or intermediate representations
Edges: Operations (convolutions, pooling, etc.)

For a cell-based search space, we define:

$N = {n_0, n_1, ..., n_k}$: Set of nodes in the cell
$O = {o_1, o_2, ..., o_m}$: Set of possible operations
$e_{ij} \in O$: Operation connecting node $i$ to node $j$

An architecture is then specified by the adjacency structure and operation assignments.

The Optimization Problem:

The NAS objective can be formalized as a bi-level optimization:

$$\min_{\alpha \in \mathcal{A}} \mathcal{L}_{val}(w^*(\alpha), \alpha)$$

subject to:

$$w^*(\alpha) = \arg\min_w \mathcal{L}_{train}(w, \alpha)$$

Where:

$\alpha$ represents the architecture (discrete or continuous representation)
$w$ represents the network weights
$\mathcal{L}{train}$ and $\mathcal{L}{val}$ are training and validation losses

This bi-level structure is fundamental: the architecture selection depends on optimal weights, but finding optimal weights requires fixing the architecture.

nas_formalization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
"""
Neural Architecture Search: Formal Components
 
This module illustrates the core abstractions of NAS
in a clear, educational implementation.
"""
 
from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import List, Dict, Any, Callable, Tuple
import numpy as np
 
# =============================================================
# SEARCH SPACE: Defines what architectures can be expressed
# =============================================================
 
@dataclass
class Operation:
    """A single operation in the search space"""
    name: str           # e.g., 'conv_3x3', 'skip_connect', 'max_pool'
    params: Dict        # Operation-specific parameters
    flops: Callable     # Function to compute FLOPs given input shape
    
class SearchSpace(ABC):
    """Abstract base class for NAS search spaces"""
    
    @abstractmethod
    def sample_random_architecture(self) -> 'Architecture':
        """Sample a random architecture uniformly"""
        pass
    
    @abstractmethod
    def get_neighbors(self, arch: 'Architecture') -> List['Architecture']:
        """Get neighboring architectures (for local search)"""
        pass
    
    @abstractmethod
    def encode(self, arch: 'Architecture') -> np.ndarray:
        """Encode architecture to vector (for ML-based methods)"""
        pass
 
class CellSearchSpace(SearchSpace):
    """
    Cell-based search space (NASNet-style)
    
    The cell is a DAG with:
    - 2 input nodes (from previous cells)
    - N intermediate nodes
    - Each intermediate node selects 2 inputs and 2 operations
    """
    
    def __init__(
        self,
        operations: List[Operation],
        num_intermediate_nodes: int = 4
    ):
        self.operations = operations
        self.num_intermediate_nodes = num_intermediate_nodes
        
    def sample_random_architecture(self) -> 'CellArchitecture':
        """
        Sample architecture by randomly selecting:
        - Input connections for each intermediate node
        - Operations for each selected input
        """
        nodes = []
        for i in range(self.num_intermediate_nodes):
            # Available inputs: 2 cell inputs + previous intermediate nodes
            available_inputs = 2 + i
            
            # Select 2 inputs for this node
            input1 = np.random.randint(0, available_inputs)
            input2 = np.random.randint(0, available_inputs)
            
            # Select operations for each input
            op1 = np.random.randint(0, len(self.operations))
            op2 = np.random.randint(0, len(self.operations))
            
            nodes.append({
                'inputs': (input1, input2),
                'operations': (op1, op2)
            })
        
        return CellArchitecture(nodes, self.operations)
 
# =============================================================
# PERFORMANCE ESTIMATION: Evaluate architecture quality
# =============================================================
 
class PerformanceEstimator(ABC):
    """Abstract base class for performance estimation strategies"""
    
    @abstractmethod
    def estimate(
        self, 
        architecture: 'Architecture',
        dataset: Any
    ) -> float:
        """Estimate performance of architecture on dataset"""
        pass
    
    @property
    @abstractmethod
    def cost(self) -> float:
        """Approximate computational cost per evaluation"""
        pass
 
class FullTrainingEstimator(PerformanceEstimator):
    """Fully train and evaluate each architecture (expensive but accurate)"""
    
    def __init__(self, epochs: int = 200, batch_size: int = 64):
        self.epochs = epochs
        self.batch_size = batch_size
        
    def estimate(self, architecture, dataset) -> float:
        # Build network from architecture
        # Train for full epochs
        # Return validation accuracy
        pass  # Implementation depends on training framework
    
    @property
    def cost(self) -> float:
        return self.epochs  # Relative cost unit
 
class WeightSharingEstimator(PerformanceEstimator):
    """Use weight sharing supernet for fast evaluation"""
    
    def __init__(self, supernet: 'SuperNet'):
        self.supernet = supernet
        
    def estimate(self, architecture, dataset) -> float:
        # Extract subnet weights from supernet
        # Evaluate on validation set
        # No additional training needed
        pass
    
    @property
    def cost(self) -> float:
        return 0.001  # Orders of magnitude cheaper
 
# =============================================================
# SEARCH STRATEGY: Navigate the search space
# =============================================================
 
class SearchStrategy(ABC):
    """Abstract base class for NAS search strategies"""
    
    @abstractmethod
    def search(
        self,
        search_space: SearchSpace,
        estimator: PerformanceEstimator,
        dataset: Any,
        budget: int
    ) -> 'Architecture':
        """
        Search for the best architecture.
        
        Args:
            search_space: Space of possible architectures
            estimator: Method to evaluate architectures
            dataset: Data for evaluation
            budget: Computational budget (number of evaluations)
            
        Returns:
            Best architecture found
        """
        pass
 
class RandomSearch(SearchStrategy):
    """Simple random search baseline"""
    
    def search(self, search_space, estimator, dataset, budget):
        best_arch = None
        best_perf = float('-inf')
        
        for _ in range(budget):
            arch = search_space.sample_random_architecture()
            perf = estimator.estimate(arch, dataset)
            
            if perf > best_perf:
                best_perf = perf
                best_arch = arch
                
        return best_arch
 
class EvolutionarySearch(SearchStrategy):
    """Regularized evolution for NAS (AmoebaNet-style)"""
    
    def __init__(
        self, 
        population_size: int = 50,
        tournament_size: int = 10,
        mutation_rate: float = 1.0
    ):
        self.population_size = population_size
        self.tournament_size = tournament_size
        self.mutation_rate = mutation_rate
        
    def search(self, search_space, estimator, dataset, budget):
        # Initialize population with random architectures
        population = []
        for _ in range(self.population_size):
            arch = search_space.sample_random_architecture()
            perf = estimator.estimate(arch, dataset)
            population.append((arch, perf))
        
        evaluations = self.population_size
        
        while evaluations < budget:
            # Tournament selection
            tournament = np.random.choice(
                len(population), 
                self.tournament_size, 
                replace=False
            )
            parent = max(
                [population[i] for i in tournament],
                key=lambda x: x[1]
            )[0]
            
            # Mutation: random neighbor
            neighbors = search_space.get_neighbors(parent)
            child = np.random.choice(neighbors)
            
            # Evaluate child
            child_perf = estimator.estimate(child, dataset)
            evaluations += 1
            
            # Add to population, remove oldest
            population.append((child, child_perf))
            population.pop(0)  # Remove oldest (regularized evolution)
        
        # Return best found
        return max(population, key=lambda x: x[1])[0]

Key Theoretical Insights:

Discrete vs Continuous: The architecture space is fundamentally discrete, but continuous relaxations (like DARTS) enable gradient-based optimization at the cost of approximation error.
Exploration vs Exploitation: Search strategies must balance exploring diverse architectures (to avoid local optima) with exploiting promising regions (to refine good candidates).
Sample Complexity: The number of architecture evaluations needed scales with the difficulty of the search space. Simpler spaces require fewer evaluations.
Performance Estimation Bias: Cheaper estimation methods may incorrectly rank architectures, leading search astray. The fidelity-efficiency trade-off is crucial.

NAS Benchmarks and Evaluation

Rigorous comparison of NAS methods requires standardized benchmarks that enable reproducible evaluation. The development of NAS benchmarks represents a methodological advance that has brought scientific rigor to the field.

The Pre-Benchmark Problem:

Before standardized benchmarks, NAS papers were difficult to compare:

Different search spaces made comparisons unfair
Different training protocols confounded architecture quality
Expensive search prevented extensive ablations
Random seed sensitivity wasn't characterized

Tabular Benchmarks:

The solution: pre-compute all architecture evaluations and store them in lookup tables. Researchers can then simulate any search algorithm without actual training:

Major NAS Benchmarks
Benchmark	Search Space Size	Evaluations Stored	Key Features
NAS-Bench-101	423,624 architectures	Full training curves (108 epochs)	First tabular benchmark; cell-based on CIFAR-10
NAS-Bench-201	15,625 architectures	3 datasets × multiple seeds	Smaller but multi-dataset; CIFAR-10/100, ImageNet-16
NAS-Bench-301	~10^18 architectures	Surrogate model predictions	DARTS space; uses surrogate instead of exhaustive
TransNAS-Bench-101	4,096 architectures	7 downstream tasks	Transformer architectures; transfer learning focus

nas_benchmark_usage.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
"""
Using NAS-Bench-201 for reproducible NAS research
 
NAS-Bench-201 provides pre-computed results for 15,625
architectures across multiple datasets and seeds.
"""
 
# Install: pip install nas-bench-201
 
from nas_201_api import NASBench201API
 
# Load benchmark API
api = NASBench201API(
    'NAS-Bench-201-v1_1-096897.pth',
    verbose=False
)
 
# Get number of architectures
print(f"Total architectures: {len(api)}")  # 15625
 
# Query a specific architecture by index
arch_index = 1234
arch_str = api.arch(arch_index)
print(f"Architecture {arch_index}: {arch_str}")
 
# Get performance on different datasets
datasets = ['cifar10-valid', 'cifar100', 'ImageNet16-120']
 
for dataset in datasets:
    info = api.query_by_index(arch_index, dataset)
    
    # info contains:
    # - train_acc_1 through train_acc_200: training accuracy per epoch
    # - test_acc: final test accuracy  
    # - train_loss, test_loss: losses
    # - train_time: time per epoch
    
    print(f"  {dataset}: {info.get_metrics()['accuracy']:.2f}%")
 
# Simulate NAS: Random Search on benchmark
import random
 
def random_search_on_benchmark(api, dataset, num_samples):
    """
    Random search using benchmark for evaluation.
    
    This is VERY fast since we query pre-computed results
    instead of actually training networks.
    """
    best_acc = 0
    best_arch = None
    
    for _ in range(num_samples):
        # Sample random architecture index
        arch_idx = random.randint(0, len(api) - 1)
        
        # Query pre-computed performance (instant!)
        info = api.query_by_index(arch_idx, dataset)
        acc = info.get_metrics()['accuracy']
        
        if acc > best_acc:
            best_acc = acc
            best_arch = api.arch(arch_idx)
    
    return best_arch, best_acc
 
# Run random search
best_arch, best_acc = random_search_on_benchmark(
    api, 
    'cifar10-valid',
    num_samples=1000
)
print(f"Best found: {best_acc:.2f}% - {best_arch}")
 
# Compare against optimal
# We can find the actual best architecture since space is enumerable
all_accs = [
    api.query_by_index(i, 'cifar10-valid').get_metrics()['accuracy']
    for i in range(len(api))
]
optimal_idx = max(range(len(api)), key=lambda i: all_accs[i])
optimal_acc = all_accs[optimal_idx]
print(f"Optimal: {optimal_acc:.2f}%")
print(f"Gap: {optimal_acc - best_acc:.2f}%")

What Benchmarks Revealed:

Standardized benchmarks led to several important findings:

Random Search is Competitive: Well-tuned random search often matches or exceeds sophisticated NAS algorithms. This raised the bar for claiming algorithmic improvements.
Performance Estimation Dominates: The choice of performance estimation strategy matters more than the search algorithm for many problems.
Reproducibility Issues: Many published NAS algorithms showed high variance across random seeds, reducing the significance of small accuracy gains.
Weight Sharing Correlation: One-shot methods' rankings don't perfectly correlate with standalone training, limiting their reliability.

These insights have pushed the field toward more rigorous experimental practices and clearer understanding of what actually matters in NAS.

Benchmark Limitations

Benchmarks have limitations: they cover only specific search spaces, datasets, and training protocols. Performance on NAS-Bench-101 doesn't guarantee real-world applicability. Additionally, optimizing for benchmark performance may not translate to practical improvements. Use benchmarks for fair comparison, not as the ultimate goal.

Summary: Foundations of NAS

This page has established the conceptual foundations of Neural Architecture Search. Let us consolidate the key points before diving deeper into specific methods:

Key Takeaways

•NAS automates architecture design — treating network topology as an optimization variable rather than a fixed human decision.
•Three pillars define any NAS system — Search Space (what architectures are possible), Search Strategy (how to explore), and Performance Estimation (how to evaluate).
•Computational cost was the barrier — early NAS required thousands of GPU-days; modern methods achieve comparable results in hours through weight sharing and efficient search.
•NAS has produced state-of-the-art architectures — EfficientNet, MobileNetV3, and others demonstrate practical value.
•Human design remains relevant — search space engineering and architectural innovation still require human insight.
•Benchmarks enable rigorous evaluation — tabular benchmarks like NAS-Bench-201 allow reproducible comparison of search algorithms.

What's next:

With the foundations established, the following pages will dive into the technical details:

Search Space Design: How to construct effective search spaces that are expressive yet tractable
Search Strategies: Deep exploration of RL, evolution, gradient-based, and other search algorithms
Weight Sharing: The one-shot paradigm that enabled efficient NAS
Efficient NAS: Zero-cost proxies, early stopping, and other acceleration techniques

Each page builds on this foundation to develop complete mastery of Neural Architecture Search.

Page Complete

You now understand the fundamental motivation, taxonomy, history, and formalization of Neural Architecture Search. This conceptual foundation prepares you for the technical depth of subsequent pages, where you'll master the algorithms and techniques that power modern NAS systems.

1 / 5

Loading learning content...

Machine LearningNeural Architecture Search

Neural Architecture Search

LevelAdvanced

Duration90 mins

TopicNeural Architecture Search

1 / 5

NAS Overview

The Neural Architecture Problem

Every deep learning practitioner has faced this fundamental challenge: given a problem, what neural network architecture should I use?

Consider what designing a neural network truly involves:

Macro-architecture decisions: How deep should the network be? Should it use residual connections? What's the overall topology—sequential, multi-branch, U-shaped?
Micro-architecture decisions: What operations should each layer perform? What kernel sizes for convolutions? How many attention heads?
Connectivity patterns: How should information flow between layers? Dense connections? Sparse patterns? Cross-stage connections?
Capacity allocation: Where should the network invest its parameters—early layers or late layers? Shallow-and-wide or deep-and-narrow?

But what if machines could design these architectures themselves?

What You Will Learn

What is Neural Architecture Search?

At its core, NAS treats architecture design as an optimization problem:

$$\text{architecture}^* = \arg\max_{a \in \mathcal{A}} \text{Performance}(a, D)$$

Where $\mathcal{A}$ is the search space of possible architectures, $D$ is the dataset, and Performance measures how well architecture $a$ performs when trained on $D$.

This formulation reveals the three fundamental components of any NAS system:

The Three Pillars of NAS

•Search Space (A) — The set of possible architectures that can be discovered. Defines what architectural choices are available: layer types, connections, hyperparameters, etc. A well-designed search space constrains the problem while including high-performing architectures.
•Search Strategy — The algorithm that explores the search space to find good architectures. Common strategies include reinforcement learning, evolutionary algorithms, gradient-based methods, and Bayesian optimization. The strategy determines efficiency and discovery capability.
•Performance Estimation — How we evaluate candidate architectures. The naive approach (fully training each candidate) is prohibitively expensive. Most NAS methods use approximations: weight sharing, early stopping, performance predictors, or lower-fidelity evaluations.

Why NAS is fundamentally difficult:

The challenge of NAS stems from several compounding factors:

Exponentially large search space: Even a modest search space often contains $10^{10}$ to $10^{20}$ possible architectures. Exhaustive search is impossible.
Expensive evaluation: Training a neural network to convergence can take hours to days on modern hardware. Evaluating even a fraction of the search space is computationally prohibitive.
Non-differentiable and discrete: Unlike weight optimization, architecture choices are discrete (number of layers, connection patterns), making gradient-based optimization non-trivial.
Multi-objective nature: Good architectures must balance accuracy, inference speed, memory consumption, training efficiency, and often deployment constraints.
Generalization uncertainty: An architecture found to work well on one dataset may not transfer to others, and overfitting the search to a validation set is a real risk.

NAS vs Hyperparameter Optimization

The NAS Taxonomy

Search Space Taxonomy:

Search spaces can be categorized by their granularity and structure:

Search Space Types in NAS
Search Space Type	Description	Examples	Trade-offs
Macro Search	Searches over high-level architecture structure: number of layers, layer types, global connections	NASNet, AmoebaNet chain-structured search	Flexible but expensive; large space to explore
Cell-based Search	Searches for a repeating cell/block structure that is stacked to form the full network	NASNet cells, DARTS cells, EfficientNet	Reduces search space; assumes repetitive structure
Hierarchical Search	Multi-level search: cells at low level, stacking patterns at high level	Hierarchical NAS, Auto-DeepLab	Balances flexibility and tractability
Unbounded Search	No fixed template; discovers completely novel topologies	Some evolutionary methods, NEAT	Maximum flexibility; hardest to optimize

Search Strategy Taxonomy:

Search strategies define how we navigate the architecture space:

Search Strategies in NAS
Strategy	Core Idea	Advantages	Disadvantages
Random Search	Sample architectures uniformly at random	Simple baseline; surprisingly competitive	No learning; wasteful sampling
Reinforcement Learning	Train a controller to generate architectures; reward = validation performance	Can learn complex patterns; flexible	Sample inefficient; high variance
Evolutionary Algorithms	Maintain population; mutate and select based on fitness	Parallel exploration; robust to local optima	Requires many evaluations; tuning sensitivity
Gradient-based (DARTS)	Continuous relaxation of discrete choices; use gradients	Extremely efficient; orders of magnitude faster	Can collapse; relaxation gap issues
Bayesian Optimization	Build surrogate model of architecture-performance mapping	Sample efficient; principled uncertainty	Scaling to high dimensions; surrogate design

Performance Estimation Taxonomy:

Since fully training each candidate is expensive, various acceleration strategies exist:

Performance Estimation Strategies

•Lower Fidelity Estimates: Train for fewer epochs, on subsampled data, or with reduced input resolution. Fast but biased—optimal architecture at low fidelity may differ from optimal at full fidelity.
•Weight Sharing / One-Shot NAS: Train a single supernet containing all possible architectures; share weights across candidates. Dramatically reduces cost but introduces weight coupling effects.
•Predictor-based Estimation: Train a model to predict architecture performance from architectural features without training. Enables extremely fast evaluation but requires training data from prior evaluations.
•Learning Curve Extrapolation: Train partially and extrapolate final performance. Can eliminate poor candidates early while keeping promising ones.
•Zero-Cost Proxies: Estimate performance without any training using measures like gradient statistics, Jacobian properties, or network topology features. Extremely fast but proxy quality varies.

The Efficiency-Fidelity Trade-off

Historical Evolution of NAS

The Pre-NAS Era (1980s-2010s):

Before modern NAS, architecture design was entirely manual:

NeuroEvolution of Augmenting Topologies (NEAT, 2002): Stan Stanley's NEAT algorithm evolved both architecture and weights simultaneously, demonstrating that structure could be learned. However, it addressed smaller-scale problems than modern deep learning.
Genetic Programming for Neural Networks: Various attempts to evolve network structures existed but were limited by computational constraints.
Meta-learning approaches: Some work explored learning architecture patterns, but practical success was limited.

The First Wave: Brute-Force NAS (2016-2017):

The modern NAS era began when Google's research demonstrated that architectural search could produce networks rivaling or exceeding human-designed ones:

NAS Breakthrough Papers

•NAS with RL (Zoph & Le, 2017): Trained an RNN controller using REINFORCE to generate architecture strings. Required 800 GPUs for weeks but produced competitive architectures. Demonstrated that NAS was fundamentally viable.
•NASNet (Zoph et al., 2018): Introduced the cell-based search space—search for a cell structure, then stack cells. Reduced search cost and produced transferable building blocks. NASNet cells achieved state-of-the-art on ImageNet.
•AmoebaNet (Real et al., 2018): Applied evolutionary algorithms to NAS. Demonstrated that evolution could match or exceed RL-based search, with better parallelization properties.

These seminal works proved that automated architecture design could compete with expert-designed networks. However, the computational cost was staggering:

NAS (2017): 800 GPUs × 28 days ≈ 22,400 GPU-days
NASNet (2018): 450 GPUs × 4 days ≈ 1,800 GPU-days
AmoebaNet (2018): Similar to NASNet

Such costs restricted NAS to organizations with exceptional computational resources.

The Second Wave: Efficient NAS (2018-2019):

Recognizing the computational barrier, researchers developed dramatically more efficient methods:

The Efficiency Revolution in NAS
Method	Year	Key Innovation	Approximate Cost
ENAS	2018	Weight sharing across architectures in the search space	0.5 GPU-days
DARTS	2019	Continuous relaxation enabling gradient-based search	1 GPU-day
ProxylessNAS	2019	Direct search on target task/hardware; memory-efficient	4 GPU-days on ImageNet
Once-for-All	2020	Train one network supporting many sub-networks	Search: minutes; Training: once

This represents a 10,000x to 100,000x reduction in computational cost—transforming NAS from industrial-scale research to something achievable on a single GPU.

The Third Wave: Practical NAS (2020-Present):

Current NAS research focuses on:

Hardware-aware NAS: Directly optimizing for latency, energy, or FLOPs on target hardware
Multi-objective NAS: Pareto optimization of accuracy, efficiency, and other metrics
Transferable NAS: Finding architectures that generalize across tasks and datasets
Zero-cost NAS: Finding good architectures without any training
NAS for emerging domains: Transformers, GNNs, speech, scientific computing

The Democratization of NAS

NAS Success Stories

EfficientNet: Compound Scaling from NAS

EfficientNet-B0: 5.3M parameters, 77.3% ImageNet accuracy
EfficientNet-B7: 66M parameters, 84.4% ImageNet accuracy

The key insight: NAS found a base structure optimized for efficiency, and systematic scaling amplified that efficiency advantage.

efficientnet_scaling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# EfficientNet Compound Scaling Principle
# NAS found optimal base architecture (B0)
# Compound scaling extends it systematically
 
import math
 
def efficientnet_compound_scaling(phi):
    """
    Compound scaling coefficients discovered through grid search
    phi controls the overall scale factor
    
    EfficientNet-B0: phi=0
    EfficientNet-B1: phi=0.5
    EfficientNet-B7: phi=6.0
    """
    alpha = 1.2   # depth coefficient (NAS-discovered)
    beta = 1.1    # width coefficient (NAS-discovered)
    gamma = 1.15  # resolution coefficient (NAS-discovered)
    
    # Compound scaling formula:
    # depth = alpha^phi
    # width = beta^phi
    # resolution = gamma^phi
    # Constraint: alpha * beta^2 * gamma^2 ≈ 2
    
    depth_mult = alpha ** phi
    width_mult = beta ** phi
    resolution_mult = gamma ** phi
    
    return {
        'depth_multiplier': depth_mult,
        'width_multiplier': width_mult,
        'resolution': int(224 * resolution_mult)
    }
 
# EfficientNet variants
efficientnet_configs = {
    'B0': efficientnet_compound_scaling(0),
    'B1': efficientnet_compound_scaling(0.5),
    'B2': efficientnet_compound_scaling(1.0),
    'B3': efficientnet_compound_scaling(2.0),
    'B4': efficientnet_compound_scaling(3.0),
    'B5': efficientnet_compound_scaling(4.0),
    'B6': efficientnet_compound_scaling(5.0),
    'B7': efficientnet_compound_scaling(6.0),
}
 
for name, config in efficientnet_configs.items():
    print(f"EfficientNet-{name}: {config}")

MobileNetV3: Hardware-Aware NAS

MobileNetV3 combined NAS with human design expertise for mobile deployment:

Platform-aware NAS found optimal layer configurations for mobile latency
NetAdapt fine-tuned individual layer widths for specific hardware
Human refinements incorporated hard-swish activation and squeeze-excitation

The result: 3.2% better accuracy than MobileNetV2 with 25% lower latency—demonstrating that NAS and human expertise can be complementary.

NAS for Transformers:

Recent work has applied NAS to transformer architectures:

Evolved Transformer: Used evolution to discover better transformer blocks for machine translation
AutoTinyBERT: NAS for efficient BERT variants
BigNAS / HAT: One-shot NAS for efficient transformers across hardware targets

These applications show NAS extending beyond CNNs to sequence models and other domains.

NAS Strengths Demonstrated

•Discovers non-intuitive but effective designs
•Optimizes for specific hardware constraints
•Explores larger design spaces than humans
•Produces systematically efficient architectures
•Finds designs that transfer across scales

Remaining Challenges

•Search space design still requires expertise
•Reproducibility and benchmark concerns
•Searched architectures can be complex/fragile
•Generalization across domains uncertain
•Environmental cost of extensive search

Why Manual Design Persists

The Search Space Bootstrap Problem:

NAS still requires humans to design the search space. The choices of:

What operations to include (convolutions, attention, pooling)
What connectivity patterns to allow
What constraints to enforce

The Interpretability Gap:

Human-designed architectures often embody interpretable principles:

ResNets: "Skip connections enable gradient flow"
Transformers: "Attention captures dependencies directly"
U-Nets: "Downsampling loses spatial info; skip connections restore it"

When to Use NAS vs Manual Design

•Use NAS when: Well-understood task with established search spaces (e.g., image classification, object detection); target hardware is fixed and efficiency is critical; have sufficient compute to run search
•Use Manual Design when: Novel problem requiring new architectural concepts; interpretability and debuggability are paramount; limited compute resources; rapid iteration needed
•Use Hybrid when: Start from NAS-discovered building blocks, refine manually; use NAS to validate human design choices; apply NAS for hardware-specific tuning of base architecture

The Reproducibility Concern:

NAS research has faced reproducibility challenges:

Search randomness: Different random seeds can produce different best architectures
Benchmark sensitivity: Architectures optimized for one dataset may not transfer
Evaluation protocols: Differences in training procedures confound architecture comparisons
Publication bias: Positive results are published; failed searches aren't

Recent work has established more rigorous NAS benchmarks (NAS-Bench-101, NAS-Bench-201, NAS-Bench-360) that enable fair comparison and reproducibility.

The Efficiency Paradox:

The Future of Architecture Design

NAS Formalization

To understand NAS algorithms deeply, we need a formal framework that precisely defines the problem. Let us develop this formalization step by step.

Architecture Representation:

An architecture $a$ can be represented as a directed acyclic graph (DAG):

Nodes: Feature maps or intermediate representations
Edges: Operations (convolutions, pooling, etc.)

For a cell-based search space, we define:

$N = {n_0, n_1, ..., n_k}$: Set of nodes in the cell
$O = {o_1, o_2, ..., o_m}$: Set of possible operations
$e_{ij} \in O$: Operation connecting node $i$ to node $j$

An architecture is then specified by the adjacency structure and operation assignments.

The Optimization Problem:

The NAS objective can be formalized as a bi-level optimization:

$$\min_{\alpha \in \mathcal{A}} \mathcal{L}_{val}(w^*(\alpha), \alpha)$$

subject to:

$$w^*(\alpha) = \arg\min_w \mathcal{L}_{train}(w, \alpha)$$

Where:

$\alpha$ represents the architecture (discrete or continuous representation)
$w$ represents the network weights
$\mathcal{L}{train}$ and $\mathcal{L}{val}$ are training and validation losses

This bi-level structure is fundamental: the architecture selection depends on optimal weights, but finding optimal weights requires fixing the architecture.

nas_formalization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
"""
Neural Architecture Search: Formal Components
 
This module illustrates the core abstractions of NAS
in a clear, educational implementation.
"""
 
from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import List, Dict, Any, Callable, Tuple
import numpy as np
 
# =============================================================
# SEARCH SPACE: Defines what architectures can be expressed
# =============================================================
 
@dataclass
class Operation:
    """A single operation in the search space"""
    name: str           # e.g., 'conv_3x3', 'skip_connect', 'max_pool'
    params: Dict        # Operation-specific parameters
    flops: Callable     # Function to compute FLOPs given input shape
    
class SearchSpace(ABC):
    """Abstract base class for NAS search spaces"""
    
    @abstractmethod
    def sample_random_architecture(self) -> 'Architecture':
        """Sample a random architecture uniformly"""
        pass
    
    @abstractmethod
    def get_neighbors(self, arch: 'Architecture') -> List['Architecture']:
        """Get neighboring architectures (for local search)"""
        pass
    
    @abstractmethod
    def encode(self, arch: 'Architecture') -> np.ndarray:
        """Encode architecture to vector (for ML-based methods)"""
        pass
 
class CellSearchSpace(SearchSpace):
    """
    Cell-based search space (NASNet-style)
    
    The cell is a DAG with:
    - 2 input nodes (from previous cells)
    - N intermediate nodes
    - Each intermediate node selects 2 inputs and 2 operations
    """
    
    def __init__(
        self,
        operations: List[Operation],
        num_intermediate_nodes: int = 4
    ):
        self.operations = operations
        self.num_intermediate_nodes = num_intermediate_nodes
        
    def sample_random_architecture(self) -> 'CellArchitecture':
        """
        Sample architecture by randomly selecting:
        - Input connections for each intermediate node
        - Operations for each selected input
        """
        nodes = []
        for i in range(self.num_intermediate_nodes):
            # Available inputs: 2 cell inputs + previous intermediate nodes
            available_inputs = 2 + i
            
            # Select 2 inputs for this node
            input1 = np.random.randint(0, available_inputs)
            input2 = np.random.randint(0, available_inputs)
            
            # Select operations for each input
            op1 = np.random.randint(0, len(self.operations))
            op2 = np.random.randint(0, len(self.operations))
            
            nodes.append({
                'inputs': (input1, input2),
                'operations': (op1, op2)
            })
        
        return CellArchitecture(nodes, self.operations)
 
# =============================================================
# PERFORMANCE ESTIMATION: Evaluate architecture quality
# =============================================================
 
class PerformanceEstimator(ABC):
    """Abstract base class for performance estimation strategies"""
    
    @abstractmethod
    def estimate(
        self, 
        architecture: 'Architecture',
        dataset: Any
    ) -> float:
        """Estimate performance of architecture on dataset"""
        pass
    
    @property
    @abstractmethod
    def cost(self) -> float:
        """Approximate computational cost per evaluation"""
        pass
 
class FullTrainingEstimator(PerformanceEstimator):
    """Fully train and evaluate each architecture (expensive but accurate)"""
    
    def __init__(self, epochs: int = 200, batch_size: int = 64):
        self.epochs = epochs
        self.batch_size = batch_size
        
    def estimate(self, architecture, dataset) -> float:
        # Build network from architecture
        # Train for full epochs
        # Return validation accuracy
        pass  # Implementation depends on training framework
    
    @property
    def cost(self) -> float:
        return self.epochs  # Relative cost unit
 
class WeightSharingEstimator(PerformanceEstimator):
    """Use weight sharing supernet for fast evaluation"""
    
    def __init__(self, supernet: 'SuperNet'):
        self.supernet = supernet
        
    def estimate(self, architecture, dataset) -> float:
        # Extract subnet weights from supernet
        # Evaluate on validation set
        # No additional training needed
        pass
    
    @property
    def cost(self) -> float:
        return 0.001  # Orders of magnitude cheaper
 
# =============================================================
# SEARCH STRATEGY: Navigate the search space
# =============================================================
 
class SearchStrategy(ABC):
    """Abstract base class for NAS search strategies"""
    
    @abstractmethod
    def search(
        self,
        search_space: SearchSpace,
        estimator: PerformanceEstimator,
        dataset: Any,
        budget: int
    ) -> 'Architecture':
        """
        Search for the best architecture.
        
        Args:
            search_space: Space of possible architectures
            estimator: Method to evaluate architectures
            dataset: Data for evaluation
            budget: Computational budget (number of evaluations)
            
        Returns:
            Best architecture found
        """
        pass
 
class RandomSearch(SearchStrategy):
    """Simple random search baseline"""
    
    def search(self, search_space, estimator, dataset, budget):
        best_arch = None
        best_perf = float('-inf')
        
        for _ in range(budget):
            arch = search_space.sample_random_architecture()
            perf = estimator.estimate(arch, dataset)
            
            if perf > best_perf:
                best_perf = perf
                best_arch = arch
                
        return best_arch
 
class EvolutionarySearch(SearchStrategy):
    """Regularized evolution for NAS (AmoebaNet-style)"""
    
    def __init__(
        self, 
        population_size: int = 50,
        tournament_size: int = 10,
        mutation_rate: float = 1.0
    ):
        self.population_size = population_size
        self.tournament_size = tournament_size
        self.mutation_rate = mutation_rate
        
    def search(self, search_space, estimator, dataset, budget):
        # Initialize population with random architectures
        population = []
        for _ in range(self.population_size):
            arch = search_space.sample_random_architecture()
            perf = estimator.estimate(arch, dataset)
            population.append((arch, perf))
        
        evaluations = self.population_size
        
        while evaluations < budget:
            # Tournament selection
            tournament = np.random.choice(
                len(population), 
                self.tournament_size, 
                replace=False
            )
            parent = max(
                [population[i] for i in tournament],
                key=lambda x: x[1]
            )[0]
            
            # Mutation: random neighbor
            neighbors = search_space.get_neighbors(parent)
            child = np.random.choice(neighbors)
            
            # Evaluate child
            child_perf = estimator.estimate(child, dataset)
            evaluations += 1
            
            # Add to population, remove oldest
            population.append((child, child_perf))
            population.pop(0)  # Remove oldest (regularized evolution)
        
        # Return best found
        return max(population, key=lambda x: x[1])[0]

Key Theoretical Insights:

Discrete vs Continuous: The architecture space is fundamentally discrete, but continuous relaxations (like DARTS) enable gradient-based optimization at the cost of approximation error.
Exploration vs Exploitation: Search strategies must balance exploring diverse architectures (to avoid local optima) with exploiting promising regions (to refine good candidates).
Sample Complexity: The number of architecture evaluations needed scales with the difficulty of the search space. Simpler spaces require fewer evaluations.
Performance Estimation Bias: Cheaper estimation methods may incorrectly rank architectures, leading search astray. The fidelity-efficiency trade-off is crucial.

NAS Benchmarks and Evaluation

The Pre-Benchmark Problem:

Before standardized benchmarks, NAS papers were difficult to compare:

Different search spaces made comparisons unfair
Different training protocols confounded architecture quality
Expensive search prevented extensive ablations
Random seed sensitivity wasn't characterized

Tabular Benchmarks:

The solution: pre-compute all architecture evaluations and store them in lookup tables. Researchers can then simulate any search algorithm without actual training:

Major NAS Benchmarks
Benchmark	Search Space Size	Evaluations Stored	Key Features
NAS-Bench-101	423,624 architectures	Full training curves (108 epochs)	First tabular benchmark; cell-based on CIFAR-10
NAS-Bench-201	15,625 architectures	3 datasets × multiple seeds	Smaller but multi-dataset; CIFAR-10/100, ImageNet-16
NAS-Bench-301	~10^18 architectures	Surrogate model predictions	DARTS space; uses surrogate instead of exhaustive
TransNAS-Bench-101	4,096 architectures	7 downstream tasks	Transformer architectures; transfer learning focus

nas_benchmark_usage.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
"""
Using NAS-Bench-201 for reproducible NAS research
 
NAS-Bench-201 provides pre-computed results for 15,625
architectures across multiple datasets and seeds.
"""
 
# Install: pip install nas-bench-201
 
from nas_201_api import NASBench201API
 
# Load benchmark API
api = NASBench201API(
    'NAS-Bench-201-v1_1-096897.pth',
    verbose=False
)
 
# Get number of architectures
print(f"Total architectures: {len(api)}")  # 15625
 
# Query a specific architecture by index
arch_index = 1234
arch_str = api.arch(arch_index)
print(f"Architecture {arch_index}: {arch_str}")
 
# Get performance on different datasets
datasets = ['cifar10-valid', 'cifar100', 'ImageNet16-120']
 
for dataset in datasets:
    info = api.query_by_index(arch_index, dataset)
    
    # info contains:
    # - train_acc_1 through train_acc_200: training accuracy per epoch
    # - test_acc: final test accuracy  
    # - train_loss, test_loss: losses
    # - train_time: time per epoch
    
    print(f"  {dataset}: {info.get_metrics()['accuracy']:.2f}%")
 
# Simulate NAS: Random Search on benchmark
import random
 
def random_search_on_benchmark(api, dataset, num_samples):
    """
    Random search using benchmark for evaluation.
    
    This is VERY fast since we query pre-computed results
    instead of actually training networks.
    """
    best_acc = 0
    best_arch = None
    
    for _ in range(num_samples):
        # Sample random architecture index
        arch_idx = random.randint(0, len(api) - 1)
        
        # Query pre-computed performance (instant!)
        info = api.query_by_index(arch_idx, dataset)
        acc = info.get_metrics()['accuracy']
        
        if acc > best_acc:
            best_acc = acc
            best_arch = api.arch(arch_idx)
    
    return best_arch, best_acc
 
# Run random search
best_arch, best_acc = random_search_on_benchmark(
    api, 
    'cifar10-valid',
    num_samples=1000
)
print(f"Best found: {best_acc:.2f}% - {best_arch}")
 
# Compare against optimal
# We can find the actual best architecture since space is enumerable
all_accs = [
    api.query_by_index(i, 'cifar10-valid').get_metrics()['accuracy']
    for i in range(len(api))
]
optimal_idx = max(range(len(api)), key=lambda i: all_accs[i])
optimal_acc = all_accs[optimal_idx]
print(f"Optimal: {optimal_acc:.2f}%")
print(f"Gap: {optimal_acc - best_acc:.2f}%")

What Benchmarks Revealed:

Standardized benchmarks led to several important findings:

Random Search is Competitive: Well-tuned random search often matches or exceeds sophisticated NAS algorithms. This raised the bar for claiming algorithmic improvements.
Performance Estimation Dominates: The choice of performance estimation strategy matters more than the search algorithm for many problems.
Reproducibility Issues: Many published NAS algorithms showed high variance across random seeds, reducing the significance of small accuracy gains.
Weight Sharing Correlation: One-shot methods' rankings don't perfectly correlate with standalone training, limiting their reliability.

These insights have pushed the field toward more rigorous experimental practices and clearer understanding of what actually matters in NAS.

Benchmark Limitations

Summary: Foundations of NAS

This page has established the conceptual foundations of Neural Architecture Search. Let us consolidate the key points before diving deeper into specific methods:

Key Takeaways

•NAS automates architecture design — treating network topology as an optimization variable rather than a fixed human decision.
•Three pillars define any NAS system — Search Space (what architectures are possible), Search Strategy (how to explore), and Performance Estimation (how to evaluate).
•Computational cost was the barrier — early NAS required thousands of GPU-days; modern methods achieve comparable results in hours through weight sharing and efficient search.
•NAS has produced state-of-the-art architectures — EfficientNet, MobileNetV3, and others demonstrate practical value.
•Human design remains relevant — search space engineering and architectural innovation still require human insight.
•Benchmarks enable rigorous evaluation — tabular benchmarks like NAS-Bench-201 allow reproducible comparison of search algorithms.

What's next:

With the foundations established, the following pages will dive into the technical details:

Search Space Design: How to construct effective search spaces that are expressive yet tractable
Search Strategies: Deep exploration of RL, evolution, gradient-based, and other search algorithms
Weight Sharing: The one-shot paradigm that enabled efficient NAS
Efficient NAS: Zero-cost proxies, early stopping, and other acceleration techniques

Each page builds on this foundation to develop complete mastery of Neural Architecture Search.

Page Complete

1 / 5