Loading content...
Early NAS methods faced a fundamental efficiency problem: each candidate architecture required training from scratch. With training taking hours to days and search spaces containing billions of architectures, exhaustive evaluation was impossible.
Weight sharing (also called one-shot NAS) solved this by training a single supernet that contains all architectures in the search space as subnetworks. Different architectures share weights in the supernet, allowing performance estimation without separate training.
This innovation reduced NAS cost from thousands of GPU-days to hours—a 10,000x improvement that democratized neural architecture search.
Understand the supernet paradigm, how weight sharing works, the training and search phases of one-shot NAS, the coupling effects that limit weight sharing, and techniques to improve supernet quality.
A supernet (or one-shot model) is a single over-parameterized network that encodes the entire search space. Every architecture in the search space corresponds to a subnetwork (or subnet) of the supernet.
Key Insight:
If architectures share structure (same layer positions, operation choices), they can share weights. The supernet contains weights for all possible operations at all positions. A specific architecture uses a subset of these weights.
Mathematical Formulation:
Let $W$ be the supernet weights. For architecture $a$, the subnet weights are: $$w_a = M_a \odot W$$
where $M_a$ is a binary mask selecting relevant weights for architecture $a$.
Evaluation without retraining:
Once the supernet is trained, any architecture can be evaluated by:
No additional training required—evaluation takes seconds instead of hours.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
import torchimport torch.nn as nnimport random class SupernetCell(nn.Module): """ A cell containing all possible operations. Architectures are evaluated by selecting specific paths. """ def __init__(self, channels, operations, num_nodes=4): super().__init__() self.num_nodes = num_nodes self.num_ops = len(operations) # Create all possible operations for all edges self.edges = nn.ModuleDict() for i in range(num_nodes): for j in range(2 + i): # 2 cell inputs + previous nodes edge_ops = nn.ModuleList([ self._build_op(op, channels) for op in operations ]) self.edges[f'{j}->{2+i}'] = edge_ops def forward(self, s0, s1, architecture): """ Forward pass for a specific architecture. architecture: dict mapping edge names to operation indices """ states = [s0, s1] for i in range(self.num_nodes): # Get inputs and operations for this node from architecture inp1, op1, inp2, op2 = architecture[f'node_{i}'] # Apply selected operations from supernet out1 = self.edges[f'{inp1}->{2+i}'][op1](states[inp1]) out2 = self.edges[f'{inp2}->{2+i}'][op2](states[inp2]) states.append(out1 + out2) return torch.cat(states[2:], dim=1) class Supernet(nn.Module): """Complete supernet for one-shot NAS""" def __init__(self, config): super().__init__() self.cells = nn.ModuleList([ SupernetCell(config.channels, config.operations) for _ in range(config.num_cells) ]) def forward(self, x, architecture): """Forward with specific architecture""" s0 = s1 = self.stem(x) for cell in self.cells: s0, s1 = s1, cell(s0, s1, architecture) return self.classifier(s1) def sample_random_architecture(self): """Sample random subnet from supernet""" arch = {} for i in range(self.num_nodes): valid_inputs = list(range(2 + i)) arch[f'node_{i}'] = ( random.choice(valid_inputs), # input1 random.randint(0, self.num_ops - 1), # op1 random.choice(valid_inputs), # input2 random.randint(0, self.num_ops - 1), # op2 ) return archSupernet training presents unique challenges: we want all subnets to inherit good weights, but we can't train all exponentially many of them explicitly.
Path Sampling Training:
The standard approach samples random paths (architectures) during training:
Over training, all operations receive updates proportional to their sampling probability.
12345678910111213141516171819202122232425262728293031323334353637383940
def train_supernet_epoch(supernet, train_loader, optimizer): """ Train supernet with random path sampling. """ supernet.train() for batch_idx, (inputs, targets) in enumerate(train_loader): # Sample random architecture for this batch architecture = supernet.sample_random_architecture() # Forward with this architecture outputs = supernet(inputs, architecture) loss = F.cross_entropy(outputs, targets) # Backward updates only active weights optimizer.zero_grad() loss.backward() optimizer.step() def train_supernet_fairnas(supernet, train_loader, optimizer): """ FairNAS-style training: ensure all operations trained equally. """ supernet.train() for batch_idx, (inputs, targets) in enumerate(train_loader): total_loss = 0 # Sample multiple architectures to ensure coverage num_samples = 4 for _ in range(num_samples): architecture = supernet.sample_random_architecture() outputs = supernet(inputs, architecture) total_loss += F.cross_entropy(outputs, targets) avg_loss = total_loss / num_samples optimizer.zero_grad() avg_loss.backward() optimizer.step()Once trained, the supernet enables fast architecture evaluation. The search phase applies any search strategy (random, evolution, RL) using supernet performance as a proxy for standalone performance.
Two-Stage Search Pipeline:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
def search_with_supernet( supernet, val_loader, search_space, search_budget=1000): """ Search for best architecture using trained supernet. """ supernet.eval() best_arch = None best_acc = 0 for _ in range(search_budget): # Sample architecture arch = search_space.sample_random() # Evaluate using supernet (fast!) acc = evaluate_architecture(supernet, arch, val_loader) if acc > best_acc: best_acc = acc best_arch = arch return best_arch, best_acc def evolutionary_search_supernet( supernet, val_loader, search_space, population_size=50, generations=20): """ Evolutionary search using supernet for fitness evaluation. """ # Initialize population population = [ search_space.sample_random() for _ in range(population_size) ] for gen in range(generations): # Evaluate all individuals using supernet fitness = [ evaluate_architecture(supernet, arch, val_loader) for arch in population ] # Select top-k as parents top_k = 10 sorted_pop = sorted( zip(population, fitness), key=lambda x: x[1], reverse=True ) parents = [p[0] for p in sorted_pop[:top_k]] # Generate offspring through mutation/crossover offspring = [] while len(offspring) < population_size - top_k: parent = random.choice(parents) child = mutate(parent, search_space) offspring.append(child) population = parents + offspring # Return best found return max(zip(population, fitness), key=lambda x: x[1])Weight sharing has a fundamental limitation: weight coupling. When architectures share weights, the optimal weights for one architecture may not be optimal for another.
The Problem:
Consider two architectures $a_1$ and $a_2$ that share some weights $w_s$:
If $w_s^* eq w_s^{**}$, the shared supernet weights can't be optimal for both.
Consequences:
| Study | Search Space | Kendall τ | Finding |
|---|---|---|---|
| Yu et al. 2020 | DARTS space | ~0.2 | Low correlation; significant mismatch |
| Zela et al. 2020 | NAS-Bench-201 | ~0.5-0.7 | Moderate correlation; depends on training |
| Yang et al. 2021 | Various | Varies | Correlation degrades with space size |
Weight coupling means supernet evaluation is a proxy, not ground truth. The best architecture by supernet evaluation may not be best when trained standalone. Always validate top candidates with full training.
Researchers have developed techniques to mitigate weight coupling and improve supernet reliability:
123456789101112131415161718192021222324252627282930
def sandwich_training_step(supernet, inputs, targets, optimizer): """ Sandwich training: train smallest, largest, and random subnets. Improves supernet quality for extreme architectures. """ total_loss = 0 # 1. Train smallest subnet smallest_arch = supernet.get_smallest_architecture() out = supernet(inputs, smallest_arch) loss = F.cross_entropy(out, targets) total_loss += loss # 2. Train largest subnet largest_arch = supernet.get_largest_architecture() out = supernet(inputs, largest_arch) loss = F.cross_entropy(out, targets) total_loss += loss # 3. Train random subnets (in-between) for _ in range(2): random_arch = supernet.sample_random_architecture() out = supernet(inputs, random_arch) loss = F.cross_entropy(out, targets) total_loss += loss # Average and update optimizer.zero_grad() (total_loss / 4).backward() optimizer.step()ENAS (Efficient NAS) introduced weight sharing to NAS, reducing search cost from 1000s to ~0.5 GPU-days.
Key Ideas:
Innovation: By sharing weights, ENAS amortizes the cost of training across all evaluated architectures.
You now understand weight sharing—the technique that made NAS practical. Next, we'll explore efficient NAS methods including zero-cost proxies and other acceleration techniques.