Loading learning content...
For years, the deep learning community scaled convolutional neural networks in an ad-hoc fashion. Want better accuracy? Make the network deeper. Still not good enough? Add more channels. Need even more performance? Use higher resolution inputs. Each dimension was tuned independently, often requiring expensive grid searches and leading to suboptimal efficiency.
EfficientNet (Tan & Le, 2019) fundamentally changed this paradigm by introducing compound scaling—a principled method for uniformly scaling depth, width, and resolution simultaneously using a fixed set of scaling coefficients. The result was a family of models that achieved state-of-the-art accuracy while being significantly more parameter-efficient and computationally cheaper than previous architectures.
By the end of this page, you will understand the theoretical foundations of compound scaling, analyze the EfficientNet-B0 baseline architecture discovered through neural architecture search, explore the complete EfficientNet family (B0-B7 and beyond), and appreciate why this work represents a paradigm shift in how we design and scale CNNs.
Before EfficientNet, researchers scaled neural networks along three primary dimensions, typically one at a time:
Depth Scaling: Adding more layers to the network. ResNet demonstrated that depth could be increased dramatically (from 18 to 152+ layers) with skip connections. Deeper networks can capture more complex hierarchical features, but they eventually suffer from diminishing returns and training difficulties.
Width Scaling: Increasing the number of channels (filters) in each layer. Wider networks can capture more fine-grained features at each level. WideResNet showed that making networks wider could sometimes outperform making them deeper, but excessive width creates memory bottlenecks.
Resolution Scaling: Using higher resolution input images. Higher resolution provides more spatial detail, potentially improving accuracy, especially for fine-grained classification. However, computational cost scales quadratically with resolution (since convolutions operate over height × width).
| Dimension | Method | Benefits | Limitations |
|---|---|---|---|
| Depth | Add more layers | Richer hierarchical features, proven effective (ResNet) | Vanishing gradients, diminishing returns, increased latency |
| Width | More channels per layer | Fine-grained features, easier to train than very deep nets | Memory intensive, captures less hierarchy |
| Resolution | Higher input size | More spatial detail, better for fine-grained tasks | Quadratic compute cost, memory explosion |
The fundamental problem: Each scaling dimension has diminishing returns when applied in isolation. Making a network twice as deep doesn't double its accuracy—it might improve accuracy by a few percentage points but significantly increases computation. Similarly, doubling width or resolution yields diminishing improvements.
More critically, scaling only one dimension creates imbalanced networks. A very deep but narrow network cannot capture fine-grained features well. A very wide but shallow network misses hierarchical abstractions. A high-resolution input on a shallow network wastes the additional pixels because the receptive field cannot integrate enough context.
The key insight: Different scaling dimensions are not independent—they interact with each other. A deeper network may need higher resolution to leverage its larger receptive field. A wider network benefits from more layers to transform its additional channels into meaningful features.
Before EfficientNet, finding the optimal combination of depth, width, and resolution required expensive grid searches over an enormous hyperparameter space. With each dimension having dozens of possible values, exhaustively searching all combinations was computationally infeasible. Researchers relied on intuition and incremental experimentation, often leading to inefficient architectures.
EfficientNet's core contribution is the compound scaling method, which scales all three dimensions uniformly using a single compound coefficient φ (phi). Rather than tuning depth, width, and resolution independently, the method uses fixed ratios between them, determined by a small grid search on the baseline network.
The Compound Scaling Formula:
Given a compound coefficient φ, the scaling is defined as:
Where α, β, and γ are constants determined by a grid search such that:
The constraint α · β² · γ² ≈ 2 ensures that total FLOPs scale approximately as 2^φ. The quadratic terms for β and γ account for the fact that FLOPs scale linearly with depth but quadratically with width and resolution (since convolutions at each layer operate on width² × height² spatial positions).
123456789101112131415161718192021222324252627282930313233343536373839404142
# Compound Scaling Coefficients for EfficientNet# Determined via grid search on EfficientNet-B0 # Base scaling coefficientsALPHA = 1.2 # Depth scaling baseBETA = 1.1 # Width scaling base GAMMA = 1.15 # Resolution scaling base # Verify constraint: α · β² · γ² ≈ 2constraint = ALPHA * (BETA ** 2) * (GAMMA ** 2)print(f"Constraint value: {constraint:.3f}") # Should be ≈ 2.0 def compute_scaling(phi: float) -> dict: """ Compute scaling factors for a given compound coefficient phi. Args: phi: Compound scaling coefficient (e.g., 0, 1, 2, ...) Returns: Dictionary with depth, width, and resolution multipliers """ depth_multiplier = ALPHA ** phi width_multiplier = BETA ** phi resolution_multiplier = GAMMA ** phi # Approximate FLOPs scaling flops_scaling = (ALPHA * (BETA ** 2) * (GAMMA ** 2)) ** phi return { "depth": depth_multiplier, "width": width_multiplier, "resolution": resolution_multiplier, "approx_flops": flops_scaling } # EfficientNet family scalingfor phi in range(8): # B0 through B7 scaling = compute_scaling(phi) print(f"EfficientNet-B{phi}: depth={scaling['depth']:.2f}x, " f"width={scaling['width']:.2f}x, res={scaling['resolution']:.2f}x, " f"~{scaling['approx_flops']:.1f}x FLOPs")Why This Works:
The compound scaling method succeeds because it respects the interdependence of scaling dimensions:
Depth and Resolution Synergy: Deeper networks have larger receptive fields, which can only be utilized with sufficient input resolution. Scaling both together ensures the network can "see" enough context to leverage its depth.
Width and Depth Balance: More channels require more layers to transform them into useful features. Scaling width and depth together prevents bottlenecks where information cannot flow through an imbalanced architecture.
Resolution and Width Coupling: Higher resolution inputs produce more spatial locations, each requiring processing by the convolutional filters. More channels per layer help capture the additional information present in higher-resolution inputs.
Empirical results confirm this intuition: compound scaling consistently outperforms single-dimension scaling at equivalent FLOP budgets.
The α, β, γ values are determined by a small grid search on the baseline network (EfficientNet-B0) with φ=1. This one-time search fixes the ratios, and subsequent models (B1-B7) are created simply by varying φ. This dramatically reduces the hyperparameter search space from three independent dimensions to a single scalar.
While compound scaling is elegant, its effectiveness depends critically on the quality of the baseline architecture being scaled. Rather than starting from a hand-designed architecture like ResNet or VGG, EfficientNet uses Neural Architecture Search (NAS) to discover an optimal baseline.
The baseline, EfficientNet-B0, was discovered using a multi-objective NAS approach (specifically, a variant of MnasNet) that optimized for both accuracy and FLOP efficiency. The search space included decisions about:
| Stage | Operator | Resolution | Channels | Layers |
|---|---|---|---|---|
| 1 | Conv3×3 | 224×224 | 32 | 1 |
| 2 | MBConv1, k3×3 | 112×112 | 16 | 1 |
| 3 | MBConv6, k3×3 | 112×112 | 24 | 2 |
| 4 | MBConv6, k5×5 | 56×56 | 40 | 2 |
| 5 | MBConv6, k3×3 | 28×28 | 80 | 3 |
| 6 | MBConv6, k5×5 | 14×14 | 112 | 3 |
| 7 | MBConv6, k5×5 | 14×14 | 192 | 4 |
| 8 | MBConv6, k3×3 | 7×7 | 320 | 1 |
| 9 | Conv1×1 + Pool + FC | 7×7→1×1 | 1280 | 1 |
Understanding MBConv (Mobile Inverted Bottleneck Convolution):
The core building block of EfficientNet is the MBConv layer, which originated from MobileNetV2. MBConv implements an inverted residual structure:
This design is highly efficient because the expensive spatial convolution operates on the expanded (but depthwise-separated) representation, while the computationally lighter 1×1 convolutions handle channel mixing.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697
import torchimport torch.nn as nn class SqueezeExcitation(nn.Module): """ Squeeze-and-Excitation block for channel attention. Computes channel-wise importance weights by: 1. Global average pooling to get channel statistics 2. Two FC layers to compute attention weights 3. Sigmoid activation to get scaling factors """ def __init__(self, in_channels: int, reduced_dim: int): super().__init__() self.se = nn.Sequential( nn.AdaptiveAvgPool2d(1), # Global pooling: (B, C, H, W) -> (B, C, 1, 1) nn.Conv2d(in_channels, reduced_dim, 1), # Reduction nn.SiLU(inplace=True), # Swish activation nn.Conv2d(reduced_dim, in_channels, 1), # Expansion nn.Sigmoid() # Attention weights in [0, 1] ) def forward(self, x: torch.Tensor) -> torch.Tensor: return x * self.se(x) # Channel-wise scaling class MBConv(nn.Module): """ Mobile Inverted Bottleneck Convolution (MBConv) block. Structure: Input -> Expand (1x1) -> Depthwise (kxk) -> SE -> Project (1x1) -> Output \---------------------------------------------------/ (residual if in_ch == out_ch) Args: in_channels: Input channel count out_channels: Output channel count kernel_size: Spatial kernel size for depthwise conv stride: Stride for depthwise conv (for downsampling) expand_ratio: Channel expansion factor (1 for MBConv1, 6 for MBConv6) se_ratio: Squeeze-Excitation reduction ratio (typically 0.25) """ def __init__( self, in_channels: int, out_channels: int, kernel_size: int = 3, stride: int = 1, expand_ratio: int = 6, se_ratio: float = 0.25 ): super().__init__() self.stride = stride self.use_residual = (stride == 1) and (in_channels == out_channels) # Expanded channel dimension expanded_channels = in_channels * expand_ratio layers = [] # 1. Expansion phase (skip if expand_ratio == 1) if expand_ratio != 1: layers.extend([ nn.Conv2d(in_channels, expanded_channels, 1, bias=False), nn.BatchNorm2d(expanded_channels), nn.SiLU(inplace=True) # Swish activation ]) # 2. Depthwise convolution layers.extend([ nn.Conv2d( expanded_channels, expanded_channels, kernel_size, stride=stride, padding=kernel_size // 2, groups=expanded_channels, bias=False # groups=channels for depthwise ), nn.BatchNorm2d(expanded_channels), nn.SiLU(inplace=True) ]) # 3. Squeeze-and-Excitation reduced_dim = max(1, int(in_channels * se_ratio)) layers.append(SqueezeExcitation(expanded_channels, reduced_dim)) # 4. Projection (linear, no activation) layers.extend([ nn.Conv2d(expanded_channels, out_channels, 1, bias=False), nn.BatchNorm2d(out_channels) ]) self.block = nn.Sequential(*layers) def forward(self, x: torch.Tensor) -> torch.Tensor: out = self.block(x) if self.use_residual: out = out + x # Residual connection return outEfficientNet uses Swish activation (also called SiLU: Sigmoid Linear Unit), defined as f(x) = x · σ(x). Swish is smooth, non-monotonic, and has been shown to outperform ReLU on deep networks. It was discovered through automated search and has become standard in modern architectures.
With the baseline architecture (B0) and compound scaling coefficients (α, β, γ) fixed, creating the EfficientNet family is remarkably simple: just vary φ from 0 to 7 (and beyond for later variants).
Each model in the family offers a different accuracy-efficiency tradeoff, allowing practitioners to select the appropriate model for their computational budget:
| Model | Input Resolution | Parameters | FLOPs | Top-1 Accuracy (ImageNet) |
|---|---|---|---|---|
| EfficientNet-B0 | 224×224 | 5.3M | 0.39B | 77.3% |
| EfficientNet-B1 | 240×240 | 7.8M | 0.70B | 79.2% |
| EfficientNet-B2 | 260×260 | 9.2M | 1.0B | 80.3% |
| EfficientNet-B3 | 300×300 | 12M | 1.8B | 81.7% |
| EfficientNet-B4 | 380×380 | 19M | 4.2B | 83.0% |
| EfficientNet-B5 | 456×456 | 30M | 9.9B | 83.7% |
| EfficientNet-B6 | 528×528 | 43M | 19B | 84.2% |
| EfficientNet-B7 | 600×600 | 66M | 37B | 84.4% |
Analyzing the Scaling Results:
Efficiency Gains: EfficientNet-B0 achieves comparable accuracy to ResNet-50 with 9× fewer parameters. EfficientNet-B7 matches the accuracy of the best published models (at the time) while using 8.4× fewer parameters.
Accuracy Scaling: Each step from B0 to B7 provides meaningful accuracy improvements, demonstrating that compound scaling continues to be effective across a wide range of model sizes.
Diminishing Returns at Scale: The jump from B0 to B1 provides ~2% accuracy improvement for ~2× FLOPs. The jump from B6 to B7 provides only ~0.2% improvement for ~2× FLOPs. This reflects the general principle that accuracy improvements become harder at higher accuracy levels.
Resolution Scaling: Input resolution grows substantially across the family. B7 uses 600×600 inputs—nearly 7× more pixels than B0's 224×224. This has significant implications for data augmentation and inference preprocessing.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159
import torchimport torch.nn as nnfrom typing import List, Tuplefrom dataclasses import dataclass @dataclassclass EfficientNetConfig: """Configuration for EfficientNet model family.""" # (expand_ratio, channels, layers, kernel, stride) stage_configs: List[Tuple[int, int, int, int, int]] = None # Compound scaling parameters width_mult: float = 1.0 depth_mult: float = 1.0 resolution: int = 224 dropout_rate: float = 0.2 def __post_init__(self): if self.stage_configs is None: # EfficientNet-B0 baseline configuration # (expand_ratio, out_channels, num_layers, kernel_size, stride) self.stage_configs = [ (1, 16, 1, 3, 1), # Stage 2: MBConv1 (6, 24, 2, 3, 2), # Stage 3: MBConv6 (6, 40, 2, 5, 2), # Stage 4: MBConv6 (6, 80, 3, 3, 2), # Stage 5: MBConv6 (6, 112, 3, 5, 1), # Stage 6: MBConv6 (6, 192, 4, 5, 2), # Stage 7: MBConv6 (6, 320, 1, 3, 1), # Stage 8: MBConv6 ] # Compound scaling coefficientsEFFICIENTNET_PARAMS = { # (phi, resolution, dropout) 'b0': (0, 224, 0.2), 'b1': (0.5, 240, 0.2), 'b2': (1, 260, 0.3), 'b3': (2, 300, 0.3), 'b4': (3, 380, 0.4), 'b5': (4, 456, 0.4), 'b6': (5, 528, 0.5), 'b7': (6, 600, 0.5),} def get_efficientnet_config(model_name: str) -> EfficientNetConfig: """ Get configuration for a specific EfficientNet variant. Applies compound scaling coefficients to the B0 baseline. """ if model_name not in EFFICIENTNET_PARAMS: raise ValueError(f"Unknown model: {model_name}") phi, resolution, dropout = EFFICIENTNET_PARAMS[model_name] # Compute scaling multipliers # α = 1.2, β = 1.1, γ = 1.15 alpha, beta, gamma = 1.2, 1.1, 1.15 depth_mult = alpha ** phi width_mult = beta ** phi return EfficientNetConfig( width_mult=width_mult, depth_mult=depth_mult, resolution=resolution, dropout_rate=dropout ) class EfficientNet(nn.Module): """ EfficientNet model implementation. Builds the network from configuration, applying compound scaling to depth (layer counts) and width (channel counts). """ def __init__(self, config: EfficientNetConfig, num_classes: int = 1000): super().__init__() self.config = config # Scale helper functions def scale_depth(d: int) -> int: return max(1, int(d * config.depth_mult + 0.5)) def scale_width(w: int) -> int: # Round to nearest multiple of 8 for efficiency w = int(w * config.width_mult + 0.5) return max(8, (w + 4) // 8 * 8) # Stem: Initial conv layer stem_channels = scale_width(32) self.stem = nn.Sequential( nn.Conv2d(3, stem_channels, 3, stride=2, padding=1, bias=False), nn.BatchNorm2d(stem_channels), nn.SiLU(inplace=True) ) # Build MBConv stages stages = [] in_channels = stem_channels for expand, out_ch, layers, kernel, stride in config.stage_configs: out_channels = scale_width(out_ch) num_layers = scale_depth(layers) stage = self._build_stage( in_channels, out_channels, num_layers, expand, kernel, stride ) stages.append(stage) in_channels = out_channels self.stages = nn.Sequential(*stages) # Head: Final conv, pooling, and classifier head_channels = scale_width(1280) self.head = nn.Sequential( nn.Conv2d(in_channels, head_channels, 1, bias=False), nn.BatchNorm2d(head_channels), nn.SiLU(inplace=True), nn.AdaptiveAvgPool2d(1), nn.Flatten(), nn.Dropout(config.dropout_rate), nn.Linear(head_channels, num_classes) ) def _build_stage( self, in_ch: int, out_ch: int, num_layers: int, expand: int, kernel: int, stride: int ) -> nn.Sequential: """Build a stage of MBConv blocks.""" layers = [] for i in range(num_layers): layers.append(MBConv( in_ch if i == 0 else out_ch, out_ch, kernel_size=kernel, stride=stride if i == 0 else 1, # Only first layer can downsample expand_ratio=expand )) return nn.Sequential(*layers) def forward(self, x: torch.Tensor) -> torch.Tensor: x = self.stem(x) x = self.stages(x) x = self.head(x) return x # Factory functiondef efficientnet_b0(num_classes: int = 1000) -> EfficientNet: return EfficientNet(get_efficientnet_config('b0'), num_classes) def efficientnet_b3(num_classes: int = 1000) -> EfficientNet: return EfficientNet(get_efficientnet_config('b3'), num_classes) def efficientnet_b7(num_classes: int = 1000) -> EfficientNet: return EfficientNet(get_efficientnet_config('b7'), num_classes)For most applications, EfficientNet-B0 through B4 offer the best tradeoffs. B0 is excellent for edge deployment and fast inference. B3-B4 provide a sweet spot for server-side inference where accuracy matters but latency is still important. B5-B7 are primarily useful when pushing for maximum accuracy regardless of cost.
While EfficientNet optimized for inference efficiency (parameters and FLOPs), EfficientNetV2 (Tan & Le, 2021) additionally optimizes for training efficiency—how fast the model can be trained on modern accelerators.
Key Insights from EfficientNetV2:
Training Bottleneck Analysis: Large image sizes (512×512+) create memory bottlenecks and slow training. Depthwise convolutions have lower hardware utilization than regular convolutions on modern GPUs/TPUs.
Progressive Resizing: Start training with smaller images and progressively increase resolution. This dramatically speeds up early training epochs when the model is still learning basic features.
Fused MBConv: Replace depthwise-separable convolutions with regular convolutions in early stages where they're actually faster on modern hardware (due to better parallelization).
Smaller Expansion Ratios: Use smaller expansion ratios (e.g., 4× instead of 6×) to reduce memory access overhead without sacrificing accuracy.
| Aspect | EfficientNet (V1) | EfficientNetV2 |
|---|---|---|
| Primary Optimization | Inference FLOPs/params | Training speed + inference |
| Image Size Strategy | Fixed resolution training | Progressive resizing |
| Early Stage Blocks | Always MBConv (depthwise) | Fused-MBConv (regular conv) |
| Expansion Ratios | Mostly 6× | Smaller (1-4×) in early layers |
| Training Speed | Baseline | ~3-4× faster for same accuracy |
Fused-MBConv Block:
In standard MBConv, a 3×3 depthwise convolution is used for spatial processing. Theoretically, this has fewer FLOPs. However, on modern accelerators (GPUs/TPUs), the overhead of multiple small operations can exceed the benefit.
Fused-MBConv replaces the expansion + depthwise pattern with a single regular convolution:
This is used in early stages where feature maps are large (more spatial positions to parallelize over), and the added FLOPs are offset by better hardware utilization.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103
import torchimport torch.nn as nn class FusedMBConv(nn.Module): """ Fused Mobile Inverted Bottleneck Convolution. Replaces the Expand + Depthwise pattern with a single regular convolution for better hardware utilization. Structure: Input -> Fused Expand+Spatial (3x3) -> SE -> Project (1x1) -> Output \-----------------------------------------------/ (residual if in_ch == out_ch) """ def __init__( self, in_channels: int, out_channels: int, kernel_size: int = 3, stride: int = 1, expand_ratio: int = 4, # Smaller than standard MBConv se_ratio: float = 0.25 ): super().__init__() self.stride = stride self.use_residual = (stride == 1) and (in_channels == out_channels) expanded_channels = in_channels * expand_ratio layers = [] # 1. Fused expansion + spatial convolution layers.extend([ nn.Conv2d( in_channels, expanded_channels, kernel_size, stride=stride, padding=kernel_size // 2, bias=False ), nn.BatchNorm2d(expanded_channels), nn.SiLU(inplace=True) ]) # 2. Squeeze-and-Excitation reduced_dim = max(1, int(in_channels * se_ratio)) layers.append(SqueezeExcitation(expanded_channels, reduced_dim)) # 3. Projection (linear, no activation) layers.extend([ nn.Conv2d(expanded_channels, out_channels, 1, bias=False), nn.BatchNorm2d(out_channels) ]) self.block = nn.Sequential(*layers) def forward(self, x: torch.Tensor) -> torch.Tensor: out = self.block(x) if self.use_residual: out = out + x return out class ProgressiveResizing: """ Progressive resizing scheduler for training. Starts with small images and progressively increases resolution throughout training. """ def __init__( self, initial_size: int = 128, final_size: int = 380, total_epochs: int = 350, ramp_epochs: int = 270 # When to reach final size ): self.initial_size = initial_size self.final_size = final_size self.total_epochs = total_epochs self.ramp_epochs = ramp_epochs def get_size(self, epoch: int) -> int: """Get image size for current epoch.""" if epoch >= self.ramp_epochs: return self.final_size # Linear interpolation progress = epoch / self.ramp_epochs size = self.initial_size + (self.final_size - self.initial_size) * progress # Round to multiple of 32 for efficient processing return int(size // 32 * 32) def get_regularization_strength(self, epoch: int) -> float: """ Scale regularization with image size. Smaller images need less regularization (less chance of overfitting); larger images need more. """ size = self.get_size(epoch) # Linearly scale dropout, mixup, etc. scale = (size - self.initial_size) / (self.final_size - self.initial_size) return scaleUse EfficientNetV2 when training from scratch or fine-tuning with full training runs. The training speedup (3-4×) significantly reduces cloud costs and iteration time. Use EfficientNetV1 when using pre-trained weights with minimal fine-tuning, where training efficiency matters less than model availability and compatibility.
Deploying EfficientNet in production requires understanding several practical aspects beyond raw accuracy numbers:
Memory and Batch Size:
Larger EfficientNet variants (B4+) have significant memory requirements, especially during training. The compound scaling of resolution means B7 processes ~7× more pixels than B0. Combined with increased depth and width, GPU memory can become a bottleneck.
| Model | Input Size | Approximate GPU Memory |
|---|---|---|
| EfficientNet-B0 | 224×224 | ~4 GB |
| EfficientNet-B1 | 240×240 | ~5 GB |
| EfficientNet-B2 | 260×260 | ~6 GB |
| EfficientNet-B3 | 300×300 | ~8 GB |
| EfficientNet-B4 | 380×380 | ~12 GB |
| EfficientNet-B5 | 456×456 | ~20 GB |
| EfficientNet-B6 | 528×528 | ~32 GB |
| EfficientNet-B7 | 600×600 | ~48 GB+ |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114
import torchimport torchvision.transforms as Tfrom PIL import Image # EfficientNet-specific preprocessing# Note: Resolution must match model variant!EFFICIENTNET_CONFIGS = { 'b0': {'size': 224, 'crop': 224}, 'b1': {'size': 240, 'crop': 240}, 'b2': {'size': 260, 'crop': 260}, 'b3': {'size': 300, 'crop': 300}, 'b4': {'size': 380, 'crop': 380}, 'b5': {'size': 456, 'crop': 456}, 'b6': {'size': 528, 'crop': 528}, 'b7': {'size': 600, 'crop': 600},} def get_efficientnet_transforms(variant: str = 'b0', training: bool = False): """Get appropriate transforms for EfficientNet variant.""" config = EFFICIENTNET_CONFIGS[variant] # ImageNet normalization (used during training) normalize = T.Normalize( mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225] ) if training: # Training: random resize crop + augmentation return T.Compose([ T.RandomResizedCrop(config['crop']), T.RandomHorizontalFlip(), T.AutoAugment(T.AutoAugmentPolicy.IMAGENET), T.ToTensor(), normalize, ]) else: # Inference: deterministic resize + center crop # Resize to slightly larger, then crop to match training resize_size = int(config['size'] / 0.875) # ~= size / (224/256) return T.Compose([ T.Resize(resize_size, interpolation=T.InterpolationMode.BICUBIC), T.CenterCrop(config['crop']), T.ToTensor(), normalize, ]) class EfficientNetPredictor: """Optimized EfficientNet inference wrapper.""" def __init__(self, model, variant: str = 'b0', device: str = 'cuda'): self.device = torch.device(device) self.model = model.to(self.device).eval() self.transforms = get_efficientnet_transforms(variant, training=False) # Enable inference optimizations if device == 'cuda': self.model = self.model.half() # FP16 inference torch.backends.cudnn.benchmark = True @torch.inference_mode() def predict(self, image: Image.Image) -> torch.Tensor: """Run inference on a single image.""" x = self.transforms(image).unsqueeze(0) # Add batch dimension x = x.to(self.device) if self.device.type == 'cuda': x = x.half() logits = self.model(x) probabilities = torch.softmax(logits, dim=1) return probabilities @torch.inference_mode() def predict_batch(self, images: list) -> torch.Tensor: """Run inference on a batch of images.""" tensors = [self.transforms(img) for img in images] batch = torch.stack(tensors).to(self.device) if self.device.type == 'cuda': batch = batch.half() logits = self.model(batch) probabilities = torch.softmax(logits, dim=1) return probabilities # TensorRT optimization example (pseudocode)def optimize_for_tensorrt(model, variant: str = 'b0'): """ Convert EfficientNet to TensorRT for maximum inference speed. This can provide 2-4x speedup on NVIDIA GPUs. """ import torch_tensorrt config = EFFICIENTNET_CONFIGS[variant] # Define input specification input_spec = torch_tensorrt.Input( shape=(1, 3, config['crop'], config['crop']), dtype=torch.half ) # Compile with TensorRT optimized = torch_tensorrt.compile( model, inputs=[input_spec], enabled_precisions={torch.half}, workspace_size=1 << 30, # 1 GB workspace ) return optimizedA common deployment error is using incorrect input resolution. EfficientNet-B3 trained on 300×300 will have degraded accuracy when given 224×224 inputs (or vice versa). Always match your inference preprocessing to the exact resolution used during training.
EfficientNet represents a paradigm shift in CNN design, moving from ad-hoc scaling to principled, unified scaling. Let's consolidate the key insights:
What's Next:
EfficientNet optimized for accuracy and efficiency within the CNN paradigm. In the next page, we'll explore MobileNet and ShuffleNet—architectures designed specifically for edge deployment with extreme computational constraints. These architectures take efficiency even further, enabling neural networks to run on mobile phones and embedded devices.
You now understand EfficientNet's compound scaling method, the NAS-derived B0 baseline, the complete model family from B0-B7, and the training-aware improvements in EfficientNetV2. This knowledge provides the foundation for understanding modern CNN efficiency and the transition from hand-designed to search-based architectures.