Machine LearningConvolutional Neural Networks

Modern CNN Architectures

LevelAdvanced

Duration90 mins

TopicConvolutional Neural Networks

1 / 5

EfficientNet: Compound Scaling for Superior CNNs

The Scaling Revolution

For years, the deep learning community scaled convolutional neural networks in an ad-hoc fashion. Want better accuracy? Make the network deeper. Still not good enough? Add more channels. Need even more performance? Use higher resolution inputs. Each dimension was tuned independently, often requiring expensive grid searches and leading to suboptimal efficiency.

EfficientNet (Tan & Le, 2019) fundamentally changed this paradigm by introducing compound scaling—a principled method for uniformly scaling depth, width, and resolution simultaneously using a fixed set of scaling coefficients. The result was a family of models that achieved state-of-the-art accuracy while being significantly more parameter-efficient and computationally cheaper than previous architectures.

What You Will Learn

By the end of this page, you will understand the theoretical foundations of compound scaling, analyze the EfficientNet-B0 baseline architecture discovered through neural architecture search, explore the complete EfficientNet family (B0-B7 and beyond), and appreciate why this work represents a paradigm shift in how we design and scale CNNs.

The Scaling Problem in Deep Learning

Before EfficientNet, researchers scaled neural networks along three primary dimensions, typically one at a time:

Depth Scaling: Adding more layers to the network. ResNet demonstrated that depth could be increased dramatically (from 18 to 152+ layers) with skip connections. Deeper networks can capture more complex hierarchical features, but they eventually suffer from diminishing returns and training difficulties.

Width Scaling: Increasing the number of channels (filters) in each layer. Wider networks can capture more fine-grained features at each level. WideResNet showed that making networks wider could sometimes outperform making them deeper, but excessive width creates memory bottlenecks.

Resolution Scaling: Using higher resolution input images. Higher resolution provides more spatial detail, potentially improving accuracy, especially for fine-grained classification. However, computational cost scales quadratically with resolution (since convolutions operate over height × width).

Single-Dimension Scaling Approaches
Dimension	Method	Benefits	Limitations
Depth	Add more layers	Richer hierarchical features, proven effective (ResNet)	Vanishing gradients, diminishing returns, increased latency
Width	More channels per layer	Fine-grained features, easier to train than very deep nets	Memory intensive, captures less hierarchy
Resolution	Higher input size	More spatial detail, better for fine-grained tasks	Quadratic compute cost, memory explosion

The fundamental problem: Each scaling dimension has diminishing returns when applied in isolation. Making a network twice as deep doesn't double its accuracy—it might improve accuracy by a few percentage points but significantly increases computation. Similarly, doubling width or resolution yields diminishing improvements.

More critically, scaling only one dimension creates imbalanced networks. A very deep but narrow network cannot capture fine-grained features well. A very wide but shallow network misses hierarchical abstractions. A high-resolution input on a shallow network wastes the additional pixels because the receptive field cannot integrate enough context.

The key insight: Different scaling dimensions are not independent—they interact with each other. A deeper network may need higher resolution to leverage its larger receptive field. A wider network benefits from more layers to transform its additional channels into meaningful features.

The Curse of Independent Scaling

Before EfficientNet, finding the optimal combination of depth, width, and resolution required expensive grid searches over an enormous hyperparameter space. With each dimension having dozens of possible values, exhaustively searching all combinations was computationally infeasible. Researchers relied on intuition and incremental experimentation, often leading to inefficient architectures.

Compound Scaling: A Principled Approach

EfficientNet's core contribution is the compound scaling method, which scales all three dimensions uniformly using a single compound coefficient φ (phi). Rather than tuning depth, width, and resolution independently, the method uses fixed ratios between them, determined by a small grid search on the baseline network.

The Compound Scaling Formula:

Given a compound coefficient φ, the scaling is defined as:

Depth: d = α^φ
Width: w = β^φ
Resolution: r = γ^φ

Where α, β, and γ are constants determined by a grid search such that:

α · β² · γ² ≈ 2 (to roughly double FLOPs when φ increases by 1)
α ≥ 1, β ≥ 1, γ ≥ 1

The constraint α · β² · γ² ≈ 2 ensures that total FLOPs scale approximately as 2^φ. The quadratic terms for β and γ account for the fact that FLOPs scale linearly with depth but quadratically with width and resolution (since convolutions at each layer operate on width² × height² spatial positions).

compound_scaling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# Compound Scaling Coefficients for EfficientNet
# Determined via grid search on EfficientNet-B0
 
# Base scaling coefficients
ALPHA = 1.2   # Depth scaling base
BETA = 1.1    # Width scaling base  
GAMMA = 1.15  # Resolution scaling base
 
# Verify constraint: α · β² · γ² ≈ 2
constraint = ALPHA * (BETA ** 2) * (GAMMA ** 2)
print(f"Constraint value: {constraint:.3f}")  # Should be ≈ 2.0
 
def compute_scaling(phi: float) -> dict:
    """
    Compute scaling factors for a given compound coefficient phi.
    
    Args:
        phi: Compound scaling coefficient (e.g., 0, 1, 2, ...)
        
    Returns:
        Dictionary with depth, width, and resolution multipliers
    """
    depth_multiplier = ALPHA ** phi
    width_multiplier = BETA ** phi
    resolution_multiplier = GAMMA ** phi
    
    # Approximate FLOPs scaling
    flops_scaling = (ALPHA * (BETA ** 2) * (GAMMA ** 2)) ** phi
    
    return {
        "depth": depth_multiplier,
        "width": width_multiplier, 
        "resolution": resolution_multiplier,
        "approx_flops": flops_scaling
    }
 
# EfficientNet family scaling
for phi in range(8):  # B0 through B7
    scaling = compute_scaling(phi)
    print(f"EfficientNet-B{phi}: depth={scaling['depth']:.2f}x, "
          f"width={scaling['width']:.2f}x, res={scaling['resolution']:.2f}x, "
          f"~{scaling['approx_flops']:.1f}x FLOPs")

Why This Works:

The compound scaling method succeeds because it respects the interdependence of scaling dimensions:

Depth and Resolution Synergy: Deeper networks have larger receptive fields, which can only be utilized with sufficient input resolution. Scaling both together ensures the network can "see" enough context to leverage its depth.
Width and Depth Balance: More channels require more layers to transform them into useful features. Scaling width and depth together prevents bottlenecks where information cannot flow through an imbalanced architecture.
Resolution and Width Coupling: Higher resolution inputs produce more spatial locations, each requiring processing by the convolutional filters. More channels per layer help capture the additional information present in higher-resolution inputs.

Empirical results confirm this intuition: compound scaling consistently outperforms single-dimension scaling at equivalent FLOP budgets.

The Grid Search Phase

The α, β, γ values are determined by a small grid search on the baseline network (EfficientNet-B0) with φ=1. This one-time search fixes the ratios, and subsequent models (B1-B7) are created simply by varying φ. This dramatically reduces the hyperparameter search space from three independent dimensions to a single scalar.

EfficientNet-B0: The NAS-Derived Baseline

While compound scaling is elegant, its effectiveness depends critically on the quality of the baseline architecture being scaled. Rather than starting from a hand-designed architecture like ResNet or VGG, EfficientNet uses Neural Architecture Search (NAS) to discover an optimal baseline.

The baseline, EfficientNet-B0, was discovered using a multi-objective NAS approach (specifically, a variant of MnasNet) that optimized for both accuracy and FLOP efficiency. The search space included decisions about:

Kernel sizes (3×3, 5×5)
Expansion ratios in inverted residual blocks
Number of layers per stage
Channel widths
Use of squeeze-and-excitation (SE) blocks

EfficientNet-B0 Architecture (Input: 224×224)
Stage	Operator	Resolution	Channels	Layers
1	Conv3×3	224×224	32	1
2	MBConv1, k3×3	112×112	16	1
3	MBConv6, k3×3	112×112	24	2
4	MBConv6, k5×5	56×56	40	2
5	MBConv6, k3×3	28×28	80	3
6	MBConv6, k5×5	14×14	112	3
7	MBConv6, k5×5	14×14	192	4
8	MBConv6, k3×3	7×7	320	1
9	Conv1×1 + Pool + FC	7×7→1×1	1280	1

Understanding MBConv (Mobile Inverted Bottleneck Convolution):

The core building block of EfficientNet is the MBConv layer, which originated from MobileNetV2. MBConv implements an inverted residual structure:

Expansion: A 1×1 convolution expands the channels by a factor (e.g., 6× for MBConv6)
Depthwise Convolution: A spatial convolution with kernel k×k that operates on each channel independently
Squeeze-and-Excitation (SE): A channel attention mechanism that recalibrates channel-wise responses
Projection: A 1×1 convolution projects back to a smaller number of output channels
Residual Connection: If input and output have the same dimensions, add a skip connection

This design is highly efficient because the expensive spatial convolution operates on the expanded (but depthwise-separated) representation, while the computationally lighter 1×1 convolutions handle channel mixing.

mbconv_block.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import torch
import torch.nn as nn
 
class SqueezeExcitation(nn.Module):
    """
    Squeeze-and-Excitation block for channel attention.
    
    Computes channel-wise importance weights by:
    1. Global average pooling to get channel statistics
    2. Two FC layers to compute attention weights
    3. Sigmoid activation to get scaling factors
    """
    def __init__(self, in_channels: int, reduced_dim: int):
        super().__init__()
        self.se = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),  # Global pooling: (B, C, H, W) -> (B, C, 1, 1)
            nn.Conv2d(in_channels, reduced_dim, 1),  # Reduction
            nn.SiLU(inplace=True),  # Swish activation
            nn.Conv2d(reduced_dim, in_channels, 1),  # Expansion
            nn.Sigmoid()  # Attention weights in [0, 1]
        )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return x * self.se(x)  # Channel-wise scaling
 
 
class MBConv(nn.Module):
    """
    Mobile Inverted Bottleneck Convolution (MBConv) block.
    
    Structure:
    Input -> Expand (1x1) -> Depthwise (kxk) -> SE -> Project (1x1) -> Output
         \---------------------------------------------------/
                            (residual if in_ch == out_ch)
    
    Args:
        in_channels: Input channel count
        out_channels: Output channel count
        kernel_size: Spatial kernel size for depthwise conv
        stride: Stride for depthwise conv (for downsampling)
        expand_ratio: Channel expansion factor (1 for MBConv1, 6 for MBConv6)
        se_ratio: Squeeze-Excitation reduction ratio (typically 0.25)
    """
    def __init__(
        self,
        in_channels: int,
        out_channels: int,
        kernel_size: int = 3,
        stride: int = 1,
        expand_ratio: int = 6,
        se_ratio: float = 0.25
    ):
        super().__init__()
        self.stride = stride
        self.use_residual = (stride == 1) and (in_channels == out_channels)
        
        # Expanded channel dimension
        expanded_channels = in_channels * expand_ratio
        
        layers = []
        
        # 1. Expansion phase (skip if expand_ratio == 1)
        if expand_ratio != 1:
            layers.extend([
                nn.Conv2d(in_channels, expanded_channels, 1, bias=False),
                nn.BatchNorm2d(expanded_channels),
                nn.SiLU(inplace=True)  # Swish activation
            ])
        
        # 2. Depthwise convolution
        layers.extend([
            nn.Conv2d(
                expanded_channels, expanded_channels, kernel_size,
                stride=stride, padding=kernel_size // 2,
                groups=expanded_channels, bias=False  # groups=channels for depthwise
            ),
            nn.BatchNorm2d(expanded_channels),
            nn.SiLU(inplace=True)
        ])
        
        # 3. Squeeze-and-Excitation
        reduced_dim = max(1, int(in_channels * se_ratio))
        layers.append(SqueezeExcitation(expanded_channels, reduced_dim))
        
        # 4. Projection (linear, no activation)
        layers.extend([
            nn.Conv2d(expanded_channels, out_channels, 1, bias=False),
            nn.BatchNorm2d(out_channels)
        ])
        
        self.block = nn.Sequential(*layers)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        out = self.block(x)
        if self.use_residual:
            out = out + x  # Residual connection
        return out

Swish Activation (SiLU)

EfficientNet uses Swish activation (also called SiLU: Sigmoid Linear Unit), defined as f(x) = x · σ(x). Swish is smooth, non-monotonic, and has been shown to outperform ReLU on deep networks. It was discovered through automated search and has become standard in modern architectures.

The EfficientNet Family: B0 to B7 and Beyond

With the baseline architecture (B0) and compound scaling coefficients (α, β, γ) fixed, creating the EfficientNet family is remarkably simple: just vary φ from 0 to 7 (and beyond for later variants).

Each model in the family offers a different accuracy-efficiency tradeoff, allowing practitioners to select the appropriate model for their computational budget:

EfficientNet Model Family
Model	Input Resolution	Parameters	FLOPs	Top-1 Accuracy (ImageNet)
EfficientNet-B0	224×224	5.3M	0.39B	77.3%
EfficientNet-B1	240×240	7.8M	0.70B	79.2%
EfficientNet-B2	260×260	9.2M	1.0B	80.3%
EfficientNet-B3	300×300	12M	1.8B	81.7%
EfficientNet-B4	380×380	19M	4.2B	83.0%
EfficientNet-B5	456×456	30M	9.9B	83.7%
EfficientNet-B6	528×528	43M	19B	84.2%
EfficientNet-B7	600×600	66M	37B	84.4%

Analyzing the Scaling Results:

Efficiency Gains: EfficientNet-B0 achieves comparable accuracy to ResNet-50 with 9× fewer parameters. EfficientNet-B7 matches the accuracy of the best published models (at the time) while using 8.4× fewer parameters.
Accuracy Scaling: Each step from B0 to B7 provides meaningful accuracy improvements, demonstrating that compound scaling continues to be effective across a wide range of model sizes.
Diminishing Returns at Scale: The jump from B0 to B1 provides ~2% accuracy improvement for ~2× FLOPs. The jump from B6 to B7 provides only ~0.2% improvement for ~2× FLOPs. This reflects the general principle that accuracy improvements become harder at higher accuracy levels.
Resolution Scaling: Input resolution grows substantially across the family. B7 uses 600×600 inputs—nearly 7× more pixels than B0's 224×224. This has significant implications for data augmentation and inference preprocessing.

efficientnet_model.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
import torch
import torch.nn as nn
from typing import List, Tuple
from dataclasses import dataclass
 
@dataclass
class EfficientNetConfig:
    """Configuration for EfficientNet model family."""
    # (expand_ratio, channels, layers, kernel, stride)
    stage_configs: List[Tuple[int, int, int, int, int]] = None
    
    # Compound scaling parameters
    width_mult: float = 1.0
    depth_mult: float = 1.0
    resolution: int = 224
    dropout_rate: float = 0.2
    
    def __post_init__(self):
        if self.stage_configs is None:
            # EfficientNet-B0 baseline configuration
            # (expand_ratio, out_channels, num_layers, kernel_size, stride)
            self.stage_configs = [
                (1, 16, 1, 3, 1),   # Stage 2: MBConv1
                (6, 24, 2, 3, 2),   # Stage 3: MBConv6
                (6, 40, 2, 5, 2),   # Stage 4: MBConv6
                (6, 80, 3, 3, 2),   # Stage 5: MBConv6
                (6, 112, 3, 5, 1),  # Stage 6: MBConv6
                (6, 192, 4, 5, 2),  # Stage 7: MBConv6
                (6, 320, 1, 3, 1),  # Stage 8: MBConv6
            ]
 
# Compound scaling coefficients
EFFICIENTNET_PARAMS = {
    # (phi, resolution, dropout)
    'b0': (0, 224, 0.2),
    'b1': (0.5, 240, 0.2),
    'b2': (1, 260, 0.3),
    'b3': (2, 300, 0.3),
    'b4': (3, 380, 0.4),
    'b5': (4, 456, 0.4),
    'b6': (5, 528, 0.5),
    'b7': (6, 600, 0.5),
}
 
def get_efficientnet_config(model_name: str) -> EfficientNetConfig:
    """
    Get configuration for a specific EfficientNet variant.
    
    Applies compound scaling coefficients to the B0 baseline.
    """
    if model_name not in EFFICIENTNET_PARAMS:
        raise ValueError(f"Unknown model: {model_name}")
    
    phi, resolution, dropout = EFFICIENTNET_PARAMS[model_name]
    
    # Compute scaling multipliers
    # α = 1.2, β = 1.1, γ = 1.15
    alpha, beta, gamma = 1.2, 1.1, 1.15
    
    depth_mult = alpha ** phi
    width_mult = beta ** phi
    
    return EfficientNetConfig(
        width_mult=width_mult,
        depth_mult=depth_mult,
        resolution=resolution,
        dropout_rate=dropout
    )
 
 
class EfficientNet(nn.Module):
    """
    EfficientNet model implementation.
    
    Builds the network from configuration, applying compound scaling
    to depth (layer counts) and width (channel counts).
    """
    def __init__(self, config: EfficientNetConfig, num_classes: int = 1000):
        super().__init__()
        self.config = config
        
        # Scale helper functions
        def scale_depth(d: int) -> int:
            return max(1, int(d * config.depth_mult + 0.5))
        
        def scale_width(w: int) -> int:
            # Round to nearest multiple of 8 for efficiency
            w = int(w * config.width_mult + 0.5)
            return max(8, (w + 4) // 8 * 8)
        
        # Stem: Initial conv layer
        stem_channels = scale_width(32)
        self.stem = nn.Sequential(
            nn.Conv2d(3, stem_channels, 3, stride=2, padding=1, bias=False),
            nn.BatchNorm2d(stem_channels),
            nn.SiLU(inplace=True)
        )
        
        # Build MBConv stages
        stages = []
        in_channels = stem_channels
        
        for expand, out_ch, layers, kernel, stride in config.stage_configs:
            out_channels = scale_width(out_ch)
            num_layers = scale_depth(layers)
            
            stage = self._build_stage(
                in_channels, out_channels, num_layers,
                expand, kernel, stride
            )
            stages.append(stage)
            in_channels = out_channels
        
        self.stages = nn.Sequential(*stages)
        
        # Head: Final conv, pooling, and classifier
        head_channels = scale_width(1280)
        self.head = nn.Sequential(
            nn.Conv2d(in_channels, head_channels, 1, bias=False),
            nn.BatchNorm2d(head_channels),
            nn.SiLU(inplace=True),
            nn.AdaptiveAvgPool2d(1),
            nn.Flatten(),
            nn.Dropout(config.dropout_rate),
            nn.Linear(head_channels, num_classes)
        )
    
    def _build_stage(
        self, in_ch: int, out_ch: int, num_layers: int,
        expand: int, kernel: int, stride: int
    ) -> nn.Sequential:
        """Build a stage of MBConv blocks."""
        layers = []
        for i in range(num_layers):
            layers.append(MBConv(
                in_ch if i == 0 else out_ch,
                out_ch,
                kernel_size=kernel,
                stride=stride if i == 0 else 1,  # Only first layer can downsample
                expand_ratio=expand
            ))
        return nn.Sequential(*layers)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.stem(x)
        x = self.stages(x)
        x = self.head(x)
        return x
 
 
# Factory function
def efficientnet_b0(num_classes: int = 1000) -> EfficientNet:
    return EfficientNet(get_efficientnet_config('b0'), num_classes)
 
def efficientnet_b3(num_classes: int = 1000) -> EfficientNet:
    return EfficientNet(get_efficientnet_config('b3'), num_classes)
 
def efficientnet_b7(num_classes: int = 1000) -> EfficientNet:
    return EfficientNet(get_efficientnet_config('b7'), num_classes)

Practical Model Selection

For most applications, EfficientNet-B0 through B4 offer the best tradeoffs. B0 is excellent for edge deployment and fast inference. B3-B4 provide a sweet spot for server-side inference where accuracy matters but latency is still important. B5-B7 are primarily useful when pushing for maximum accuracy regardless of cost.

EfficientNetV2: Training-Aware Scaling

While EfficientNet optimized for inference efficiency (parameters and FLOPs), EfficientNetV2 (Tan & Le, 2021) additionally optimizes for training efficiency—how fast the model can be trained on modern accelerators.

Key Insights from EfficientNetV2:

Training Bottleneck Analysis: Large image sizes (512×512+) create memory bottlenecks and slow training. Depthwise convolutions have lower hardware utilization than regular convolutions on modern GPUs/TPUs.
Progressive Resizing: Start training with smaller images and progressively increase resolution. This dramatically speeds up early training epochs when the model is still learning basic features.
Fused MBConv: Replace depthwise-separable convolutions with regular convolutions in early stages where they're actually faster on modern hardware (due to better parallelization).
Smaller Expansion Ratios: Use smaller expansion ratios (e.g., 4× instead of 6×) to reduce memory access overhead without sacrificing accuracy.

EfficientNet vs EfficientNetV2
Aspect	EfficientNet (V1)	EfficientNetV2
Primary Optimization	Inference FLOPs/params	Training speed + inference
Image Size Strategy	Fixed resolution training	Progressive resizing
Early Stage Blocks	Always MBConv (depthwise)	Fused-MBConv (regular conv)
Expansion Ratios	Mostly 6×	Smaller (1-4×) in early layers
Training Speed	Baseline	~3-4× faster for same accuracy

Fused-MBConv Block:

In standard MBConv, a 3×3 depthwise convolution is used for spatial processing. Theoretically, this has fewer FLOPs. However, on modern accelerators (GPUs/TPUs), the overhead of multiple small operations can exceed the benefit.

Fused-MBConv replaces the expansion + depthwise pattern with a single regular convolution:

MBConv: Expand (1×1) → Depthwise (3×3) → SE → Project (1×1)
Fused-MBConv: Expand+Spatial (3×3) → SE → Project (1×1)

This is used in early stages where feature maps are large (more spatial positions to parallelize over), and the added FLOPs are offset by better hardware utilization.

fused_mbconv.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
import torch
import torch.nn as nn
 
class FusedMBConv(nn.Module):
    """
    Fused Mobile Inverted Bottleneck Convolution.
    
    Replaces the Expand + Depthwise pattern with a single
    regular convolution for better hardware utilization.
    
    Structure:
    Input -> Fused Expand+Spatial (3x3) -> SE -> Project (1x1) -> Output
         \-----------------------------------------------/
                        (residual if in_ch == out_ch)
    """
    def __init__(
        self,
        in_channels: int,
        out_channels: int,
        kernel_size: int = 3,
        stride: int = 1,
        expand_ratio: int = 4,  # Smaller than standard MBConv
        se_ratio: float = 0.25
    ):
        super().__init__()
        self.stride = stride
        self.use_residual = (stride == 1) and (in_channels == out_channels)
        
        expanded_channels = in_channels * expand_ratio
        
        layers = []
        
        # 1. Fused expansion + spatial convolution
        layers.extend([
            nn.Conv2d(
                in_channels, expanded_channels, kernel_size,
                stride=stride, padding=kernel_size // 2, bias=False
            ),
            nn.BatchNorm2d(expanded_channels),
            nn.SiLU(inplace=True)
        ])
        
        # 2. Squeeze-and-Excitation
        reduced_dim = max(1, int(in_channels * se_ratio))
        layers.append(SqueezeExcitation(expanded_channels, reduced_dim))
        
        # 3. Projection (linear, no activation)
        layers.extend([
            nn.Conv2d(expanded_channels, out_channels, 1, bias=False),
            nn.BatchNorm2d(out_channels)
        ])
        
        self.block = nn.Sequential(*layers)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        out = self.block(x)
        if self.use_residual:
            out = out + x
        return out
 
 
class ProgressiveResizing:
    """
    Progressive resizing scheduler for training.
    
    Starts with small images and progressively increases
    resolution throughout training.
    """
    def __init__(
        self,
        initial_size: int = 128,
        final_size: int = 380,
        total_epochs: int = 350,
        ramp_epochs: int = 270  # When to reach final size
    ):
        self.initial_size = initial_size
        self.final_size = final_size
        self.total_epochs = total_epochs
        self.ramp_epochs = ramp_epochs
    
    def get_size(self, epoch: int) -> int:
        """Get image size for current epoch."""
        if epoch >= self.ramp_epochs:
            return self.final_size
        
        # Linear interpolation
        progress = epoch / self.ramp_epochs
        size = self.initial_size + (self.final_size - self.initial_size) * progress
        
        # Round to multiple of 32 for efficient processing
        return int(size // 32 * 32)
    
    def get_regularization_strength(self, epoch: int) -> float:
        """
        Scale regularization with image size.
        
        Smaller images need less regularization (less chance of overfitting);
        larger images need more.
        """
        size = self.get_size(epoch)
        # Linearly scale dropout, mixup, etc.
        scale = (size - self.initial_size) / (self.final_size - self.initial_size)
        return scale

When to Use V1 vs V2

Use EfficientNetV2 when training from scratch or fine-tuning with full training runs. The training speedup (3-4×) significantly reduces cloud costs and iteration time. Use EfficientNetV1 when using pre-trained weights with minimal fine-tuning, where training efficiency matters less than model availability and compatibility.

Practical Considerations and Deployment

Deploying EfficientNet in production requires understanding several practical aspects beyond raw accuracy numbers:

Memory and Batch Size:

Larger EfficientNet variants (B4+) have significant memory requirements, especially during training. The compound scaling of resolution means B7 processes ~7× more pixels than B0. Combined with increased depth and width, GPU memory can become a bottleneck.

Memory Requirements (Training, Batch Size 32)
Model	Input Size	Approximate GPU Memory
EfficientNet-B0	224×224	~4 GB
EfficientNet-B1	240×240	~5 GB
EfficientNet-B2	260×260	~6 GB
EfficientNet-B3	300×300	~8 GB
EfficientNet-B4	380×380	~12 GB
EfficientNet-B5	456×456	~20 GB
EfficientNet-B6	528×528	~32 GB
EfficientNet-B7	600×600	~48 GB+

Deployment Best Practices

•Inference Optimization: Use TensorRT, ONNX Runtime, or framework-specific optimizations. EfficientNet's depthwise convolutions may need special handling for optimal performance.
•Batch Size Tuning: Find the optimal batch size for your hardware. Larger batches aren't always faster due to memory bandwidth limitations with depthwise convolutions.
•Mixed Precision: FP16/BF16 inference can nearly halve memory usage and often improve throughput without accuracy loss. Test quantization-aware training for INT8 deployment.
•Input Preprocessing: Use the correct normalization (ImageNet mean/std). Ensure resize and crop operations match training exactly—resolution differences can significantly impact accuracy.
•Model Selection: Don't always choose the largest model. Profile on your target hardware. Sometimes B2 at higher batch size beats B4 at batch size 1 in throughput.

efficientnet_inference.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import torch
import torchvision.transforms as T
from PIL import Image
 
# EfficientNet-specific preprocessing
# Note: Resolution must match model variant!
EFFICIENTNET_CONFIGS = {
    'b0': {'size': 224, 'crop': 224},
    'b1': {'size': 240, 'crop': 240},
    'b2': {'size': 260, 'crop': 260},
    'b3': {'size': 300, 'crop': 300},
    'b4': {'size': 380, 'crop': 380},
    'b5': {'size': 456, 'crop': 456},
    'b6': {'size': 528, 'crop': 528},
    'b7': {'size': 600, 'crop': 600},
}
 
def get_efficientnet_transforms(variant: str = 'b0', training: bool = False):
    """Get appropriate transforms for EfficientNet variant."""
    config = EFFICIENTNET_CONFIGS[variant]
    
    # ImageNet normalization (used during training)
    normalize = T.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
    
    if training:
        # Training: random resize crop + augmentation
        return T.Compose([
            T.RandomResizedCrop(config['crop']),
            T.RandomHorizontalFlip(),
            T.AutoAugment(T.AutoAugmentPolicy.IMAGENET),
            T.ToTensor(),
            normalize,
        ])
    else:
        # Inference: deterministic resize + center crop
        # Resize to slightly larger, then crop to match training
        resize_size = int(config['size'] / 0.875)  # ~= size / (224/256)
        return T.Compose([
            T.Resize(resize_size, interpolation=T.InterpolationMode.BICUBIC),
            T.CenterCrop(config['crop']),
            T.ToTensor(),
            normalize,
        ])
 
 
class EfficientNetPredictor:
    """Optimized EfficientNet inference wrapper."""
    
    def __init__(self, model, variant: str = 'b0', device: str = 'cuda'):
        self.device = torch.device(device)
        self.model = model.to(self.device).eval()
        self.transforms = get_efficientnet_transforms(variant, training=False)
        
        # Enable inference optimizations
        if device == 'cuda':
            self.model = self.model.half()  # FP16 inference
            torch.backends.cudnn.benchmark = True
    
    @torch.inference_mode()
    def predict(self, image: Image.Image) -> torch.Tensor:
        """Run inference on a single image."""
        x = self.transforms(image).unsqueeze(0)  # Add batch dimension
        x = x.to(self.device)
        
        if self.device.type == 'cuda':
            x = x.half()
        
        logits = self.model(x)
        probabilities = torch.softmax(logits, dim=1)
        return probabilities
    
    @torch.inference_mode()
    def predict_batch(self, images: list) -> torch.Tensor:
        """Run inference on a batch of images."""
        tensors = [self.transforms(img) for img in images]
        batch = torch.stack(tensors).to(self.device)
        
        if self.device.type == 'cuda':
            batch = batch.half()
        
        logits = self.model(batch)
        probabilities = torch.softmax(logits, dim=1)
        return probabilities
 
 
# TensorRT optimization example (pseudocode)
def optimize_for_tensorrt(model, variant: str = 'b0'):
    """
    Convert EfficientNet to TensorRT for maximum inference speed.
    
    This can provide 2-4x speedup on NVIDIA GPUs.
    """
    import torch_tensorrt
    
    config = EFFICIENTNET_CONFIGS[variant]
    
    # Define input specification
    input_spec = torch_tensorrt.Input(
        shape=(1, 3, config['crop'], config['crop']),
        dtype=torch.half
    )
    
    # Compile with TensorRT
    optimized = torch_tensorrt.compile(
        model,
        inputs=[input_spec],
        enabled_precisions={torch.half},
        workspace_size=1 << 30,  # 1 GB workspace
    )
    
    return optimized

Resolution Mismatch Warning

A common deployment error is using incorrect input resolution. EfficientNet-B3 trained on 300×300 will have degraded accuracy when given 224×224 inputs (or vice versa). Always match your inference preprocessing to the exact resolution used during training.

Summary: EfficientNet's Contributions

EfficientNet represents a paradigm shift in CNN design, moving from ad-hoc scaling to principled, unified scaling. Let's consolidate the key insights:

Key Takeaways

•Compound Scaling: Scaling depth, width, and resolution together with fixed ratios (α^φ, β^φ, γ^φ) outperforms single-dimension scaling at equivalent computational cost.
•NAS-Derived Baseline: The effectiveness of scaling depends on the baseline architecture. EfficientNet-B0, discovered through neural architecture search, provides an efficient starting point.
•MBConv Building Block: Mobile inverted bottleneck convolutions with squeeze-and-excitation provide the parameter-efficient foundation for the architecture.
•Model Family: A single set of scaling coefficients produces a family of models (B0-B7) spanning a wide range of accuracy-efficiency tradeoffs.
•EfficientNetV2 Improvements: Training-aware design (Fused-MBConv, progressive resizing, smaller expansion ratios) can provide 3-4× training speedup.
•Practical Impact: EfficientNet became the default backbone for many vision tasks, demonstrating that principled design can outperform brute-force scaling.

What's Next:

EfficientNet optimized for accuracy and efficiency within the CNN paradigm. In the next page, we'll explore MobileNet and ShuffleNet—architectures designed specifically for edge deployment with extreme computational constraints. These architectures take efficiency even further, enabling neural networks to run on mobile phones and embedded devices.

Page Complete

You now understand EfficientNet's compound scaling method, the NAS-derived B0 baseline, the complete model family from B0-B7, and the training-aware improvements in EfficientNetV2. This knowledge provides the foundation for understanding modern CNN efficiency and the transition from hand-designed to search-based architectures.

1 / 5

Loading learning content...

Machine LearningConvolutional Neural Networks

Modern CNN Architectures

LevelAdvanced

Duration90 mins

TopicConvolutional Neural Networks

1 / 5

EfficientNet: Compound Scaling for Superior CNNs

The Scaling Revolution

What You Will Learn

The Scaling Problem in Deep Learning

Before EfficientNet, researchers scaled neural networks along three primary dimensions, typically one at a time:

Single-Dimension Scaling Approaches
Dimension	Method	Benefits	Limitations
Depth	Add more layers	Richer hierarchical features, proven effective (ResNet)	Vanishing gradients, diminishing returns, increased latency
Width	More channels per layer	Fine-grained features, easier to train than very deep nets	Memory intensive, captures less hierarchy
Resolution	Higher input size	More spatial detail, better for fine-grained tasks	Quadratic compute cost, memory explosion

The Curse of Independent Scaling

Compound Scaling: A Principled Approach

The Compound Scaling Formula:

Given a compound coefficient φ, the scaling is defined as:

Depth: d = α^φ
Width: w = β^φ
Resolution: r = γ^φ

Where α, β, and γ are constants determined by a grid search such that:

α · β² · γ² ≈ 2 (to roughly double FLOPs when φ increases by 1)
α ≥ 1, β ≥ 1, γ ≥ 1

compound_scaling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# Compound Scaling Coefficients for EfficientNet
# Determined via grid search on EfficientNet-B0
 
# Base scaling coefficients
ALPHA = 1.2   # Depth scaling base
BETA = 1.1    # Width scaling base  
GAMMA = 1.15  # Resolution scaling base
 
# Verify constraint: α · β² · γ² ≈ 2
constraint = ALPHA * (BETA ** 2) * (GAMMA ** 2)
print(f"Constraint value: {constraint:.3f}")  # Should be ≈ 2.0
 
def compute_scaling(phi: float) -> dict:
    """
    Compute scaling factors for a given compound coefficient phi.
    
    Args:
        phi: Compound scaling coefficient (e.g., 0, 1, 2, ...)
        
    Returns:
        Dictionary with depth, width, and resolution multipliers
    """
    depth_multiplier = ALPHA ** phi
    width_multiplier = BETA ** phi
    resolution_multiplier = GAMMA ** phi
    
    # Approximate FLOPs scaling
    flops_scaling = (ALPHA * (BETA ** 2) * (GAMMA ** 2)) ** phi
    
    return {
        "depth": depth_multiplier,
        "width": width_multiplier, 
        "resolution": resolution_multiplier,
        "approx_flops": flops_scaling
    }
 
# EfficientNet family scaling
for phi in range(8):  # B0 through B7
    scaling = compute_scaling(phi)
    print(f"EfficientNet-B{phi}: depth={scaling['depth']:.2f}x, "
          f"width={scaling['width']:.2f}x, res={scaling['resolution']:.2f}x, "
          f"~{scaling['approx_flops']:.1f}x FLOPs")

Why This Works:

The compound scaling method succeeds because it respects the interdependence of scaling dimensions:

Depth and Resolution Synergy: Deeper networks have larger receptive fields, which can only be utilized with sufficient input resolution. Scaling both together ensures the network can "see" enough context to leverage its depth.
Width and Depth Balance: More channels require more layers to transform them into useful features. Scaling width and depth together prevents bottlenecks where information cannot flow through an imbalanced architecture.
Resolution and Width Coupling: Higher resolution inputs produce more spatial locations, each requiring processing by the convolutional filters. More channels per layer help capture the additional information present in higher-resolution inputs.

Empirical results confirm this intuition: compound scaling consistently outperforms single-dimension scaling at equivalent FLOP budgets.

The Grid Search Phase

EfficientNet-B0: The NAS-Derived Baseline

Kernel sizes (3×3, 5×5)
Expansion ratios in inverted residual blocks
Number of layers per stage
Channel widths
Use of squeeze-and-excitation (SE) blocks

EfficientNet-B0 Architecture (Input: 224×224)
Stage	Operator	Resolution	Channels	Layers
1	Conv3×3	224×224	32	1
2	MBConv1, k3×3	112×112	16	1
3	MBConv6, k3×3	112×112	24	2
4	MBConv6, k5×5	56×56	40	2
5	MBConv6, k3×3	28×28	80	3
6	MBConv6, k5×5	14×14	112	3
7	MBConv6, k5×5	14×14	192	4
8	MBConv6, k3×3	7×7	320	1
9	Conv1×1 + Pool + FC	7×7→1×1	1280	1

Understanding MBConv (Mobile Inverted Bottleneck Convolution):

The core building block of EfficientNet is the MBConv layer, which originated from MobileNetV2. MBConv implements an inverted residual structure:

Expansion: A 1×1 convolution expands the channels by a factor (e.g., 6× for MBConv6)
Depthwise Convolution: A spatial convolution with kernel k×k that operates on each channel independently
Squeeze-and-Excitation (SE): A channel attention mechanism that recalibrates channel-wise responses
Projection: A 1×1 convolution projects back to a smaller number of output channels
Residual Connection: If input and output have the same dimensions, add a skip connection

mbconv_block.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import torch
import torch.nn as nn
 
class SqueezeExcitation(nn.Module):
    """
    Squeeze-and-Excitation block for channel attention.
    
    Computes channel-wise importance weights by:
    1. Global average pooling to get channel statistics
    2. Two FC layers to compute attention weights
    3. Sigmoid activation to get scaling factors
    """
    def __init__(self, in_channels: int, reduced_dim: int):
        super().__init__()
        self.se = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),  # Global pooling: (B, C, H, W) -> (B, C, 1, 1)
            nn.Conv2d(in_channels, reduced_dim, 1),  # Reduction
            nn.SiLU(inplace=True),  # Swish activation
            nn.Conv2d(reduced_dim, in_channels, 1),  # Expansion
            nn.Sigmoid()  # Attention weights in [0, 1]
        )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return x * self.se(x)  # Channel-wise scaling
 
 
class MBConv(nn.Module):
    """
    Mobile Inverted Bottleneck Convolution (MBConv) block.
    
    Structure:
    Input -> Expand (1x1) -> Depthwise (kxk) -> SE -> Project (1x1) -> Output
         \---------------------------------------------------/
                            (residual if in_ch == out_ch)
    
    Args:
        in_channels: Input channel count
        out_channels: Output channel count
        kernel_size: Spatial kernel size for depthwise conv
        stride: Stride for depthwise conv (for downsampling)
        expand_ratio: Channel expansion factor (1 for MBConv1, 6 for MBConv6)
        se_ratio: Squeeze-Excitation reduction ratio (typically 0.25)
    """
    def __init__(
        self,
        in_channels: int,
        out_channels: int,
        kernel_size: int = 3,
        stride: int = 1,
        expand_ratio: int = 6,
        se_ratio: float = 0.25
    ):
        super().__init__()
        self.stride = stride
        self.use_residual = (stride == 1) and (in_channels == out_channels)
        
        # Expanded channel dimension
        expanded_channels = in_channels * expand_ratio
        
        layers = []
        
        # 1. Expansion phase (skip if expand_ratio == 1)
        if expand_ratio != 1:
            layers.extend([
                nn.Conv2d(in_channels, expanded_channels, 1, bias=False),
                nn.BatchNorm2d(expanded_channels),
                nn.SiLU(inplace=True)  # Swish activation
            ])
        
        # 2. Depthwise convolution
        layers.extend([
            nn.Conv2d(
                expanded_channels, expanded_channels, kernel_size,
                stride=stride, padding=kernel_size // 2,
                groups=expanded_channels, bias=False  # groups=channels for depthwise
            ),
            nn.BatchNorm2d(expanded_channels),
            nn.SiLU(inplace=True)
        ])
        
        # 3. Squeeze-and-Excitation
        reduced_dim = max(1, int(in_channels * se_ratio))
        layers.append(SqueezeExcitation(expanded_channels, reduced_dim))
        
        # 4. Projection (linear, no activation)
        layers.extend([
            nn.Conv2d(expanded_channels, out_channels, 1, bias=False),
            nn.BatchNorm2d(out_channels)
        ])
        
        self.block = nn.Sequential(*layers)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        out = self.block(x)
        if self.use_residual:
            out = out + x  # Residual connection
        return out

Swish Activation (SiLU)

The EfficientNet Family: B0 to B7 and Beyond

Each model in the family offers a different accuracy-efficiency tradeoff, allowing practitioners to select the appropriate model for their computational budget:

EfficientNet Model Family
Model	Input Resolution	Parameters	FLOPs	Top-1 Accuracy (ImageNet)
EfficientNet-B0	224×224	5.3M	0.39B	77.3%
EfficientNet-B1	240×240	7.8M	0.70B	79.2%
EfficientNet-B2	260×260	9.2M	1.0B	80.3%
EfficientNet-B3	300×300	12M	1.8B	81.7%
EfficientNet-B4	380×380	19M	4.2B	83.0%
EfficientNet-B5	456×456	30M	9.9B	83.7%
EfficientNet-B6	528×528	43M	19B	84.2%
EfficientNet-B7	600×600	66M	37B	84.4%

Analyzing the Scaling Results:

Efficiency Gains: EfficientNet-B0 achieves comparable accuracy to ResNet-50 with 9× fewer parameters. EfficientNet-B7 matches the accuracy of the best published models (at the time) while using 8.4× fewer parameters.
Accuracy Scaling: Each step from B0 to B7 provides meaningful accuracy improvements, demonstrating that compound scaling continues to be effective across a wide range of model sizes.
Diminishing Returns at Scale: The jump from B0 to B1 provides ~2% accuracy improvement for ~2× FLOPs. The jump from B6 to B7 provides only ~0.2% improvement for ~2× FLOPs. This reflects the general principle that accuracy improvements become harder at higher accuracy levels.
Resolution Scaling: Input resolution grows substantially across the family. B7 uses 600×600 inputs—nearly 7× more pixels than B0's 224×224. This has significant implications for data augmentation and inference preprocessing.

efficientnet_model.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
import torch
import torch.nn as nn
from typing import List, Tuple
from dataclasses import dataclass
 
@dataclass
class EfficientNetConfig:
    """Configuration for EfficientNet model family."""
    # (expand_ratio, channels, layers, kernel, stride)
    stage_configs: List[Tuple[int, int, int, int, int]] = None
    
    # Compound scaling parameters
    width_mult: float = 1.0
    depth_mult: float = 1.0
    resolution: int = 224
    dropout_rate: float = 0.2
    
    def __post_init__(self):
        if self.stage_configs is None:
            # EfficientNet-B0 baseline configuration
            # (expand_ratio, out_channels, num_layers, kernel_size, stride)
            self.stage_configs = [
                (1, 16, 1, 3, 1),   # Stage 2: MBConv1
                (6, 24, 2, 3, 2),   # Stage 3: MBConv6
                (6, 40, 2, 5, 2),   # Stage 4: MBConv6
                (6, 80, 3, 3, 2),   # Stage 5: MBConv6
                (6, 112, 3, 5, 1),  # Stage 6: MBConv6
                (6, 192, 4, 5, 2),  # Stage 7: MBConv6
                (6, 320, 1, 3, 1),  # Stage 8: MBConv6
            ]
 
# Compound scaling coefficients
EFFICIENTNET_PARAMS = {
    # (phi, resolution, dropout)
    'b0': (0, 224, 0.2),
    'b1': (0.5, 240, 0.2),
    'b2': (1, 260, 0.3),
    'b3': (2, 300, 0.3),
    'b4': (3, 380, 0.4),
    'b5': (4, 456, 0.4),
    'b6': (5, 528, 0.5),
    'b7': (6, 600, 0.5),
}
 
def get_efficientnet_config(model_name: str) -> EfficientNetConfig:
    """
    Get configuration for a specific EfficientNet variant.
    
    Applies compound scaling coefficients to the B0 baseline.
    """
    if model_name not in EFFICIENTNET_PARAMS:
        raise ValueError(f"Unknown model: {model_name}")
    
    phi, resolution, dropout = EFFICIENTNET_PARAMS[model_name]
    
    # Compute scaling multipliers
    # α = 1.2, β = 1.1, γ = 1.15
    alpha, beta, gamma = 1.2, 1.1, 1.15
    
    depth_mult = alpha ** phi
    width_mult = beta ** phi
    
    return EfficientNetConfig(
        width_mult=width_mult,
        depth_mult=depth_mult,
        resolution=resolution,
        dropout_rate=dropout
    )
 
 
class EfficientNet(nn.Module):
    """
    EfficientNet model implementation.
    
    Builds the network from configuration, applying compound scaling
    to depth (layer counts) and width (channel counts).
    """
    def __init__(self, config: EfficientNetConfig, num_classes: int = 1000):
        super().__init__()
        self.config = config
        
        # Scale helper functions
        def scale_depth(d: int) -> int:
            return max(1, int(d * config.depth_mult + 0.5))
        
        def scale_width(w: int) -> int:
            # Round to nearest multiple of 8 for efficiency
            w = int(w * config.width_mult + 0.5)
            return max(8, (w + 4) // 8 * 8)
        
        # Stem: Initial conv layer
        stem_channels = scale_width(32)
        self.stem = nn.Sequential(
            nn.Conv2d(3, stem_channels, 3, stride=2, padding=1, bias=False),
            nn.BatchNorm2d(stem_channels),
            nn.SiLU(inplace=True)
        )
        
        # Build MBConv stages
        stages = []
        in_channels = stem_channels
        
        for expand, out_ch, layers, kernel, stride in config.stage_configs:
            out_channels = scale_width(out_ch)
            num_layers = scale_depth(layers)
            
            stage = self._build_stage(
                in_channels, out_channels, num_layers,
                expand, kernel, stride
            )
            stages.append(stage)
            in_channels = out_channels
        
        self.stages = nn.Sequential(*stages)
        
        # Head: Final conv, pooling, and classifier
        head_channels = scale_width(1280)
        self.head = nn.Sequential(
            nn.Conv2d(in_channels, head_channels, 1, bias=False),
            nn.BatchNorm2d(head_channels),
            nn.SiLU(inplace=True),
            nn.AdaptiveAvgPool2d(1),
            nn.Flatten(),
            nn.Dropout(config.dropout_rate),
            nn.Linear(head_channels, num_classes)
        )
    
    def _build_stage(
        self, in_ch: int, out_ch: int, num_layers: int,
        expand: int, kernel: int, stride: int
    ) -> nn.Sequential:
        """Build a stage of MBConv blocks."""
        layers = []
        for i in range(num_layers):
            layers.append(MBConv(
                in_ch if i == 0 else out_ch,
                out_ch,
                kernel_size=kernel,
                stride=stride if i == 0 else 1,  # Only first layer can downsample
                expand_ratio=expand
            ))
        return nn.Sequential(*layers)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.stem(x)
        x = self.stages(x)
        x = self.head(x)
        return x
 
 
# Factory function
def efficientnet_b0(num_classes: int = 1000) -> EfficientNet:
    return EfficientNet(get_efficientnet_config('b0'), num_classes)
 
def efficientnet_b3(num_classes: int = 1000) -> EfficientNet:
    return EfficientNet(get_efficientnet_config('b3'), num_classes)
 
def efficientnet_b7(num_classes: int = 1000) -> EfficientNet:
    return EfficientNet(get_efficientnet_config('b7'), num_classes)

Practical Model Selection

EfficientNetV2: Training-Aware Scaling

Key Insights from EfficientNetV2:

Training Bottleneck Analysis: Large image sizes (512×512+) create memory bottlenecks and slow training. Depthwise convolutions have lower hardware utilization than regular convolutions on modern GPUs/TPUs.
Progressive Resizing: Start training with smaller images and progressively increase resolution. This dramatically speeds up early training epochs when the model is still learning basic features.
Fused MBConv: Replace depthwise-separable convolutions with regular convolutions in early stages where they're actually faster on modern hardware (due to better parallelization).
Smaller Expansion Ratios: Use smaller expansion ratios (e.g., 4× instead of 6×) to reduce memory access overhead without sacrificing accuracy.

EfficientNet vs EfficientNetV2
Aspect	EfficientNet (V1)	EfficientNetV2
Primary Optimization	Inference FLOPs/params	Training speed + inference
Image Size Strategy	Fixed resolution training	Progressive resizing
Early Stage Blocks	Always MBConv (depthwise)	Fused-MBConv (regular conv)
Expansion Ratios	Mostly 6×	Smaller (1-4×) in early layers
Training Speed	Baseline	~3-4× faster for same accuracy

Fused-MBConv Block:

Fused-MBConv replaces the expansion + depthwise pattern with a single regular convolution:

MBConv: Expand (1×1) → Depthwise (3×3) → SE → Project (1×1)
Fused-MBConv: Expand+Spatial (3×3) → SE → Project (1×1)

This is used in early stages where feature maps are large (more spatial positions to parallelize over), and the added FLOPs are offset by better hardware utilization.

fused_mbconv.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
import torch
import torch.nn as nn
 
class FusedMBConv(nn.Module):
    """
    Fused Mobile Inverted Bottleneck Convolution.
    
    Replaces the Expand + Depthwise pattern with a single
    regular convolution for better hardware utilization.
    
    Structure:
    Input -> Fused Expand+Spatial (3x3) -> SE -> Project (1x1) -> Output
         \-----------------------------------------------/
                        (residual if in_ch == out_ch)
    """
    def __init__(
        self,
        in_channels: int,
        out_channels: int,
        kernel_size: int = 3,
        stride: int = 1,
        expand_ratio: int = 4,  # Smaller than standard MBConv
        se_ratio: float = 0.25
    ):
        super().__init__()
        self.stride = stride
        self.use_residual = (stride == 1) and (in_channels == out_channels)
        
        expanded_channels = in_channels * expand_ratio
        
        layers = []
        
        # 1. Fused expansion + spatial convolution
        layers.extend([
            nn.Conv2d(
                in_channels, expanded_channels, kernel_size,
                stride=stride, padding=kernel_size // 2, bias=False
            ),
            nn.BatchNorm2d(expanded_channels),
            nn.SiLU(inplace=True)
        ])
        
        # 2. Squeeze-and-Excitation
        reduced_dim = max(1, int(in_channels * se_ratio))
        layers.append(SqueezeExcitation(expanded_channels, reduced_dim))
        
        # 3. Projection (linear, no activation)
        layers.extend([
            nn.Conv2d(expanded_channels, out_channels, 1, bias=False),
            nn.BatchNorm2d(out_channels)
        ])
        
        self.block = nn.Sequential(*layers)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        out = self.block(x)
        if self.use_residual:
            out = out + x
        return out
 
 
class ProgressiveResizing:
    """
    Progressive resizing scheduler for training.
    
    Starts with small images and progressively increases
    resolution throughout training.
    """
    def __init__(
        self,
        initial_size: int = 128,
        final_size: int = 380,
        total_epochs: int = 350,
        ramp_epochs: int = 270  # When to reach final size
    ):
        self.initial_size = initial_size
        self.final_size = final_size
        self.total_epochs = total_epochs
        self.ramp_epochs = ramp_epochs
    
    def get_size(self, epoch: int) -> int:
        """Get image size for current epoch."""
        if epoch >= self.ramp_epochs:
            return self.final_size
        
        # Linear interpolation
        progress = epoch / self.ramp_epochs
        size = self.initial_size + (self.final_size - self.initial_size) * progress
        
        # Round to multiple of 32 for efficient processing
        return int(size // 32 * 32)
    
    def get_regularization_strength(self, epoch: int) -> float:
        """
        Scale regularization with image size.
        
        Smaller images need less regularization (less chance of overfitting);
        larger images need more.
        """
        size = self.get_size(epoch)
        # Linearly scale dropout, mixup, etc.
        scale = (size - self.initial_size) / (self.final_size - self.initial_size)
        return scale

When to Use V1 vs V2

Practical Considerations and Deployment

Deploying EfficientNet in production requires understanding several practical aspects beyond raw accuracy numbers:

Memory and Batch Size:

Memory Requirements (Training, Batch Size 32)
Model	Input Size	Approximate GPU Memory
EfficientNet-B0	224×224	~4 GB
EfficientNet-B1	240×240	~5 GB
EfficientNet-B2	260×260	~6 GB
EfficientNet-B3	300×300	~8 GB
EfficientNet-B4	380×380	~12 GB
EfficientNet-B5	456×456	~20 GB
EfficientNet-B6	528×528	~32 GB
EfficientNet-B7	600×600	~48 GB+

Deployment Best Practices

•Inference Optimization: Use TensorRT, ONNX Runtime, or framework-specific optimizations. EfficientNet's depthwise convolutions may need special handling for optimal performance.
•Batch Size Tuning: Find the optimal batch size for your hardware. Larger batches aren't always faster due to memory bandwidth limitations with depthwise convolutions.
•Mixed Precision: FP16/BF16 inference can nearly halve memory usage and often improve throughput without accuracy loss. Test quantization-aware training for INT8 deployment.
•Input Preprocessing: Use the correct normalization (ImageNet mean/std). Ensure resize and crop operations match training exactly—resolution differences can significantly impact accuracy.
•Model Selection: Don't always choose the largest model. Profile on your target hardware. Sometimes B2 at higher batch size beats B4 at batch size 1 in throughput.

efficientnet_inference.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import torch
import torchvision.transforms as T
from PIL import Image
 
# EfficientNet-specific preprocessing
# Note: Resolution must match model variant!
EFFICIENTNET_CONFIGS = {
    'b0': {'size': 224, 'crop': 224},
    'b1': {'size': 240, 'crop': 240},
    'b2': {'size': 260, 'crop': 260},
    'b3': {'size': 300, 'crop': 300},
    'b4': {'size': 380, 'crop': 380},
    'b5': {'size': 456, 'crop': 456},
    'b6': {'size': 528, 'crop': 528},
    'b7': {'size': 600, 'crop': 600},
}
 
def get_efficientnet_transforms(variant: str = 'b0', training: bool = False):
    """Get appropriate transforms for EfficientNet variant."""
    config = EFFICIENTNET_CONFIGS[variant]
    
    # ImageNet normalization (used during training)
    normalize = T.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
    
    if training:
        # Training: random resize crop + augmentation
        return T.Compose([
            T.RandomResizedCrop(config['crop']),
            T.RandomHorizontalFlip(),
            T.AutoAugment(T.AutoAugmentPolicy.IMAGENET),
            T.ToTensor(),
            normalize,
        ])
    else:
        # Inference: deterministic resize + center crop
        # Resize to slightly larger, then crop to match training
        resize_size = int(config['size'] / 0.875)  # ~= size / (224/256)
        return T.Compose([
            T.Resize(resize_size, interpolation=T.InterpolationMode.BICUBIC),
            T.CenterCrop(config['crop']),
            T.ToTensor(),
            normalize,
        ])
 
 
class EfficientNetPredictor:
    """Optimized EfficientNet inference wrapper."""
    
    def __init__(self, model, variant: str = 'b0', device: str = 'cuda'):
        self.device = torch.device(device)
        self.model = model.to(self.device).eval()
        self.transforms = get_efficientnet_transforms(variant, training=False)
        
        # Enable inference optimizations
        if device == 'cuda':
            self.model = self.model.half()  # FP16 inference
            torch.backends.cudnn.benchmark = True
    
    @torch.inference_mode()
    def predict(self, image: Image.Image) -> torch.Tensor:
        """Run inference on a single image."""
        x = self.transforms(image).unsqueeze(0)  # Add batch dimension
        x = x.to(self.device)
        
        if self.device.type == 'cuda':
            x = x.half()
        
        logits = self.model(x)
        probabilities = torch.softmax(logits, dim=1)
        return probabilities
    
    @torch.inference_mode()
    def predict_batch(self, images: list) -> torch.Tensor:
        """Run inference on a batch of images."""
        tensors = [self.transforms(img) for img in images]
        batch = torch.stack(tensors).to(self.device)
        
        if self.device.type == 'cuda':
            batch = batch.half()
        
        logits = self.model(batch)
        probabilities = torch.softmax(logits, dim=1)
        return probabilities
 
 
# TensorRT optimization example (pseudocode)
def optimize_for_tensorrt(model, variant: str = 'b0'):
    """
    Convert EfficientNet to TensorRT for maximum inference speed.
    
    This can provide 2-4x speedup on NVIDIA GPUs.
    """
    import torch_tensorrt
    
    config = EFFICIENTNET_CONFIGS[variant]
    
    # Define input specification
    input_spec = torch_tensorrt.Input(
        shape=(1, 3, config['crop'], config['crop']),
        dtype=torch.half
    )
    
    # Compile with TensorRT
    optimized = torch_tensorrt.compile(
        model,
        inputs=[input_spec],
        enabled_precisions={torch.half},
        workspace_size=1 << 30,  # 1 GB workspace
    )
    
    return optimized

Resolution Mismatch Warning

Summary: EfficientNet's Contributions

EfficientNet represents a paradigm shift in CNN design, moving from ad-hoc scaling to principled, unified scaling. Let's consolidate the key insights:

Key Takeaways

•Compound Scaling: Scaling depth, width, and resolution together with fixed ratios (α^φ, β^φ, γ^φ) outperforms single-dimension scaling at equivalent computational cost.
•NAS-Derived Baseline: The effectiveness of scaling depends on the baseline architecture. EfficientNet-B0, discovered through neural architecture search, provides an efficient starting point.
•MBConv Building Block: Mobile inverted bottleneck convolutions with squeeze-and-excitation provide the parameter-efficient foundation for the architecture.
•Model Family: A single set of scaling coefficients produces a family of models (B0-B7) spanning a wide range of accuracy-efficiency tradeoffs.
•EfficientNetV2 Improvements: Training-aware design (Fused-MBConv, progressive resizing, smaller expansion ratios) can provide 3-4× training speedup.
•Practical Impact: EfficientNet became the default backbone for many vision tasks, demonstrating that principled design can outperform brute-force scaling.

What's Next:

Page Complete

1 / 5