Data Augmentation - Learning Module

Loading content...

0/245

Image Augmentations: Foundations and Techniques

The Data Scarcity Dilemma

Deep neural networks are insatiable consumers of data. Modern vision models routinely train on millions—or billions—of images, yet even these vast datasets cannot capture the infinite variability of the real world. A model trained on perfectly centered, well-lit photographs will struggle with tilted, shadowed, or partially occluded inputs. This fundamental mismatch between training distribution and deployment reality is one of the central challenges in deep learning.

Data augmentation offers an elegant solution: instead of collecting more data, we systematically transform existing samples to simulate the diversity the model will encounter in practice. By applying geometric transformations, color modifications, and synthetic corruptions during training, we effectively expand our dataset by orders of magnitude—teaching the model to recognize objects regardless of their position, orientation, lighting, or context.

This isn't merely a practical trick; it's a principled approach with deep connections to invariance, equivariance, and the geometry of learned representations. Understanding augmentation thoroughly is essential for any practitioner building robust, deployable deep learning systems.

What You Will Master

By the end of this page, you will understand the theoretical foundations of data augmentation, master the full taxonomy of image transformations (geometric, photometric, and synthetic), implement augmentations correctly with proper mathematical formulations, and develop intuition for which augmentations benefit which tasks.

Theoretical Foundations of Data Augmentation

Before examining specific augmentation techniques, we must understand why augmentation works from a theoretical perspective. This understanding guides principled application rather than blind recipe-following.

The Invariance Principle

At its core, data augmentation enforces invariance—the property that a model's output should not change under certain transformations of its input. When we train a dog classifier, we want p(dog|x) to remain stable whether the input image shows the dog in the center, corner, rotated, or under different lighting. Mathematically, we desire:

$$f(T(x)) = f(x) \quad \forall T \in \mathcal{T}$$

where $\mathcal{T}$ is a family of transformations under which predictions should be invariant.

The Regularization Perspective

From an optimization standpoint, augmentation acts as a regularizer. Consider the expected risk we wish to minimize:

$$R(\theta) = \mathbb{E}{(x,y) \sim p{data}}[L(f_\theta(x), y)]$$

With augmentation, we instead minimize:

$$R_{aug}(\theta) = \mathbb{E}{T \sim p_T}\mathbb{E}{(x,y) \sim p_{data}}[L(f_\theta(T(x)), y)]$$

This modified objective smooths the loss landscape and reduces the model's sensitivity to input perturbations, directly improving generalization.

Vicinal Risk Minimization

Data augmentation can be formally understood as Vicinal Risk Minimization (VRM), where we replace the empirical distribution with a "vicinal" distribution that spreads probability mass around each training point. Each augmentation defines a vicinity function that determines how probability is allocated around observed samples.

The Data Distribution View

Augmentation can be understood as expanding the support of our training distribution. Given limited samples ${x_1, ..., x_n}$ from the true data distribution $p_{data}(x)$, augmentation creates an augmented distribution:

$$p_{aug}(x) = \frac{1}{|\mathcal{T}|}\sum_{T \in \mathcal{T}} p_{emp}(T^{-1}(x))$$

where $p_{emp}$ is the empirical distribution over training samples. Effective augmentation ensures $p_{aug}$ better approximates $p_{data}$ by filling in regions of input space that would otherwise have zero training density.

Invariance vs. Equivariance

Not all tasks require invariance. In semantic segmentation, we want pixel-wise predictions to transform along with the input—this is equivariance:

$$f(T(x)) = T(f(x)) \quad \text{(Equivariance)}$$

For segmentation, when we horizontally flip an image, the predicted mask should also flip horizontally. This distinction is crucial: the same transformation may appear in augmentation pipelines for both tasks, but with different label handling strategies.

Invariance vs. Equivariance Across Vision Tasks
Task	Desired Property	Label Transformation	Example
Image Classification	Invariance	Labels unchanged	Rotated cat → still labeled 'cat'
Object Detection	Equivariance	Bounding boxes transform	Flipped image → flipped box coordinates
Semantic Segmentation	Equivariance	Masks transform spatially	Rotated image → rotated mask
Pose Estimation	Equivariance	Keypoints transform	Scaled image → scaled keypoint locations
Image Captioning	Invariance	Captions unchanged	Color-jittered image → same caption

Geometric Transformations

Geometric transformations modify the spatial structure of images—their position, scale, rotation, and shape. These are among the most fundamental and universally applicable augmentations.

Random Cropping

Random cropping extracts rectangular regions from random locations, forcing the model to recognize objects even when partially visible or off-center. This is perhaps the single most important augmentation for image classification.

Mathematical Formulation: Given an image $I$ of size $H \times W$, a random crop selects coordinates $(h_1, w_1)$ and extracts a region of size $h \times w$:

$$I_{crop}[i,j] = I[h_1 + i, w_1 + j] \quad \text{for } i \in [0, h), j \in [0, w)$$

where:

$h_1 \sim \text{Uniform}(0, H - h)$
$w_1 \sim \text{Uniform}(0, W - w)$

ResNet-style random resized crop is particularly effective: randomly select a crop with area ratio in $[0.08, 1.0]$ and aspect ratio in $[3/4, 4/3]$, then resize to target dimensions.

random_resized_crop.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy as np
from PIL import Image
 
def random_resized_crop(
    image: Image.Image,
    target_size: tuple = (224, 224),
    scale: tuple = (0.08, 1.0),
    ratio: tuple = (3/4, 4/3)
) -> Image.Image:
    """
    Random resized crop as used in ResNet training.
    
    Parameters:
    -----------
    image : PIL Image
        Input image to crop and resize
    target_size : tuple
        Final (height, width) after resizing
    scale : tuple
        Range of crop area relative to original (min, max)
    ratio : tuple
        Range of aspect ratios (min, max)
    
    Returns:
    --------
    Cropped and resized PIL Image
    """
    width, height = image.size
    area = width * height
    
    for attempt in range(10):  # Try multiple times
        # Sample target area and aspect ratio
        target_area = np.random.uniform(scale[0], scale[1]) * area
        log_ratio = np.log(ratio)
        aspect_ratio = np.exp(np.random.uniform(log_ratio[0], log_ratio[1]))
        
        # Compute crop dimensions
        crop_width = int(round(np.sqrt(target_area * aspect_ratio)))
        crop_height = int(round(np.sqrt(target_area / aspect_ratio)))
        
        # Check if valid crop is possible
        if 0 < crop_width <= width and 0 < crop_height <= height:
            # Random position
            x = np.random.randint(0, width - crop_width + 1)
            y = np.random.randint(0, height - crop_height + 1)
            
            # Crop and resize
            return image.crop((x, y, x + crop_width, y + crop_height)).resize(
                target_size, Image.BILINEAR
            )
    
    # Fallback: center crop to match aspect ratio, then resize
    scale_factor = min(width / target_size[0], height / target_size[1])
    crop_width = int(target_size[0] * scale_factor)
    crop_height = int(target_size[1] * scale_factor)
    x = (width - crop_width) // 2
    y = (height - crop_height) // 2
    
    return image.crop((x, y, x + crop_width, y + crop_height)).resize(
        target_size, Image.BILINEAR
    )

Horizontal and Vertical Flipping

Horizontal flipping reflects the image across the vertical axis. For most natural scenes and objects, horizontal flip is semantically valid—a flipped dog is still a dog.

$$I_{flip}[i,j] = I[i, W-1-j]$$

Vertical flipping reflects across the horizontal axis and is appropriate for aerial imagery, medical scans, or satellite photos where orientation is arbitrary:

$$I_{flip}[i,j] = I[H-1-i, j]$$

When Flipping Is Inappropriate

Not all tasks tolerate flipping. Text recognition would fail with horizontally flipped text. Digit recognition cannot flip '6' and '9' interchangeably. Medical imaging may have left/right significance. Always validate that augmentations preserve the task's semantic structure.

Rotation

Random rotation rotates the image by an angle $\theta$ sampled from a specified range. The transformation matrix for 2D rotation around the center $(c_x, c_y)$ is:

$$\begin{bmatrix} x' \ y' \end{bmatrix} = \begin{bmatrix} \cos\theta & -\sin\theta \ \sin\theta & \cos\theta \end{bmatrix} \begin{bmatrix} x - c_x \ y - c_y \end{bmatrix} + \begin{bmatrix} c_x \ c_y \end{bmatrix}$$

Rotation introduces empty pixels at corners that must be handled via padding (constant, reflection, or wrap-around).

Common rotation ranges:

±10–15°: Small rotations, generally safe for most natural images
±30°: Moderate rotation, useful for logos, symbols, or scenes without strong orientation
0/90/180/270°: Discrete rotations for satellite imagery, texture analysis, or medical scans

Affine and Perspective Transformations

More general transformations combine rotation, translation, scaling, and shearing:

Affine transformation (preserves parallel lines): $$\begin{bmatrix} x' \ y' \end{bmatrix} = \begin{bmatrix} a_{11} & a_{12} \ a_{21} & a_{22} \end{bmatrix} \begin{bmatrix} x \ y \end{bmatrix} + \begin{bmatrix} t_x \ t_y \end{bmatrix}$$

Perspective transformation (simulates viewpoint changes): $$x' = \frac{a_{11}x + a_{12}y + a_{13}}{a_{31}x + a_{32}y + a_{33}}, \quad y' = \frac{a_{21}x + a_{22}y + a_{23}}{a_{31}x + a_{32}y + a_{33}}$$

Perspective transforms are particularly valuable for document analysis, OCR, and scene understanding where viewpoint varies.

geometric_augmentations.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
import torch
import torch.nn.functional as F
import numpy as np
from typing import Tuple
 
def get_affine_matrix(
    angle: float = 0,
    translate: Tuple[float, float] = (0, 0),
    scale: float = 1.0,
    shear: Tuple[float, float] = (0, 0)
) -> torch.Tensor:
    """
    Compute affine transformation matrix combining rotation, translation,
    scaling, and shearing.
    
    Parameters:
    -----------
    angle : float
        Rotation angle in degrees
    translate : tuple
        Translation as fraction of image size (tx, ty)
    scale : float
        Isotropic scaling factor
    shear : tuple
        Shear factors (shear_x, shear_y) in degrees
    
    Returns:
    --------
    torch.Tensor of shape (2, 3): Affine transformation matrix
    """
    # Convert to radians
    angle_rad = np.deg2rad(angle)
    shear_x = np.deg2rad(shear[0])
    shear_y = np.deg2rad(shear[1])
    
    # Rotation matrix
    cos_a, sin_a = np.cos(angle_rad), np.sin(angle_rad)
    R = np.array([
        [cos_a, -sin_a],
        [sin_a, cos_a]
    ])
    
    # Shear matrix
    S = np.array([
        [1, np.tan(shear_x)],
        [np.tan(shear_y), 1]
    ])
    
    # Combined linear transformation
    M = scale * R @ S
    
    # Full affine matrix with translation
    affine = np.array([
        [M[0, 0], M[0, 1], translate[0]],
        [M[1, 0], M[1, 1], translate[1]]
    ])
    
    return torch.tensor(affine, dtype=torch.float32)
 
 
def apply_affine_transform(
    image: torch.Tensor,
    affine_matrix: torch.Tensor,
    mode: str = 'bilinear',
    padding_mode: str = 'reflection'
) -> torch.Tensor:
    """
    Apply affine transformation to image tensor.
    
    Parameters:
    -----------
    image : torch.Tensor
        Image tensor of shape (C, H, W) or (B, C, H, W)
    affine_matrix : torch.Tensor
        Affine matrix of shape (2, 3)
    mode : str
        Interpolation mode: 'bilinear' or 'nearest'
    padding_mode : str
        Padding for out-of-bounds: 'zeros', 'border', 'reflection'
    
    Returns:
    --------
    Transformed image tensor
    """
    # Ensure batch dimension
    if image.dim() == 3:
        image = image.unsqueeze(0)
        squeeze = True
    else:
        squeeze = False
    
    B, C, H, W = image.shape
    
    # Expand affine matrix for batch
    theta = affine_matrix.unsqueeze(0).expand(B, -1, -1)
    
    # Create sampling grid
    grid = F.affine_grid(theta, image.size(), align_corners=False)
    
    # Apply transformation
    output = F.grid_sample(image, grid, mode=mode, 
                           padding_mode=padding_mode, align_corners=False)
    
    if squeeze:
        output = output.squeeze(0)
    
    return output

Elastic Deformations

Elastic deformations apply spatially-varying displacements to simulate non-rigid distortions. These are particularly effective for handwritten text, biological cells, or any domain with natural shape variability.

The displacement field $(\Delta x, \Delta y)$ at each pixel is generated by:

Sampling random displacements from a Gaussian distribution
Smoothing with a Gaussian filter of standard deviation $\sigma$
Scaling by intensity factor $\alpha$

$$x' = x + \alpha \cdot (G_\sigma * \epsilon_x)(x,y)$$ $$y' = y + \alpha \cdot (G_\sigma * \epsilon_y)(x,y)$$

where $\epsilon_x, \epsilon_y \sim \mathcal{N}(0,1)$ and $G_\sigma$ is a Gaussian kernel.

Elastic transforms won the MNIST digit recognition competition when introduced and remain valuable for medical imaging (cell segmentation, tissue deformation) and document analysis.

Photometric Transformations

Photometric transformations modify pixel intensities without changing spatial structure. These simulate variations in lighting, camera settings, and environmental conditions.

Color Jittering

Color jittering randomly perturbs brightness, contrast, saturation, and hue. This is essential for robustness to lighting variations.

Brightness adjustment: Adds or multiplies a constant to all pixel values: $$I' = \alpha \cdot I + \beta$$

where $\alpha$ controls contrast and $\beta$ controls brightness.

Saturation adjustment: Modifies color intensity in HSV or HSL space: $$S' = \text{clip}(S \cdot \gamma, 0, 1)$$

Hue shift: Rotates the hue channel: $$H' = (H + \Delta H) \mod 360$$

color_jitter.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import torch
import torch.nn.functional as F
from typing import Tuple
import numpy as np
 
def rgb_to_hsv(rgb: torch.Tensor) -> torch.Tensor:
    """
    Convert RGB image to HSV color space.
    
    Parameters:
    -----------
    rgb : torch.Tensor
        RGB image tensor of shape (C=3, H, W) with values in [0, 1]
    
    Returns:
    --------
    HSV tensor of shape (C=3, H, W)
    """
    r, g, b = rgb[0], rgb[1], rgb[2]
    
    max_val, max_idx = rgb.max(dim=0)
    min_val = rgb.min(dim=0)[0]
    diff = max_val - min_val
    
    # Value
    v = max_val
    
    # Saturation
    s = torch.where(max_val > 0, diff / max_val, torch.zeros_like(max_val))
    
    # Hue
    h = torch.zeros_like(max_val)
    
    mask = diff > 0
    
    # When max is R
    mask_r = mask & (max_idx == 0)
    h[mask_r] = (60 * (g[mask_r] - b[mask_r]) / diff[mask_r]) % 360
    
    # When max is G
    mask_g = mask & (max_idx == 1)
    h[mask_g] = 60 * (b[mask_g] - r[mask_g]) / diff[mask_g] + 120
    
    # When max is B
    mask_b = mask & (max_idx == 2)
    h[mask_b] = 60 * (r[mask_b] - g[mask_b]) / diff[mask_b] + 240
    
    h = h / 360  # Normalize to [0, 1]
    
    return torch.stack([h, s, v])
 
 
def color_jitter(
    image: torch.Tensor,
    brightness: float = 0.2,
    contrast: float = 0.2,
    saturation: float = 0.2,
    hue: float = 0.05
) -> torch.Tensor:
    """
    Apply random color jittering to image.
    
    Parameters:
    -----------
    image : torch.Tensor
        RGB image of shape (C=3, H, W) with values in [0, 1]
    brightness : float
        Max brightness adjustment factor
    contrast : float
        Max contrast adjustment factor  
    saturation : float
        Max saturation adjustment factor
    hue : float
        Max hue shift (as fraction of full circle)
    
    Returns:
    --------
    Color-jittered image tensor
    """
    # Random order of transformations
    transforms = []
    
    # Brightness: I' = I + delta
    if brightness > 0:
        def adjust_brightness(img):
            delta = torch.empty(1).uniform_(-brightness, brightness)
            return (img + delta).clamp(0, 1)
        transforms.append(adjust_brightness)
    
    # Contrast: I' = mean + alpha * (I - mean)
    if contrast > 0:
        def adjust_contrast(img):
            alpha = torch.empty(1).uniform_(1 - contrast, 1 + contrast)
            mean = img.mean(dim=(1, 2), keepdim=True)
            return (mean + alpha * (img - mean)).clamp(0, 1)
        transforms.append(adjust_contrast)
    
    # Saturation: multiply S channel in HSV
    if saturation > 0:
        def adjust_saturation(img):
            factor = torch.empty(1).uniform_(1 - saturation, 1 + saturation)
            gray = 0.2989 * img[0] + 0.5870 * img[1] + 0.1140 * img[2]
            return (factor * img + (1 - factor) * gray.unsqueeze(0)).clamp(0, 1)
        transforms.append(adjust_saturation)
    
    # Hue: rotate H channel in HSV
    if hue > 0:
        def adjust_hue(img):
            delta = torch.empty(1).uniform_(-hue, hue)
            hsv = rgb_to_hsv(img)
            hsv[0] = (hsv[0] + delta) % 1.0
            return hsv_to_rgb(hsv)  # Assume hsv_to_rgb is defined
        transforms.append(adjust_hue)
    
    # Apply in random order
    np.random.shuffle(transforms)
    for transform in transforms:
        image = transform(image)
    
    return image

Grayscale Conversion

Random grayscale occasionally converts images to grayscale, forcing the model to rely on shape and texture rather than color:

$$I_{gray} = 0.2989 \cdot R + 0.5870 \cdot G + 0.1140 \cdot B$$

These coefficients approximate human luminance perception. Random grayscale with probability 10-20% is a component of many successful training recipes (e.g., SimCLR, BYOL).

Gaussian Blur

Gaussian blur applies a Gaussian smoothing kernel to simulate focus variations or low-resolution inputs:

$$G(x,y) = \frac{1}{2\pi\sigma^2}e^{-\frac{x^2+y^2}{2\sigma^2}}$$

$$I_{blur} = I * G$$

Random blur with kernel sizes 3-23px and $\sigma \in [0.1, 2.0]$ is standard in contrastive and self-supervised learning.

Gaussian Noise

Additive Gaussian noise simulates sensor noise in cameras:

$$I_{noisy} = I + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2)$$

Typical $\sigma$ values range from 0.01 to 0.1 for normalized [0,1] images. Gaussian noise is particularly important for:

Low-light imaging
Medical imaging with inherent noise
Models that must be robust to degraded inputs

Normalization and Standardization

While not strictly augmentation, proper normalization is essential:

Per-channel normalization: $$I_{norm} = \frac{I - \mu}{\sigma}$$

ImageNet statistics ($\mu = [0.485, 0.456, 0.406]$, $\sigma = [0.229, 0.224, 0.225]$) are widely used even for non-ImageNet tasks when using pretrained backbones.

Normalization Order Matters

Apply normalization after all other augmentations, not before. Photometric augmentations like color jitter expect pixel values in [0,1] or [0,255], not standardized values with mean 0. Incorrect ordering is a common source of degraded performance.

Erasing and Occlusion Augmentations

Erasing augmentations randomly remove or corrupt rectangular regions, forcing models to recognize objects from partial information. This family of techniques addresses occlusion robustness—a critical real-world challenge.

Random Erasing

Random Erasing (also called Cutout in its original form) masks a rectangular region with random noise or a constant value:

$$I_{erased}[i,j] = \begin{cases} v & \text{if } (i,j) \in R_{erase} \ I[i,j] & \text{otherwise} \end{cases}$$

where $R_{erase}$ is a randomly positioned rectangle and $v$ is either:

A constant (0, mean pixel value, or 127)
Random uniform noise
Random pixels from ImageNet statistics

random_erasing.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import torch
import numpy as np
from typing import Tuple
 
class RandomErasing:
    """
    Random Erasing augmentation as described in:
    'Random Erasing Data Augmentation' (Zhong et al., 2020)
    
    Randomly masks rectangular regions with noise or constant values,
    forcing the model to learn from partial object views.
    """
    
    def __init__(
        self,
        probability: float = 0.5,
        scale: Tuple[float, float] = (0.02, 0.33),
        ratio: Tuple[float, float] = (0.3, 3.3),
        value: str = 'random',  # 'random', 'zero', 'mean', or float
        inplace: bool = False
    ):
        """
        Parameters:
        -----------
        probability : float
            Probability of applying random erasing
        scale : tuple
            Range of proportion of erased area vs input image
        ratio : tuple
            Range of aspect ratio of erased area
        value : str or float
            Fill value for erased region:
            - 'random': random uniform noise
            - 'zero': constant 0
            - 'mean': ImageNet mean values
            - float: constant value
        """
        self.probability = probability
        self.scale = scale
        self.ratio = ratio
        self.value = value
        self.inplace = inplace
        
        # ImageNet mean for 'mean' fill option
        self.imagenet_mean = torch.tensor([0.485, 0.456, 0.406])
    
    def __call__(self, image: torch.Tensor) -> torch.Tensor:
        """
        Apply random erasing to image tensor.
        
        Parameters:
        -----------
        image : torch.Tensor
            Image of shape (C, H, W) with values in [0, 1]
        
        Returns:
        --------
        Image with randomly erased region
        """
        if np.random.random() > self.probability:
            return image
        
        if not self.inplace:
            image = image.clone()
        
        C, H, W = image.shape
        area = H * W
        
        for _ in range(10):  # Maximum attempts
            # Sample target area and aspect ratio
            target_area = np.random.uniform(self.scale[0], self.scale[1]) * area
            aspect_ratio = np.random.uniform(self.ratio[0], self.ratio[1])
            
            # Compute erase dimensions
            h = int(round(np.sqrt(target_area * aspect_ratio)))
            w = int(round(np.sqrt(target_area / aspect_ratio)))
            
            if h < H and w < W:
                # Random position
                i = np.random.randint(0, H - h + 1)
                j = np.random.randint(0, W - w + 1)
                
                # Generate fill value
                if self.value == 'random':
                    fill = torch.rand(C, h, w)
                elif self.value == 'zero':
                    fill = torch.zeros(C, h, w)
                elif self.value == 'mean':
                    fill = self.imagenet_mean.view(C, 1, 1).expand(C, h, w)
                else:
                    fill = torch.full((C, h, w), self.value)
                
                # Apply erasing
                image[:, i:i+h, j:j+w] = fill
                return image
        
        return image

Cutout

Cutout is a simpler variant that always uses a square mask with a fixed side length (typically 16-64 pixels for 32x32 CIFAR images, or scaled proportionally for larger images). Unlike Random Erasing, Cutout:

Always applies (probability = 1.0, but can extend outside image boundaries)
Uses fixed square regions
Fills with zeros or mean pixel values

Cutout is particularly effective for small-scale image classification (CIFAR-10/100) and has largely been superseded by more flexible techniques for ImageNet-scale training.

GridMask

GridMask takes structured erasing further by removing multiple rectangular regions arranged in a grid pattern:

$$R_{erase} = \bigcup_{i,j} {(x,y) : i \cdot d \leq x < i \cdot d + l, j \cdot d \leq y < j \cdot d + l}$$

where $d$ is the grid spacing and $l$ is the unit mask size. GridMask provides more uniform coverage than random erasing and can be scheduled to increase difficulty during training.

Occlusion in Object Detection

For object detection, erasing augmentations require careful handling. If an erasing rectangle fully covers a bounding box, that ground truth should be removed. Partial occlusions require deciding whether to keep, remove, or clip the bounding box—choices that affect training dynamics.

Augmentation Composition and Pipelines

Individual augmentations are typically composed into augmentation pipelines—sequences of transformations applied to each training sample. Effective composition requires understanding transformation interactions and computational constraints.

Sequential Composition

The standard approach applies transformations in sequence:

$$T_{pipeline} = T_n \circ T_{n-1} \circ \cdots \circ T_1$$

Critical ordering principles:

Geometric before photometric: Apply spatial transforms first to avoid interpolating already-modified pixel values
Expensive operations once: Compute bounding boxes, masks, or keypoints after all geometric transforms, not incrementally
Normalization last: Always apply mean/std normalization at the very end
Resolution-changing transforms early: Crop/resize before applying convolution-based augmentations

augmentation_pipeline.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
import torch
from typing import List, Callable, Optional
import numpy as np
 
class AugmentationPipeline:
    """
    Composable augmentation pipeline with proper ordering and
    probabilistic application.
    
    Ensures geometric transforms are applied before photometric,
    and normalization is always applied last.
    """
    
    def __init__(self, transforms: List[dict]):
        """
        Parameters:
        -----------
        transforms : list of dict
            Each dict contains:
            - 'fn': callable augmentation function
            - 'prob': probability of applying (default 1.0)
            - 'stage': 'geometric', 'photometric', 'erase', or 'normalize'
        """
        self.transforms = transforms
        
        # Sort by stage to ensure correct ordering
        stage_order = {
            'geometric': 0,
            'photometric': 1, 
            'erase': 2,
            'normalize': 3
        }
        
        self.transforms = sorted(
            transforms,
            key=lambda x: stage_order.get(x.get('stage', 'photometric'), 1)
        )
    
    def __call__(self, image: torch.Tensor) -> torch.Tensor:
        """
        Apply augmentation pipeline to image.
        
        Parameters:
        -----------
        image : torch.Tensor
            Input image of shape (C, H, W)
        
        Returns:
        --------
        Augmented image tensor
        """
        for transform in self.transforms:
            fn = transform['fn']
            prob = transform.get('prob', 1.0)
            
            if np.random.random() < prob:
                image = fn(image)
        
        return image
 
 
# Example: Standard ImageNet training pipeline
def create_imagenet_pipeline(
    crop_size: int = 224,
    hflip_prob: float = 0.5,
    color_jitter: tuple = (0.4, 0.4, 0.4, 0.1),
    random_erase_prob: float = 0.25
):
    """
    Create standard ImageNet training augmentation pipeline.
    
    Based on ResNet training recipe with modern additions.
    """
    transforms = [
        # Stage 1: Geometric transforms
        {
            'fn': lambda x: random_resized_crop(x, (crop_size, crop_size)),
            'prob': 1.0,
            'stage': 'geometric'
        },
        {
            'fn': lambda x: torch.flip(x, dims=[-1]),
            'prob': hflip_prob,
            'stage': 'geometric'
        },
        
        # Stage 2: Photometric transforms
        {
            'fn': lambda x: color_jitter(
                x, 
                brightness=color_jitter[0],
                contrast=color_jitter[1],
                saturation=color_jitter[2],
                hue=color_jitter[3]
            ),
            'prob': 0.8,
            'stage': 'photometric'
        },
        {
            'fn': random_grayscale,
            'prob': 0.2,
            'stage': 'photometric'
        },
        {
            'fn': gaussian_blur,
            'prob': 0.1,
            'stage': 'photometric'
        },
        
        # Stage 3: Erasing
        {
            'fn': RandomErasing(probability=1.0),
            'prob': random_erase_prob,
            'stage': 'erase'
        },
        
        # Stage 4: Normalization (always apply)
        {
            'fn': imagenet_normalize,
            'prob': 1.0,
            'stage': 'normalize'
        },
    ]
    
    return AugmentationPipeline(transforms)

Stochastic Depth and Width

Beyond individual transform probabilities, we can modulate the overall intensity of augmentation:

Augmentation magnitude: How strong each individual transformation is Augmentation probability: Whether each transformation is applied Augmentation count: How many transformations are applied per sample

RandAugment (discussed later) controls these with just two hyperparameters (N, M), dramatically simplifying the search space.

Reproducibility Considerations

For reproducible training, augmentations must be deterministically controlled:

# Set global seed for reproducibility
def set_augmentation_seed(seed: int):
    np.random.seed(seed)
    torch.manual_seed(seed)
    random.seed(seed)

# Per-sample seeding for distributed training
def worker_init_fn(worker_id: int):
    worker_seed = torch.initial_seed() % 2**32
    np.random.seed(worker_seed)
    random.seed(worker_seed)

Note that augmentation randomness is intentional—the same image should receive different augmentations across epochs. What should be reproducible is the sequence of random augmentations for debugging and comparison.

Task-Specific Augmentation Strategies

Different vision tasks require different augmentation strategies. What works for classification may harm detection; what helps segmentation may break pose estimation. Understanding task constraints is essential for effective augmentation.

Image Classification

Goal: Invariance to geometric and photometric variations

Standard recipe (EfficientNet/ResNet training):

Random resized crop (scale 0.08-1.0, ratio 3/4-4/3)
Horizontal flip (50%)
Color jitter (brightness ±0.4, contrast ±0.4, saturation ±0.4, hue ±0.1)
Random grayscale (20%)
Random erasing (25%)
Gaussian blur (10%, σ ∈ [0.1, 2.0])

Fine-grained classification (birds, cars, flowers): Use weaker augmentation to preserve diagnostic details. Reduce crop scale range and color jitter intensity.

Recommended Augmentation Settings by Task
Task	Geometric	Photometric	Erasing	Special
Classification	Strong random crop, flip	Strong color jitter	Random erasing 25%	RandAugment or AutoAugment
Object Detection	Moderate crop, flip, scale	Moderate jitter	With box validation	Mosaic, Copy-Paste
Segmentation	Crop, flip, rotate with mask	Moderate jitter	Rare	Preserve class balance in crops
Pose Estimation	Crop, flip, rotate with keypoints	Light jitter	None	Half-body crop
OCR/Text	Perspective, elastic	Minimal	None	Font variation, background variation
Medical Imaging	Rotation, elastic, flip (if appropriate)	Intensity normalization	Rare	Domain-specific physics simulation

Object Detection

Goal: Equivariance for bounding boxes, invariance for class labels

Critical considerations:

When cropping, boxes may be clipped, fully removed, or kept—must define policy
Horizontal flip requires flipping box coordinates: $x_{new} = W - x_{old} - w_{box}$
Some augmentations (Mosaic, CutMix) merge images, requiring label changes

YOLO Mosaic augmentation: Combines 4 images into one, providing:

Scenes with more objects per sample
Varied object scales in single image
Reduced batch size requirements

Semantic Segmentation

Goal: Equivariance for pixel-wise labels

Critical considerations:

All geometric transforms (crop, rotate, flip) must apply identically to image and mask
Use nearest-neighbor interpolation for mask to avoid introducing new label values
Random crop may eliminate minority classes—monitor class distribution
Avoid photometric augmentations on the mask

detection_augmentation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
import torch
import numpy as np
from typing import Tuple, List
 
def flip_boxes_horizontal(
    boxes: torch.Tensor,
    image_width: int
) -> torch.Tensor:
    """
    Flip bounding boxes horizontally.
    
    Parameters:
    -----------
    boxes : torch.Tensor
        Bounding boxes of shape (N, 4) in format [x1, y1, x2, y2]
    image_width : int
        Width of the image
    
    Returns:
    --------
    Flipped bounding boxes
    """
    flipped = boxes.clone()
    flipped[:, 0] = image_width - boxes[:, 2]  # new x1 = W - old x2
    flipped[:, 2] = image_width - boxes[:, 0]  # new x2 = W - old x1
    return flipped
 
 
def crop_boxes(
    boxes: torch.Tensor,
    labels: torch.Tensor,
    crop_x1: int,
    crop_y1: int,
    crop_x2: int,
    crop_y2: int,
    min_visible_ratio: float = 0.3
) -> Tuple[torch.Tensor, torch.Tensor]:
    """
    Adjust bounding boxes after crop operation.
    
    Removes boxes that are mostly outside crop region,
    clips remaining boxes to crop boundaries.
    
    Parameters:
    -----------
    boxes : torch.Tensor
        Boxes in [x1, y1, x2, y2] format, shape (N, 4)
    labels : torch.Tensor
        Class labels, shape (N,)
    crop_x1, crop_y1, crop_x2, crop_y2 : int
        Crop region boundaries
    min_visible_ratio : float
        Minimum fraction of box area that must remain visible
    
    Returns:
    --------
    Clipped boxes and corresponding labels
    """
    # Original box areas
    original_areas = (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1])
    
    # Clip boxes to crop region
    clipped = boxes.clone()
    clipped[:, 0] = clipped[:, 0].clamp(min=crop_x1) - crop_x1
    clipped[:, 1] = clipped[:, 1].clamp(min=crop_y1) - crop_y1  
    clipped[:, 2] = clipped[:, 2].clamp(max=crop_x2) - crop_x1
    clipped[:, 3] = clipped[:, 3].clamp(max=crop_y2) - crop_y1
    
    # Compute clipped areas
    clipped_widths = (clipped[:, 2] - clipped[:, 0]).clamp(min=0)
    clipped_heights = (clipped[:, 3] - clipped[:, 1]).clamp(min=0)
    clipped_areas = clipped_widths * clipped_heights
    
    # Keep boxes with sufficient visible area
    visible_ratios = clipped_areas / (original_areas + 1e-6)
    valid_mask = (
        (visible_ratios >= min_visible_ratio) & 
        (clipped_widths > 0) & 
        (clipped_heights > 0)
    )
    
    return clipped[valid_mask], labels[valid_mask]
 
 
def mosaic_augmentation(
    images: List[torch.Tensor],
    boxes_list: List[torch.Tensor],
    labels_list: List[torch.Tensor],
    output_size: int = 640
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """
    YOLO-style Mosaic augmentation combining 4 images.
    
    Creates a single training sample from 4 different images,
    increasing object diversity and reducing batch size requirements.
    """
    assert len(images) == 4, "Mosaic requires exactly 4 images"
    
    s = output_size
    yc, xc = np.random.randint(s // 2, 3 * s // 2, 2)  # Mosaic center
    
    mosaic_img = torch.zeros(3, s * 2, s * 2)
    all_boxes = []
    all_labels = []
    
    # Placement positions: top-left, top-right, bottom-left, bottom-right
    for i, (img, boxes, labels) in enumerate(zip(images, boxes_list, labels_list)):
        C, h, w = img.shape
        
        if i == 0:  # Top-left
            x1a, y1a, x2a, y2a = max(xc - w, 0), max(yc - h, 0), xc, yc
            x1b, y1b, x2b, y2b = w - (x2a - x1a), h - (y2a - y1a), w, h
        elif i == 1:  # Top-right
            x1a, y1a, x2a, y2a = xc, max(yc - h, 0), min(xc + w, s * 2), yc
            x1b, y1b, x2b, y2b = 0, h - (y2a - y1a), min(w, x2a - x1a), h
        elif i == 2:  # Bottom-left
            x1a, y1a, x2a, y2a = max(xc - w, 0), yc, xc, min(s * 2, yc + h)
            x1b, y1b, x2b, y2b = w - (x2a - x1a), 0, w, min(y2a - y1a, h)
        else:  # Bottom-right
            x1a, y1a, x2a, y2a = xc, yc, min(xc + w, s * 2), min(s * 2, yc + h)
            x1b, y1b, x2b, y2b = 0, 0, min(w, x2a - x1a), min(y2a - y1a, h)
        
        # Place image tile
        mosaic_img[:, y1a:y2a, x1a:x2a] = img[:, y1b:y2b, x1b:x2b]
        
        # Adjust boxes for placement offset
        offset_x = x1a - x1b
        offset_y = y1a - y1b
        
        if len(boxes) > 0:
            adjusted_boxes = boxes.clone()
            adjusted_boxes[:, [0, 2]] += offset_x
            adjusted_boxes[:, [1, 3]] += offset_y
            all_boxes.append(adjusted_boxes)
            all_labels.append(labels)
    
    # Center crop to output size
    mosaic_img = mosaic_img[:, s//2:s//2+s, s//2:s//2+s]
    
    if all_boxes:
        all_boxes = torch.cat(all_boxes)
        all_labels = torch.cat(all_labels)
        # Clip to output region
        all_boxes[:, [0, 2]] -= s // 2
        all_boxes[:, [1, 3]] -= s // 2
        all_boxes, all_labels = crop_boxes(
            all_boxes, all_labels, 0, 0, s, s
        )
    else:
        all_boxes = torch.empty(0, 4)
        all_labels = torch.empty(0)
    
    return mosaic_img, all_boxes, all_labels

Summary: Image Augmentation Foundations

We've established the comprehensive foundation for understanding image augmentations—the theoretical basis, mathematical formulations, and practical implementation of the techniques that form the backbone of modern deep learning training pipelines.

Key Takeaways

•Augmentation enforces invariance—training models to produce consistent outputs under input transformations that preserve semantic meaning.
•Vicinal Risk Minimization provides the theoretical foundation—augmentation expands the effective training distribution to better approximate the true data distribution.
•Geometric transformations (crop, flip, rotate, affine, perspective) modify spatial structure while preserving content.
•Photometric transformations (color jitter, brightness, contrast, blur, noise) simulate imaging variations without changing geometry.
•Erasing augmentations (Random Erasing, Cutout, GridMask) enforce occlusion robustness by removing image regions.
•Pipeline composition requires careful ordering: geometric → photometric → erasing → normalization.
•Task-specific strategies differ fundamentally: classification needs invariance, detection/segmentation need equivariance.

What's Next:

Now that we've established the foundations of individual image augmentations, we'll explore Mixup and CutMix—revolutionary techniques that don't just transform individual images, but combine multiple training samples. These mixing strategies create synthetic training examples with interpolated labels, providing a fundamentally different form of regularization with deep connections to label smoothing and ensembling.

Foundation Established

You now understand the theoretical motivation, mathematical formulations, and practical implementation of image augmentation. This foundation prepares you for advanced mixing strategies, learned augmentation policies, and test-time augmentation techniques covered in subsequent pages.

Image Augmentations: Foundations and Techniques

The Data Scarcity Dilemma

What You Will Master

Theoretical Foundations of Data Augmentation

The Invariance Principle

$$f(T(x)) = f(x) \quad \forall T \in \mathcal{T}$$

where $\mathcal{T}$ is a family of transformations under which predictions should be invariant.

The Regularization Perspective

From an optimization standpoint, augmentation acts as a regularizer. Consider the expected risk we wish to minimize:

$$R(\theta) = \mathbb{E}{(x,y) \sim p{data}}[L(f_\theta(x), y)]$$

With augmentation, we instead minimize:

$$R_{aug}(\theta) = \mathbb{E}{T \sim p_T}\mathbb{E}{(x,y) \sim p_{data}}[L(f_\theta(T(x)), y)]$$

This modified objective smooths the loss landscape and reduces the model's sensitivity to input perturbations, directly improving generalization.

Vicinal Risk Minimization

The Data Distribution View

$$p_{aug}(x) = \frac{1}{|\mathcal{T}|}\sum_{T \in \mathcal{T}} p_{emp}(T^{-1}(x))$$

Invariance vs. Equivariance

Not all tasks require invariance. In semantic segmentation, we want pixel-wise predictions to transform along with the input—this is equivariance:

$$f(T(x)) = T(f(x)) \quad \text{(Equivariance)}$$

Invariance vs. Equivariance Across Vision Tasks
Task	Desired Property	Label Transformation	Example
Image Classification	Invariance	Labels unchanged	Rotated cat → still labeled 'cat'
Object Detection	Equivariance	Bounding boxes transform	Flipped image → flipped box coordinates
Semantic Segmentation	Equivariance	Masks transform spatially	Rotated image → rotated mask
Pose Estimation	Equivariance	Keypoints transform	Scaled image → scaled keypoint locations
Image Captioning	Invariance	Captions unchanged	Color-jittered image → same caption

Geometric Transformations

Geometric transformations modify the spatial structure of images—their position, scale, rotation, and shape. These are among the most fundamental and universally applicable augmentations.

Random Cropping

Mathematical Formulation: Given an image $I$ of size $H \times W$, a random crop selects coordinates $(h_1, w_1)$ and extracts a region of size $h \times w$:

$$I_{crop}[i,j] = I[h_1 + i, w_1 + j] \quad \text{for } i \in [0, h), j \in [0, w)$$

where:

$h_1 \sim \text{Uniform}(0, H - h)$
$w_1 \sim \text{Uniform}(0, W - w)$

ResNet-style random resized crop is particularly effective: randomly select a crop with area ratio in $[0.08, 1.0]$ and aspect ratio in $[3/4, 4/3]$, then resize to target dimensions.

random_resized_crop.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy as np
from PIL import Image
 
def random_resized_crop(
    image: Image.Image,
    target_size: tuple = (224, 224),
    scale: tuple = (0.08, 1.0),
    ratio: tuple = (3/4, 4/3)
) -> Image.Image:
    """
    Random resized crop as used in ResNet training.
    
    Parameters:
    -----------
    image : PIL Image
        Input image to crop and resize
    target_size : tuple
        Final (height, width) after resizing
    scale : tuple
        Range of crop area relative to original (min, max)
    ratio : tuple
        Range of aspect ratios (min, max)
    
    Returns:
    --------
    Cropped and resized PIL Image
    """
    width, height = image.size
    area = width * height
    
    for attempt in range(10):  # Try multiple times
        # Sample target area and aspect ratio
        target_area = np.random.uniform(scale[0], scale[1]) * area
        log_ratio = np.log(ratio)
        aspect_ratio = np.exp(np.random.uniform(log_ratio[0], log_ratio[1]))
        
        # Compute crop dimensions
        crop_width = int(round(np.sqrt(target_area * aspect_ratio)))
        crop_height = int(round(np.sqrt(target_area / aspect_ratio)))
        
        # Check if valid crop is possible
        if 0 < crop_width <= width and 0 < crop_height <= height:
            # Random position
            x = np.random.randint(0, width - crop_width + 1)
            y = np.random.randint(0, height - crop_height + 1)
            
            # Crop and resize
            return image.crop((x, y, x + crop_width, y + crop_height)).resize(
                target_size, Image.BILINEAR
            )
    
    # Fallback: center crop to match aspect ratio, then resize
    scale_factor = min(width / target_size[0], height / target_size[1])
    crop_width = int(target_size[0] * scale_factor)
    crop_height = int(target_size[1] * scale_factor)
    x = (width - crop_width) // 2
    y = (height - crop_height) // 2
    
    return image.crop((x, y, x + crop_width, y + crop_height)).resize(
        target_size, Image.BILINEAR
    )

Horizontal and Vertical Flipping

Horizontal flipping reflects the image across the vertical axis. For most natural scenes and objects, horizontal flip is semantically valid—a flipped dog is still a dog.

$$I_{flip}[i,j] = I[i, W-1-j]$$

Vertical flipping reflects across the horizontal axis and is appropriate for aerial imagery, medical scans, or satellite photos where orientation is arbitrary:

$$I_{flip}[i,j] = I[H-1-i, j]$$

When Flipping Is Inappropriate

Rotation

Random rotation rotates the image by an angle $\theta$ sampled from a specified range. The transformation matrix for 2D rotation around the center $(c_x, c_y)$ is:

Rotation introduces empty pixels at corners that must be handled via padding (constant, reflection, or wrap-around).

Common rotation ranges:

±10–15°: Small rotations, generally safe for most natural images
±30°: Moderate rotation, useful for logos, symbols, or scenes without strong orientation
0/90/180/270°: Discrete rotations for satellite imagery, texture analysis, or medical scans

Affine and Perspective Transformations

More general transformations combine rotation, translation, scaling, and shearing:

Perspective transforms are particularly valuable for document analysis, OCR, and scene understanding where viewpoint varies.

geometric_augmentations.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
import torch
import torch.nn.functional as F
import numpy as np
from typing import Tuple
 
def get_affine_matrix(
    angle: float = 0,
    translate: Tuple[float, float] = (0, 0),
    scale: float = 1.0,
    shear: Tuple[float, float] = (0, 0)
) -> torch.Tensor:
    """
    Compute affine transformation matrix combining rotation, translation,
    scaling, and shearing.
    
    Parameters:
    -----------
    angle : float
        Rotation angle in degrees
    translate : tuple
        Translation as fraction of image size (tx, ty)
    scale : float
        Isotropic scaling factor
    shear : tuple
        Shear factors (shear_x, shear_y) in degrees
    
    Returns:
    --------
    torch.Tensor of shape (2, 3): Affine transformation matrix
    """
    # Convert to radians
    angle_rad = np.deg2rad(angle)
    shear_x = np.deg2rad(shear[0])
    shear_y = np.deg2rad(shear[1])
    
    # Rotation matrix
    cos_a, sin_a = np.cos(angle_rad), np.sin(angle_rad)
    R = np.array([
        [cos_a, -sin_a],
        [sin_a, cos_a]
    ])
    
    # Shear matrix
    S = np.array([
        [1, np.tan(shear_x)],
        [np.tan(shear_y), 1]
    ])
    
    # Combined linear transformation
    M = scale * R @ S
    
    # Full affine matrix with translation
    affine = np.array([
        [M[0, 0], M[0, 1], translate[0]],
        [M[1, 0], M[1, 1], translate[1]]
    ])
    
    return torch.tensor(affine, dtype=torch.float32)
 
 
def apply_affine_transform(
    image: torch.Tensor,
    affine_matrix: torch.Tensor,
    mode: str = 'bilinear',
    padding_mode: str = 'reflection'
) -> torch.Tensor:
    """
    Apply affine transformation to image tensor.
    
    Parameters:
    -----------
    image : torch.Tensor
        Image tensor of shape (C, H, W) or (B, C, H, W)
    affine_matrix : torch.Tensor
        Affine matrix of shape (2, 3)
    mode : str
        Interpolation mode: 'bilinear' or 'nearest'
    padding_mode : str
        Padding for out-of-bounds: 'zeros', 'border', 'reflection'
    
    Returns:
    --------
    Transformed image tensor
    """
    # Ensure batch dimension
    if image.dim() == 3:
        image = image.unsqueeze(0)
        squeeze = True
    else:
        squeeze = False
    
    B, C, H, W = image.shape
    
    # Expand affine matrix for batch
    theta = affine_matrix.unsqueeze(0).expand(B, -1, -1)
    
    # Create sampling grid
    grid = F.affine_grid(theta, image.size(), align_corners=False)
    
    # Apply transformation
    output = F.grid_sample(image, grid, mode=mode, 
                           padding_mode=padding_mode, align_corners=False)
    
    if squeeze:
        output = output.squeeze(0)
    
    return output

Elastic Deformations

The displacement field $(\Delta x, \Delta y)$ at each pixel is generated by:

Sampling random displacements from a Gaussian distribution
Smoothing with a Gaussian filter of standard deviation $\sigma$
Scaling by intensity factor $\alpha$

$$x' = x + \alpha \cdot (G_\sigma * \epsilon_x)(x,y)$$ $$y' = y + \alpha \cdot (G_\sigma * \epsilon_y)(x,y)$$

where $\epsilon_x, \epsilon_y \sim \mathcal{N}(0,1)$ and $G_\sigma$ is a Gaussian kernel.

Elastic transforms won the MNIST digit recognition competition when introduced and remain valuable for medical imaging (cell segmentation, tissue deformation) and document analysis.

Photometric Transformations

Photometric transformations modify pixel intensities without changing spatial structure. These simulate variations in lighting, camera settings, and environmental conditions.

Color Jittering

Color jittering randomly perturbs brightness, contrast, saturation, and hue. This is essential for robustness to lighting variations.

Brightness adjustment: Adds or multiplies a constant to all pixel values: $$I' = \alpha \cdot I + \beta$$

where $\alpha$ controls contrast and $\beta$ controls brightness.

Saturation adjustment: Modifies color intensity in HSV or HSL space: $$S' = \text{clip}(S \cdot \gamma, 0, 1)$$

Hue shift: Rotates the hue channel: $$H' = (H + \Delta H) \mod 360$$

color_jitter.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import torch
import torch.nn.functional as F
from typing import Tuple
import numpy as np
 
def rgb_to_hsv(rgb: torch.Tensor) -> torch.Tensor:
    """
    Convert RGB image to HSV color space.
    
    Parameters:
    -----------
    rgb : torch.Tensor
        RGB image tensor of shape (C=3, H, W) with values in [0, 1]
    
    Returns:
    --------
    HSV tensor of shape (C=3, H, W)
    """
    r, g, b = rgb[0], rgb[1], rgb[2]
    
    max_val, max_idx = rgb.max(dim=0)
    min_val = rgb.min(dim=0)[0]
    diff = max_val - min_val
    
    # Value
    v = max_val
    
    # Saturation
    s = torch.where(max_val > 0, diff / max_val, torch.zeros_like(max_val))
    
    # Hue
    h = torch.zeros_like(max_val)
    
    mask = diff > 0
    
    # When max is R
    mask_r = mask & (max_idx == 0)
    h[mask_r] = (60 * (g[mask_r] - b[mask_r]) / diff[mask_r]) % 360
    
    # When max is G
    mask_g = mask & (max_idx == 1)
    h[mask_g] = 60 * (b[mask_g] - r[mask_g]) / diff[mask_g] + 120
    
    # When max is B
    mask_b = mask & (max_idx == 2)
    h[mask_b] = 60 * (r[mask_b] - g[mask_b]) / diff[mask_b] + 240
    
    h = h / 360  # Normalize to [0, 1]
    
    return torch.stack([h, s, v])
 
 
def color_jitter(
    image: torch.Tensor,
    brightness: float = 0.2,
    contrast: float = 0.2,
    saturation: float = 0.2,
    hue: float = 0.05
) -> torch.Tensor:
    """
    Apply random color jittering to image.
    
    Parameters:
    -----------
    image : torch.Tensor
        RGB image of shape (C=3, H, W) with values in [0, 1]
    brightness : float
        Max brightness adjustment factor
    contrast : float
        Max contrast adjustment factor  
    saturation : float
        Max saturation adjustment factor
    hue : float
        Max hue shift (as fraction of full circle)
    
    Returns:
    --------
    Color-jittered image tensor
    """
    # Random order of transformations
    transforms = []
    
    # Brightness: I' = I + delta
    if brightness > 0:
        def adjust_brightness(img):
            delta = torch.empty(1).uniform_(-brightness, brightness)
            return (img + delta).clamp(0, 1)
        transforms.append(adjust_brightness)
    
    # Contrast: I' = mean + alpha * (I - mean)
    if contrast > 0:
        def adjust_contrast(img):
            alpha = torch.empty(1).uniform_(1 - contrast, 1 + contrast)
            mean = img.mean(dim=(1, 2), keepdim=True)
            return (mean + alpha * (img - mean)).clamp(0, 1)
        transforms.append(adjust_contrast)
    
    # Saturation: multiply S channel in HSV
    if saturation > 0:
        def adjust_saturation(img):
            factor = torch.empty(1).uniform_(1 - saturation, 1 + saturation)
            gray = 0.2989 * img[0] + 0.5870 * img[1] + 0.1140 * img[2]
            return (factor * img + (1 - factor) * gray.unsqueeze(0)).clamp(0, 1)
        transforms.append(adjust_saturation)
    
    # Hue: rotate H channel in HSV
    if hue > 0:
        def adjust_hue(img):
            delta = torch.empty(1).uniform_(-hue, hue)
            hsv = rgb_to_hsv(img)
            hsv[0] = (hsv[0] + delta) % 1.0
            return hsv_to_rgb(hsv)  # Assume hsv_to_rgb is defined
        transforms.append(adjust_hue)
    
    # Apply in random order
    np.random.shuffle(transforms)
    for transform in transforms:
        image = transform(image)
    
    return image

Grayscale Conversion

Random grayscale occasionally converts images to grayscale, forcing the model to rely on shape and texture rather than color:

$$I_{gray} = 0.2989 \cdot R + 0.5870 \cdot G + 0.1140 \cdot B$$

These coefficients approximate human luminance perception. Random grayscale with probability 10-20% is a component of many successful training recipes (e.g., SimCLR, BYOL).

Gaussian Blur

Gaussian blur applies a Gaussian smoothing kernel to simulate focus variations or low-resolution inputs:

$$G(x,y) = \frac{1}{2\pi\sigma^2}e^{-\frac{x^2+y^2}{2\sigma^2}}$$

$$I_{blur} = I * G$$

Random blur with kernel sizes 3-23px and $\sigma \in [0.1, 2.0]$ is standard in contrastive and self-supervised learning.

Gaussian Noise

Additive Gaussian noise simulates sensor noise in cameras:

$$I_{noisy} = I + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2)$$

Typical $\sigma$ values range from 0.01 to 0.1 for normalized [0,1] images. Gaussian noise is particularly important for:

Low-light imaging
Medical imaging with inherent noise
Models that must be robust to degraded inputs

Normalization and Standardization

While not strictly augmentation, proper normalization is essential:

Per-channel normalization: $$I_{norm} = \frac{I - \mu}{\sigma}$$

ImageNet statistics ($\mu = [0.485, 0.456, 0.406]$, $\sigma = [0.229, 0.224, 0.225]$) are widely used even for non-ImageNet tasks when using pretrained backbones.

Normalization Order Matters

Erasing and Occlusion Augmentations

Random Erasing

Random Erasing (also called Cutout in its original form) masks a rectangular region with random noise or a constant value:

$$I_{erased}[i,j] = \begin{cases} v & \text{if } (i,j) \in R_{erase} \ I[i,j] & \text{otherwise} \end{cases}$$

where $R_{erase}$ is a randomly positioned rectangle and $v$ is either:

A constant (0, mean pixel value, or 127)
Random uniform noise
Random pixels from ImageNet statistics

random_erasing.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import torch
import numpy as np
from typing import Tuple
 
class RandomErasing:
    """
    Random Erasing augmentation as described in:
    'Random Erasing Data Augmentation' (Zhong et al., 2020)
    
    Randomly masks rectangular regions with noise or constant values,
    forcing the model to learn from partial object views.
    """
    
    def __init__(
        self,
        probability: float = 0.5,
        scale: Tuple[float, float] = (0.02, 0.33),
        ratio: Tuple[float, float] = (0.3, 3.3),
        value: str = 'random',  # 'random', 'zero', 'mean', or float
        inplace: bool = False
    ):
        """
        Parameters:
        -----------
        probability : float
            Probability of applying random erasing
        scale : tuple
            Range of proportion of erased area vs input image
        ratio : tuple
            Range of aspect ratio of erased area
        value : str or float
            Fill value for erased region:
            - 'random': random uniform noise
            - 'zero': constant 0
            - 'mean': ImageNet mean values
            - float: constant value
        """
        self.probability = probability
        self.scale = scale
        self.ratio = ratio
        self.value = value
        self.inplace = inplace
        
        # ImageNet mean for 'mean' fill option
        self.imagenet_mean = torch.tensor([0.485, 0.456, 0.406])
    
    def __call__(self, image: torch.Tensor) -> torch.Tensor:
        """
        Apply random erasing to image tensor.
        
        Parameters:
        -----------
        image : torch.Tensor
            Image of shape (C, H, W) with values in [0, 1]
        
        Returns:
        --------
        Image with randomly erased region
        """
        if np.random.random() > self.probability:
            return image
        
        if not self.inplace:
            image = image.clone()
        
        C, H, W = image.shape
        area = H * W
        
        for _ in range(10):  # Maximum attempts
            # Sample target area and aspect ratio
            target_area = np.random.uniform(self.scale[0], self.scale[1]) * area
            aspect_ratio = np.random.uniform(self.ratio[0], self.ratio[1])
            
            # Compute erase dimensions
            h = int(round(np.sqrt(target_area * aspect_ratio)))
            w = int(round(np.sqrt(target_area / aspect_ratio)))
            
            if h < H and w < W:
                # Random position
                i = np.random.randint(0, H - h + 1)
                j = np.random.randint(0, W - w + 1)
                
                # Generate fill value
                if self.value == 'random':
                    fill = torch.rand(C, h, w)
                elif self.value == 'zero':
                    fill = torch.zeros(C, h, w)
                elif self.value == 'mean':
                    fill = self.imagenet_mean.view(C, 1, 1).expand(C, h, w)
                else:
                    fill = torch.full((C, h, w), self.value)
                
                # Apply erasing
                image[:, i:i+h, j:j+w] = fill
                return image
        
        return image

Cutout

Always applies (probability = 1.0, but can extend outside image boundaries)
Uses fixed square regions
Fills with zeros or mean pixel values

Cutout is particularly effective for small-scale image classification (CIFAR-10/100) and has largely been superseded by more flexible techniques for ImageNet-scale training.

GridMask

GridMask takes structured erasing further by removing multiple rectangular regions arranged in a grid pattern:

$$R_{erase} = \bigcup_{i,j} {(x,y) : i \cdot d \leq x < i \cdot d + l, j \cdot d \leq y < j \cdot d + l}$$

where $d$ is the grid spacing and $l$ is the unit mask size. GridMask provides more uniform coverage than random erasing and can be scheduled to increase difficulty during training.

Occlusion in Object Detection

Augmentation Composition and Pipelines

Sequential Composition

The standard approach applies transformations in sequence:

$$T_{pipeline} = T_n \circ T_{n-1} \circ \cdots \circ T_1$$

Critical ordering principles:

Geometric before photometric: Apply spatial transforms first to avoid interpolating already-modified pixel values
Expensive operations once: Compute bounding boxes, masks, or keypoints after all geometric transforms, not incrementally
Normalization last: Always apply mean/std normalization at the very end
Resolution-changing transforms early: Crop/resize before applying convolution-based augmentations

augmentation_pipeline.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
import torch
from typing import List, Callable, Optional
import numpy as np
 
class AugmentationPipeline:
    """
    Composable augmentation pipeline with proper ordering and
    probabilistic application.
    
    Ensures geometric transforms are applied before photometric,
    and normalization is always applied last.
    """
    
    def __init__(self, transforms: List[dict]):
        """
        Parameters:
        -----------
        transforms : list of dict
            Each dict contains:
            - 'fn': callable augmentation function
            - 'prob': probability of applying (default 1.0)
            - 'stage': 'geometric', 'photometric', 'erase', or 'normalize'
        """
        self.transforms = transforms
        
        # Sort by stage to ensure correct ordering
        stage_order = {
            'geometric': 0,
            'photometric': 1, 
            'erase': 2,
            'normalize': 3
        }
        
        self.transforms = sorted(
            transforms,
            key=lambda x: stage_order.get(x.get('stage', 'photometric'), 1)
        )
    
    def __call__(self, image: torch.Tensor) -> torch.Tensor:
        """
        Apply augmentation pipeline to image.
        
        Parameters:
        -----------
        image : torch.Tensor
            Input image of shape (C, H, W)
        
        Returns:
        --------
        Augmented image tensor
        """
        for transform in self.transforms:
            fn = transform['fn']
            prob = transform.get('prob', 1.0)
            
            if np.random.random() < prob:
                image = fn(image)
        
        return image
 
 
# Example: Standard ImageNet training pipeline
def create_imagenet_pipeline(
    crop_size: int = 224,
    hflip_prob: float = 0.5,
    color_jitter: tuple = (0.4, 0.4, 0.4, 0.1),
    random_erase_prob: float = 0.25
):
    """
    Create standard ImageNet training augmentation pipeline.
    
    Based on ResNet training recipe with modern additions.
    """
    transforms = [
        # Stage 1: Geometric transforms
        {
            'fn': lambda x: random_resized_crop(x, (crop_size, crop_size)),
            'prob': 1.0,
            'stage': 'geometric'
        },
        {
            'fn': lambda x: torch.flip(x, dims=[-1]),
            'prob': hflip_prob,
            'stage': 'geometric'
        },
        
        # Stage 2: Photometric transforms
        {
            'fn': lambda x: color_jitter(
                x, 
                brightness=color_jitter[0],
                contrast=color_jitter[1],
                saturation=color_jitter[2],
                hue=color_jitter[3]
            ),
            'prob': 0.8,
            'stage': 'photometric'
        },
        {
            'fn': random_grayscale,
            'prob': 0.2,
            'stage': 'photometric'
        },
        {
            'fn': gaussian_blur,
            'prob': 0.1,
            'stage': 'photometric'
        },
        
        # Stage 3: Erasing
        {
            'fn': RandomErasing(probability=1.0),
            'prob': random_erase_prob,
            'stage': 'erase'
        },
        
        # Stage 4: Normalization (always apply)
        {
            'fn': imagenet_normalize,
            'prob': 1.0,
            'stage': 'normalize'
        },
    ]
    
    return AugmentationPipeline(transforms)

Stochastic Depth and Width

Beyond individual transform probabilities, we can modulate the overall intensity of augmentation:

RandAugment (discussed later) controls these with just two hyperparameters (N, M), dramatically simplifying the search space.

Reproducibility Considerations

For reproducible training, augmentations must be deterministically controlled:

# Set global seed for reproducibility
def set_augmentation_seed(seed: int):
    np.random.seed(seed)
    torch.manual_seed(seed)
    random.seed(seed)

# Per-sample seeding for distributed training
def worker_init_fn(worker_id: int):
    worker_seed = torch.initial_seed() % 2**32
    np.random.seed(worker_seed)
    random.seed(worker_seed)

Task-Specific Augmentation Strategies

Image Classification

Goal: Invariance to geometric and photometric variations

Standard recipe (EfficientNet/ResNet training):

Random resized crop (scale 0.08-1.0, ratio 3/4-4/3)
Horizontal flip (50%)
Color jitter (brightness ±0.4, contrast ±0.4, saturation ±0.4, hue ±0.1)
Random grayscale (20%)
Random erasing (25%)
Gaussian blur (10%, σ ∈ [0.1, 2.0])

Fine-grained classification (birds, cars, flowers): Use weaker augmentation to preserve diagnostic details. Reduce crop scale range and color jitter intensity.

Recommended Augmentation Settings by Task
Task	Geometric	Photometric	Erasing	Special
Classification	Strong random crop, flip	Strong color jitter	Random erasing 25%	RandAugment or AutoAugment
Object Detection	Moderate crop, flip, scale	Moderate jitter	With box validation	Mosaic, Copy-Paste
Segmentation	Crop, flip, rotate with mask	Moderate jitter	Rare	Preserve class balance in crops
Pose Estimation	Crop, flip, rotate with keypoints	Light jitter	None	Half-body crop
OCR/Text	Perspective, elastic	Minimal	None	Font variation, background variation
Medical Imaging	Rotation, elastic, flip (if appropriate)	Intensity normalization	Rare	Domain-specific physics simulation

Object Detection

Goal: Equivariance for bounding boxes, invariance for class labels

Critical considerations:

When cropping, boxes may be clipped, fully removed, or kept—must define policy
Horizontal flip requires flipping box coordinates: $x_{new} = W - x_{old} - w_{box}$
Some augmentations (Mosaic, CutMix) merge images, requiring label changes

YOLO Mosaic augmentation: Combines 4 images into one, providing:

Scenes with more objects per sample
Varied object scales in single image
Reduced batch size requirements

Semantic Segmentation

Goal: Equivariance for pixel-wise labels

Critical considerations:

All geometric transforms (crop, rotate, flip) must apply identically to image and mask
Use nearest-neighbor interpolation for mask to avoid introducing new label values
Random crop may eliminate minority classes—monitor class distribution
Avoid photometric augmentations on the mask

detection_augmentation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
import torch
import numpy as np
from typing import Tuple, List
 
def flip_boxes_horizontal(
    boxes: torch.Tensor,
    image_width: int
) -> torch.Tensor:
    """
    Flip bounding boxes horizontally.
    
    Parameters:
    -----------
    boxes : torch.Tensor
        Bounding boxes of shape (N, 4) in format [x1, y1, x2, y2]
    image_width : int
        Width of the image
    
    Returns:
    --------
    Flipped bounding boxes
    """
    flipped = boxes.clone()
    flipped[:, 0] = image_width - boxes[:, 2]  # new x1 = W - old x2
    flipped[:, 2] = image_width - boxes[:, 0]  # new x2 = W - old x1
    return flipped
 
 
def crop_boxes(
    boxes: torch.Tensor,
    labels: torch.Tensor,
    crop_x1: int,
    crop_y1: int,
    crop_x2: int,
    crop_y2: int,
    min_visible_ratio: float = 0.3
) -> Tuple[torch.Tensor, torch.Tensor]:
    """
    Adjust bounding boxes after crop operation.
    
    Removes boxes that are mostly outside crop region,
    clips remaining boxes to crop boundaries.
    
    Parameters:
    -----------
    boxes : torch.Tensor
        Boxes in [x1, y1, x2, y2] format, shape (N, 4)
    labels : torch.Tensor
        Class labels, shape (N,)
    crop_x1, crop_y1, crop_x2, crop_y2 : int
        Crop region boundaries
    min_visible_ratio : float
        Minimum fraction of box area that must remain visible
    
    Returns:
    --------
    Clipped boxes and corresponding labels
    """
    # Original box areas
    original_areas = (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1])
    
    # Clip boxes to crop region
    clipped = boxes.clone()
    clipped[:, 0] = clipped[:, 0].clamp(min=crop_x1) - crop_x1
    clipped[:, 1] = clipped[:, 1].clamp(min=crop_y1) - crop_y1  
    clipped[:, 2] = clipped[:, 2].clamp(max=crop_x2) - crop_x1
    clipped[:, 3] = clipped[:, 3].clamp(max=crop_y2) - crop_y1
    
    # Compute clipped areas
    clipped_widths = (clipped[:, 2] - clipped[:, 0]).clamp(min=0)
    clipped_heights = (clipped[:, 3] - clipped[:, 1]).clamp(min=0)
    clipped_areas = clipped_widths * clipped_heights
    
    # Keep boxes with sufficient visible area
    visible_ratios = clipped_areas / (original_areas + 1e-6)
    valid_mask = (
        (visible_ratios >= min_visible_ratio) & 
        (clipped_widths > 0) & 
        (clipped_heights > 0)
    )
    
    return clipped[valid_mask], labels[valid_mask]
 
 
def mosaic_augmentation(
    images: List[torch.Tensor],
    boxes_list: List[torch.Tensor],
    labels_list: List[torch.Tensor],
    output_size: int = 640
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """
    YOLO-style Mosaic augmentation combining 4 images.
    
    Creates a single training sample from 4 different images,
    increasing object diversity and reducing batch size requirements.
    """
    assert len(images) == 4, "Mosaic requires exactly 4 images"
    
    s = output_size
    yc, xc = np.random.randint(s // 2, 3 * s // 2, 2)  # Mosaic center
    
    mosaic_img = torch.zeros(3, s * 2, s * 2)
    all_boxes = []
    all_labels = []
    
    # Placement positions: top-left, top-right, bottom-left, bottom-right
    for i, (img, boxes, labels) in enumerate(zip(images, boxes_list, labels_list)):
        C, h, w = img.shape
        
        if i == 0:  # Top-left
            x1a, y1a, x2a, y2a = max(xc - w, 0), max(yc - h, 0), xc, yc
            x1b, y1b, x2b, y2b = w - (x2a - x1a), h - (y2a - y1a), w, h
        elif i == 1:  # Top-right
            x1a, y1a, x2a, y2a = xc, max(yc - h, 0), min(xc + w, s * 2), yc
            x1b, y1b, x2b, y2b = 0, h - (y2a - y1a), min(w, x2a - x1a), h
        elif i == 2:  # Bottom-left
            x1a, y1a, x2a, y2a = max(xc - w, 0), yc, xc, min(s * 2, yc + h)
            x1b, y1b, x2b, y2b = w - (x2a - x1a), 0, w, min(y2a - y1a, h)
        else:  # Bottom-right
            x1a, y1a, x2a, y2a = xc, yc, min(xc + w, s * 2), min(s * 2, yc + h)
            x1b, y1b, x2b, y2b = 0, 0, min(w, x2a - x1a), min(y2a - y1a, h)
        
        # Place image tile
        mosaic_img[:, y1a:y2a, x1a:x2a] = img[:, y1b:y2b, x1b:x2b]
        
        # Adjust boxes for placement offset
        offset_x = x1a - x1b
        offset_y = y1a - y1b
        
        if len(boxes) > 0:
            adjusted_boxes = boxes.clone()
            adjusted_boxes[:, [0, 2]] += offset_x
            adjusted_boxes[:, [1, 3]] += offset_y
            all_boxes.append(adjusted_boxes)
            all_labels.append(labels)
    
    # Center crop to output size
    mosaic_img = mosaic_img[:, s//2:s//2+s, s//2:s//2+s]
    
    if all_boxes:
        all_boxes = torch.cat(all_boxes)
        all_labels = torch.cat(all_labels)
        # Clip to output region
        all_boxes[:, [0, 2]] -= s // 2
        all_boxes[:, [1, 3]] -= s // 2
        all_boxes, all_labels = crop_boxes(
            all_boxes, all_labels, 0, 0, s, s
        )
    else:
        all_boxes = torch.empty(0, 4)
        all_labels = torch.empty(0)
    
    return mosaic_img, all_boxes, all_labels

Summary: Image Augmentation Foundations

Key Takeaways

•Augmentation enforces invariance—training models to produce consistent outputs under input transformations that preserve semantic meaning.
•Vicinal Risk Minimization provides the theoretical foundation—augmentation expands the effective training distribution to better approximate the true data distribution.
•Geometric transformations (crop, flip, rotate, affine, perspective) modify spatial structure while preserving content.
•Photometric transformations (color jitter, brightness, contrast, blur, noise) simulate imaging variations without changing geometry.
•Erasing augmentations (Random Erasing, Cutout, GridMask) enforce occlusion robustness by removing image regions.
•Pipeline composition requires careful ordering: geometric → photometric → erasing → normalization.
•Task-specific strategies differ fundamentally: classification needs invariance, detection/segmentation need equivariance.

What's Next:

Foundation Established