Loading content...
Deep neural networks are insatiable consumers of data. Modern vision models routinely train on millions—or billions—of images, yet even these vast datasets cannot capture the infinite variability of the real world. A model trained on perfectly centered, well-lit photographs will struggle with tilted, shadowed, or partially occluded inputs. This fundamental mismatch between training distribution and deployment reality is one of the central challenges in deep learning.
Data augmentation offers an elegant solution: instead of collecting more data, we systematically transform existing samples to simulate the diversity the model will encounter in practice. By applying geometric transformations, color modifications, and synthetic corruptions during training, we effectively expand our dataset by orders of magnitude—teaching the model to recognize objects regardless of their position, orientation, lighting, or context.
This isn't merely a practical trick; it's a principled approach with deep connections to invariance, equivariance, and the geometry of learned representations. Understanding augmentation thoroughly is essential for any practitioner building robust, deployable deep learning systems.
By the end of this page, you will understand the theoretical foundations of data augmentation, master the full taxonomy of image transformations (geometric, photometric, and synthetic), implement augmentations correctly with proper mathematical formulations, and develop intuition for which augmentations benefit which tasks.
Before examining specific augmentation techniques, we must understand why augmentation works from a theoretical perspective. This understanding guides principled application rather than blind recipe-following.
At its core, data augmentation enforces invariance—the property that a model's output should not change under certain transformations of its input. When we train a dog classifier, we want p(dog|x) to remain stable whether the input image shows the dog in the center, corner, rotated, or under different lighting. Mathematically, we desire:
$$f(T(x)) = f(x) \quad \forall T \in \mathcal{T}$$
where $\mathcal{T}$ is a family of transformations under which predictions should be invariant.
From an optimization standpoint, augmentation acts as a regularizer. Consider the expected risk we wish to minimize:
$$R(\theta) = \mathbb{E}{(x,y) \sim p{data}}[L(f_\theta(x), y)]$$
With augmentation, we instead minimize:
$$R_{aug}(\theta) = \mathbb{E}{T \sim p_T}\mathbb{E}{(x,y) \sim p_{data}}[L(f_\theta(T(x)), y)]$$
This modified objective smooths the loss landscape and reduces the model's sensitivity to input perturbations, directly improving generalization.
Data augmentation can be formally understood as Vicinal Risk Minimization (VRM), where we replace the empirical distribution with a "vicinal" distribution that spreads probability mass around each training point. Each augmentation defines a vicinity function that determines how probability is allocated around observed samples.
Augmentation can be understood as expanding the support of our training distribution. Given limited samples ${x_1, ..., x_n}$ from the true data distribution $p_{data}(x)$, augmentation creates an augmented distribution:
$$p_{aug}(x) = \frac{1}{|\mathcal{T}|}\sum_{T \in \mathcal{T}} p_{emp}(T^{-1}(x))$$
where $p_{emp}$ is the empirical distribution over training samples. Effective augmentation ensures $p_{aug}$ better approximates $p_{data}$ by filling in regions of input space that would otherwise have zero training density.
Not all tasks require invariance. In semantic segmentation, we want pixel-wise predictions to transform along with the input—this is equivariance:
$$f(T(x)) = T(f(x)) \quad \text{(Equivariance)}$$
For segmentation, when we horizontally flip an image, the predicted mask should also flip horizontally. This distinction is crucial: the same transformation may appear in augmentation pipelines for both tasks, but with different label handling strategies.
| Task | Desired Property | Label Transformation | Example |
|---|---|---|---|
| Image Classification | Invariance | Labels unchanged | Rotated cat → still labeled 'cat' |
| Object Detection | Equivariance | Bounding boxes transform | Flipped image → flipped box coordinates |
| Semantic Segmentation | Equivariance | Masks transform spatially | Rotated image → rotated mask |
| Pose Estimation | Equivariance | Keypoints transform | Scaled image → scaled keypoint locations |
| Image Captioning | Invariance | Captions unchanged | Color-jittered image → same caption |
Geometric transformations modify the spatial structure of images—their position, scale, rotation, and shape. These are among the most fundamental and universally applicable augmentations.
Random cropping extracts rectangular regions from random locations, forcing the model to recognize objects even when partially visible or off-center. This is perhaps the single most important augmentation for image classification.
Mathematical Formulation: Given an image $I$ of size $H \times W$, a random crop selects coordinates $(h_1, w_1)$ and extracts a region of size $h \times w$:
$$I_{crop}[i,j] = I[h_1 + i, w_1 + j] \quad \text{for } i \in [0, h), j \in [0, w)$$
where:
ResNet-style random resized crop is particularly effective: randomly select a crop with area ratio in $[0.08, 1.0]$ and aspect ratio in $[3/4, 4/3]$, then resize to target dimensions.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
import numpy as npfrom PIL import Image def random_resized_crop( image: Image.Image, target_size: tuple = (224, 224), scale: tuple = (0.08, 1.0), ratio: tuple = (3/4, 4/3)) -> Image.Image: """ Random resized crop as used in ResNet training. Parameters: ----------- image : PIL Image Input image to crop and resize target_size : tuple Final (height, width) after resizing scale : tuple Range of crop area relative to original (min, max) ratio : tuple Range of aspect ratios (min, max) Returns: -------- Cropped and resized PIL Image """ width, height = image.size area = width * height for attempt in range(10): # Try multiple times # Sample target area and aspect ratio target_area = np.random.uniform(scale[0], scale[1]) * area log_ratio = np.log(ratio) aspect_ratio = np.exp(np.random.uniform(log_ratio[0], log_ratio[1])) # Compute crop dimensions crop_width = int(round(np.sqrt(target_area * aspect_ratio))) crop_height = int(round(np.sqrt(target_area / aspect_ratio))) # Check if valid crop is possible if 0 < crop_width <= width and 0 < crop_height <= height: # Random position x = np.random.randint(0, width - crop_width + 1) y = np.random.randint(0, height - crop_height + 1) # Crop and resize return image.crop((x, y, x + crop_width, y + crop_height)).resize( target_size, Image.BILINEAR ) # Fallback: center crop to match aspect ratio, then resize scale_factor = min(width / target_size[0], height / target_size[1]) crop_width = int(target_size[0] * scale_factor) crop_height = int(target_size[1] * scale_factor) x = (width - crop_width) // 2 y = (height - crop_height) // 2 return image.crop((x, y, x + crop_width, y + crop_height)).resize( target_size, Image.BILINEAR )Horizontal flipping reflects the image across the vertical axis. For most natural scenes and objects, horizontal flip is semantically valid—a flipped dog is still a dog.
$$I_{flip}[i,j] = I[i, W-1-j]$$
Vertical flipping reflects across the horizontal axis and is appropriate for aerial imagery, medical scans, or satellite photos where orientation is arbitrary:
$$I_{flip}[i,j] = I[H-1-i, j]$$
Not all tasks tolerate flipping. Text recognition would fail with horizontally flipped text. Digit recognition cannot flip '6' and '9' interchangeably. Medical imaging may have left/right significance. Always validate that augmentations preserve the task's semantic structure.
Random rotation rotates the image by an angle $\theta$ sampled from a specified range. The transformation matrix for 2D rotation around the center $(c_x, c_y)$ is:
$$\begin{bmatrix} x' \ y' \end{bmatrix} = \begin{bmatrix} \cos\theta & -\sin\theta \ \sin\theta & \cos\theta \end{bmatrix} \begin{bmatrix} x - c_x \ y - c_y \end{bmatrix} + \begin{bmatrix} c_x \ c_y \end{bmatrix}$$
Rotation introduces empty pixels at corners that must be handled via padding (constant, reflection, or wrap-around).
Common rotation ranges:
More general transformations combine rotation, translation, scaling, and shearing:
Affine transformation (preserves parallel lines): $$\begin{bmatrix} x' \ y' \end{bmatrix} = \begin{bmatrix} a_{11} & a_{12} \ a_{21} & a_{22} \end{bmatrix} \begin{bmatrix} x \ y \end{bmatrix} + \begin{bmatrix} t_x \ t_y \end{bmatrix}$$
Perspective transformation (simulates viewpoint changes): $$x' = \frac{a_{11}x + a_{12}y + a_{13}}{a_{31}x + a_{32}y + a_{33}}, \quad y' = \frac{a_{21}x + a_{22}y + a_{23}}{a_{31}x + a_{32}y + a_{33}}$$
Perspective transforms are particularly valuable for document analysis, OCR, and scene understanding where viewpoint varies.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107
import torchimport torch.nn.functional as Fimport numpy as npfrom typing import Tuple def get_affine_matrix( angle: float = 0, translate: Tuple[float, float] = (0, 0), scale: float = 1.0, shear: Tuple[float, float] = (0, 0)) -> torch.Tensor: """ Compute affine transformation matrix combining rotation, translation, scaling, and shearing. Parameters: ----------- angle : float Rotation angle in degrees translate : tuple Translation as fraction of image size (tx, ty) scale : float Isotropic scaling factor shear : tuple Shear factors (shear_x, shear_y) in degrees Returns: -------- torch.Tensor of shape (2, 3): Affine transformation matrix """ # Convert to radians angle_rad = np.deg2rad(angle) shear_x = np.deg2rad(shear[0]) shear_y = np.deg2rad(shear[1]) # Rotation matrix cos_a, sin_a = np.cos(angle_rad), np.sin(angle_rad) R = np.array([ [cos_a, -sin_a], [sin_a, cos_a] ]) # Shear matrix S = np.array([ [1, np.tan(shear_x)], [np.tan(shear_y), 1] ]) # Combined linear transformation M = scale * R @ S # Full affine matrix with translation affine = np.array([ [M[0, 0], M[0, 1], translate[0]], [M[1, 0], M[1, 1], translate[1]] ]) return torch.tensor(affine, dtype=torch.float32) def apply_affine_transform( image: torch.Tensor, affine_matrix: torch.Tensor, mode: str = 'bilinear', padding_mode: str = 'reflection') -> torch.Tensor: """ Apply affine transformation to image tensor. Parameters: ----------- image : torch.Tensor Image tensor of shape (C, H, W) or (B, C, H, W) affine_matrix : torch.Tensor Affine matrix of shape (2, 3) mode : str Interpolation mode: 'bilinear' or 'nearest' padding_mode : str Padding for out-of-bounds: 'zeros', 'border', 'reflection' Returns: -------- Transformed image tensor """ # Ensure batch dimension if image.dim() == 3: image = image.unsqueeze(0) squeeze = True else: squeeze = False B, C, H, W = image.shape # Expand affine matrix for batch theta = affine_matrix.unsqueeze(0).expand(B, -1, -1) # Create sampling grid grid = F.affine_grid(theta, image.size(), align_corners=False) # Apply transformation output = F.grid_sample(image, grid, mode=mode, padding_mode=padding_mode, align_corners=False) if squeeze: output = output.squeeze(0) return outputElastic deformations apply spatially-varying displacements to simulate non-rigid distortions. These are particularly effective for handwritten text, biological cells, or any domain with natural shape variability.
The displacement field $(\Delta x, \Delta y)$ at each pixel is generated by:
$$x' = x + \alpha \cdot (G_\sigma * \epsilon_x)(x,y)$$ $$y' = y + \alpha \cdot (G_\sigma * \epsilon_y)(x,y)$$
where $\epsilon_x, \epsilon_y \sim \mathcal{N}(0,1)$ and $G_\sigma$ is a Gaussian kernel.
Elastic transforms won the MNIST digit recognition competition when introduced and remain valuable for medical imaging (cell segmentation, tissue deformation) and document analysis.
Photometric transformations modify pixel intensities without changing spatial structure. These simulate variations in lighting, camera settings, and environmental conditions.
Color jittering randomly perturbs brightness, contrast, saturation, and hue. This is essential for robustness to lighting variations.
Brightness adjustment: Adds or multiplies a constant to all pixel values: $$I' = \alpha \cdot I + \beta$$
where $\alpha$ controls contrast and $\beta$ controls brightness.
Saturation adjustment: Modifies color intensity in HSV or HSL space: $$S' = \text{clip}(S \cdot \gamma, 0, 1)$$
Hue shift: Rotates the hue channel: $$H' = (H + \Delta H) \mod 360$$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120
import torchimport torch.nn.functional as Ffrom typing import Tupleimport numpy as np def rgb_to_hsv(rgb: torch.Tensor) -> torch.Tensor: """ Convert RGB image to HSV color space. Parameters: ----------- rgb : torch.Tensor RGB image tensor of shape (C=3, H, W) with values in [0, 1] Returns: -------- HSV tensor of shape (C=3, H, W) """ r, g, b = rgb[0], rgb[1], rgb[2] max_val, max_idx = rgb.max(dim=0) min_val = rgb.min(dim=0)[0] diff = max_val - min_val # Value v = max_val # Saturation s = torch.where(max_val > 0, diff / max_val, torch.zeros_like(max_val)) # Hue h = torch.zeros_like(max_val) mask = diff > 0 # When max is R mask_r = mask & (max_idx == 0) h[mask_r] = (60 * (g[mask_r] - b[mask_r]) / diff[mask_r]) % 360 # When max is G mask_g = mask & (max_idx == 1) h[mask_g] = 60 * (b[mask_g] - r[mask_g]) / diff[mask_g] + 120 # When max is B mask_b = mask & (max_idx == 2) h[mask_b] = 60 * (r[mask_b] - g[mask_b]) / diff[mask_b] + 240 h = h / 360 # Normalize to [0, 1] return torch.stack([h, s, v]) def color_jitter( image: torch.Tensor, brightness: float = 0.2, contrast: float = 0.2, saturation: float = 0.2, hue: float = 0.05) -> torch.Tensor: """ Apply random color jittering to image. Parameters: ----------- image : torch.Tensor RGB image of shape (C=3, H, W) with values in [0, 1] brightness : float Max brightness adjustment factor contrast : float Max contrast adjustment factor saturation : float Max saturation adjustment factor hue : float Max hue shift (as fraction of full circle) Returns: -------- Color-jittered image tensor """ # Random order of transformations transforms = [] # Brightness: I' = I + delta if brightness > 0: def adjust_brightness(img): delta = torch.empty(1).uniform_(-brightness, brightness) return (img + delta).clamp(0, 1) transforms.append(adjust_brightness) # Contrast: I' = mean + alpha * (I - mean) if contrast > 0: def adjust_contrast(img): alpha = torch.empty(1).uniform_(1 - contrast, 1 + contrast) mean = img.mean(dim=(1, 2), keepdim=True) return (mean + alpha * (img - mean)).clamp(0, 1) transforms.append(adjust_contrast) # Saturation: multiply S channel in HSV if saturation > 0: def adjust_saturation(img): factor = torch.empty(1).uniform_(1 - saturation, 1 + saturation) gray = 0.2989 * img[0] + 0.5870 * img[1] + 0.1140 * img[2] return (factor * img + (1 - factor) * gray.unsqueeze(0)).clamp(0, 1) transforms.append(adjust_saturation) # Hue: rotate H channel in HSV if hue > 0: def adjust_hue(img): delta = torch.empty(1).uniform_(-hue, hue) hsv = rgb_to_hsv(img) hsv[0] = (hsv[0] + delta) % 1.0 return hsv_to_rgb(hsv) # Assume hsv_to_rgb is defined transforms.append(adjust_hue) # Apply in random order np.random.shuffle(transforms) for transform in transforms: image = transform(image) return imageRandom grayscale occasionally converts images to grayscale, forcing the model to rely on shape and texture rather than color:
$$I_{gray} = 0.2989 \cdot R + 0.5870 \cdot G + 0.1140 \cdot B$$
These coefficients approximate human luminance perception. Random grayscale with probability 10-20% is a component of many successful training recipes (e.g., SimCLR, BYOL).
Gaussian blur applies a Gaussian smoothing kernel to simulate focus variations or low-resolution inputs:
$$G(x,y) = \frac{1}{2\pi\sigma^2}e^{-\frac{x^2+y^2}{2\sigma^2}}$$
$$I_{blur} = I * G$$
Random blur with kernel sizes 3-23px and $\sigma \in [0.1, 2.0]$ is standard in contrastive and self-supervised learning.
Additive Gaussian noise simulates sensor noise in cameras:
$$I_{noisy} = I + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2)$$
Typical $\sigma$ values range from 0.01 to 0.1 for normalized [0,1] images. Gaussian noise is particularly important for:
While not strictly augmentation, proper normalization is essential:
Per-channel normalization: $$I_{norm} = \frac{I - \mu}{\sigma}$$
ImageNet statistics ($\mu = [0.485, 0.456, 0.406]$, $\sigma = [0.229, 0.224, 0.225]$) are widely used even for non-ImageNet tasks when using pretrained backbones.
Apply normalization after all other augmentations, not before. Photometric augmentations like color jitter expect pixel values in [0,1] or [0,255], not standardized values with mean 0. Incorrect ordering is a common source of degraded performance.
Erasing augmentations randomly remove or corrupt rectangular regions, forcing models to recognize objects from partial information. This family of techniques addresses occlusion robustness—a critical real-world challenge.
Random Erasing (also called Cutout in its original form) masks a rectangular region with random noise or a constant value:
$$I_{erased}[i,j] = \begin{cases} v & \text{if } (i,j) \in R_{erase} \ I[i,j] & \text{otherwise} \end{cases}$$
where $R_{erase}$ is a randomly positioned rectangle and $v$ is either:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697
import torchimport numpy as npfrom typing import Tuple class RandomErasing: """ Random Erasing augmentation as described in: 'Random Erasing Data Augmentation' (Zhong et al., 2020) Randomly masks rectangular regions with noise or constant values, forcing the model to learn from partial object views. """ def __init__( self, probability: float = 0.5, scale: Tuple[float, float] = (0.02, 0.33), ratio: Tuple[float, float] = (0.3, 3.3), value: str = 'random', # 'random', 'zero', 'mean', or float inplace: bool = False ): """ Parameters: ----------- probability : float Probability of applying random erasing scale : tuple Range of proportion of erased area vs input image ratio : tuple Range of aspect ratio of erased area value : str or float Fill value for erased region: - 'random': random uniform noise - 'zero': constant 0 - 'mean': ImageNet mean values - float: constant value """ self.probability = probability self.scale = scale self.ratio = ratio self.value = value self.inplace = inplace # ImageNet mean for 'mean' fill option self.imagenet_mean = torch.tensor([0.485, 0.456, 0.406]) def __call__(self, image: torch.Tensor) -> torch.Tensor: """ Apply random erasing to image tensor. Parameters: ----------- image : torch.Tensor Image of shape (C, H, W) with values in [0, 1] Returns: -------- Image with randomly erased region """ if np.random.random() > self.probability: return image if not self.inplace: image = image.clone() C, H, W = image.shape area = H * W for _ in range(10): # Maximum attempts # Sample target area and aspect ratio target_area = np.random.uniform(self.scale[0], self.scale[1]) * area aspect_ratio = np.random.uniform(self.ratio[0], self.ratio[1]) # Compute erase dimensions h = int(round(np.sqrt(target_area * aspect_ratio))) w = int(round(np.sqrt(target_area / aspect_ratio))) if h < H and w < W: # Random position i = np.random.randint(0, H - h + 1) j = np.random.randint(0, W - w + 1) # Generate fill value if self.value == 'random': fill = torch.rand(C, h, w) elif self.value == 'zero': fill = torch.zeros(C, h, w) elif self.value == 'mean': fill = self.imagenet_mean.view(C, 1, 1).expand(C, h, w) else: fill = torch.full((C, h, w), self.value) # Apply erasing image[:, i:i+h, j:j+w] = fill return image return imageCutout is a simpler variant that always uses a square mask with a fixed side length (typically 16-64 pixels for 32x32 CIFAR images, or scaled proportionally for larger images). Unlike Random Erasing, Cutout:
Cutout is particularly effective for small-scale image classification (CIFAR-10/100) and has largely been superseded by more flexible techniques for ImageNet-scale training.
GridMask takes structured erasing further by removing multiple rectangular regions arranged in a grid pattern:
$$R_{erase} = \bigcup_{i,j} {(x,y) : i \cdot d \leq x < i \cdot d + l, j \cdot d \leq y < j \cdot d + l}$$
where $d$ is the grid spacing and $l$ is the unit mask size. GridMask provides more uniform coverage than random erasing and can be scheduled to increase difficulty during training.
For object detection, erasing augmentations require careful handling. If an erasing rectangle fully covers a bounding box, that ground truth should be removed. Partial occlusions require deciding whether to keep, remove, or clip the bounding box—choices that affect training dynamics.
Individual augmentations are typically composed into augmentation pipelines—sequences of transformations applied to each training sample. Effective composition requires understanding transformation interactions and computational constraints.
The standard approach applies transformations in sequence:
$$T_{pipeline} = T_n \circ T_{n-1} \circ \cdots \circ T_1$$
Critical ordering principles:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125
import torchfrom typing import List, Callable, Optionalimport numpy as np class AugmentationPipeline: """ Composable augmentation pipeline with proper ordering and probabilistic application. Ensures geometric transforms are applied before photometric, and normalization is always applied last. """ def __init__(self, transforms: List[dict]): """ Parameters: ----------- transforms : list of dict Each dict contains: - 'fn': callable augmentation function - 'prob': probability of applying (default 1.0) - 'stage': 'geometric', 'photometric', 'erase', or 'normalize' """ self.transforms = transforms # Sort by stage to ensure correct ordering stage_order = { 'geometric': 0, 'photometric': 1, 'erase': 2, 'normalize': 3 } self.transforms = sorted( transforms, key=lambda x: stage_order.get(x.get('stage', 'photometric'), 1) ) def __call__(self, image: torch.Tensor) -> torch.Tensor: """ Apply augmentation pipeline to image. Parameters: ----------- image : torch.Tensor Input image of shape (C, H, W) Returns: -------- Augmented image tensor """ for transform in self.transforms: fn = transform['fn'] prob = transform.get('prob', 1.0) if np.random.random() < prob: image = fn(image) return image # Example: Standard ImageNet training pipelinedef create_imagenet_pipeline( crop_size: int = 224, hflip_prob: float = 0.5, color_jitter: tuple = (0.4, 0.4, 0.4, 0.1), random_erase_prob: float = 0.25): """ Create standard ImageNet training augmentation pipeline. Based on ResNet training recipe with modern additions. """ transforms = [ # Stage 1: Geometric transforms { 'fn': lambda x: random_resized_crop(x, (crop_size, crop_size)), 'prob': 1.0, 'stage': 'geometric' }, { 'fn': lambda x: torch.flip(x, dims=[-1]), 'prob': hflip_prob, 'stage': 'geometric' }, # Stage 2: Photometric transforms { 'fn': lambda x: color_jitter( x, brightness=color_jitter[0], contrast=color_jitter[1], saturation=color_jitter[2], hue=color_jitter[3] ), 'prob': 0.8, 'stage': 'photometric' }, { 'fn': random_grayscale, 'prob': 0.2, 'stage': 'photometric' }, { 'fn': gaussian_blur, 'prob': 0.1, 'stage': 'photometric' }, # Stage 3: Erasing { 'fn': RandomErasing(probability=1.0), 'prob': random_erase_prob, 'stage': 'erase' }, # Stage 4: Normalization (always apply) { 'fn': imagenet_normalize, 'prob': 1.0, 'stage': 'normalize' }, ] return AugmentationPipeline(transforms)Beyond individual transform probabilities, we can modulate the overall intensity of augmentation:
Augmentation magnitude: How strong each individual transformation is Augmentation probability: Whether each transformation is applied Augmentation count: How many transformations are applied per sample
RandAugment (discussed later) controls these with just two hyperparameters (N, M), dramatically simplifying the search space.
For reproducible training, augmentations must be deterministically controlled:
# Set global seed for reproducibility
def set_augmentation_seed(seed: int):
np.random.seed(seed)
torch.manual_seed(seed)
random.seed(seed)
# Per-sample seeding for distributed training
def worker_init_fn(worker_id: int):
worker_seed = torch.initial_seed() % 2**32
np.random.seed(worker_seed)
random.seed(worker_seed)
Note that augmentation randomness is intentional—the same image should receive different augmentations across epochs. What should be reproducible is the sequence of random augmentations for debugging and comparison.
Different vision tasks require different augmentation strategies. What works for classification may harm detection; what helps segmentation may break pose estimation. Understanding task constraints is essential for effective augmentation.
Goal: Invariance to geometric and photometric variations
Standard recipe (EfficientNet/ResNet training):
Fine-grained classification (birds, cars, flowers): Use weaker augmentation to preserve diagnostic details. Reduce crop scale range and color jitter intensity.
| Task | Geometric | Photometric | Erasing | Special |
|---|---|---|---|---|
| Classification | Strong random crop, flip | Strong color jitter | Random erasing 25% | RandAugment or AutoAugment |
| Object Detection | Moderate crop, flip, scale | Moderate jitter | With box validation | Mosaic, Copy-Paste |
| Segmentation | Crop, flip, rotate with mask | Moderate jitter | Rare | Preserve class balance in crops |
| Pose Estimation | Crop, flip, rotate with keypoints | Light jitter | None | Half-body crop |
| OCR/Text | Perspective, elastic | Minimal | None | Font variation, background variation |
| Medical Imaging | Rotation, elastic, flip (if appropriate) | Intensity normalization | Rare | Domain-specific physics simulation |
Goal: Equivariance for bounding boxes, invariance for class labels
Critical considerations:
YOLO Mosaic augmentation: Combines 4 images into one, providing:
Goal: Equivariance for pixel-wise labels
Critical considerations:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153
import torchimport numpy as npfrom typing import Tuple, List def flip_boxes_horizontal( boxes: torch.Tensor, image_width: int) -> torch.Tensor: """ Flip bounding boxes horizontally. Parameters: ----------- boxes : torch.Tensor Bounding boxes of shape (N, 4) in format [x1, y1, x2, y2] image_width : int Width of the image Returns: -------- Flipped bounding boxes """ flipped = boxes.clone() flipped[:, 0] = image_width - boxes[:, 2] # new x1 = W - old x2 flipped[:, 2] = image_width - boxes[:, 0] # new x2 = W - old x1 return flipped def crop_boxes( boxes: torch.Tensor, labels: torch.Tensor, crop_x1: int, crop_y1: int, crop_x2: int, crop_y2: int, min_visible_ratio: float = 0.3) -> Tuple[torch.Tensor, torch.Tensor]: """ Adjust bounding boxes after crop operation. Removes boxes that are mostly outside crop region, clips remaining boxes to crop boundaries. Parameters: ----------- boxes : torch.Tensor Boxes in [x1, y1, x2, y2] format, shape (N, 4) labels : torch.Tensor Class labels, shape (N,) crop_x1, crop_y1, crop_x2, crop_y2 : int Crop region boundaries min_visible_ratio : float Minimum fraction of box area that must remain visible Returns: -------- Clipped boxes and corresponding labels """ # Original box areas original_areas = (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1]) # Clip boxes to crop region clipped = boxes.clone() clipped[:, 0] = clipped[:, 0].clamp(min=crop_x1) - crop_x1 clipped[:, 1] = clipped[:, 1].clamp(min=crop_y1) - crop_y1 clipped[:, 2] = clipped[:, 2].clamp(max=crop_x2) - crop_x1 clipped[:, 3] = clipped[:, 3].clamp(max=crop_y2) - crop_y1 # Compute clipped areas clipped_widths = (clipped[:, 2] - clipped[:, 0]).clamp(min=0) clipped_heights = (clipped[:, 3] - clipped[:, 1]).clamp(min=0) clipped_areas = clipped_widths * clipped_heights # Keep boxes with sufficient visible area visible_ratios = clipped_areas / (original_areas + 1e-6) valid_mask = ( (visible_ratios >= min_visible_ratio) & (clipped_widths > 0) & (clipped_heights > 0) ) return clipped[valid_mask], labels[valid_mask] def mosaic_augmentation( images: List[torch.Tensor], boxes_list: List[torch.Tensor], labels_list: List[torch.Tensor], output_size: int = 640) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: """ YOLO-style Mosaic augmentation combining 4 images. Creates a single training sample from 4 different images, increasing object diversity and reducing batch size requirements. """ assert len(images) == 4, "Mosaic requires exactly 4 images" s = output_size yc, xc = np.random.randint(s // 2, 3 * s // 2, 2) # Mosaic center mosaic_img = torch.zeros(3, s * 2, s * 2) all_boxes = [] all_labels = [] # Placement positions: top-left, top-right, bottom-left, bottom-right for i, (img, boxes, labels) in enumerate(zip(images, boxes_list, labels_list)): C, h, w = img.shape if i == 0: # Top-left x1a, y1a, x2a, y2a = max(xc - w, 0), max(yc - h, 0), xc, yc x1b, y1b, x2b, y2b = w - (x2a - x1a), h - (y2a - y1a), w, h elif i == 1: # Top-right x1a, y1a, x2a, y2a = xc, max(yc - h, 0), min(xc + w, s * 2), yc x1b, y1b, x2b, y2b = 0, h - (y2a - y1a), min(w, x2a - x1a), h elif i == 2: # Bottom-left x1a, y1a, x2a, y2a = max(xc - w, 0), yc, xc, min(s * 2, yc + h) x1b, y1b, x2b, y2b = w - (x2a - x1a), 0, w, min(y2a - y1a, h) else: # Bottom-right x1a, y1a, x2a, y2a = xc, yc, min(xc + w, s * 2), min(s * 2, yc + h) x1b, y1b, x2b, y2b = 0, 0, min(w, x2a - x1a), min(y2a - y1a, h) # Place image tile mosaic_img[:, y1a:y2a, x1a:x2a] = img[:, y1b:y2b, x1b:x2b] # Adjust boxes for placement offset offset_x = x1a - x1b offset_y = y1a - y1b if len(boxes) > 0: adjusted_boxes = boxes.clone() adjusted_boxes[:, [0, 2]] += offset_x adjusted_boxes[:, [1, 3]] += offset_y all_boxes.append(adjusted_boxes) all_labels.append(labels) # Center crop to output size mosaic_img = mosaic_img[:, s//2:s//2+s, s//2:s//2+s] if all_boxes: all_boxes = torch.cat(all_boxes) all_labels = torch.cat(all_labels) # Clip to output region all_boxes[:, [0, 2]] -= s // 2 all_boxes[:, [1, 3]] -= s // 2 all_boxes, all_labels = crop_boxes( all_boxes, all_labels, 0, 0, s, s ) else: all_boxes = torch.empty(0, 4) all_labels = torch.empty(0) return mosaic_img, all_boxes, all_labelsWe've established the comprehensive foundation for understanding image augmentations—the theoretical basis, mathematical formulations, and practical implementation of the techniques that form the backbone of modern deep learning training pipelines.
What's Next:
Now that we've established the foundations of individual image augmentations, we'll explore Mixup and CutMix—revolutionary techniques that don't just transform individual images, but combine multiple training samples. These mixing strategies create synthetic training examples with interpolated labels, providing a fundamentally different form of regularization with deep connections to label smoothing and ensembling.
You now understand the theoretical motivation, mathematical formulations, and practical implementation of image augmentation. This foundation prepares you for advanced mixing strategies, learned augmentation policies, and test-time augmentation techniques covered in subsequent pages.