Loading problem...
Channel Group Normalization is a powerful normalization technique designed to address the limitations of batch normalization, particularly when batch sizes are small or variable. Instead of normalizing across the entire batch, this method partitions channels into smaller groups and computes normalization statistics independently within each group.
Given a 4D input tensor X with shape (B, C, H, W)—where B represents batch size, C is the number of channels, H is height, and W is width—the normalization proceeds as follows:
Step 1: Channel Partitioning The C channels are divided into G groups, where each group contains C/G channels. For this division to be valid, C must be evenly divisible by G.
Step 2: Group-wise Statistics Computation For each sample in the batch and each channel group, compute the mean and variance across all spatial positions (H × W) and all channels within that group:
$$\mu_{b,g} = \frac{1}{(C/G) \cdot H \cdot W} \sum_{c \in \text{group}g} \sum{h,w} X_{b,c,h,w}$$
$$\sigma^2_{b,g} = \frac{1}{(C/G) \cdot H \cdot W} \sum_{c \in \text{group}g} \sum{h,w} (X_{b,c,h,w} - \mu_{b,g})^2$$
Step 3: Normalization Each element is normalized using its group's statistics:
$$\hat{X}{b,c,h,w} = \frac{X{b,c,h,w} - \mu_{b,g}}{\sqrt{\sigma^2_{b,g} + \epsilon}}$$
Step 4: Scale and Shift (Affine Transformation) Apply learnable per-channel scale (gamma) and shift (beta) parameters:
$$Y_{b,c,h,w} = \gamma_c \cdot \hat{X}_{b,c,h,w} + \beta_c$$
Unlike batch normalization which requires large batch sizes to compute stable statistics, group normalization computes statistics independently for each sample within channel groups. This makes it particularly effective for:
Implement a function that performs channel group normalization on a 4D input tensor. The function should normalize each group of channels independently per sample, then apply the learned scale and shift parameters.
Output values should be rounded to 4 decimal places.
X = [[[[1.0, 2.0], [3.0, 4.0]], [[5.0, 6.0], [7.0, 8.0]]], [[[9.0, 10.0], [11.0, 12.0]], [[13.0, 14.0], [15.0, 16.0]]]]
gamma = [1.0, 1.0]
beta = [0.0, 0.0]
num_groups = 2[[[[-1.3416, -0.4472], [0.4472, 1.3416]], [[-1.3416, -0.4472], [0.4472, 1.3416]]], [[[-1.3416, -0.4472], [0.4472, 1.3416]], [[-1.3416, -0.4472], [0.4472, 1.3416]]]]The input tensor has shape (2, 2, 2, 2) — 2 samples, 2 channels, 2×2 spatial dimensions.
With num_groups = 2 and 2 channels, each group contains 1 channel (2 ÷ 2 = 1).
For Sample 0, Group 0 (Channel 0):
Since gamma = 1.0 and beta = 0.0, the output equals the normalized values. The same pattern applies to all samples and groups.
X = [[[[1.0, 2.0], [3.0, 4.0]], [[2.0, 3.0], [4.0, 5.0]]]]
gamma = [1.0, 1.0]
beta = [0.0, 0.0]
num_groups = 1[[[[-1.633, -0.8165], [0.0, 0.8165]], [[-0.8165, 0.0], [0.8165, 1.633]]]]The input tensor has shape (1, 2, 2, 2) — 1 sample, 2 channels, 2×2 spatial dimensions.
With num_groups = 1, all 2 channels belong to a single group.
For Sample 0, Group 0 (Channels 0 and 1):
This demonstrates how a single group computes one mean/variance pair for all channels, creating more uniform normalization across the entire feature map.
X = [[[[0.0, 1.0], [2.0, 3.0]], [[4.0, 5.0], [6.0, 7.0]], [[8.0, 9.0], [10.0, 11.0]], [[12.0, 13.0], [14.0, 15.0]]]]
gamma = [2.0, 2.0, 0.5, 0.5]
beta = [1.0, -1.0, 0.5, -0.5]
num_groups = 2[[[[-2.055, -1.1822], [-0.3093, 0.5636]], [[-0.5636, 0.3093], [1.1822, 2.055]], [[-0.2638, -0.0455], [0.1727, 0.3909]], [[-0.3909, -0.1727], [0.0455, 0.2638]]]]The input tensor has shape (1, 4, 2, 2) — 1 sample, 4 channels, 2×2 spatial dimensions.
With num_groups = 2, channels are split into 2 groups of 2 channels each:
Affine Parameters:
Process:
This example demonstrates how different scale/shift parameters per channel allow the network to learn optimal transformations while maintaining normalized group statistics.
Constraints