00:00:00

Description

Editorial

Channel Group Normalization

MEDIUM20 pts

Channel Group Normalization is a powerful normalization technique designed to address the limitations of batch normalization, particularly when batch sizes are small or variable. Instead of normalizing across the entire batch, this method partitions channels into smaller groups and computes normalization statistics independently within each group.

Conceptual Overview

Given a 4D input tensor X with shape (B, C, H, W)—where B represents batch size, C is the number of channels, H is height, and W is width—the normalization proceeds as follows:

Step 1: Channel Partitioning The C channels are divided into G groups, where each group contains C/G channels. For this division to be valid, C must be evenly divisible by G.

Step 2: Group-wise Statistics Computation For each sample in the batch and each channel group, compute the mean and variance across all spatial positions (H × W) and all channels within that group:

$$\mu_{b,g} = \frac{1}{(C/G) \cdot H \cdot W} \sum_{c \in \text{group}g} \sum{h,w} X_{b,c,h,w}$$

$$\sigma^2_{b,g} = \frac{1}{(C/G) \cdot H \cdot W} \sum_{c \in \text{group}g} \sum{h,w} (X_{b,c,h,w} - \mu_{b,g})^2$$

Step 3: Normalization Each element is normalized using its group's statistics:

$$\hat{X}{b,c,h,w} = \frac{X{b,c,h,w} - \mu_{b,g}}{\sqrt{\sigma^2_{b,g} + \epsilon}}$$

Step 4: Scale and Shift (Affine Transformation) Apply learnable per-channel scale (gamma) and shift (beta) parameters:

$$Y_{b,c,h,w} = \gamma_c \cdot \hat{X}_{b,c,h,w} + \beta_c$$

Why Channel Group Normalization?

Unlike batch normalization which requires large batch sizes to compute stable statistics, group normalization computes statistics independently for each sample within channel groups. This makes it particularly effective for:

Small batch training (common in object detection, segmentation, and video processing)
Variable batch sizes during inference
Recurrent neural networks where batch normalization is challenging to apply

Your Task

Implement a function that performs channel group normalization on a 4D input tensor. The function should normalize each group of channels independently per sample, then apply the learned scale and shift parameters.

Output values should be rounded to 4 decimal places.

Example

Input

X = [[[[1.0, 2.0], [3.0, 4.0]], [[5.0, 6.0], [7.0, 8.0]]], [[[9.0, 10.0], [11.0, 12.0]], [[13.0, 14.0], [15.0, 16.0]]]]
gamma = [1.0, 1.0]
beta = [0.0, 0.0]
num_groups = 2

Output

[[[[-1.3416, -0.4472], [0.4472, 1.3416]], [[-1.3416, -0.4472], [0.4472, 1.3416]]], [[[-1.3416, -0.4472], [0.4472, 1.3416]], [[-1.3416, -0.4472], [0.4472, 1.3416]]]]

Explanation

The input tensor has shape (2, 2, 2, 2) — 2 samples, 2 channels, 2×2 spatial dimensions.

With num_groups = 2 and 2 channels, each group contains 1 channel (2 ÷ 2 = 1).

For Sample 0, Group 0 (Channel 0):

Values: [1.0, 2.0, 3.0, 4.0]
Mean: (1 + 2 + 3 + 4) / 4 = 2.5
Variance: ((1-2.5)² + (2-2.5)² + (3-2.5)² + (4-2.5)²) / 4 = 1.25
Std: √1.25 ≈ 1.118
Normalized: (x - 2.5) / 1.118 → [-1.3416, -0.4472, 0.4472, 1.3416]

Since gamma = 1.0 and beta = 0.0, the output equals the normalized values. The same pattern applies to all samples and groups.

Example

Input

X = [[[[1.0, 2.0], [3.0, 4.0]], [[2.0, 3.0], [4.0, 5.0]]]]
gamma = [1.0, 1.0]
beta = [0.0, 0.0]
num_groups = 1

Output

[[[[-1.633, -0.8165], [0.0, 0.8165]], [[-0.8165, 0.0], [0.8165, 1.633]]]]

Explanation

The input tensor has shape (1, 2, 2, 2) — 1 sample, 2 channels, 2×2 spatial dimensions.

With num_groups = 1, all 2 channels belong to a single group.

For Sample 0, Group 0 (Channels 0 and 1):

All values combined: [1.0, 2.0, 3.0, 4.0, 2.0, 3.0, 4.0, 5.0]
Mean: (1+2+3+4+2+3+4+5) / 8 = 24 / 8 = 3.0
Variance: computed across all 8 values
Each value is normalized using the shared statistics, then reshaped back

This demonstrates how a single group computes one mean/variance pair for all channels, creating more uniform normalization across the entire feature map.

Example

Input

X = [[[[0.0, 1.0], [2.0, 3.0]], [[4.0, 5.0], [6.0, 7.0]], [[8.0, 9.0], [10.0, 11.0]], [[12.0, 13.0], [14.0, 15.0]]]]
gamma = [2.0, 2.0, 0.5, 0.5]
beta = [1.0, -1.0, 0.5, -0.5]
num_groups = 2

Output

[[[[-2.055, -1.1822], [-0.3093, 0.5636]], [[-0.5636, 0.3093], [1.1822, 2.055]], [[-0.2638, -0.0455], [0.1727, 0.3909]], [[-0.3909, -0.1727], [0.0455, 0.2638]]]]

Explanation

The input tensor has shape (1, 4, 2, 2) — 1 sample, 4 channels, 2×2 spatial dimensions.

With num_groups = 2, channels are split into 2 groups of 2 channels each:

Group 0: Channels 0 and 1
Group 1: Channels 2 and 3

Affine Parameters:

gamma = [2.0, 2.0, 0.5, 0.5]: Channels 0-1 scaled by 2.0, Channels 2-3 scaled by 0.5
beta = [1.0, -1.0, 0.5, -0.5]: Different shifts per channel

Process:

Compute mean/variance for Group 0 (over channels 0-1, all spatial positions = 8 values)
Compute mean/variance for Group 1 (over channels 2-3, all spatial positions = 8 values)
Normalize each channel using its group's statistics
Apply per-channel gamma and beta: Y = gamma[c] × normalized + beta[c]

This example demonstrates how different scale/shift parameters per channel allow the network to learn optimal transformations while maintaining normalized group statistics.

Accepted0/0·0% Acceptance

Constraints

1 ≤ B ≤ 64 (batch size)
1 ≤ C ≤ 512 (number of channels)
1 ≤ H, W ≤ 256 (spatial dimensions)
1 ≤ num_groups ≤ C
C must be evenly divisible by num_groups
len(gamma) = len(beta) = C
-10⁶ ≤ X[b][c][h][w] ≤ 10⁶
-10³ ≤ gamma[c], beta[c] ≤ 10³
ε = 1e-5 for numerical stability

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

X =

[[[[1,2],[3,4]],[[5,6],[7,8]]],[[[9,10],[11,12]],[[13,14],[15,16]]]]

beta =

[0,0]

gamma =

[1,1]

num_groups =

Loading problem...

101