Loading content...
In deep learning for computer vision, normalization techniques play a critical role in stabilizing training and improving model convergence. While batch normalization computes statistics across the entire batch, there are scenarios—particularly in image style transfer, image generation, and domain adaptation—where normalizing each sample independently across its spatial dimensions proves more effective.
Channel-Wise Spatial Normalization operates on 4D tensors of shape (B, C, H, W), where:
For each individual sample in the batch and each channel independently, this normalization technique:
Mathematical Formulation:
For a given sample b and channel c, let x_{b,c} be the 2D slice of shape (H, W). The normalized output is computed as:
$$\mu_{b,c} = \frac{1}{H \times W} \sum_{h=1}^{H} \sum_{w=1}^{W} x_{b,c,h,w}$$
$$\sigma_{b,c} = \sqrt{\frac{1}{H \times W} \sum_{h=1}^{H} \sum_{w=1}^{W} (x_{b,c,h,w} - \mu_{b,c})^2 + \epsilon}$$
$$\hat{x}{b,c,h,w} = \frac{x{b,c,h,w} - \mu_{b,c}}{\sigma_{b,c}}$$
$$y_{b,c,h,w} = \gamma_c \cdot \hat{x}_{b,c,h,w} + \beta_c$$
Where ε (epsilon) is a small constant (typically 1e-5) added for numerical stability to prevent division by zero.
Key Distinction: Unlike batch normalization (which normalizes across the batch dimension), channel-wise spatial normalization treats each sample independently. This makes it particularly useful when:
Your Task:
Implement the channel_wise_spatial_normalization function that takes an input tensor X of shape (B, C, H, W), along with learnable parameters gamma and beta of shape (C,), and returns the normalized tensor of the same shape.
X = [
[
[[0.497, -0.138], [0.648, 1.523]],
[[-0.234, -0.234], [1.579, 0.767]]
],
[
[[-0.469, 0.543], [-0.463, -0.466]],
[[0.242, -1.913], [-1.725, -0.562]]
]
]
gamma = [1.0, 1.0]
beta = [0.0, 0.0][
[
[[-0.229, -1.300], [0.026, 1.502]],
[[-0.926, -0.926], [1.460, 0.392]]
],
[
[[-0.585, 1.732], [-0.571, -0.576]],
[[1.401, -1.050], [-0.836, 0.486]]
]
]The input tensor has shape (2, 2, 2, 2), meaning 2 samples, 2 channels, and 2×2 spatial dimensions.
For sample 0, channel 0:
For sample 0, channel 1:
Since γ = [1.0, 1.0] and β = [0.0, 0.0], the output is simply the standardized values without additional scaling or shifting.
X = [
[
[[0.497, -0.138, 0.648], [1.523, -0.234, -0.234], [1.579, 0.767, -0.469]],
[[0.543, -0.463, -0.466], [0.242, -1.913, -1.725], [-0.562, -1.013, 0.314]]
]
]
gamma = [2.0, 0.5]
beta = [1.0, -1.0][
[
[[1.164, -0.595, 1.582], [4.006, -0.860, -0.860], [4.161, 1.913, -1.512]],
[[-0.327, -0.941, -0.942], [-0.510, -1.826, -1.711], [-1.001, -1.276, -0.466]]
]
]This example demonstrates the effect of non-trivial gamma and beta values.
For channel 0:
For channel 1:
The gamma and beta parameters are per-channel, allowing the network to learn optimal scales and shifts for each feature channel independently.
X = [
[
[[0.257, -0.908, -0.379, -0.535], [0.858, -0.413, 0.498, 2.010], [1.263, -0.439, -0.346, 0.455], [-1.669, -0.862, 0.493, -0.124]]
]
]
gamma = [1.5]
beta = [0.5][
[
[[0.921, -1.064, -0.161, -0.428], [1.944, -0.220, 1.331, 3.906], [2.633, -0.265, -0.107, 1.258], [-2.358, -0.985, 1.322, 0.271]]
]
]A single-sample, single-channel example with larger spatial dimensions (4×4 = 16 elements).
Process:
The 16 spatial values now have a controlled distribution, scaled by 1.5 and shifted by 0.5, demonstrating how the normalization adapts to arbitrary spatial dimensions while preserving per-channel learnable parameters.
Constraints