Sequence-Wise Feature Normalization with Affine Transform (Medium) — Practice with Code Visualizer

In modern deep learning architectures, particularly those handling sequential data like Transformers, normalization techniques play a crucial role in stabilizing training and improving convergence. Unlike batch normalization which operates across samples in a batch, layer normalization standardizes activations across the feature dimension independently for each position in a sequence.

Given a 3D input tensor X with dimensions (batch_size, sequence_length, d_model) representing a batch of sequences, your task is to normalize each feature vector at every sequence position. For each position in the sequence, you will:

Compute the mean (μ) across the feature dimension (d_model)
Compute the variance (σ²) across the feature dimension
Standardize the features using: $\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}$
Apply affine transformation using learnable parameters: $y = \gamma \cdot \hat{x} + \beta$

Mathematical Formulation:

For each position in the sequence, given a feature vector $\mathbf{x} \in \mathbb{R}^{d}$:

$$\mu = \frac{1}{d} \sum_{i=1}^{d} x_i$$

$$\sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2$$

$$\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}$$

$$y_i = \gamma_i \cdot \hat{x}_i + \beta_i$$

Where:

d is the dimension of the feature space (d_model)
ε (epsilon) is a small constant (typically 1e-5) added for numerical stability
γ (gamma) is the scale parameter (shape: 1 × 1 × d_model)
β (beta) is the shift parameter (shape: 1 × 1 × d_model)

Your Task: Implement a function that performs layer normalization on a 3D tensor representing batch sequences. The function should normalize across the last dimension (feature dimension) independently for each batch and sequence position, then apply the affine transformation using the provided gamma and beta parameters.

For each position in the sequence, we compute the mean and variance across the 3 features:

• Position (0,0): Features [0.4967, -0.1383, 0.6477] → μ ≈ 0.335, σ² ≈ 0.114 Normalized: [(0.4967-0.335)/√0.114, (-0.1383-0.335)/√0.114, (0.6477-0.335)/√0.114] Result: [0.474, -1.391, 0.917]

• Position (0,1): Features [1.523, -0.234, -0.234] → Nearly equal negative values produce symmetric output Result: [1.414, -0.707, -0.707]

With gamma=1 and beta=0, the affine transformation preserves the normalized values.

Both positions have evenly spaced features with identical statistics:

• Position (0,0): Features [1, 2, 3] → μ = 2, σ² = 0.667 Normalized: [(1-2)/0.816, (2-2)/0.816, (3-2)/0.816] = [-1.225, 0, 1.225]

• Position (0,1): Features [4, 5, 6] → μ = 5, σ² = 0.667 Same spread pattern produces identical normalized output: [-1.225, 0, 1.225]

This demonstrates that layer normalization preserves relative relationships while standardizing the scale.

This example demonstrates the effect of non-trivial gamma and beta:

• Features [0.5, -0.5, 0.0] → μ = 0, σ² = 0.167, σ ≈ 0.408 • Normalized: [0.5/0.408, -0.5/0.408, 0/0.408] = [1.225, -1.225, 0] • Apply affine transform: y = 2 * normalized + 1 Result: [21.225+1, 2(-1.225)+1, 2*0+1] = [3.449, -1.449, 1.0]

The scale parameter (gamma=2) amplifies variations, while shift (beta=1) centers the output around 1.

Compute the mean (μ) across the feature dimension (d_model)
Compute the variance (σ²) across the feature dimension
Standardize the features using: $\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}$
Apply affine transformation using learnable parameters: $y = \gamma \cdot \hat{x} + \beta$