Loading problem...
In modern deep learning architectures, particularly those handling sequential data like Transformers, normalization techniques play a crucial role in stabilizing training and improving convergence. Unlike batch normalization which operates across samples in a batch, layer normalization standardizes activations across the feature dimension independently for each position in a sequence.
Given a 3D input tensor X with dimensions (batch_size, sequence_length, d_model) representing a batch of sequences, your task is to normalize each feature vector at every sequence position. For each position in the sequence, you will:
Mathematical Formulation:
For each position in the sequence, given a feature vector $\mathbf{x} \in \mathbb{R}^{d}$:
$$\mu = \frac{1}{d} \sum_{i=1}^{d} x_i$$
$$\sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2$$
$$\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}$$
$$y_i = \gamma_i \cdot \hat{x}_i + \beta_i$$
Where:
Your Task: Implement a function that performs layer normalization on a 3D tensor representing batch sequences. The function should normalize across the last dimension (feature dimension) independently for each batch and sequence position, then apply the affine transformation using the provided gamma and beta parameters.
X = [[[0.4967, -0.1383, 0.6477], [1.5230, -0.2342, -0.2341]], [[1.5792, 0.7674, -0.4695], [0.5426, -0.4634, -0.4657]]]
gamma = [[[1.0, 1.0, 1.0]]]
beta = [[[0.0, 0.0, 0.0]]][[[0.47376012, -1.39085726, 0.91709715], [1.41421355, -0.70711669, -0.70709687]], [[1.13193274, 0.16823128, -1.30016402], [1.41421074, -0.70467043, -0.7095403]]]For each position in the sequence, we compute the mean and variance across the 3 features:
• Position (0,0): Features [0.4967, -0.1383, 0.6477] → μ ≈ 0.335, σ² ≈ 0.114 Normalized: [(0.4967-0.335)/√0.114, (-0.1383-0.335)/√0.114, (0.6477-0.335)/√0.114] Result: [0.474, -1.391, 0.917]
• Position (0,1): Features [1.523, -0.234, -0.234] → Nearly equal negative values produce symmetric output Result: [1.414, -0.707, -0.707]
With gamma=1 and beta=0, the affine transformation preserves the normalized values.
X = [[[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]]
gamma = [[[1.0, 1.0, 1.0]]]
beta = [[[0.0, 0.0, 0.0]]][[[-1.22474486, 0.0, 1.22474486], [-1.22474486, 0.0, 1.22474486]]]Both positions have evenly spaced features with identical statistics:
• Position (0,0): Features [1, 2, 3] → μ = 2, σ² = 0.667 Normalized: [(1-2)/0.816, (2-2)/0.816, (3-2)/0.816] = [-1.225, 0, 1.225]
• Position (0,1): Features [4, 5, 6] → μ = 5, σ² = 0.667 Same spread pattern produces identical normalized output: [-1.225, 0, 1.225]
This demonstrates that layer normalization preserves relative relationships while standardizing the scale.
X = [[[0.5, -0.5, 0.0]]]
gamma = [[[2.0, 2.0, 2.0]]]
beta = [[[1.0, 1.0, 1.0]]][[[3.44948967, -1.44948967, 1.0]]]This example demonstrates the effect of non-trivial gamma and beta:
• Features [0.5, -0.5, 0.0] → μ = 0, σ² = 0.167, σ ≈ 0.408 • Normalized: [0.5/0.408, -0.5/0.408, 0/0.408] = [1.225, -1.225, 0] • Apply affine transform: y = 2 * normalized + 1 Result: [21.225+1, 2(-1.225)+1, 2*0+1] = [3.449, -1.449, 1.0]
The scale parameter (gamma=2) amplifies variations, while shift (beta=1) centers the output around 1.
Constraints