Loading content...
In modern deep learning architectures, normalization layers like Layer Normalization and Batch Normalization play a crucial role in stabilizing training and accelerating convergence. However, these layers introduce computational overhead by requiring explicit statistics computation across features or batches.
The Adaptive Hyperbolic Tangent Transformation offers an elegant, normalization-free alternative that preserves the essential squashing behavior required for stable training while eliminating the need for computing running statistics. This transformation combines the bounded, smooth properties of the hyperbolic tangent function with learnable affine parameters.
Mathematical Formulation:
Given an input tensor x with feature dimension D, and learnable parameters:
The transformation is computed as:
$$\text{output} = \gamma \cdot \tanh(\alpha \cdot x) + \beta$$
Key Properties:
Broadcasting Semantics:
The transformation applies element-wise to the input tensor. The gamma and beta parameters are broadcast across all dimensions except the last (feature) dimension. This means:
Your Task:
Implement a function that applies this adaptive hyperbolic tangent transformation to an input tensor. The function should correctly handle tensors of arbitrary shape, applying the transformation element-wise while properly broadcasting the gamma and beta parameters along the feature dimension.
x = [[[0.1412, 0.0037, 0.2413, 0.2218]]]
alpha = 0.5
gamma = [1.0, 1.0, 1.0, 1.0]
beta = [0.0, 0.0, 0.0, 0.0][[[0.0705, 0.0019, 0.1201, 0.1105]]]With gamma=1 and beta=0, the output is simply tanh(alpha * x):
• tanh(0.5 × 0.1412) = tanh(0.0706) ≈ 0.0705 • tanh(0.5 × 0.0037) = tanh(0.00185) ≈ 0.0019 • tanh(0.5 × 0.2413) = tanh(0.12065) ≈ 0.1201 • tanh(0.5 × 0.2218) = tanh(0.1109) ≈ 0.1105
The alpha scaling controls how steeply the tanh saturates for large inputs.
x = [[[0.5, -0.5, 0.0, 1.0]]]
alpha = 1.0
gamma = [1.0, 2.0, 1.5, 0.5]
beta = [0.0, 0.1, -0.1, 0.2][[[0.4621, -0.8242, -0.1, 0.5808]]]Each element undergoes the full transformation gamma * tanh(alpha * x) + beta:
• γ₁ × tanh(1.0 × 0.5) + β₁ = 1.0 × 0.4621 + 0.0 = 0.4621 • γ₂ × tanh(1.0 × -0.5) + β₂ = 2.0 × (-0.4621) + 0.1 = -0.8242 • γ₃ × tanh(1.0 × 0.0) + β₃ = 1.5 × 0.0 + (-0.1) = -0.1 • γ₄ × tanh(1.0 × 1.0) + β₄ = 0.5 × 0.7616 + 0.2 = 0.5808
The gamma scales and beta shifts allow fine-grained control over the output distribution.
x = [[[[0.1, 0.2], [0.3, 0.4]], [[-0.1, -0.2], [-0.3, -0.4]]]]
alpha = 0.5
gamma = [1.0, 1.0]
beta = [0.0, 0.0][[[[0.05, 0.0997], [0.1489, 0.1974]], [[-0.05, -0.0997], [-0.1489, -0.1974]]]]For 4D tensors with shape (1, 2, 2, 2), the gamma and beta of shape (2,) broadcast across all spatial dimensions. Each position applies tanh(0.5 * x) independently:
• Positive values: tanh(0.5 × 0.1) ≈ 0.05, tanh(0.5 × 0.2) ≈ 0.0997, etc. • Negative values: The transformation is odd-symmetric, so tanh(-z) = -tanh(z)
This demonstrates how the transformation naturally handles multi-dimensional feature maps.
Constraints