Loading problem...
The SwiGLU (Swish-Gated Linear Unit) activation function represents a significant advancement in neural network architecture design, combining the smoothness of the Swish activation with the expressive power of gated linear units. This activation function has become a cornerstone in modern transformer-based models, notably powering models like LLaMA, PaLM, and other state-of-the-art language models.
SwiGLU is a variant of the Gated Linear Unit (GLU) family of activation functions. The core idea behind GLU architectures is to split the input into two parts: one serves as the primary signal, and the other acts as a gate that controls how much of the signal passes through.
The Swish function is defined as:
$$\text{Swish}(x) = x \cdot \sigma(x)$$
Where σ(x) is the sigmoid function:
$$\sigma(x) = \frac{1}{1 + e^{-x}}$$
Swish is notable for being smooth, non-monotonic, and having demonstrated superior performance over ReLU in many deep learning applications.
Given an input vector that has been projected to shape (batch_size, 2d), the SwiGLU operation works as follows:
Split the input into two halves along the feature dimension:
Apply Swish to the second half: $$\text{Swish}(x_2) = x_2 \cdot \sigma(x_2)$$
Element-wise multiplication of the first half with the Swish-activated second half: $$\text{SwiGLU}(x_1, x_2) = x_1 \odot \text{Swish}(x_2)$$
The gating mechanism allows the network to learn which features to amplify and which to suppress, providing more expressive representations than simpler activation functions.
Implement a Python function that applies the SwiGLU activation function to a NumPy array. The function should:
Note: Pay careful attention to numerical stability when computing the sigmoid function for very large or very small values.
x = np.array([[1.0, -1.0, 1000.0, -1000.0]])[[1000.0, 0.0]]The input has shape (1, 4), so d = 2. We split it into:
• x₁ = [1.0, -1.0] (first half) • x₂ = [1000.0, -1000.0] (second half)
Applying the sigmoid function: • σ(1000) ≈ 1.0 (sigmoid saturates to 1 for large positive values) • σ(-1000) ≈ 0.0 (sigmoid saturates to 0 for large negative values)
Computing Swish(x₂): • Swish(1000) = 1000 × 1.0 = 1000.0 • Swish(-1000) = -1000 × 0.0 = 0.0
Applying the gating multiplication: • SwiGLU = x₁ ⊙ Swish(x₂) = [1.0 × 1000.0, -1.0 × 0.0] = [1000.0, 0.0]
x = np.array([[0.0, 0.0, 0.0, 0.0]])[[0.0, 0.0]]With all zeros as input:
• x₁ = [0.0, 0.0] • x₂ = [0.0, 0.0]
The sigmoid of 0 is σ(0) = 0.5, so: • Swish(0) = 0 × 0.5 = 0.0
Therefore: • SwiGLU = x₁ ⊙ Swish(x₂) = [0.0 × 0.0, 0.0 × 0.0] = [0.0, 0.0]
This demonstrates that zeros pass through as zeros, which is important for maintaining sparsity in neural networks.
x = np.array([[1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0]])[[2.8577, 7.8561], [34.9681, 47.9839]]This is a batch of 2 samples, each with 4 features (d = 2):
Sample 1: • x₁ = [1.0, 2.0], x₂ = [3.0, 4.0] • σ(3.0) ≈ 0.9526, σ(4.0) ≈ 0.9820 • Swish(3.0) = 3.0 × 0.9526 ≈ 2.8577 • Swish(4.0) = 4.0 × 0.9820 ≈ 3.9281 • SwiGLU = [1.0 × 2.8577, 2.0 × 3.9281] = [2.8577, 7.8561]
Sample 2: • x₁ = [5.0, 6.0], x₂ = [7.0, 8.0] • σ(7.0) ≈ 0.9991, σ(8.0) ≈ 0.9997 • Swish(7.0) = 7.0 × 0.9991 ≈ 6.9936 • Swish(8.0) = 8.0 × 0.9997 ≈ 7.9974 • SwiGLU = [5.0 × 6.9936, 6.0 × 7.9974] = [34.9681, 47.9839]
Constraints