Gated Linear Unit with Swish Activation (Easy) — Practice with Code Visualizer

The SwiGLU (Swish-Gated Linear Unit) activation function represents a significant advancement in neural network architecture design, combining the smoothness of the Swish activation with the expressive power of gated linear units. This activation function has become a cornerstone in modern transformer-based models, notably powering models like LLaMA, PaLM, and other state-of-the-art language models.

Understanding SwiGLU

SwiGLU is a variant of the Gated Linear Unit (GLU) family of activation functions. The core idea behind GLU architectures is to split the input into two parts: one serves as the primary signal, and the other acts as a gate that controls how much of the signal passes through.

The Swish Activation

The Swish function is defined as:

$$\text{Swish}(x) = x \cdot \sigma(x)$$

Where σ(x) is the sigmoid function:

$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

Swish is notable for being smooth, non-monotonic, and having demonstrated superior performance over ReLU in many deep learning applications.

The SwiGLU Formula

Given an input vector that has been projected to shape (batch_size, 2d), the SwiGLU operation works as follows:

Split the input into two halves along the feature dimension:
- First half: x₁ of shape (batch_size, d)
- Second half: x₂ of shape (batch_size, d)
Apply Swish to the second half: $$\text{Swish}(x_2) = x_2 \cdot \sigma(x_2)$$
Element-wise multiplication of the first half with the Swish-activated second half: $$\text{SwiGLU}(x_1, x_2) = x_1 \odot \text{Swish}(x_2)$$

The gating mechanism allows the network to learn which features to amplify and which to suppress, providing more expressive representations than simpler activation functions.

Your Task

Implement a Python function that applies the SwiGLU activation function to a NumPy array. The function should:

Accept an input array of shape (batch_size, 2d) that has already been passed through a linear projection
Split the input along the last dimension into two equal halves
Apply the SwiGLU transformation as described above
Round each output value to four decimal places
Return the result as a NumPy array of shape (batch_size, d)

Note: Pay careful attention to numerical stability when computing the sigmoid function for very large or very small values.

The input has shape (1, 4), so d = 2. We split it into:

• x₁ = [1.0, -1.0] (first half) • x₂ = [1000.0, -1000.0] (second half)

Applying the sigmoid function: • σ(1000) ≈ 1.0 (sigmoid saturates to 1 for large positive values) • σ(-1000) ≈ 0.0 (sigmoid saturates to 0 for large negative values)

Computing Swish(x₂): • Swish(1000) = 1000 × 1.0 = 1000.0 • Swish(-1000) = -1000 × 0.0 = 0.0

Applying the gating multiplication: • SwiGLU = x₁ ⊙ Swish(x₂) = [1.0 × 1000.0, -1.0 × 0.0] = [1000.0, 0.0]

With all zeros as input:

• x₁ = [0.0, 0.0] • x₂ = [0.0, 0.0]

The sigmoid of 0 is σ(0) = 0.5, so: • Swish(0) = 0 × 0.5 = 0.0

Therefore: • SwiGLU = x₁ ⊙ Swish(x₂) = [0.0 × 0.0, 0.0 × 0.0] = [0.0, 0.0]

This demonstrates that zeros pass through as zeros, which is important for maintaining sparsity in neural networks.

This is a batch of 2 samples, each with 4 features (d = 2):

Sample 1: • x₁ = [1.0, 2.0], x₂ = [3.0, 4.0] • σ(3.0) ≈ 0.9526, σ(4.0) ≈ 0.9820 • Swish(3.0) = 3.0 × 0.9526 ≈ 2.8577 • Swish(4.0) = 4.0 × 0.9820 ≈ 3.9281 • SwiGLU = [1.0 × 2.8577, 2.0 × 3.9281] = [2.8577, 7.8561]

Sample 2: • x₁ = [5.0, 6.0], x₂ = [7.0, 8.0] • σ(7.0) ≈ 0.9991, σ(8.0) ≈ 0.9997 • Swish(7.0) = 7.0 × 0.9991 ≈ 6.9936 • Swish(8.0) = 8.0 × 0.9997 ≈ 7.9974 • SwiGLU = [5.0 × 6.9936, 6.0 × 7.9974] = [34.9681, 47.9839]

Understanding SwiGLU

The Swish Activation

The Swish function is defined as:

$$\text{Swish}(x) = x \cdot \sigma(x)$$

Where σ(x) is the sigmoid function:

$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

Swish is notable for being smooth, non-monotonic, and having demonstrated superior performance over ReLU in many deep learning applications.

The SwiGLU Formula

Given an input vector that has been projected to shape (batch_size, 2d), the SwiGLU operation works as follows:

Split the input into two halves along the feature dimension:
- First half: x₁ of shape (batch_size, d)
- Second half: x₂ of shape (batch_size, d)
Apply Swish to the second half: $$\text{Swish}(x_2) = x_2 \cdot \sigma(x_2)$$
Element-wise multiplication of the first half with the Swish-activated second half: $$\text{SwiGLU}(x_1, x_2) = x_1 \odot \text{Swish}(x_2)$$

The gating mechanism allows the network to learn which features to amplify and which to suppress, providing more expressive representations than simpler activation functions.

Your Task

Implement a Python function that applies the SwiGLU activation function to a NumPy array. The function should:

Accept an input array of shape (batch_size, 2d) that has already been passed through a linear projection
Split the input along the last dimension into two equal halves
Apply the SwiGLU transformation as described above
Round each output value to four decimal places
Return the result as a NumPy array of shape (batch_size, d)

Note: Pay careful attention to numerical stability when computing the sigmoid function for very large or very small values.

The input has shape (1, 4), so d = 2. We split it into:

• x₁ = [1.0, -1.0] (first half) • x₂ = [1000.0, -1000.0] (second half)

Applying the sigmoid function: • σ(1000) ≈ 1.0 (sigmoid saturates to 1 for large positive values) • σ(-1000) ≈ 0.0 (sigmoid saturates to 0 for large negative values)

Computing Swish(x₂): • Swish(1000) = 1000 × 1.0 = 1000.0 • Swish(-1000) = -1000 × 0.0 = 0.0

Applying the gating multiplication: • SwiGLU = x₁ ⊙ Swish(x₂) = [1.0 × 1000.0, -1.0 × 0.0] = [1000.0, 0.0]

With all zeros as input:

• x₁ = [0.0, 0.0] • x₂ = [0.0, 0.0]

The sigmoid of 0 is σ(0) = 0.5, so: • Swish(0) = 0 × 0.5 = 0.0

Therefore: • SwiGLU = x₁ ⊙ Swish(x₂) = [0.0 × 0.0, 0.0 × 0.0] = [0.0, 0.0]

This demonstrates that zeros pass through as zeros, which is important for maintaining sparsity in neural networks.

This is a batch of 2 samples, each with 4 features (d = 2):

Gated Linear Unit with Swish Activation

Understanding SwiGLU

The Swish Activation

The SwiGLU Formula

Your Task

Hints

Gated Linear Unit with Swish Activation

Understanding SwiGLU

The Swish Activation

The SwiGLU Formula

Your Task

Hints