Activation Functions - Learning Module

Loading content...

0/278

Swish and GELU

The Modern Activation Landscape

While ReLU and its variants dominated deep learning from 2012-2017, two new activation functions have emerged as the de facto standards for state-of-the-art models:

Swish (2017): Discovered by Google Brain through automated neural architecture search
GELU (2016): Derived from a stochastic regularization interpretation

These functions share a remarkable property: they are smooth, non-monotonic approximations to ReLU that consistently outperform ReLU across diverse architectures and tasks. They are now the default activations in:

GPT series → GELU
BERT and T5 → GELU
EfficientNet → Swish (called SiLU)
Vision Transformers (ViT) → GELU
Llama, Mistral → SwiGLU (Swish-gated variant)

This page provides a complete mathematical and empirical analysis of these modern activation functions.

What You Will Master

By completing this page, you will understand the mathematical formulation and derivatives of Swish and GELU, their relationship to ReLU and the stochastic interpretation of GELU, why non-monotonic activations outperform ReLU, computational considerations and common approximations, and when to choose Swish vs GELU in practice.

The Swish Function

Discovery Through Automated Search

Swish was discovered in 2017 by Ramachandran, Zoph, and Le at Google Brain through neural architecture search (NAS). Rather than hand-designing activation functions, they searched over a space of possible functions using reinforcement learning, evaluating candidates on their downstream task performance.

The search discovered that the simplest effective function was:

$$\text{Swish}(x) = x \cdot \sigma(x) = \frac{x}{1 + e^{-x}}$$

where σ(x) is the sigmoid function.

Mathematical Properties

Domain and Range:

Domain: All real numbers, x ∈ (-∞, +∞)
Range: [≈-0.278, +∞) — bounded below, unbounded above

Key Characteristics:

Smooth: Infinitely differentiable everywhere
Non-monotonic: Decreases for x ∈ [≈-1.28, 0], then increases
Asymptotically linear: Approaches identity for large positive x (Swish(x) → x)
Bounded below: Has a global minimum at x ≈ -1.28 where Swish ≈ -0.278

The Derivative of Swish

Using the product rule on f(x) = x · σ(x):

$$\text{Swish}'(x) = \sigma(x) + x \cdot \sigma'(x) = \sigma(x) + x \cdot \sigma(x)(1 - \sigma(x))$$

Simplifying:

$$\text{Swish}'(x) = \sigma(x) \cdot [1 + x(1 - \sigma(x))]$$

Or equivalently:

$$\text{Swish}'(x) = \sigma(x) + \frac{x \cdot e^{-x}}{(1 + e^{-x})^2}$$

Properties of the Derivative:

At x = 0: Swish'(0) = 0.5 (smoother transition than ReLU's jump from 0 to 1)
For large positive x: Swish'(x) → 1 (approaches ReLU behavior)
For large negative x: Swish'(x) → 0 (similar to ReLU)
Negative region: Unlike ReLU, gradient is non-zero for negative x (prevents dead neurons)

swish_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import numpy as np
 
def sigmoid(x):
    """Numerically stable sigmoid."""
    return np.where(x >= 0,
                    1 / (1 + np.exp(-x)),
                    np.exp(x) / (1 + np.exp(x)))
 
def swish(x, beta=1.0):
    """
    Swish activation function: x * sigmoid(beta * x)
    beta=1.0 is standard Swish
    beta→∞ approaches ReLU
    beta→0 approaches linear (x/2)
    """
    return x * sigmoid(beta * x)
 
def swish_derivative(x, beta=1.0):
    """
    Derivative of Swish.
    d/dx[x * σ(βx)] = σ(βx) + βx * σ(βx) * (1 - σ(βx))
    """
    sig = sigmoid(beta * x)
    return sig + beta * x * sig * (1 - sig)
 
def swish_derivative_from_output(output, x, beta=1.0):
    """
    More efficient: use cached forward values.
    Swish'(x) = β*Swish(x) + σ(βx)*(1 - β*Swish(x))
    """
    sig = sigmoid(beta * x)
    return beta * output + sig * (1 - beta * output)
 
# Analyze properties
def analyze_swish():
    x = np.linspace(-4, 4, 1000)
    y = swish(x)
    dy = swish_derivative(x)
    
    # Find minimum
    min_idx = np.argmin(y)
    min_x = x[min_idx]
    min_y = y[min_idx]
    
    print("Swish Analysis:")
    print("-" * 40)
    print(f"Minimum: x = {min_x:.3f}, Swish(x) = {min_y:.3f}")
    print(f"At x = 0: Swish(0) = {swish(np.array([0.0]))[0]:.3f}")
    print(f"At x = 0: Swish'(0) = {swish_derivative(np.array([0.0]))[0]:.3f}")
    print(f"At x = 5: Swish(5) = {swish(np.array([5.0]))[0]:.3f} (≈ x = 5)")
    print(f"At x = 5: Swish'(5) = {swish_derivative(np.array([5.0]))[0]:.4f} (≈ 1)")
    print(f"At x = -5: Swish(-5) = {swish(np.array([-5.0]))[0]:.4f} (≈ 0)")
    print(f"At x = -5: Swish'(-5) = {swish_derivative(np.array([-5.0]))[0]:.4f} (≈ 0)")
 
analyze_swish()
 
# Beta parameter exploration
def explore_beta():
    """
    Show how beta parameter controls Swish behavior.
    """
    x = np.array([0.5, -0.5, -2.0])
    
    print("
Beta parameter effect (interpolating toward ReLU):")
    print("-" * 50)
    for beta in [0.1, 0.5, 1.0, 2.0, 5.0, 10.0]:
        vals = swish(x, beta=beta)
        print(f"beta={beta:4.1f}: Swish([0.5, -0.5, -2]) = {vals}")
    
    # As beta → ∞, Swish → ReLU
    print("
As beta → ∞:")
    print(f"  ReLU([0.5, -0.5, -2]) = {np.maximum(0, x)}")
 
explore_beta()

Swish = SiLU = σLU

Swish is also known as SiLU (Sigmoid Linear Unit) in PyTorch and other frameworks. They are identical: Swish(x) = x · σ(x) = SiLU(x). The name 'SiLU' emphasizes the sigmoid-linear structure, while 'Swish' was the original Google publication term.

The Non-Monotonicity Advantage

Swish's non-monotonic behavior in the region x ∈ [-1.28, 0] is a key differentiator from ReLU. For slightly negative inputs:

ReLU: Outputs exactly 0, losing all information
Swish: Outputs a small negative value, preserving information

This 'bump' below zero allows the network to use negative pre-activations as soft negative signals rather than completely discarding them.

Why This Helps:

Regularization effect: Small negative outputs act as implicit regularization
Gradient flow: Non-zero gradients for negative inputs (no dead neurons)
Richer representations: The network can distinguish between 'slightly negative' and 'very negative'

The non-monotonicity was initially surprising—simpler monotonic functions seem more natural. But empirically, this property consistently improves performance across vision and language tasks.

The Gaussian Error Linear Unit (GELU)

Stochastic Motivation

GELU (Gaussian Error Linear Unit) was introduced by Hendrycks and Gimpel in 2016, motivated by a different perspective: stochastic regularization.

Consider a neuron that randomly drops based on input magnitude:

For inputs x, with probability Φ(x) the neuron passes x
With probability 1 - Φ(x), the neuron outputs 0

where Φ(x) is the standard Gaussian CDF.

The expected output of this stochastic process is:

$$\mathbb{E}[\text{output}] = x \cdot \Phi(x) + 0 \cdot (1 - \Phi(x)) = x \cdot \Phi(x)$$

This is precisely the GELU function!

Mathematical Definition

$$\text{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right]$$

where erf is the Gauss error function:

$$\text{erf}(x) = \frac{2}{\sqrt{\pi}} \int_0^x e^{-t^2} dt$$

and Φ(x) is the CDF of the standard normal distribution:

$$\Phi(x) = \frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right]$$

The Derivative of GELU

Using the product rule:

$$\text{GELU}'(x) = \Phi(x) + x \cdot \phi(x)$$

where φ(x) is the standard Gaussian PDF:

$$\phi(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2}$$

Properties:

At x = 0: GELU'(0) = 0.5 (identical to Swish)
For large positive x: GELU'(x) → 1 (approaches ReLU)
For large negative x: GELU'(x) → 0
Maximum gradient: Slightly greater than 1.0, occurring around x ≈ -0.15

Common Approximations

The erf function is computationally expensive. Two approximations are widely used:

Tanh Approximation (used in GPT-2):

$$\text{GELU}(x) \approx 0.5x\left[1 + \tanh\left(\sqrt{\frac{2}{\pi}}\left(x + 0.044715x^3\right)\right)\right]$$

Sigmoid Approximation:

$$\text{GELU}(x) \approx x \cdot \sigma(1.702x)$$

Both approximations are accurate to within 0.01 across the typical input range.

gelu_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import numpy as np
from scipy.special import erf
from scipy.stats import norm
 
def gelu_exact(x):
    """
    Exact GELU using error function.
    GELU(x) = x * Φ(x) where Φ is standard normal CDF.
    """
    return x * norm.cdf(x)
 
def gelu_erf(x):
    """
    Exact GELU using erf directly.
    Φ(x) = 0.5 * (1 + erf(x / sqrt(2)))
    """
    return 0.5 * x * (1 + erf(x / np.sqrt(2)))
 
def gelu_tanh_approx(x):
    """
    Tanh approximation (used in GPT-2, BERT).
    Faster than exact erf computation.
    """
    inner = np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)
    return 0.5 * x * (1 + np.tanh(inner))
 
def gelu_sigmoid_approx(x):
    """
    Sigmoid approximation.
    GELU(x) ≈ x * sigmoid(1.702 * x)
    Even faster, very close to Swish.
    """
    return x * sigmoid(1.702 * x)
 
def sigmoid(x):
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
 
def gelu_derivative_exact(x):
    """
    Exact derivative: GELU'(x) = Φ(x) + x*φ(x)
    """
    return norm.cdf(x) + x * norm.pdf(x)
 
def gelu_derivative_tanh_approx(x):
    """
    Derivative of tanh approximation.
    Requires more complex chain rule application.
    """
    sqrt_2_pi = np.sqrt(2 / np.pi)
    inner = sqrt_2_pi * (x + 0.044715 * x**3)
    tanh_inner = np.tanh(inner)
    sech2_inner = 1 - tanh_inner**2
    
    d_inner = sqrt_2_pi * (1 + 3 * 0.044715 * x**2)
    
    return 0.5 * (1 + tanh_inner) + 0.5 * x * sech2_inner * d_inner
 
# Compare implementations
def compare_gelu_implementations():
    x = np.linspace(-4, 4, 1000)
    
    exact = gelu_exact(x)
    tanh_approx = gelu_tanh_approx(x)
    sigmoid_approx = gelu_sigmoid_approx(x)
    
    print("GELU Implementation Comparison:")
    print("-" * 50)
    print(f"Max |exact - tanh_approx|:    {np.max(np.abs(exact - tanh_approx)):.6f}")
    print(f"Max |exact - sigmoid_approx|: {np.max(np.abs(exact - sigmoid_approx)):.6f}")
    
    # Specific values
    print("
Values at key points:")
    for xi in [-2.0, -1.0, 0.0, 1.0, 2.0]:
        x_arr = np.array([xi])
        print(f"  x={xi:4.1f}: exact={gelu_exact(x_arr)[0]:.4f}, "
              f"tanh={gelu_tanh_approx(x_arr)[0]:.4f}, "
              f"sigmoid={gelu_sigmoid_approx(x_arr)[0]:.4f}")
 
compare_gelu_implementations()
 
# Analyze GELU properties
def analyze_gelu():
    x = np.linspace(-4, 4, 10000)
    y = gelu_exact(x)
    dy = gelu_derivative_exact(x)
    
    min_idx = np.argmin(y)
    min_x = x[min_idx]
    min_y = y[min_idx]
    
    max_grad_idx = np.argmax(dy)
    max_grad_x = x[max_grad_idx]
    max_grad = dy[max_grad_idx]
    
    print("
GELU Analysis:")
    print("-" * 40)
    print(f"Minimum: x = {min_x:.3f}, GELU(x) = {min_y:.3f}")
    print(f"Max gradient: x = {max_grad_x:.3f}, GELU'(x) = {max_grad:.4f}")
    print(f"At x = 0: GELU(0) = {gelu_exact(np.array([0.0]))[0]:.4f}")
    print(f"At x = 0: GELU'(0) = {gelu_derivative_exact(np.array([0.0]))[0]:.4f}")
 
analyze_gelu()

Why Transformers Use GELU

GELU became the standard activation in Transformers (BERT, GPT, T5) because: (1) The stochastic interpretation connects to dropout-like regularization, beneficial for the large overparameterized models, (2) Smooth gradients work well with Adam optimizer, (3) The 'soft' gating behavior based on input magnitude aligns with attention mechanisms, and (4) Empirically outperformed ReLU on language modeling benchmarks.

Swish vs GELU: Detailed Comparison

Swish and GELU are remarkably similar in practice. Both are smooth, non-monotonic functions that interpolate between zero and the identity. Their differences are subtle but sometimes meaningful.

Mathematical Relationship

Both can be written as x multiplied by a gate function:

Swish: x · σ(x) where σ is sigmoid
GELU: x · Φ(x) where Φ is Gaussian CDF

Since $\sigma(x) \approx \Phi(1.702x)$, we have:

$$\text{GELU}(x) \approx \text{Swish}_{\beta=1.702}(x)$$

So GELU is approximately Swish with a steeper sigmoid!

swish_gelu_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import numpy as np
from scipy.stats import norm
 
def swish(x, beta=1.0):
    return x * (1 / (1 + np.exp(-beta * x)))
 
def gelu_exact(x):
    return x * norm.cdf(x)
 
def compare_functions():
    """
    Detailed comparison of Swish and GELU.
    """
    x = np.linspace(-4, 4, 1000)
    
    swish_1 = swish(x, beta=1.0)
    swish_17 = swish(x, beta=1.702)  # Should match GELU
    gelu = gelu_exact(x)
    relu = np.maximum(0, x)
    
    # Compute differences
    diff_swish_gelu = np.max(np.abs(swish_1 - gelu))
    diff_swish17_gelu = np.max(np.abs(swish_17 - gelu))
    
    print("Swish vs GELU Comparison:")
    print("-" * 50)
    print(f"Max |Swish(β=1) - GELU|:     {diff_swish_gelu:.4f}")
    print(f"Max |Swish(β=1.702) - GELU|: {diff_swish17_gelu:.4f}")
    
    # Key property: non-monotonicity region
    print("
Non-monotonic regions (where derivative crosses 0):")
    swish_grad = swish(x, beta=1.0) + x * swish(x, beta=1.0) * (1 - swish(x, beta=1.0) / x)
    
    # Find zeros of (1 + x*(1-σ(x))) by numerical method
    from scipy.optimize import brentq
    
    def swish_derivative(x):
        sig = 1 / (1 + np.exp(-x))
        return sig + x * sig * (1 - sig)
    
    def gelu_derivative(x):
        return norm.cdf(x) + x * norm.pdf(x)
    
    # Find where derivatives = 0
    try:
        swish_zero = brentq(lambda x: swish_derivative(x) - 0, -3, 0)
        gelu_zero = brentq(lambda x: gelu_derivative(x) - 0, -3, 0)
        print(f"  Swish derivative = 0 at x ≈ {swish_zero:.3f}")
        print(f"  GELU derivative = 0 at x ≈ {gelu_zero:.3f}")
    except:
        print("  (Could not find derivative zeros)")
    
    # Find minima
    swish_vals = swish(x, beta=1.0)
    gelu_vals = gelu_exact(x)
    
    swish_min_idx = np.argmin(swish_vals)
    gelu_min_idx = np.argmin(gelu_vals)
    
    print(f"
Global minima:")
    print(f"  Swish: x = {x[swish_min_idx]:.3f}, min = {swish_vals[swish_min_idx]:.4f}")
    print(f"  GELU:  x = {x[gelu_min_idx]:.3f}, min = {gelu_vals[gelu_min_idx]:.4f}")
 
compare_functions()
 
def gradient_comparison():
    """
    Compare gradient magnitudes.
    """
    x = np.linspace(-3, 3, 1000)
    
    def swish_grad(x):
        sig = 1 / (1 + np.exp(-x))
        return sig + x * sig * (1 - sig)
    
    def gelu_grad(x):
        return norm.cdf(x) + x * norm.pdf(x)
    
    sw_grad = swish_grad(x)
    ge_grad = gelu_grad(x)
    
    print("
Gradient behavior:")
    print("-" * 50)
    print(f"At x = 0:  Swish' = {swish_grad(0):.4f}, GELU' = {gelu_grad(0):.4f}")
    print(f"At x = 1:  Swish' = {swish_grad(1):.4f}, GELU' = {gelu_grad(1):.4f}")
    print(f"At x = -1: Swish' = {swish_grad(-1):.4f}, GELU' = {gelu_grad(-1):.4f}")
    print(f"Max Swish gradient: {np.max(sw_grad):.4f} at x = {x[np.argmax(sw_grad)]:.3f}")
    print(f"Max GELU gradient:  {np.max(ge_grad):.4f} at x = {x[np.argmax(ge_grad)]:.3f}")
 
gradient_comparison()

Swish vs GELU Feature Comparison
Property	Swish (SiLU)	GELU	Notes
Formula	x · σ(x)	x · Φ(x)	σ = sigmoid, Φ = normal CDF
At x = 0	0	0	Both pass through origin
Gradient at x = 0	0.50	0.50	Identical at origin
Global minimum	≈ -0.278 at x ≈ -1.28	≈ -0.170 at x ≈ -1.00	GELU less negative
Non-monotonic range	x ∈ [-1.28, 0]	x ∈ [-1.00, 0]	GELU narrower
Computational cost	1 sigmoid + 1 mul	1 erf + muls	Swish simpler (exact)
Common approx	None needed	tanh polynomial	Both fast in practice
Typical use	EfficientNet, MobileNet	BERT, GPT, Transformers	Both work everywhere

Practical Equivalence

In practice, Swish and GELU perform nearly identically across most tasks. The choice often comes down to framework defaults or model heritage. When migrating between frameworks, using one instead of the other typically has negligible impact on accuracy. That said, for reproducing existing work, always match the exact activation used in the original paper.

Why Non-Monotonic Activations Work

The success of Swish and GELU challenged the assumption that activation functions should be monotonic. Here's the theoretical and empirical basis for non-monotonicity's benefits.

Information Preservation

Consider inputs in the range [-2, 0]:

ReLU: All values map to exactly 0. Information is completely lost.
Leaky ReLU (α=0.01): Values map to [-0.02, 0]. Information is heavily compressed.
Swish/GELU: Values map to approximately [−0.28, 0] and [−0.17, 0]. The full range is preserved.

The non-monotonic 'bump' creates a bijective mapping in this region, meaning no information is lost. The network can distinguish between x = -0.5 and x = -1.5 in ways that ReLU cannot.

Implicit Regularization

The negative values produced by Swish/GELU for slightly negative inputs act as soft penalties:

Features that 'almost activate' receive small negative outputs
This nudges the network to be more confident in its activations
Similar to the effect of negative samples in contrastive learning

non_monotonic_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import numpy as np
from scipy.stats import norm
 
def relu(x):
    return np.maximum(0, x)
 
def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)
 
def swish(x):
    return x / (1 + np.exp(-x))
 
def gelu(x):
    return x * norm.cdf(x)
 
def information_analysis():
    """
    Analyze information preservation in the negative region.
    """
    # 1000 distinct inputs in [-2, 0]
    x = np.linspace(-2, 0, 1000)
    
    outputs = {
        'ReLU': relu(x),
        'Leaky ReLU': leaky_relu(x),
        'Swish': swish(x),
        'GELU': gelu(x)
    }
    
    print("Information Preservation Analysis (inputs in [-2, 0]):")
    print("-" * 60)
    
    for name, y in outputs.items():
        unique_outputs = len(np.unique(np.round(y, 6)))
        output_range = y.max() - y.min()
        print(f"{name:12}: {unique_outputs} unique outputs, "
              f"range = [{y.min():.4f}, {y.max():.4f}]")
 
information_analysis()
 
def gradient_flow_comparison():
    """
    Compare gradient flow through negative inputs.
    """
    x = np.linspace(-3, 0, 100)  # Negative inputs only
    
    # Derivatives
    relu_grad = np.zeros_like(x)  # All zero
    leaky_grad = np.full_like(x, 0.01)
    
    def swish_grad(x):
        sig = 1 / (1 + np.exp(-x))
        return sig + x * sig * (1 - sig)
    
    def gelu_grad(x):
        return norm.cdf(x) + x * norm.pdf(x)
    
    sw_grad = swish_grad(x)
    ge_grad = gelu_grad(x)
    
    print("
Gradient flow for negative inputs (x ∈ [-3, 0]):")
    print("-" * 60)
    print(f"ReLU:       mean grad = {relu_grad.mean():.4f} (dead)")
    print(f"Leaky ReLU: mean grad = {leaky_grad.mean():.4f} (fixed)")
    print(f"Swish:      mean grad = {sw_grad.mean():.4f} (adaptive)")
    print(f"GELU:       mean grad = {ge_grad.mean():.4f} (adaptive)")
    
    # Gradient varies adaptively!
    print(f"
Swish gradient range: [{sw_grad.min():.4f}, {sw_grad.max():.4f}]")
    print(f"GELU gradient range:  [{ge_grad.min():.4f}, {ge_grad.max():.4f}]")
 
gradient_flow_comparison()
 
def representation_richness():
    """
    Demonstrate richer representations from non-monotonic activations.
    """
    # Two distinct inputs that ReLU conflates
    x1 = np.array([-0.5])
    x2 = np.array([-1.5])
    
    print("
Distinguishing power for negative inputs:")
    print("-" * 60)
    print(f"Inputs: x1 = {x1[0]}, x2 = {x2[0]}")
    print(f"ReLU:       f(x1) = {relu(x1)[0]:.4f}, f(x2) = {relu(x2)[0]:.4f}, "
          f"diff = {abs(relu(x1)[0] - relu(x2)[0]):.4f}")
    print(f"Leaky ReLU: f(x1) = {leaky_relu(x1)[0]:.4f}, f(x2) = {leaky_relu(x2)[0]:.4f}, "
          f"diff = {abs(leaky_relu(x1)[0] - leaky_relu(x2)[0]):.4f}")
    print(f"Swish:      f(x1) = {swish(x1)[0]:.4f}, f(x2) = {swish(x2)[0]:.4f}, "
          f"diff = {abs(swish(x1)[0] - swish(x2)[0]):.4f}")
    print(f"GELU:       f(x1) = {gelu(x1)[0]:.4f}, f(x2) = {gelu(x2)[0]:.4f}, "
          f"diff = {abs(gelu(x1)[0] - gelu(x2)[0]):.4f}")
 
representation_richness()

Smoothness and Optimization

Beyond non-monotonicity, the smoothness of Swish and GELU provides optimization benefits:

Continuous second derivatives: The loss landscape is smoother, allowing adaptive optimizers (Adam) to estimate curvature more accurately.
No 'corners': ReLU has a non-differentiable point at x = 0. While this rarely causes issues in practice, the smooth transition of Swish/GELU can help in some optimization scenarios.
Gradient continuity: The gradient changes smoothly with input, avoiding sudden jumps that can destabilize training.

The Empirical Evidence

Google's original Swish paper tested across:

ImageNet classification (ResNet, Inception, MobileNet)
Machine translation (GNMT)
Object detection (SSD)

Finding: Swish consistently outperformed ReLU by 0.5-1.0% accuracy, often matching or exceeding ELU with lower computational cost.

Similarly, GELU's adoption in BERT and subsequent Transformers was validated by consistent improvements over ReLU on language modeling perplexity and downstream task accuracy.

Computational Considerations

Speed Comparison

Modern activation functions are called billions of times during training. Even small efficiency differences matter.

Forward pass complexity:

Activation	Operations (Forward)	GPU-Optimized
ReLU	1 max	Fastest
Leaky ReLU	1 comparison + 1 mul	Very fast
Swish	1 exp + 2 div + 1 mul	Moderate
GELU (exact)	1 erf + 2 mul	Slow
GELU (tanh)	1 tanh + 2 mul + pow	Moderate

Backward pass:

ReLU: 1 comparison
Swish: Requires caching σ(x) or recomputing
GELU: Requires caching Φ(x) or recomputing

GPU-Fused Implementations

Frameworks provide fused CUDA kernels that minimize memory access overhead:

computational_benchmark.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
import time
 
def benchmark_cpu(func, x, num_iterations=100, name="function"):
    """Benchmark activation function on CPU."""
    # Warmup
    for _ in range(10):
        _ = func(x)
    
    start = time.perf_counter()
    for _ in range(num_iterations):
        _ = func(x)
    elapsed = time.perf_counter() - start
    
    throughput = (num_iterations * x.size) / elapsed / 1e6
    print(f"{name:20}: {elapsed:.4f}s, {throughput:.1f}M elements/sec")
    return elapsed
 
# Activation functions
def relu(x):
    return np.maximum(0, x)
 
def leaky_relu(x):
    return np.where(x > 0, x, 0.01 * x)
 
def swish(x):
    return x / (1 + np.exp(-np.clip(x, -500, 500)))
 
def gelu_tanh(x):
    inner = np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)
    return 0.5 * x * (1 + np.tanh(inner))
 
def gelu_sigmoid(x):
    return x * (1 / (1 + np.exp(-1.702 * np.clip(x, -500, 500))))
 
# Benchmark
x = np.random.randn(10_000_000).astype(np.float32)
 
print("CPU Benchmark (10M elements, 100 iterations):")
print("-" * 60)
 
benchmark_cpu(relu, x, name="ReLU")
benchmark_cpu(leaky_relu, x, name="Leaky ReLU")
benchmark_cpu(swish, x, name="Swish")
benchmark_cpu(gelu_tanh, x, name="GELU (tanh approx)")
benchmark_cpu(gelu_sigmoid, x, name="GELU (sigmoid approx)")
 
# PyTorch comparison (if available)
try:
    import torch
    import torch.nn.functional as F
    
    print("
PyTorch GPU Benchmark (if CUDA available):")
    print("-" * 60)
    
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    x_torch = torch.randn(10_000_000, device=device)
    
    def benchmark_torch(func, x, num_iterations=100, name="function"):
        # Warmup
        for _ in range(10):
            _ = func(x)
        
        if device.type == 'cuda':
            torch.cuda.synchronize()
        
        start = time.perf_counter()
        for _ in range(num_iterations):
            _ = func(x)
        
        if device.type == 'cuda':
            torch.cuda.synchronize()
        
        elapsed = time.perf_counter() - start
        print(f"{name:20}: {elapsed:.4f}s on {device}")
    
    benchmark_torch(F.relu, x_torch, name="ReLU")
    benchmark_torch(F.leaky_relu, x_torch, name="Leaky ReLU")
    benchmark_torch(F.silu, x_torch, name="SiLU/Swish")
    benchmark_torch(F.gelu, x_torch, name="GELU")
    
except ImportError:
    print("
(PyTorch not available for GPU benchmark)")

Memory Considerations

For backpropagation, we cache intermediate values:

Activation	Cached Values	Memory/Element
ReLU	Binary mask (x > 0)	1 bit (packable)
Swish	σ(x) or input x	32 bits
GELU	Φ(x) or input x	32 bits

For very deep networks or limited GPU memory, ReLU's minimal memory footprint can be significant.

Gradient Checkpointing

With gradient checkpointing, Swish/GELU's caching disadvantage is mitigated—all activations are recomputed during backward pass anyway. In this scenario, the relative cost of Swish/GELU vs ReLU remains similar forward-pass ratios.

Practical Impact

Despite higher computational cost than ReLU, Swish and GELU are fast enough that they rarely bottleneck training. The 0.5-1% accuracy improvement typically outweighs the ~10-20% computational overhead. For inference-constrained deployments (mobile, edge), ReLU or Leaky ReLU may still be preferred for efficiency.

Variants and Extensions

SwiGLU: Gated Linear Units with Swish

SwiGLU combines Swish with the Gated Linear Unit (GLU) mechanism, becoming the standard activation in LLaMA, Mistral, and other modern LLMs:

$$\text{SwiGLU}(x, W, V, b, c) = \text{Swish}(xW + b) \odot (xV + c)$$

The idea: use Swish to gate another linear projection, combining the benefits of gating (like in LSTMs) with smooth activation.

Advantage: Superior performance on language modeling tasks compared to plain GELU or Swish.

Cost: Doubles the parameters in FFN layers (two projections: W and V).

Hard Swish for Mobile

Hard Swish is a piecewise linear approximation used in MobileNetV3 for efficient inference:

$$\text{HardSwish}(x) = x \cdot \frac{\text{ReLU6}(x + 3)}{6}$$

where ReLU6(x) = min(max(0, x), 6).

This avoids exp() entirely, using only comparisons and multiplications.

activation_variants.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import numpy as np
 
def swiglu(x, W, V, b, c):
    """
    SwiGLU: Gated activation used in LLaMA, Mistral, etc.
    """
    gate = swish(x @ W + b)  # Gate projection with Swish
    value = x @ V + c        # Value projection (linear)
    return gate * value      # Element-wise gating
 
def swish(x):
    return x / (1 + np.exp(-np.clip(x, -500, 500)))
 
def relu6(x):
    """ReLU capped at 6."""
    return np.clip(x, 0, 6)
 
def hard_swish(x):
    """
    Hard Swish: Efficient piecewise linear approximation.
    Used in MobileNetV3.
    """
    return x * relu6(x + 3) / 6
 
def hard_sigmoid(x):
    """
    Hard Sigmoid: Piecewise linear sigmoid approximation.
    """
    return np.clip((x + 3) / 6, 0, 1)
 
def mish(x):
    """
    Mish: Another smooth non-monotonic activation.
    mish(x) = x * tanh(softplus(x)) = x * tanh(ln(1 + e^x))
    """
    softplus = np.log1p(np.exp(np.clip(x, -500, 20)))
    return x * np.tanh(softplus)
 
# Compare variants to original
def compare_variants():
    x = np.linspace(-4, 4, 1000)
    
    exact_swish = swish(x)
    hard = hard_swish(x)
    m = mish(x)
    
    print("Variant Analysis:")
    print("-" * 50)
    print(f"Max |Swish - Hard Swish|: {np.max(np.abs(exact_swish - hard)):.4f}")
    print(f"Max |Swish - Mish|:       {np.max(np.abs(exact_swish - m)):.4f}")
    
    # Properties
    print("
At x = -1:")
    print(f"  Swish(-1)      = {swish(np.array([-1.0]))[0]:.4f}")
    print(f"  Hard Swish(-1) = {hard_swish(np.array([-1.0]))[0]:.4f}")
    print(f"  Mish(-1)       = {mish(np.array([-1.0]))[0]:.4f}")
 
compare_variants()
 
# SwiGLU example
def swiglu_example():
    """Demonstrate SwiGLU in a simple FFN context."""
    np.random.seed(42)
    
    # Simulated hidden dimensions
    d_model = 512
    d_hidden = d_model * 4  # Typical 4x expansion
    batch_size = 16
    
    # Input
    x = np.random.randn(batch_size, d_model).astype(np.float32)
    
    # Weight matrices (in practice, single matrix with split)
    W_gate = np.random.randn(d_model, d_hidden).astype(np.float32) * 0.02
    W_value = np.random.randn(d_model, d_hidden).astype(np.float32) * 0.02
    b_gate = np.zeros(d_hidden, dtype=np.float32)
    b_value = np.zeros(d_hidden, dtype=np.float32)
    
    # SwiGLU forward
    gate = swish(x @ W_gate + b_gate)
    value = x @ W_value + b_value
    output = gate * value
    
    print("
SwiGLU FFN Example:")
    print(f"  Input shape:  {x.shape}")
    print(f"  Gate shape:   {gate.shape}")
    print(f"  Output shape: {output.shape}")
    print(f"  Output stats: mean={output.mean():.4f}, std={output.std():.4f}")
    
    # Note: 2x parameters compared to standard FFN (W_gate + W_value vs just W)
    standard_params = d_model * d_hidden
    swiglu_params = 2 * d_model * d_hidden
    print(f"  Standard FFN params: {standard_params:,}")
    print(f"  SwiGLU FFN params:   {swiglu_params:,} ({swiglu_params/standard_params:.1f}x)")
 
swiglu_example()

Modern LLM Activations

State-of-the-art LLMs in 2024 predominantly use SwiGLU or variants. The doubled parameter count in FFN layers is offset by improved training efficiency—models learn more per parameter. When implementing custom architectures, SwiGLU is now the recommended default for transformer FFN layers.

Summary: Swish and GELU

We have comprehensively analyzed Swish and GELU—the modern smooth activation functions that have become standard in state-of-the-art neural networks.

Core Takeaways

•Swish = x · σ(x) was discovered via neural architecture search and is also known as SiLU; it is smooth, non-monotonic, and approaches ReLU for large positive inputs.
•GELU = x · Φ(x) has a stochastic interpretation (expected value under input-dependent dropout) and is the standard for Transformers (BERT, GPT).
•Both are non-monotonic, providing a 'bump' in the negative region that preserves information and provides implicit regularization.
•GELU ≈ Swish(β=1.702): The two functions are nearly equivalent in practice; choice often follows framework/paper conventions.
•Computational overhead is modest: 10-20% slower than ReLU, but accuracy gains typically justify the cost.
•SwiGLU combines Swish gating with linear projections and is now standard in modern LLMs (LLaMA, Mistral).

Modern Activation Selection Guide
Architecture	Recommended Activation	Notes
CNNs (general)	Swish (SiLU) or ReLU	Swish for quality, ReLU for speed
CNNs (mobile)	Hard Swish	Efficient piecewise approximation
Transformers (NLP)	GELU	Standard in BERT, GPT, T5
Transformers (vision)	GELU or Swish	Both common in ViT variants
LLM FFN layers	SwiGLU	LLaMA, Mistral, modern LLMs
Output layer (binary)	Sigmoid	For probability outputs
Gating mechanisms	Sigmoid	LSTM/GRU gates, attention gates

Looking Ahead:

The next page explores Softmax, the essential multi-class output activation that powers classification heads and the entire attention mechanism in Transformers. Understanding softmax is crucial for completing your mastery of neural network activations.

Page Complete

You now have complete mastery of Swish and GELU—the modern smooth activations that power today's best models. You understand their mathematical foundations, why non-monotonicity helps, how to approximate GELU efficiently, and when to choose each variant for different architectures.