Loading content...
While ReLU and its variants dominated deep learning from 2012-2017, two new activation functions have emerged as the de facto standards for state-of-the-art models:
These functions share a remarkable property: they are smooth, non-monotonic approximations to ReLU that consistently outperform ReLU across diverse architectures and tasks. They are now the default activations in:
This page provides a complete mathematical and empirical analysis of these modern activation functions.
By completing this page, you will understand the mathematical formulation and derivatives of Swish and GELU, their relationship to ReLU and the stochastic interpretation of GELU, why non-monotonic activations outperform ReLU, computational considerations and common approximations, and when to choose Swish vs GELU in practice.
Swish was discovered in 2017 by Ramachandran, Zoph, and Le at Google Brain through neural architecture search (NAS). Rather than hand-designing activation functions, they searched over a space of possible functions using reinforcement learning, evaluating candidates on their downstream task performance.
The search discovered that the simplest effective function was:
$$\text{Swish}(x) = x \cdot \sigma(x) = \frac{x}{1 + e^{-x}}$$
where σ(x) is the sigmoid function.
Domain and Range:
Key Characteristics:
Using the product rule on f(x) = x · σ(x):
$$\text{Swish}'(x) = \sigma(x) + x \cdot \sigma'(x) = \sigma(x) + x \cdot \sigma(x)(1 - \sigma(x))$$
Simplifying:
$$\text{Swish}'(x) = \sigma(x) \cdot [1 + x(1 - \sigma(x))]$$
Or equivalently:
$$\text{Swish}'(x) = \sigma(x) + \frac{x \cdot e^{-x}}{(1 + e^{-x})^2}$$
Properties of the Derivative:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
import numpy as np def sigmoid(x): """Numerically stable sigmoid.""" return np.where(x >= 0, 1 / (1 + np.exp(-x)), np.exp(x) / (1 + np.exp(x))) def swish(x, beta=1.0): """ Swish activation function: x * sigmoid(beta * x) beta=1.0 is standard Swish beta→∞ approaches ReLU beta→0 approaches linear (x/2) """ return x * sigmoid(beta * x) def swish_derivative(x, beta=1.0): """ Derivative of Swish. d/dx[x * σ(βx)] = σ(βx) + βx * σ(βx) * (1 - σ(βx)) """ sig = sigmoid(beta * x) return sig + beta * x * sig * (1 - sig) def swish_derivative_from_output(output, x, beta=1.0): """ More efficient: use cached forward values. Swish'(x) = β*Swish(x) + σ(βx)*(1 - β*Swish(x)) """ sig = sigmoid(beta * x) return beta * output + sig * (1 - beta * output) # Analyze propertiesdef analyze_swish(): x = np.linspace(-4, 4, 1000) y = swish(x) dy = swish_derivative(x) # Find minimum min_idx = np.argmin(y) min_x = x[min_idx] min_y = y[min_idx] print("Swish Analysis:") print("-" * 40) print(f"Minimum: x = {min_x:.3f}, Swish(x) = {min_y:.3f}") print(f"At x = 0: Swish(0) = {swish(np.array([0.0]))[0]:.3f}") print(f"At x = 0: Swish'(0) = {swish_derivative(np.array([0.0]))[0]:.3f}") print(f"At x = 5: Swish(5) = {swish(np.array([5.0]))[0]:.3f} (≈ x = 5)") print(f"At x = 5: Swish'(5) = {swish_derivative(np.array([5.0]))[0]:.4f} (≈ 1)") print(f"At x = -5: Swish(-5) = {swish(np.array([-5.0]))[0]:.4f} (≈ 0)") print(f"At x = -5: Swish'(-5) = {swish_derivative(np.array([-5.0]))[0]:.4f} (≈ 0)") analyze_swish() # Beta parameter explorationdef explore_beta(): """ Show how beta parameter controls Swish behavior. """ x = np.array([0.5, -0.5, -2.0]) print("Beta parameter effect (interpolating toward ReLU):") print("-" * 50) for beta in [0.1, 0.5, 1.0, 2.0, 5.0, 10.0]: vals = swish(x, beta=beta) print(f"beta={beta:4.1f}: Swish([0.5, -0.5, -2]) = {vals}") # As beta → ∞, Swish → ReLU print("As beta → ∞:") print(f" ReLU([0.5, -0.5, -2]) = {np.maximum(0, x)}") explore_beta()Swish is also known as SiLU (Sigmoid Linear Unit) in PyTorch and other frameworks. They are identical: Swish(x) = x · σ(x) = SiLU(x). The name 'SiLU' emphasizes the sigmoid-linear structure, while 'Swish' was the original Google publication term.
Swish's non-monotonic behavior in the region x ∈ [-1.28, 0] is a key differentiator from ReLU. For slightly negative inputs:
This 'bump' below zero allows the network to use negative pre-activations as soft negative signals rather than completely discarding them.
Why This Helps:
The non-monotonicity was initially surprising—simpler monotonic functions seem more natural. But empirically, this property consistently improves performance across vision and language tasks.
GELU (Gaussian Error Linear Unit) was introduced by Hendrycks and Gimpel in 2016, motivated by a different perspective: stochastic regularization.
Consider a neuron that randomly drops based on input magnitude:
where Φ(x) is the standard Gaussian CDF.
The expected output of this stochastic process is:
$$\mathbb{E}[\text{output}] = x \cdot \Phi(x) + 0 \cdot (1 - \Phi(x)) = x \cdot \Phi(x)$$
This is precisely the GELU function!
$$\text{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right]$$
where erf is the Gauss error function:
$$\text{erf}(x) = \frac{2}{\sqrt{\pi}} \int_0^x e^{-t^2} dt$$
and Φ(x) is the CDF of the standard normal distribution:
$$\Phi(x) = \frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right]$$
Using the product rule:
$$\text{GELU}'(x) = \Phi(x) + x \cdot \phi(x)$$
where φ(x) is the standard Gaussian PDF:
$$\phi(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2}$$
Properties:
The erf function is computationally expensive. Two approximations are widely used:
Tanh Approximation (used in GPT-2):
$$\text{GELU}(x) \approx 0.5x\left[1 + \tanh\left(\sqrt{\frac{2}{\pi}}\left(x + 0.044715x^3\right)\right)\right]$$
Sigmoid Approximation:
$$\text{GELU}(x) \approx x \cdot \sigma(1.702x)$$
Both approximations are accurate to within 0.01 across the typical input range.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104
import numpy as npfrom scipy.special import erffrom scipy.stats import norm def gelu_exact(x): """ Exact GELU using error function. GELU(x) = x * Φ(x) where Φ is standard normal CDF. """ return x * norm.cdf(x) def gelu_erf(x): """ Exact GELU using erf directly. Φ(x) = 0.5 * (1 + erf(x / sqrt(2))) """ return 0.5 * x * (1 + erf(x / np.sqrt(2))) def gelu_tanh_approx(x): """ Tanh approximation (used in GPT-2, BERT). Faster than exact erf computation. """ inner = np.sqrt(2 / np.pi) * (x + 0.044715 * x**3) return 0.5 * x * (1 + np.tanh(inner)) def gelu_sigmoid_approx(x): """ Sigmoid approximation. GELU(x) ≈ x * sigmoid(1.702 * x) Even faster, very close to Swish. """ return x * sigmoid(1.702 * x) def sigmoid(x): return 1 / (1 + np.exp(-np.clip(x, -500, 500))) def gelu_derivative_exact(x): """ Exact derivative: GELU'(x) = Φ(x) + x*φ(x) """ return norm.cdf(x) + x * norm.pdf(x) def gelu_derivative_tanh_approx(x): """ Derivative of tanh approximation. Requires more complex chain rule application. """ sqrt_2_pi = np.sqrt(2 / np.pi) inner = sqrt_2_pi * (x + 0.044715 * x**3) tanh_inner = np.tanh(inner) sech2_inner = 1 - tanh_inner**2 d_inner = sqrt_2_pi * (1 + 3 * 0.044715 * x**2) return 0.5 * (1 + tanh_inner) + 0.5 * x * sech2_inner * d_inner # Compare implementationsdef compare_gelu_implementations(): x = np.linspace(-4, 4, 1000) exact = gelu_exact(x) tanh_approx = gelu_tanh_approx(x) sigmoid_approx = gelu_sigmoid_approx(x) print("GELU Implementation Comparison:") print("-" * 50) print(f"Max |exact - tanh_approx|: {np.max(np.abs(exact - tanh_approx)):.6f}") print(f"Max |exact - sigmoid_approx|: {np.max(np.abs(exact - sigmoid_approx)):.6f}") # Specific values print("Values at key points:") for xi in [-2.0, -1.0, 0.0, 1.0, 2.0]: x_arr = np.array([xi]) print(f" x={xi:4.1f}: exact={gelu_exact(x_arr)[0]:.4f}, " f"tanh={gelu_tanh_approx(x_arr)[0]:.4f}, " f"sigmoid={gelu_sigmoid_approx(x_arr)[0]:.4f}") compare_gelu_implementations() # Analyze GELU propertiesdef analyze_gelu(): x = np.linspace(-4, 4, 10000) y = gelu_exact(x) dy = gelu_derivative_exact(x) min_idx = np.argmin(y) min_x = x[min_idx] min_y = y[min_idx] max_grad_idx = np.argmax(dy) max_grad_x = x[max_grad_idx] max_grad = dy[max_grad_idx] print("GELU Analysis:") print("-" * 40) print(f"Minimum: x = {min_x:.3f}, GELU(x) = {min_y:.3f}") print(f"Max gradient: x = {max_grad_x:.3f}, GELU'(x) = {max_grad:.4f}") print(f"At x = 0: GELU(0) = {gelu_exact(np.array([0.0]))[0]:.4f}") print(f"At x = 0: GELU'(0) = {gelu_derivative_exact(np.array([0.0]))[0]:.4f}") analyze_gelu()GELU became the standard activation in Transformers (BERT, GPT, T5) because: (1) The stochastic interpretation connects to dropout-like regularization, beneficial for the large overparameterized models, (2) Smooth gradients work well with Adam optimizer, (3) The 'soft' gating behavior based on input magnitude aligns with attention mechanisms, and (4) Empirically outperformed ReLU on language modeling benchmarks.
Swish and GELU are remarkably similar in practice. Both are smooth, non-monotonic functions that interpolate between zero and the identity. Their differences are subtle but sometimes meaningful.
Both can be written as x multiplied by a gate function:
Since $\sigma(x) \approx \Phi(1.702x)$, we have:
$$\text{GELU}(x) \approx \text{Swish}_{\beta=1.702}(x)$$
So GELU is approximately Swish with a steeper sigmoid!
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
import numpy as npfrom scipy.stats import norm def swish(x, beta=1.0): return x * (1 / (1 + np.exp(-beta * x))) def gelu_exact(x): return x * norm.cdf(x) def compare_functions(): """ Detailed comparison of Swish and GELU. """ x = np.linspace(-4, 4, 1000) swish_1 = swish(x, beta=1.0) swish_17 = swish(x, beta=1.702) # Should match GELU gelu = gelu_exact(x) relu = np.maximum(0, x) # Compute differences diff_swish_gelu = np.max(np.abs(swish_1 - gelu)) diff_swish17_gelu = np.max(np.abs(swish_17 - gelu)) print("Swish vs GELU Comparison:") print("-" * 50) print(f"Max |Swish(β=1) - GELU|: {diff_swish_gelu:.4f}") print(f"Max |Swish(β=1.702) - GELU|: {diff_swish17_gelu:.4f}") # Key property: non-monotonicity region print("Non-monotonic regions (where derivative crosses 0):") swish_grad = swish(x, beta=1.0) + x * swish(x, beta=1.0) * (1 - swish(x, beta=1.0) / x) # Find zeros of (1 + x*(1-σ(x))) by numerical method from scipy.optimize import brentq def swish_derivative(x): sig = 1 / (1 + np.exp(-x)) return sig + x * sig * (1 - sig) def gelu_derivative(x): return norm.cdf(x) + x * norm.pdf(x) # Find where derivatives = 0 try: swish_zero = brentq(lambda x: swish_derivative(x) - 0, -3, 0) gelu_zero = brentq(lambda x: gelu_derivative(x) - 0, -3, 0) print(f" Swish derivative = 0 at x ≈ {swish_zero:.3f}") print(f" GELU derivative = 0 at x ≈ {gelu_zero:.3f}") except: print(" (Could not find derivative zeros)") # Find minima swish_vals = swish(x, beta=1.0) gelu_vals = gelu_exact(x) swish_min_idx = np.argmin(swish_vals) gelu_min_idx = np.argmin(gelu_vals) print(f"Global minima:") print(f" Swish: x = {x[swish_min_idx]:.3f}, min = {swish_vals[swish_min_idx]:.4f}") print(f" GELU: x = {x[gelu_min_idx]:.3f}, min = {gelu_vals[gelu_min_idx]:.4f}") compare_functions() def gradient_comparison(): """ Compare gradient magnitudes. """ x = np.linspace(-3, 3, 1000) def swish_grad(x): sig = 1 / (1 + np.exp(-x)) return sig + x * sig * (1 - sig) def gelu_grad(x): return norm.cdf(x) + x * norm.pdf(x) sw_grad = swish_grad(x) ge_grad = gelu_grad(x) print("Gradient behavior:") print("-" * 50) print(f"At x = 0: Swish' = {swish_grad(0):.4f}, GELU' = {gelu_grad(0):.4f}") print(f"At x = 1: Swish' = {swish_grad(1):.4f}, GELU' = {gelu_grad(1):.4f}") print(f"At x = -1: Swish' = {swish_grad(-1):.4f}, GELU' = {gelu_grad(-1):.4f}") print(f"Max Swish gradient: {np.max(sw_grad):.4f} at x = {x[np.argmax(sw_grad)]:.3f}") print(f"Max GELU gradient: {np.max(ge_grad):.4f} at x = {x[np.argmax(ge_grad)]:.3f}") gradient_comparison()| Property | Swish (SiLU) | GELU | Notes |
|---|---|---|---|
| Formula | x · σ(x) | x · Φ(x) | σ = sigmoid, Φ = normal CDF |
| At x = 0 | 0 | 0 | Both pass through origin |
| Gradient at x = 0 | 0.50 | 0.50 | Identical at origin |
| Global minimum | ≈ -0.278 at x ≈ -1.28 | ≈ -0.170 at x ≈ -1.00 | GELU less negative |
| Non-monotonic range | x ∈ [-1.28, 0] | x ∈ [-1.00, 0] | GELU narrower |
| Computational cost | 1 sigmoid + 1 mul | 1 erf + muls | Swish simpler (exact) |
| Common approx | None needed | tanh polynomial | Both fast in practice |
| Typical use | EfficientNet, MobileNet | BERT, GPT, Transformers | Both work everywhere |
In practice, Swish and GELU perform nearly identically across most tasks. The choice often comes down to framework defaults or model heritage. When migrating between frameworks, using one instead of the other typically has negligible impact on accuracy. That said, for reproducing existing work, always match the exact activation used in the original paper.
The success of Swish and GELU challenged the assumption that activation functions should be monotonic. Here's the theoretical and empirical basis for non-monotonicity's benefits.
Consider inputs in the range [-2, 0]:
The non-monotonic 'bump' creates a bijective mapping in this region, meaning no information is lost. The network can distinguish between x = -0.5 and x = -1.5 in ways that ReLU cannot.
The negative values produced by Swish/GELU for slightly negative inputs act as soft penalties:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697
import numpy as npfrom scipy.stats import norm def relu(x): return np.maximum(0, x) def leaky_relu(x, alpha=0.01): return np.where(x > 0, x, alpha * x) def swish(x): return x / (1 + np.exp(-x)) def gelu(x): return x * norm.cdf(x) def information_analysis(): """ Analyze information preservation in the negative region. """ # 1000 distinct inputs in [-2, 0] x = np.linspace(-2, 0, 1000) outputs = { 'ReLU': relu(x), 'Leaky ReLU': leaky_relu(x), 'Swish': swish(x), 'GELU': gelu(x) } print("Information Preservation Analysis (inputs in [-2, 0]):") print("-" * 60) for name, y in outputs.items(): unique_outputs = len(np.unique(np.round(y, 6))) output_range = y.max() - y.min() print(f"{name:12}: {unique_outputs} unique outputs, " f"range = [{y.min():.4f}, {y.max():.4f}]") information_analysis() def gradient_flow_comparison(): """ Compare gradient flow through negative inputs. """ x = np.linspace(-3, 0, 100) # Negative inputs only # Derivatives relu_grad = np.zeros_like(x) # All zero leaky_grad = np.full_like(x, 0.01) def swish_grad(x): sig = 1 / (1 + np.exp(-x)) return sig + x * sig * (1 - sig) def gelu_grad(x): return norm.cdf(x) + x * norm.pdf(x) sw_grad = swish_grad(x) ge_grad = gelu_grad(x) print("Gradient flow for negative inputs (x ∈ [-3, 0]):") print("-" * 60) print(f"ReLU: mean grad = {relu_grad.mean():.4f} (dead)") print(f"Leaky ReLU: mean grad = {leaky_grad.mean():.4f} (fixed)") print(f"Swish: mean grad = {sw_grad.mean():.4f} (adaptive)") print(f"GELU: mean grad = {ge_grad.mean():.4f} (adaptive)") # Gradient varies adaptively! print(f"Swish gradient range: [{sw_grad.min():.4f}, {sw_grad.max():.4f}]") print(f"GELU gradient range: [{ge_grad.min():.4f}, {ge_grad.max():.4f}]") gradient_flow_comparison() def representation_richness(): """ Demonstrate richer representations from non-monotonic activations. """ # Two distinct inputs that ReLU conflates x1 = np.array([-0.5]) x2 = np.array([-1.5]) print("Distinguishing power for negative inputs:") print("-" * 60) print(f"Inputs: x1 = {x1[0]}, x2 = {x2[0]}") print(f"ReLU: f(x1) = {relu(x1)[0]:.4f}, f(x2) = {relu(x2)[0]:.4f}, " f"diff = {abs(relu(x1)[0] - relu(x2)[0]):.4f}") print(f"Leaky ReLU: f(x1) = {leaky_relu(x1)[0]:.4f}, f(x2) = {leaky_relu(x2)[0]:.4f}, " f"diff = {abs(leaky_relu(x1)[0] - leaky_relu(x2)[0]):.4f}") print(f"Swish: f(x1) = {swish(x1)[0]:.4f}, f(x2) = {swish(x2)[0]:.4f}, " f"diff = {abs(swish(x1)[0] - swish(x2)[0]):.4f}") print(f"GELU: f(x1) = {gelu(x1)[0]:.4f}, f(x2) = {gelu(x2)[0]:.4f}, " f"diff = {abs(gelu(x1)[0] - gelu(x2)[0]):.4f}") representation_richness()Beyond non-monotonicity, the smoothness of Swish and GELU provides optimization benefits:
Continuous second derivatives: The loss landscape is smoother, allowing adaptive optimizers (Adam) to estimate curvature more accurately.
No 'corners': ReLU has a non-differentiable point at x = 0. While this rarely causes issues in practice, the smooth transition of Swish/GELU can help in some optimization scenarios.
Gradient continuity: The gradient changes smoothly with input, avoiding sudden jumps that can destabilize training.
Google's original Swish paper tested across:
Finding: Swish consistently outperformed ReLU by 0.5-1.0% accuracy, often matching or exceeding ELU with lower computational cost.
Similarly, GELU's adoption in BERT and subsequent Transformers was validated by consistent improvements over ReLU on language modeling perplexity and downstream task accuracy.
Modern activation functions are called billions of times during training. Even small efficiency differences matter.
Forward pass complexity:
| Activation | Operations (Forward) | GPU-Optimized |
|---|---|---|
| ReLU | 1 max | Fastest |
| Leaky ReLU | 1 comparison + 1 mul | Very fast |
| Swish | 1 exp + 2 div + 1 mul | Moderate |
| GELU (exact) | 1 erf + 2 mul | Slow |
| GELU (tanh) | 1 tanh + 2 mul + pow | Moderate |
Backward pass:
Frameworks provide fused CUDA kernels that minimize memory access overhead:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
import numpy as npimport time def benchmark_cpu(func, x, num_iterations=100, name="function"): """Benchmark activation function on CPU.""" # Warmup for _ in range(10): _ = func(x) start = time.perf_counter() for _ in range(num_iterations): _ = func(x) elapsed = time.perf_counter() - start throughput = (num_iterations * x.size) / elapsed / 1e6 print(f"{name:20}: {elapsed:.4f}s, {throughput:.1f}M elements/sec") return elapsed # Activation functionsdef relu(x): return np.maximum(0, x) def leaky_relu(x): return np.where(x > 0, x, 0.01 * x) def swish(x): return x / (1 + np.exp(-np.clip(x, -500, 500))) def gelu_tanh(x): inner = np.sqrt(2 / np.pi) * (x + 0.044715 * x**3) return 0.5 * x * (1 + np.tanh(inner)) def gelu_sigmoid(x): return x * (1 / (1 + np.exp(-1.702 * np.clip(x, -500, 500)))) # Benchmarkx = np.random.randn(10_000_000).astype(np.float32) print("CPU Benchmark (10M elements, 100 iterations):")print("-" * 60) benchmark_cpu(relu, x, name="ReLU")benchmark_cpu(leaky_relu, x, name="Leaky ReLU")benchmark_cpu(swish, x, name="Swish")benchmark_cpu(gelu_tanh, x, name="GELU (tanh approx)")benchmark_cpu(gelu_sigmoid, x, name="GELU (sigmoid approx)") # PyTorch comparison (if available)try: import torch import torch.nn.functional as F print("PyTorch GPU Benchmark (if CUDA available):") print("-" * 60) device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') x_torch = torch.randn(10_000_000, device=device) def benchmark_torch(func, x, num_iterations=100, name="function"): # Warmup for _ in range(10): _ = func(x) if device.type == 'cuda': torch.cuda.synchronize() start = time.perf_counter() for _ in range(num_iterations): _ = func(x) if device.type == 'cuda': torch.cuda.synchronize() elapsed = time.perf_counter() - start print(f"{name:20}: {elapsed:.4f}s on {device}") benchmark_torch(F.relu, x_torch, name="ReLU") benchmark_torch(F.leaky_relu, x_torch, name="Leaky ReLU") benchmark_torch(F.silu, x_torch, name="SiLU/Swish") benchmark_torch(F.gelu, x_torch, name="GELU") except ImportError: print("(PyTorch not available for GPU benchmark)")For backpropagation, we cache intermediate values:
| Activation | Cached Values | Memory/Element |
|---|---|---|
| ReLU | Binary mask (x > 0) | 1 bit (packable) |
| Swish | σ(x) or input x | 32 bits |
| GELU | Φ(x) or input x | 32 bits |
For very deep networks or limited GPU memory, ReLU's minimal memory footprint can be significant.
With gradient checkpointing, Swish/GELU's caching disadvantage is mitigated—all activations are recomputed during backward pass anyway. In this scenario, the relative cost of Swish/GELU vs ReLU remains similar forward-pass ratios.
Despite higher computational cost than ReLU, Swish and GELU are fast enough that they rarely bottleneck training. The 0.5-1% accuracy improvement typically outweighs the ~10-20% computational overhead. For inference-constrained deployments (mobile, edge), ReLU or Leaky ReLU may still be preferred for efficiency.
SwiGLU combines Swish with the Gated Linear Unit (GLU) mechanism, becoming the standard activation in LLaMA, Mistral, and other modern LLMs:
$$\text{SwiGLU}(x, W, V, b, c) = \text{Swish}(xW + b) \odot (xV + c)$$
The idea: use Swish to gate another linear projection, combining the benefits of gating (like in LSTMs) with smooth activation.
Advantage: Superior performance on language modeling tasks compared to plain GELU or Swish.
Cost: Doubles the parameters in FFN layers (two projections: W and V).
Hard Swish is a piecewise linear approximation used in MobileNetV3 for efficient inference:
$$\text{HardSwish}(x) = x \cdot \frac{\text{ReLU6}(x + 3)}{6}$$
where ReLU6(x) = min(max(0, x), 6).
This avoids exp() entirely, using only comparisons and multiplications.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798
import numpy as np def swiglu(x, W, V, b, c): """ SwiGLU: Gated activation used in LLaMA, Mistral, etc. """ gate = swish(x @ W + b) # Gate projection with Swish value = x @ V + c # Value projection (linear) return gate * value # Element-wise gating def swish(x): return x / (1 + np.exp(-np.clip(x, -500, 500))) def relu6(x): """ReLU capped at 6.""" return np.clip(x, 0, 6) def hard_swish(x): """ Hard Swish: Efficient piecewise linear approximation. Used in MobileNetV3. """ return x * relu6(x + 3) / 6 def hard_sigmoid(x): """ Hard Sigmoid: Piecewise linear sigmoid approximation. """ return np.clip((x + 3) / 6, 0, 1) def mish(x): """ Mish: Another smooth non-monotonic activation. mish(x) = x * tanh(softplus(x)) = x * tanh(ln(1 + e^x)) """ softplus = np.log1p(np.exp(np.clip(x, -500, 20))) return x * np.tanh(softplus) # Compare variants to originaldef compare_variants(): x = np.linspace(-4, 4, 1000) exact_swish = swish(x) hard = hard_swish(x) m = mish(x) print("Variant Analysis:") print("-" * 50) print(f"Max |Swish - Hard Swish|: {np.max(np.abs(exact_swish - hard)):.4f}") print(f"Max |Swish - Mish|: {np.max(np.abs(exact_swish - m)):.4f}") # Properties print("At x = -1:") print(f" Swish(-1) = {swish(np.array([-1.0]))[0]:.4f}") print(f" Hard Swish(-1) = {hard_swish(np.array([-1.0]))[0]:.4f}") print(f" Mish(-1) = {mish(np.array([-1.0]))[0]:.4f}") compare_variants() # SwiGLU exampledef swiglu_example(): """Demonstrate SwiGLU in a simple FFN context.""" np.random.seed(42) # Simulated hidden dimensions d_model = 512 d_hidden = d_model * 4 # Typical 4x expansion batch_size = 16 # Input x = np.random.randn(batch_size, d_model).astype(np.float32) # Weight matrices (in practice, single matrix with split) W_gate = np.random.randn(d_model, d_hidden).astype(np.float32) * 0.02 W_value = np.random.randn(d_model, d_hidden).astype(np.float32) * 0.02 b_gate = np.zeros(d_hidden, dtype=np.float32) b_value = np.zeros(d_hidden, dtype=np.float32) # SwiGLU forward gate = swish(x @ W_gate + b_gate) value = x @ W_value + b_value output = gate * value print("SwiGLU FFN Example:") print(f" Input shape: {x.shape}") print(f" Gate shape: {gate.shape}") print(f" Output shape: {output.shape}") print(f" Output stats: mean={output.mean():.4f}, std={output.std():.4f}") # Note: 2x parameters compared to standard FFN (W_gate + W_value vs just W) standard_params = d_model * d_hidden swiglu_params = 2 * d_model * d_hidden print(f" Standard FFN params: {standard_params:,}") print(f" SwiGLU FFN params: {swiglu_params:,} ({swiglu_params/standard_params:.1f}x)") swiglu_example()State-of-the-art LLMs in 2024 predominantly use SwiGLU or variants. The doubled parameter count in FFN layers is offset by improved training efficiency—models learn more per parameter. When implementing custom architectures, SwiGLU is now the recommended default for transformer FFN layers.
We have comprehensively analyzed Swish and GELU—the modern smooth activation functions that have become standard in state-of-the-art neural networks.
| Architecture | Recommended Activation | Notes |
|---|---|---|
| CNNs (general) | Swish (SiLU) or ReLU | Swish for quality, ReLU for speed |
| CNNs (mobile) | Hard Swish | Efficient piecewise approximation |
| Transformers (NLP) | GELU | Standard in BERT, GPT, T5 |
| Transformers (vision) | GELU or Swish | Both common in ViT variants |
| LLM FFN layers | SwiGLU | LLaMA, Mistral, modern LLMs |
| Output layer (binary) | Sigmoid | For probability outputs |
| Gating mechanisms | Sigmoid | LSTM/GRU gates, attention gates |
Looking Ahead:
The next page explores Softmax, the essential multi-class output activation that powers classification heads and the entire attention mechanism in Transformers. Understanding softmax is crucial for completing your mastery of neural network activations.
You now have complete mastery of Swish and GELU—the modern smooth activations that power today's best models. You understand their mathematical foundations, why non-monotonicity helps, how to approximate GELU efficiently, and when to choose each variant for different architectures.