Loading learning content...
Activation functions are the mathematical soul of neural networks. Without them, a network of any depth would collapse to a single linear transformation—incapable of learning anything beyond what simple linear regression can represent.
The Linear Composition Collapse:
Consider a network with only linear activations ($\sigma(z) = z$):
$$\mathbf{a}^{(L)} = W^{(L)}(W^{(L-1)}(\cdots W^{(1)}\mathbf{x})) = (W^{(L)} W^{(L-1)} \cdots W^{(1)})\mathbf{x} = \tilde{W}\mathbf{x}$$
The composition of linear functions is linear. All those layers collapse to a single matrix $\tilde{W}$. Depth provides no benefit.
The Role of Nonlinearity:
Nonlinear activation functions break this collapse. Each layer does something fundamentally different from matrix multiplication—it introduces curvature, thresholds, or saturation that linear algebra cannot express. This enables:
This page examines activation functions in depth: their mathematical properties, gradient behavior, computational characteristics, and practical selection for different architectures.
By the end of this page, you will understand: (1) Why nonlinearity is mathematically essential; (2) Classical activations (sigmoid, tanh) and their limitations; (3) Modern activations (ReLU family, Swish, GELU) and their advantages; (4) Gradient flow properties and the vanishing gradient problem; (5) Selection guidelines for different network types.
Before examining specific activations, we establish the mathematical properties that determine their effectiveness.
Definition (Activation Function): An activation function $\sigma: \mathbb{R} \to \mathbb{R}$ is applied element-wise to pre-activation vectors: $$\mathbf{a} = \sigma(\mathbf{z}) = (\sigma(z_1), \sigma(z_2), \ldots, \sigma(z_n))^\top$$
Key Properties to Analyze:
Range: The possible output values. Bounded (sigmoid: [0,1]) vs unbounded (ReLU: [0,∞)).
Monotonicity: Whether $x < y \Rightarrow \sigma(x) < \sigma(y)$. Most activations are monotonic; some (GELU) have slight non-monotonicities.
Differentiability: Required for gradient-based training. Most are differentiable everywhere or almost everywhere (ReLU is non-differentiable at 0).
Gradient Bounds: $|\sigma'(z)|$ determines gradient flow. If always < 1, gradients shrink (vanishing); if > 1, they grow (exploding).
Zero-Centeredness: Whether $\mathbb{E}[\sigma(z)] \approx 0$ for typical inputs. Non-zero mean can cause zig-zagging in gradient descent.
Saturation: Whether outputs approach constant values for extreme inputs, causing near-zero gradients.
Computational Cost: Some activations are cheap (ReLU: one comparison) while others require expensive operations (tanh: exp, division).
| Activation | Range | Derivative Range | Zero-Centered | Saturates | Computation |
|---|---|---|---|---|---|
| Sigmoid | (0, 1) | (0, 0.25] | No | Yes (both ends) | exp, division |
| Tanh | (-1, 1) | (0, 1] | Yes | Yes (both ends) | exp, division |
| ReLU | [0, ∞) | {0, 1} | No | Left only | max(0, z) |
| Leaky ReLU | (-∞, ∞) | {α, 1} | Nearly | No | max(αz, z) |
| ELU | (-α, ∞) | (0, 1] | Nearly | Left only | exp for z<0 |
| SELU | (-λα, ∞) | varies | Self-normalizing | Left only | exp for z<0 |
| Swish/SiLU | ≈(-0.28, ∞) | varies | Nearly | Smooth left | sigmoid × z |
| GELU | (-0.17, ∞) | varies | Nearly | Smooth left | erf or approx |
The most critical property for deep networks is gradient behavior. During backpropagation, gradients are multiplied by σ'(z) at each layer. If |σ'(z)| < 1 consistently (sigmoid), gradients shrink exponentially with depth. If |σ'(z)| = 1 when active (ReLU), gradients can flow unchanged. This is why ReLU revolutionized deep learning.
Sigmoid and tanh dominated neural networks from the 1980s through the early 2010s. Understanding their properties—and limitations—explains the motivation for modern alternatives.
Sigmoid (Logistic) Function:
$$\sigma(z) = \frac{1}{1 + e^{-z}} = \frac{e^z}{1 + e^z}$$
Derivative: $$\sigma'(z) = \sigma(z)(1 - \sigma(z))$$
Properties:
Tanh (Hyperbolic Tangent):
$$\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} = 2\sigma(2z) - 1$$
Derivative: $$\tanh'(z) = 1 - \tanh^2(z)$$
Properties:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128
import numpy as npimport matplotlib.pyplot as plt def sigmoid(z): """Logistic sigmoid activation.""" return 1 / (1 + np.exp(-np.clip(z, -500, 500))) def sigmoid_derivative(z): """Derivative of sigmoid.""" s = sigmoid(z) return s * (1 - s) def tanh(z): """Hyperbolic tangent activation.""" return np.tanh(z) def tanh_derivative(z): """Derivative of tanh.""" return 1 - np.tanh(z)**2 def analyze_classical_activations(): """ Comprehensive analysis of sigmoid and tanh. """ z = np.linspace(-6, 6, 1000) fig, axes = plt.subplots(1, 3, figsize=(14, 4)) # Function values ax1 = axes[0] ax1.plot(z, sigmoid(z), 'b-', linewidth=2, label='Sigmoid') ax1.plot(z, tanh(z), 'r-', linewidth=2, label='Tanh') ax1.axhline(y=0, color='k', linestyle='--', alpha=0.3) ax1.axvline(x=0, color='k', linestyle='--', alpha=0.3) ax1.set_title('Activation Values', fontsize=12) ax1.set_xlabel('z (pre-activation)') ax1.set_ylabel('σ(z)') ax1.legend() ax1.grid(True, alpha=0.3) ax1.set_ylim(-1.2, 1.2) # Derivatives ax2 = axes[1] ax2.plot(z, sigmoid_derivative(z), 'b-', linewidth=2, label="Sigmoid'") ax2.plot(z, tanh_derivative(z), 'r-', linewidth=2, label="Tanh'") ax2.axhline(y=0.25, color='b', linestyle=':', alpha=0.5, label='Sigmoid max (0.25)') ax2.axhline(y=1.0, color='r', linestyle=':', alpha=0.5, label='Tanh max (1.0)') ax2.set_title('Derivatives (Gradient Flow)', fontsize=12) ax2.set_xlabel('z (pre-activation)') ax2.set_ylabel("σ'(z)") ax2.legend() ax2.grid(True, alpha=0.3) ax2.set_ylim(-0.1, 1.2) # Gradient decay through layers ax3 = axes[2] depths = np.arange(1, 21) # Assume pre-activations near 0 (best case) sigmoid_decay = 0.25 ** depths # Gradient shrinks by 0.25 each layer tanh_decay = 1.0 ** depths # Best case: no decay # More realistic: mixed pre-activations sigmoid_realistic = 0.2 ** depths # Average derivative < max tanh_realistic = 0.7 ** depths # Average derivative < 1 ax3.semilogy(depths, sigmoid_decay, 'b--', label='Sigmoid (best case)', alpha=0.5) ax3.semilogy(depths, sigmoid_realistic, 'b-', linewidth=2, label='Sigmoid (typical)') ax3.semilogy(depths, tanh_realistic, 'r-', linewidth=2, label='Tanh (typical)') ax3.set_title('Gradient Magnitude vs Depth', fontsize=12) ax3.set_xlabel('Network Depth (layers)') ax3.set_ylabel('Gradient Scale (log)') ax3.legend() ax3.grid(True, alpha=0.3) ax3.axhline(y=1e-10, color='gray', linestyle=':', label='Numerical underflow') plt.tight_layout() plt.savefig('classical_activations.png', dpi=150) plt.show() def vanishing_gradient_demo(): """ Demonstrate vanishing gradients with sigmoid in a deep network. """ print("Vanishing Gradient Demonstration") print("=" * 50) # Simulate gradient flow through layers depth = 20 n_units = 100 # Initialize pre-activations from standard normal np.random.seed(42) print("\nSimulating gradient backpropagation through sigmoid network:") print("-" * 50) gradient_norms = [] grad = np.ones(n_units) # Start with unit gradient from loss for layer in range(depth, 0, -1): # Pre-activation at this layer (standard normal) z = np.random.randn(n_units) # Gradient through sigmoid local_grad = sigmoid_derivative(z) grad = grad * local_grad # Element-wise for simplicity # Weight matrix contribution would multiply here too # (simulating with random orthogonal for now) W = np.random.randn(n_units, n_units) * 0.1 grad = W.T @ grad norm = np.linalg.norm(grad) gradient_norms.append(norm) if layer % 5 == 0 or norm < 1e-10: print(f" Layer {layer:2d}: gradient norm = {norm:.2e}") print(f"\nGradient shrunk by factor: {gradient_norms[0] / gradient_norms[-1]:.2e}") print("This is why deep sigmoid networks are nearly impossible to train!") if __name__ == "__main__": analyze_classical_activations() vanishing_gradient_demo()With sigmoid, gradients multiply by at most 0.25 at each layer. After 10 layers, gradients are at most 0.25¹⁰ ≈ 10⁻⁶ of their original magnitude. After 20 layers, they're essentially zero (10⁻¹²). This is why deep networks were considered impractical until around 2010—the gradients simply vanished before reaching early layers.
The Rectified Linear Unit (ReLU) transformed deep learning. Introduced for deep networks by Nair and Hinton (2010) and popularized by Krizhevsky et al. in AlexNet (2012), ReLU addressed the vanishing gradient problem with elegant simplicity.
ReLU Definition:
$$\text{ReLU}(z) = \max(0, z) = \begin{cases} z & \text{if } z > 0 \ 0 & \text{if } z \leq 0 \end{cases}$$
Derivative:
$$\text{ReLU}'(z) = \begin{cases} 1 & \text{if } z > 0 \ 0 & \text{if } z < 0 \ \text{undefined} & \text{if } z = 0 \end{cases}$$
In practice, we use a subgradient at $z=0$ (typically 0 or 1).
Why ReLU Solved Vanishing Gradients:
The Dead Neuron Problem:
A neuron "dies" when its pre-activation is negative for all training inputs. Once dead:
Causes:
Solutions:
How Many Dead Neurons?
In a well-trained network with ReLU:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179
import numpy as npimport matplotlib.pyplot as plt def relu(z): return np.maximum(0, z) def relu_derivative(z): return (z > 0).astype(float) def gradient_flow_comparison(): """ Compare gradient flow through ReLU vs Sigmoid networks. """ np.random.seed(42) depths = [5, 10, 20, 50] n_units = 256 n_simulations = 100 results = {'relu': {}, 'sigmoid': {}} for depth in depths: relu_grads = [] sigmoid_grads = [] for _ in range(n_simulations): # Initialize weights properly relu_grad = np.ones(n_units) sigmoid_grad = np.ones(n_units) for layer in range(depth): # Random pre-activations z = np.random.randn(n_units) # ReLU gradient relu_local = relu_derivative(z) relu_grad = relu_grad * relu_local # Weight matrix (He initialization scale) W = np.random.randn(n_units, n_units) * np.sqrt(2/n_units) relu_grad = W.T @ relu_grad # Sigmoid gradient s = 1 / (1 + np.exp(-z)) sigmoid_local = s * (1 - s) sigmoid_grad = sigmoid_grad * sigmoid_local W = np.random.randn(n_units, n_units) * np.sqrt(1/n_units) sigmoid_grad = W.T @ sigmoid_grad relu_grads.append(np.linalg.norm(relu_grad)) sigmoid_grads.append(np.linalg.norm(sigmoid_grad)) results['relu'][depth] = np.mean(relu_grads) results['sigmoid'][depth] = np.mean(sigmoid_grads) print("Gradient Flow Comparison") print("=" * 50) print(f"{'Depth':>6} | {'ReLU':>12} | {'Sigmoid':>12} | {'Ratio':>10}") print("-" * 50) for depth in depths: r = results['relu'][depth] s = results['sigmoid'][depth] ratio = r / s if s > 0 else float('inf') print(f"{depth:>6} | {r:>12.2e} | {s:>12.2e} | {ratio:>10.0f}x") return results def dead_neuron_simulation(): """ Simulate dead neuron occurrence during training. """ print("\nDead Neuron Simulation") print("=" * 50) n_units = 1000 # Different initialization strategies strategies = { 'zero_bias': (np.random.randn(n_units) * 0.1, np.zeros(n_units)), 'negative_bias': (np.random.randn(n_units) * 0.1, -np.ones(n_units)), 'positive_bias': (np.random.randn(n_units) * 0.1, np.ones(n_units)), 'poor_init': (np.random.randn(n_units) * 1.0, np.zeros(n_units)), } # Simulate inputs (batch of 1000 samples) inputs = np.random.randn(1000, 10) input_weights = np.random.randn(n_units, 10) * np.sqrt(2/10) for name, (hidden_weights, biases) in strategies.items(): # Compute pre-activations for all inputs pre_acts = inputs @ input_weights.T # (1000, n_units) pre_acts += biases # Add bias # Count dead neurons (never positive across all inputs) activations = relu(pre_acts) dead_mask = np.all(activations == 0, axis=0) dead_count = np.sum(dead_mask) dead_pct = 100 * dead_count / n_units print(f"{name:>15}: {dead_count:4d} dead neurons ({dead_pct:.1f}%)") def relu_vs_sigmoid_training(): """ Demonstrate training speed difference on a simple task. """ print("\nTraining Speed Comparison (XOR task)") print("=" * 50) # XOR dataset X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]) y = np.array([[0], [1], [1], [0]]) def mse_loss(y_pred, y_true): return 0.5 * np.mean((y_pred - y_true)**2) def train_simple_network(activation, activation_deriv, epochs=5000, lr=0.5): np.random.seed(42) # Simple 2-4-1 network W1 = np.random.randn(4, 2) * 0.5 b1 = np.zeros((1, 4)) W2 = np.random.randn(1, 4) * 0.5 b2 = np.zeros((1, 1)) losses = [] for epoch in range(epochs): # Forward z1 = X @ W1.T + b1 a1 = activation(z1) z2 = a1 @ W2.T + b2 a2 = 1 / (1 + np.exp(-z2)) # Sigmoid output for probability loss = mse_loss(a2, y) losses.append(loss) # Backward dz2 = (a2 - y) * a2 * (1 - a2) dW2 = dz2.T @ a1 db2 = np.sum(dz2, axis=0, keepdims=True) da1 = dz2 @ W2 dz1 = da1 * activation_deriv(z1) dW1 = dz1.T @ X db1 = np.sum(dz1, axis=0, keepdims=True) # Update W2 -= lr * dW2 b2 -= lr * db2 W1 -= lr * dW1 b1 -= lr * db1 return losses sigmoid_losses = train_simple_network( lambda z: 1 / (1 + np.exp(-z)), lambda z: (1 / (1 + np.exp(-z))) * (1 - 1 / (1 + np.exp(-z))) ) relu_losses = train_simple_network( relu, relu_derivative ) # Find convergence point threshold = 0.01 sigmoid_converged = next((i for i, l in enumerate(sigmoid_losses) if l < threshold), len(sigmoid_losses)) relu_converged = next((i for i, l in enumerate(relu_losses) if l < threshold), len(relu_losses)) print(f"Sigmoid converged at epoch: {sigmoid_converged}") print(f"ReLU converged at epoch: {relu_converged}") print(f"ReLU was {sigmoid_converged/relu_converged:.1f}x faster") if __name__ == "__main__": gradient_flow_comparison() dead_neuron_simulation() relu_vs_sigmoid_training()Several ReLU variants address its limitations while preserving its gradient flow advantages.
Leaky ReLU (Maas et al., 2013):
$$\text{LeakyReLU}(z) = \begin{cases} z & \text{if } z > 0 \ \alpha z & \text{if } z \leq 0 \end{cases}$$
where $\alpha \in (0, 1)$ is typically $0.01$ or $0.1$.
Advantage: No dead neurons—negative inputs still produce nonzero gradients.
Parametric ReLU (PReLU, He et al., 2015):
$$\text{PReLU}(z) = \begin{cases} z & \text{if } z > 0 \ a z & \text{if } z \leq 0 \end{cases}$$
where $a$ is a learned parameter (per-channel or shared).
Advantage: Optimal slope is learned from data.
Exponential Linear Unit (ELU, Clevert et al., 2016):
$$\text{ELU}(z) = \begin{cases} z & \text{if } z > 0 \ \alpha(e^z - 1) & \text{if } z \leq 0 \end{cases}$$
Advantages:
Disadvantage: Requires exp computation for negative values.
Scaled ELU (SELU, Klambauer et al., 2017):
$$\text{SELU}(z) = \lambda \begin{cases} z & \text{if } z > 0 \ \alpha(e^z - 1) & \text{if } z \leq 0 \end{cases}$$
where $\alpha \approx 1.6733$ and $\lambda \approx 1.0507$ are specific constants.
Advantage: Self-normalizing—activations converge to zero mean, unit variance without batch normalization.
Requirement: Proper initialization (LeCun normal) and fully-connected architecture.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152
import numpy as npimport matplotlib.pyplot as plt # Activation functionsdef relu(z): return np.maximum(0, z) def leaky_relu(z, alpha=0.01): return np.where(z > 0, z, alpha * z) def elu(z, alpha=1.0): return np.where(z > 0, z, alpha * (np.exp(z) - 1)) def selu(z, alpha=1.6732632423543772, lam=1.0507009873554805): return lam * np.where(z > 0, z, alpha * (np.exp(np.clip(z, -10, 10)) - 1)) def prelu(z, a): """PReLU with learned parameter a.""" return np.where(z > 0, z, a * z) # Derivativesdef relu_deriv(z): return (z > 0).astype(float) def leaky_relu_deriv(z, alpha=0.01): return np.where(z > 0, 1, alpha) def elu_deriv(z, alpha=1.0): return np.where(z > 0, 1, alpha * np.exp(z)) def selu_deriv(z, alpha=1.6732632423543772, lam=1.0507009873554805): return lam * np.where(z > 0, 1, alpha * np.exp(np.clip(z, -10, 10))) def visualize_relu_variants(): """ Compare ReLU variants visually. """ z = np.linspace(-4, 4, 1000) fig, axes = plt.subplots(2, 2, figsize=(12, 10)) # Function values ax1 = axes[0, 0] ax1.plot(z, relu(z), 'b-', linewidth=2, label='ReLU') ax1.plot(z, leaky_relu(z, 0.1), 'r-', linewidth=2, label='Leaky ReLU (α=0.1)') ax1.plot(z, elu(z, 1.0), 'g-', linewidth=2, label='ELU (α=1.0)') ax1.axhline(y=0, color='k', linestyle='--', alpha=0.3) ax1.axvline(x=0, color='k', linestyle='--', alpha=0.3) ax1.set_title('Activation Functions', fontsize=12) ax1.set_xlabel('z') ax1.set_ylabel('σ(z)') ax1.legend() ax1.grid(True, alpha=0.3) ax1.set_ylim(-2, 4) # Derivatives ax2 = axes[0, 1] ax2.plot(z, relu_deriv(z), 'b-', linewidth=2, label='ReLU') ax2.plot(z, leaky_relu_deriv(z, 0.1), 'r-', linewidth=2, label='Leaky ReLU') ax2.plot(z, elu_deriv(z, 1.0), 'g-', linewidth=2, label='ELU') ax2.axhline(y=0, color='k', linestyle='--', alpha=0.3) ax2.axhline(y=1, color='gray', linestyle=':', alpha=0.5) ax2.set_title('Derivatives', fontsize=12) ax2.set_xlabel('z') ax2.set_ylabel("σ'(z)") ax2.legend() ax2.grid(True, alpha=0.3) ax2.set_ylim(-0.2, 1.5) # SELU self-normalizing property ax3 = axes[1, 0] np.random.seed(42) # Simulate forward pass through many layers n_layers = 50 n_units = 1000 means_relu = [] vars_relu = [] means_selu = [] vars_selu = [] # Start with unit Gaussian x_relu = np.random.randn(n_units) x_selu = np.random.randn(n_units) for _ in range(n_layers): # Random weights W_relu = np.random.randn(n_units, n_units) * np.sqrt(2/n_units) # He init W_selu = np.random.randn(n_units, n_units) * np.sqrt(1/n_units) # LeCun init x_relu = relu(W_relu @ x_relu) x_selu = selu(W_selu @ x_selu) means_relu.append(np.mean(x_relu)) vars_relu.append(np.var(x_relu)) means_selu.append(np.mean(x_selu)) vars_selu.append(np.var(x_selu)) layers = np.arange(1, n_layers + 1) ax3.plot(layers, vars_relu, 'b-', linewidth=2, label='ReLU Variance') ax3.plot(layers, vars_selu, 'purple', linewidth=2, label='SELU Variance') ax3.axhline(y=1, color='gray', linestyle='--', alpha=0.5, label='Target Var=1') ax3.set_title('Self-Normalizing Property', fontsize=12) ax3.set_xlabel('Layer Depth') ax3.set_ylabel('Activation Variance') ax3.legend() ax3.grid(True, alpha=0.3) ax3.set_yscale('log') # Dead neuron comparison ax4 = axes[1, 1] # Simulate dead neuron rate n_samples = 1000 n_neurons = 500 # Random pre-activations with bias toward negative z_samples = np.random.randn(n_samples, n_neurons) - 0.5 activations = { 'ReLU': relu(z_samples), 'Leaky ReLU': leaky_relu(z_samples, 0.1), 'ELU': elu(z_samples), } dead_rates = {} for name, acts in activations.items(): # A neuron is "dead" if it's zero for all samples if name == 'ReLU': dead = np.all(acts == 0, axis=0) else: dead = np.all(acts <= 0, axis=0) dead_rates[name] = 100 * np.mean(dead) bars = ax4.bar(dead_rates.keys(), dead_rates.values(), color=['blue', 'red', 'green']) ax4.set_title('Dead/Inactive Neuron Rate', fontsize=12) ax4.set_ylabel('% Neurons Dead') ax4.grid(True, alpha=0.3, axis='y') for bar, rate in zip(bars, dead_rates.values()): ax4.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1, f'{rate:.1f}%', ha='center', fontsize=10) plt.tight_layout() plt.savefig('relu_variants.png', dpi=150) plt.show() if __name__ == "__main__": visualize_relu_variants()ReLU: Default choice for most architectures, especially CNNs. Leaky ReLU: When dead neurons are a concern; good general alternative. ELU: When you want smoother gradients and can afford the computation. SELU: Specific to fully-connected networks without batch normalization. PReLU: When you have enough data to learn the slope parameter.
Recent research has produced smooth activations that match or exceed ReLU performance, particularly in deep networks and transformers.
Swish / SiLU (Ramachandran et al., 2017):
$$\text{Swish}(z) = z \cdot \sigma(z) = \frac{z}{1 + e^{-z}}$$
Properties:
Derivative: $$\text{Swish}'(z) = \sigma(z) + z \cdot \sigma(z) \cdot (1 - \sigma(z)) = \sigma(z)(1 + z(1 - \sigma(z)))$$
GELU (Gaussian Error Linear Unit, Hendrycks & Gimpel, 2016):
$$\text{GELU}(z) = z \cdot \Phi(z) = z \cdot \frac{1}{2}\left(1 + \text{erf}\left(\frac{z}{\sqrt{2}}\right)\right)$$
where $\Phi(z)$ is the CDF of the standard normal distribution.
Approximation (faster): $$\text{GELU}(z) \approx 0.5z\left(1 + \tanh\left(\sqrt{\frac{2}{\pi}}(z + 0.044715z^3)\right)\right)$$
Properties:
Why Smooth Matters:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153
import numpy as npimport matplotlib.pyplot as pltfrom scipy.special import erf def sigmoid(z): return 1 / (1 + np.exp(-np.clip(z, -500, 500))) def swish(z): """Swish/SiLU activation: z * sigmoid(z)""" return z * sigmoid(z) def gelu_exact(z): """GELU activation using exact formula with erf.""" return z * 0.5 * (1 + erf(z / np.sqrt(2))) def gelu_approx(z): """GELU approximation using tanh (faster).""" return 0.5 * z * (1 + np.tanh(np.sqrt(2/np.pi) * (z + 0.044715 * z**3))) def swish_derivative(z): """Derivative of Swish.""" s = sigmoid(z) return s + z * s * (1 - s) def gelu_derivative(z): """Derivative of GELU (approximate).""" # Numerical derivative for simplicity eps = 1e-7 return (gelu_exact(z + eps) - gelu_exact(z - eps)) / (2 * eps) def compare_modern_activations(): """ Comprehensive comparison of modern smooth activations. """ z = np.linspace(-4, 4, 1000) fig, axes = plt.subplots(2, 2, figsize=(12, 10)) # Function comparison ax1 = axes[0, 0] ax1.plot(z, np.maximum(0, z), 'b--', linewidth=1.5, alpha=0.7, label='ReLU') ax1.plot(z, swish(z), 'r-', linewidth=2, label='Swish') ax1.plot(z, gelu_exact(z), 'g-', linewidth=2, label='GELU') ax1.axhline(y=0, color='k', linestyle=':', alpha=0.3) ax1.axvline(x=0, color='k', linestyle=':', alpha=0.3) ax1.set_title('Activation Functions', fontsize=12) ax1.set_xlabel('z') ax1.set_ylabel('σ(z)') ax1.legend() ax1.grid(True, alpha=0.3) ax1.set_ylim(-0.5, 4) # Derivative comparison ax2 = axes[0, 1] ax2.plot(z, (z > 0).astype(float), 'b--', linewidth=1.5, alpha=0.7, label='ReLU') ax2.plot(z, swish_derivative(z), 'r-', linewidth=2, label='Swish') ax2.plot(z, gelu_derivative(z), 'g-', linewidth=2, label='GELU') ax2.axhline(y=1, color='gray', linestyle=':', alpha=0.5) ax2.set_title('Derivatives', fontsize=12) ax2.set_xlabel('z') ax2.set_ylabel("σ'(z)") ax2.legend() ax2.grid(True, alpha=0.3) ax2.set_ylim(-0.2, 1.5) # GELU approximation accuracy ax3 = axes[1, 0] exact = gelu_exact(z) approx = gelu_approx(z) error = np.abs(exact - approx) ax3.semilogy(z, error + 1e-10, 'purple', linewidth=2) ax3.set_title('GELU Approximation Error', fontsize=12) ax3.set_xlabel('z') ax3.set_ylabel('|exact - approx|') ax3.grid(True, alpha=0.3) ax3.axhline(y=1e-3, color='r', linestyle='--', alpha=0.5, label='0.001 threshold') ax3.legend() # Non-monotonicity detail ax4 = axes[1, 1] z_detail = np.linspace(-3, 1, 500) ax4.plot(z_detail, swish(z_detail), 'r-', linewidth=2, label='Swish') ax4.plot(z_detail, gelu_exact(z_detail), 'g-', linewidth=2, label='GELU') ax4.axhline(y=0, color='k', linestyle=':', alpha=0.3) ax4.axvline(x=0, color='k', linestyle=':', alpha=0.3) # Mark minimum points swish_min_z = -1.278 # Approximate minimum gelu_min_z = -0.77 # Approximate minimum ax4.scatter([swish_min_z], [swish(swish_min_z)], color='r', s=100, zorder=5, marker='v') ax4.scatter([gelu_min_z], [gelu_exact(gelu_min_z)], color='g', s=100, zorder=5, marker='v') ax4.set_title('Non-Monotonicity (Zoomed)', fontsize=12) ax4.set_xlabel('z') ax4.set_ylabel('σ(z)') ax4.legend() ax4.grid(True, alpha=0.3) ax4.set_ylim(-0.4, 0.6) plt.tight_layout() plt.savefig('modern_activations.png', dpi=150) plt.show() def transformer_activation_comparison(): """ Compare activations in transformer-like setting. """ print("Activation Comparison for Transformer FFN") print("=" * 50) np.random.seed(42) # Simulate transformer FFN block # d_model = 768, d_ff = 3072 (BERT-base dimensions) d_model = 768 d_ff = 3072 batch_size = 32 seq_len = 128 # Random input (simulating hidden states) x = np.random.randn(batch_size * seq_len, d_model).astype(np.float32) # FFN weights W1 = np.random.randn(d_ff, d_model).astype(np.float32) * np.sqrt(2/d_model) W2 = np.random.randn(d_model, d_ff).astype(np.float32) * np.sqrt(2/d_ff) # Forward with different activations def ffn_forward(x, activation_fn): h = x @ W1.T h = activation_fn(h) out = h @ W2.T return out activations = { 'ReLU': lambda z: np.maximum(0, z), 'GELU': gelu_approx, 'Swish': swish, } for name, fn in activations.items(): out = ffn_forward(x, fn) print(f"\n{name}:") print(f" Output mean: {out.mean():.4f}") print(f" Output std: {out.std():.4f}") print(f" Output range: [{out.min():.2f}, {out.max():.2f}]") if __name__ == "__main__": compare_modern_activations() transformer_activation_comparison()GELU emerged from a probabilistic interpretation: multiply input by a Bernoulli random variable with probability Φ(z). This stochastic view connects to dropout-like regularization. Empirically, GELU consistently outperforms ReLU in transformer architectures, likely due to smoother gradients and gentler saturation. The original BERT, GPT-2, and most modern language models use GELU.
Choosing the right activation function is part science, part engineering judgment. Here are practical guidelines based on architecture type and problem characteristics.
| Architecture | Hidden Layers | Output Layer | Notes |
|---|---|---|---|
| MLP (general) | ReLU or Leaky ReLU | Task-dependent | Start with ReLU; try Leaky if dead neurons |
| Deep MLP (>10 layers) | SELU or ELU | Task-dependent | SELU with LeCun init; no BatchNorm needed |
| CNN | ReLU | Softmax (classification) | ReLU is standard; BatchNorm handles activation drift |
| ResNet | ReLU | Softmax | Skip connections allow ReLU to work at any depth |
| Transformer | GELU (preferred) or Swish | Task-dependent | GELU is standard for NLP; Swish for vision |
| RNN/LSTM | Tanh (gates), Sigmoid (gates) | Task-dependent | LSTM gates require bounded activations |
| GAN Generator | ReLU or Leaky ReLU | Tanh | Tanh output for [-1, 1] image range |
| GAN Discriminator | Leaky ReLU | Sigmoid or none | Leaky ReLU prevents mode collapse |
| VAE Encoder | ReLU or ELU | Linear (mean/logvar) | Smooth activations help gradient flow |
Output Layer Activations by Task:
| Task | Activation | Loss Function | Output Range |
|---|---|---|---|
| Binary Classification | Sigmoid | Binary Cross-Entropy | (0, 1) |
| Multi-class Classification | Softmax | Cross-Entropy | Probability simplex |
| Regression | Linear (none) | MSE or MAE | (-∞, +∞) |
| Bounded Regression | Sigmoid * scale | MSE | (0, scale) |
| Multi-label Classification | Sigmoid (each) | Binary CE per label | (0, 1) per label |
Decision Flowchart:
Is it an output layer?
Is it a transformer or modern NLP model?
Is it an RNN gate?
Are you using batch normalization?
Is the network very deep (>20 layers) without skip connections?
Default: ReLU is almost always a safe starting choice
In practice, activation function choice rarely makes more than a few percentage points difference in final accuracy. Focus on: (1) Architecture design, (2) Data quality and quantity, (3) Optimization (learning rate, batch size), (4) Regularization. Then fine-tune activation functions. ReLU is the safe default; only switch if you have specific reasons.
Activation functions are the nonlinear ingredient that transforms stacked linear layers into universal function approximators. Understanding their properties guides both architecture design and debugging.
Module Complete:
You have now mastered the core components of Multi-Layer Perceptrons:
This foundation is essential for all advanced neural network topics. Next, we explore Universal Approximation—understanding what functions MLPs can represent and what that means for practical applications.
Congratulations! You've mastered Multi-Layer Perceptrons—the foundational neural network architecture. Every modern deep learning system builds on these principles: from CNNs (specialized connectivity) to Transformers (attention-weighted averaging) to ResNets (skip connections). With this foundation, you're ready to explore the rich landscape of neural network architectures and training techniques.