Neural Networks & Deep LearningMulti-Layer Perceptrons

Multi-Layer Perceptrons

LevelIntermediate

Duration90 mins

TopicMulti-Layer Perceptrons

5 / 5

Activation Functions

The Necessity of Nonlinearity

Activation functions are the mathematical soul of neural networks. Without them, a network of any depth would collapse to a single linear transformation—incapable of learning anything beyond what simple linear regression can represent.

The Linear Composition Collapse:

Consider a network with only linear activations ($\sigma(z) = z$):

$$\mathbf{a}^{(L)} = W^{(L)}(W^{(L-1)}(\cdots W^{(1)}\mathbf{x})) = (W^{(L)} W^{(L-1)} \cdots W^{(1)})\mathbf{x} = \tilde{W}\mathbf{x}$$

The composition of linear functions is linear. All those layers collapse to a single matrix $\tilde{W}$. Depth provides no benefit.

The Role of Nonlinearity:

Nonlinear activation functions break this collapse. Each layer does something fundamentally different from matrix multiplication—it introduces curvature, thresholds, or saturation that linear algebra cannot express. This enables:

Decision boundaries of arbitrary complexity
Hierarchical feature composition
Universal function approximation

This page examines activation functions in depth: their mathematical properties, gradient behavior, computational characteristics, and practical selection for different architectures.

What You Will Master

By the end of this page, you will understand: (1) Why nonlinearity is mathematically essential; (2) Classical activations (sigmoid, tanh) and their limitations; (3) Modern activations (ReLU family, Swish, GELU) and their advantages; (4) Gradient flow properties and the vanishing gradient problem; (5) Selection guidelines for different network types.

Mathematical Framework for Activation Analysis

Before examining specific activations, we establish the mathematical properties that determine their effectiveness.

Definition (Activation Function): An activation function $\sigma: \mathbb{R} \to \mathbb{R}$ is applied element-wise to pre-activation vectors: $$\mathbf{a} = \sigma(\mathbf{z}) = (\sigma(z_1), \sigma(z_2), \ldots, \sigma(z_n))^\top$$

Key Properties to Analyze:

Range: The possible output values. Bounded (sigmoid: [0,1]) vs unbounded (ReLU: [0,∞)).
Monotonicity: Whether $x < y \Rightarrow \sigma(x) < \sigma(y)$. Most activations are monotonic; some (GELU) have slight non-monotonicities.
Differentiability: Required for gradient-based training. Most are differentiable everywhere or almost everywhere (ReLU is non-differentiable at 0).
Gradient Bounds: $|\sigma'(z)|$ determines gradient flow. If always < 1, gradients shrink (vanishing); if > 1, they grow (exploding).
Zero-Centeredness: Whether $\mathbb{E}[\sigma(z)] \approx 0$ for typical inputs. Non-zero mean can cause zig-zagging in gradient descent.
Saturation: Whether outputs approach constant values for extreme inputs, causing near-zero gradients.
Computational Cost: Some activations are cheap (ReLU: one comparison) while others require expensive operations (tanh: exp, division).

Activation Function Property Summary
Activation	Range	Derivative Range	Zero-Centered	Saturates	Computation
Sigmoid	(0, 1)	(0, 0.25]	No	Yes (both ends)	exp, division
Tanh	(-1, 1)	(0, 1]	Yes	Yes (both ends)	exp, division
ReLU	[0, ∞)	{0, 1}	No	Left only	max(0, z)
Leaky ReLU	(-∞, ∞)	{α, 1}	Nearly	No	max(αz, z)
ELU	(-α, ∞)	(0, 1]	Nearly	Left only	exp for z<0
SELU	(-λα, ∞)	varies	Self-normalizing	Left only	exp for z<0
Swish/SiLU	≈(-0.28, ∞)	varies	Nearly	Smooth left	sigmoid × z
GELU	(-0.17, ∞)	varies	Nearly	Smooth left	erf or approx

The Gradient Flow Perspective

The most critical property for deep networks is gradient behavior. During backpropagation, gradients are multiplied by σ'(z) at each layer. If |σ'(z)| < 1 consistently (sigmoid), gradients shrink exponentially with depth. If |σ'(z)| = 1 when active (ReLU), gradients can flow unchanged. This is why ReLU revolutionized deep learning.

Classical Activations: Sigmoid and Tanh

Sigmoid and tanh dominated neural networks from the 1980s through the early 2010s. Understanding their properties—and limitations—explains the motivation for modern alternatives.

Sigmoid (Logistic) Function:

$$\sigma(z) = \frac{1}{1 + e^{-z}} = \frac{e^z}{1 + e^z}$$

Derivative: $$\sigma'(z) = \sigma(z)(1 - \sigma(z))$$

Properties:

Range: $(0, 1)$ — natural for probabilities
Maximum derivative: $\sigma'(0) = 0.25$ at $z = 0$
Saturates for $|z| > 5$: gradients become negligible

Tanh (Hyperbolic Tangent):

$$\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} = 2\sigma(2z) - 1$$

Derivative: $$\tanh'(z) = 1 - \tanh^2(z)$$

Properties:

Range: $(-1, 1)$ — zero-centered
Maximum derivative: $\tanh'(0) = 1$ at $z = 0$
Better gradient flow than sigmoid
Still saturates at extremes

classical_activations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
import numpy as np
import matplotlib.pyplot as plt
 
def sigmoid(z):
    """Logistic sigmoid activation."""
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
 
def sigmoid_derivative(z):
    """Derivative of sigmoid."""
    s = sigmoid(z)
    return s * (1 - s)
 
def tanh(z):
    """Hyperbolic tangent activation."""
    return np.tanh(z)
 
def tanh_derivative(z):
    """Derivative of tanh."""
    return 1 - np.tanh(z)**2
 
 
def analyze_classical_activations():
    """
    Comprehensive analysis of sigmoid and tanh.
    """
    z = np.linspace(-6, 6, 1000)
    
    fig, axes = plt.subplots(1, 3, figsize=(14, 4))
    
    # Function values
    ax1 = axes[0]
    ax1.plot(z, sigmoid(z), 'b-', linewidth=2, label='Sigmoid')
    ax1.plot(z, tanh(z), 'r-', linewidth=2, label='Tanh')
    ax1.axhline(y=0, color='k', linestyle='--', alpha=0.3)
    ax1.axvline(x=0, color='k', linestyle='--', alpha=0.3)
    ax1.set_title('Activation Values', fontsize=12)
    ax1.set_xlabel('z (pre-activation)')
    ax1.set_ylabel('σ(z)')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    ax1.set_ylim(-1.2, 1.2)
    
    # Derivatives
    ax2 = axes[1]
    ax2.plot(z, sigmoid_derivative(z), 'b-', linewidth=2, label="Sigmoid'")
    ax2.plot(z, tanh_derivative(z), 'r-', linewidth=2, label="Tanh'")
    ax2.axhline(y=0.25, color='b', linestyle=':', alpha=0.5, label='Sigmoid max (0.25)')
    ax2.axhline(y=1.0, color='r', linestyle=':', alpha=0.5, label='Tanh max (1.0)')
    ax2.set_title('Derivatives (Gradient Flow)', fontsize=12)
    ax2.set_xlabel('z (pre-activation)')
    ax2.set_ylabel("σ'(z)")
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    ax2.set_ylim(-0.1, 1.2)
    
    # Gradient decay through layers
    ax3 = axes[2]
    depths = np.arange(1, 21)
    
    # Assume pre-activations near 0 (best case)
    sigmoid_decay = 0.25 ** depths  # Gradient shrinks by 0.25 each layer
    tanh_decay = 1.0 ** depths      # Best case: no decay
    
    # More realistic: mixed pre-activations
    sigmoid_realistic = 0.2 ** depths   # Average derivative < max
    tanh_realistic = 0.7 ** depths      # Average derivative < 1
    
    ax3.semilogy(depths, sigmoid_decay, 'b--', label='Sigmoid (best case)', alpha=0.5)
    ax3.semilogy(depths, sigmoid_realistic, 'b-', linewidth=2, label='Sigmoid (typical)')
    ax3.semilogy(depths, tanh_realistic, 'r-', linewidth=2, label='Tanh (typical)')
    ax3.set_title('Gradient Magnitude vs Depth', fontsize=12)
    ax3.set_xlabel('Network Depth (layers)')
    ax3.set_ylabel('Gradient Scale (log)')
    ax3.legend()
    ax3.grid(True, alpha=0.3)
    ax3.axhline(y=1e-10, color='gray', linestyle=':', label='Numerical underflow')
    
    plt.tight_layout()
    plt.savefig('classical_activations.png', dpi=150)
    plt.show()
 
 
def vanishing_gradient_demo():
    """
    Demonstrate vanishing gradients with sigmoid in a deep network.
    """
    print("Vanishing Gradient Demonstration")
    print("=" * 50)
    
    # Simulate gradient flow through layers
    depth = 20
    n_units = 100
    
    # Initialize pre-activations from standard normal
    np.random.seed(42)
    
    print("\nSimulating gradient backpropagation through sigmoid network:")
    print("-" * 50)
    
    gradient_norms = []
    grad = np.ones(n_units)  # Start with unit gradient from loss
    
    for layer in range(depth, 0, -1):
        # Pre-activation at this layer (standard normal)
        z = np.random.randn(n_units)
        
        # Gradient through sigmoid
        local_grad = sigmoid_derivative(z)
        grad = grad * local_grad  # Element-wise for simplicity
        
        # Weight matrix contribution would multiply here too
        # (simulating with random orthogonal for now)
        W = np.random.randn(n_units, n_units) * 0.1
        grad = W.T @ grad
        
        norm = np.linalg.norm(grad)
        gradient_norms.append(norm)
        
        if layer % 5 == 0 or norm < 1e-10:
            print(f"  Layer {layer:2d}: gradient norm = {norm:.2e}")
    
    print(f"\nGradient shrunk by factor: {gradient_norms[0] / gradient_norms[-1]:.2e}")
    print("This is why deep sigmoid networks are nearly impossible to train!")
 
 
if __name__ == "__main__":
    analyze_classical_activations()
    vanishing_gradient_demo()

The Vanishing Gradient Problem

With sigmoid, gradients multiply by at most 0.25 at each layer. After 10 layers, gradients are at most 0.25¹⁰ ≈ 10⁻⁶ of their original magnitude. After 20 layers, they're essentially zero (10⁻¹²). This is why deep networks were considered impractical until around 2010—the gradients simply vanished before reaching early layers.

The ReLU Revolution

The Rectified Linear Unit (ReLU) transformed deep learning. Introduced for deep networks by Nair and Hinton (2010) and popularized by Krizhevsky et al. in AlexNet (2012), ReLU addressed the vanishing gradient problem with elegant simplicity.

ReLU Definition:

$$\text{ReLU}(z) = \max(0, z) = \begin{cases} z & \text{if } z > 0 \ 0 & \text{if } z \leq 0 \end{cases}$$

Derivative:

$$\text{ReLU}'(z) = \begin{cases} 1 & \text{if } z > 0 \ 0 & \text{if } z < 0 \ \text{undefined} & \text{if } z = 0 \end{cases}$$

In practice, we use a subgradient at $z=0$ (typically 0 or 1).

Why ReLU Solved Vanishing Gradients:

Gradient is 1 for positive inputs: No exponential shrinkage through layers
Linear regime: For active units, gradients pass through unchanged
Sparse activation: Many units output exactly 0, providing natural regularization
Computational efficiency: Just a comparison operation, no exp/division
Scale-invariance: Positive inputs maintain their relative magnitudes

ReLU Advantages

•No vanishing gradient: Derivative is 1 for positive inputs
•Fast computation: Just max(0, z); no exponentials
•Sparse activations: Natural regularization effect
•Linear for positive z: Easier optimization landscape
•Empirically dominant: Best default for most architectures

ReLU Limitations

•Dead neurons: Units with z < 0 for all inputs stop learning
•Non-zero centered: Outputs are always ≥ 0
•Unbounded: Can have numerical issues with large values
•Non-differentiable at 0: Theoretically awkward (practically fine)
•Not smooth: Some optimization properties lost

The Dead Neuron Problem:

A neuron "dies" when its pre-activation is negative for all training inputs. Once dead:

Gradient through that neuron is always 0
Weights never update
The neuron contributes nothing to the network

Causes:

Large negative bias learned during training
Large learning rate causing weights to overshoot
Poor initialization

Solutions:

Use Leaky ReLU or variants
Careful initialization (He initialization)
Lower learning rates
Batch normalization (keeps pre-activations centered)

How Many Dead Neurons?

In a well-trained network with ReLU:

Typically 10-30% of hidden units are inactive for any given input
This sparsity is feature, not bug—it's implicit regularization
But if >50% are dead across all inputs, there's a problem

relu_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
import numpy as np
import matplotlib.pyplot as plt
 
def relu(z):
    return np.maximum(0, z)
 
def relu_derivative(z):
    return (z > 0).astype(float)
 
 
def gradient_flow_comparison():
    """
    Compare gradient flow through ReLU vs Sigmoid networks.
    """
    np.random.seed(42)
    
    depths = [5, 10, 20, 50]
    n_units = 256
    n_simulations = 100
    
    results = {'relu': {}, 'sigmoid': {}}
    
    for depth in depths:
        relu_grads = []
        sigmoid_grads = []
        
        for _ in range(n_simulations):
            # Initialize weights properly
            relu_grad = np.ones(n_units)
            sigmoid_grad = np.ones(n_units)
            
            for layer in range(depth):
                # Random pre-activations
                z = np.random.randn(n_units)
                
                # ReLU gradient
                relu_local = relu_derivative(z)
                relu_grad = relu_grad * relu_local
                # Weight matrix (He initialization scale)
                W = np.random.randn(n_units, n_units) * np.sqrt(2/n_units)
                relu_grad = W.T @ relu_grad
                
                # Sigmoid gradient
                s = 1 / (1 + np.exp(-z))
                sigmoid_local = s * (1 - s)
                sigmoid_grad = sigmoid_grad * sigmoid_local
                W = np.random.randn(n_units, n_units) * np.sqrt(1/n_units)
                sigmoid_grad = W.T @ sigmoid_grad
            
            relu_grads.append(np.linalg.norm(relu_grad))
            sigmoid_grads.append(np.linalg.norm(sigmoid_grad))
        
        results['relu'][depth] = np.mean(relu_grads)
        results['sigmoid'][depth] = np.mean(sigmoid_grads)
    
    print("Gradient Flow Comparison")
    print("=" * 50)
    print(f"{'Depth':>6} | {'ReLU':>12} | {'Sigmoid':>12} | {'Ratio':>10}")
    print("-" * 50)
    for depth in depths:
        r = results['relu'][depth]
        s = results['sigmoid'][depth]
        ratio = r / s if s > 0 else float('inf')
        print(f"{depth:>6} | {r:>12.2e} | {s:>12.2e} | {ratio:>10.0f}x")
    
    return results
 
 
def dead_neuron_simulation():
    """
    Simulate dead neuron occurrence during training.
    """
    print("\nDead Neuron Simulation")
    print("=" * 50)
    
    n_units = 1000
    
    # Different initialization strategies
    strategies = {
        'zero_bias': (np.random.randn(n_units) * 0.1, np.zeros(n_units)),
        'negative_bias': (np.random.randn(n_units) * 0.1, -np.ones(n_units)),
        'positive_bias': (np.random.randn(n_units) * 0.1, np.ones(n_units)),
        'poor_init': (np.random.randn(n_units) * 1.0, np.zeros(n_units)),
    }
    
    # Simulate inputs (batch of 1000 samples)
    inputs = np.random.randn(1000, 10)
    input_weights = np.random.randn(n_units, 10) * np.sqrt(2/10)
    
    for name, (hidden_weights, biases) in strategies.items():
        # Compute pre-activations for all inputs
        pre_acts = inputs @ input_weights.T  # (1000, n_units)
        pre_acts += biases  # Add bias
        
        # Count dead neurons (never positive across all inputs)
        activations = relu(pre_acts)
        dead_mask = np.all(activations == 0, axis=0)
        dead_count = np.sum(dead_mask)
        dead_pct = 100 * dead_count / n_units
        
        print(f"{name:>15}: {dead_count:4d} dead neurons ({dead_pct:.1f}%)")
 
 
def relu_vs_sigmoid_training():
    """
    Demonstrate training speed difference on a simple task.
    """
    print("\nTraining Speed Comparison (XOR task)")
    print("=" * 50)
    
    # XOR dataset
    X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
    y = np.array([[0], [1], [1], [0]])
    
    def mse_loss(y_pred, y_true):
        return 0.5 * np.mean((y_pred - y_true)**2)
    
    def train_simple_network(activation, activation_deriv, epochs=5000, lr=0.5):
        np.random.seed(42)
        
        # Simple 2-4-1 network
        W1 = np.random.randn(4, 2) * 0.5
        b1 = np.zeros((1, 4))
        W2 = np.random.randn(1, 4) * 0.5
        b2 = np.zeros((1, 1))
        
        losses = []
        
        for epoch in range(epochs):
            # Forward
            z1 = X @ W1.T + b1
            a1 = activation(z1)
            z2 = a1 @ W2.T + b2
            a2 = 1 / (1 + np.exp(-z2))  # Sigmoid output for probability
            
            loss = mse_loss(a2, y)
            losses.append(loss)
            
            # Backward
            dz2 = (a2 - y) * a2 * (1 - a2)
            dW2 = dz2.T @ a1
            db2 = np.sum(dz2, axis=0, keepdims=True)
            
            da1 = dz2 @ W2
            dz1 = da1 * activation_deriv(z1)
            dW1 = dz1.T @ X
            db1 = np.sum(dz1, axis=0, keepdims=True)
            
            # Update
            W2 -= lr * dW2
            b2 -= lr * db2
            W1 -= lr * dW1
            b1 -= lr * db1
        
        return losses
    
    sigmoid_losses = train_simple_network(
        lambda z: 1 / (1 + np.exp(-z)),
        lambda z: (1 / (1 + np.exp(-z))) * (1 - 1 / (1 + np.exp(-z)))
    )
    
    relu_losses = train_simple_network(
        relu, relu_derivative
    )
    
    # Find convergence point
    threshold = 0.01
    sigmoid_converged = next((i for i, l in enumerate(sigmoid_losses) if l < threshold), len(sigmoid_losses))
    relu_converged = next((i for i, l in enumerate(relu_losses) if l < threshold), len(relu_losses))
    
    print(f"Sigmoid converged at epoch: {sigmoid_converged}")
    print(f"ReLU converged at epoch: {relu_converged}")
    print(f"ReLU was {sigmoid_converged/relu_converged:.1f}x faster")
 
 
if __name__ == "__main__":
    gradient_flow_comparison()
    dead_neuron_simulation()
    relu_vs_sigmoid_training()

ReLU Variants: Addressing the Limitations

Several ReLU variants address its limitations while preserving its gradient flow advantages.

Leaky ReLU (Maas et al., 2013):

$$\text{LeakyReLU}(z) = \begin{cases} z & \text{if } z > 0 \ \alpha z & \text{if } z \leq 0 \end{cases}$$

where $\alpha \in (0, 1)$ is typically $0.01$ or $0.1$.

Advantage: No dead neurons—negative inputs still produce nonzero gradients.

Parametric ReLU (PReLU, He et al., 2015):

$$\text{PReLU}(z) = \begin{cases} z & \text{if } z > 0 \ a z & \text{if } z \leq 0 \end{cases}$$

where $a$ is a learned parameter (per-channel or shared).

Advantage: Optimal slope is learned from data.

Exponential Linear Unit (ELU, Clevert et al., 2016):

$$\text{ELU}(z) = \begin{cases} z & \text{if } z > 0 \ \alpha(e^z - 1) & \text{if } z \leq 0 \end{cases}$$

Advantages:

Smooth at $z=0$
Negative outputs push mean toward zero
More robust to noise

Disadvantage: Requires exp computation for negative values.

Scaled ELU (SELU, Klambauer et al., 2017):

$$\text{SELU}(z) = \lambda \begin{cases} z & \text{if } z > 0 \ \alpha(e^z - 1) & \text{if } z \leq 0 \end{cases}$$

where $\alpha \approx 1.6733$ and $\lambda \approx 1.0507$ are specific constants.

Advantage: Self-normalizing—activations converge to zero mean, unit variance without batch normalization.

Requirement: Proper initialization (LeCun normal) and fully-connected architecture.

relu_variants.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
import numpy as np
import matplotlib.pyplot as plt
 
# Activation functions
def relu(z):
    return np.maximum(0, z)
 
def leaky_relu(z, alpha=0.01):
    return np.where(z > 0, z, alpha * z)
 
def elu(z, alpha=1.0):
    return np.where(z > 0, z, alpha * (np.exp(z) - 1))
 
def selu(z, alpha=1.6732632423543772, lam=1.0507009873554805):
    return lam * np.where(z > 0, z, alpha * (np.exp(np.clip(z, -10, 10)) - 1))
 
def prelu(z, a):
    """PReLU with learned parameter a."""
    return np.where(z > 0, z, a * z)
 
# Derivatives
def relu_deriv(z):
    return (z > 0).astype(float)
 
def leaky_relu_deriv(z, alpha=0.01):
    return np.where(z > 0, 1, alpha)
 
def elu_deriv(z, alpha=1.0):
    return np.where(z > 0, 1, alpha * np.exp(z))
 
def selu_deriv(z, alpha=1.6732632423543772, lam=1.0507009873554805):
    return lam * np.where(z > 0, 1, alpha * np.exp(np.clip(z, -10, 10)))
 
 
def visualize_relu_variants():
    """
    Compare ReLU variants visually.
    """
    z = np.linspace(-4, 4, 1000)
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    
    # Function values
    ax1 = axes[0, 0]
    ax1.plot(z, relu(z), 'b-', linewidth=2, label='ReLU')
    ax1.plot(z, leaky_relu(z, 0.1), 'r-', linewidth=2, label='Leaky ReLU (α=0.1)')
    ax1.plot(z, elu(z, 1.0), 'g-', linewidth=2, label='ELU (α=1.0)')
    ax1.axhline(y=0, color='k', linestyle='--', alpha=0.3)
    ax1.axvline(x=0, color='k', linestyle='--', alpha=0.3)
    ax1.set_title('Activation Functions', fontsize=12)
    ax1.set_xlabel('z')
    ax1.set_ylabel('σ(z)')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    ax1.set_ylim(-2, 4)
    
    # Derivatives
    ax2 = axes[0, 1]
    ax2.plot(z, relu_deriv(z), 'b-', linewidth=2, label='ReLU')
    ax2.plot(z, leaky_relu_deriv(z, 0.1), 'r-', linewidth=2, label='Leaky ReLU')
    ax2.plot(z, elu_deriv(z, 1.0), 'g-', linewidth=2, label='ELU')
    ax2.axhline(y=0, color='k', linestyle='--', alpha=0.3)
    ax2.axhline(y=1, color='gray', linestyle=':', alpha=0.5)
    ax2.set_title('Derivatives', fontsize=12)
    ax2.set_xlabel('z')
    ax2.set_ylabel("σ'(z)")
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    ax2.set_ylim(-0.2, 1.5)
    
    # SELU self-normalizing property
    ax3 = axes[1, 0]
    np.random.seed(42)
    
    # Simulate forward pass through many layers
    n_layers = 50
    n_units = 1000
    
    means_relu = []
    vars_relu = []
    means_selu = []
    vars_selu = []
    
    # Start with unit Gaussian
    x_relu = np.random.randn(n_units)
    x_selu = np.random.randn(n_units)
    
    for _ in range(n_layers):
        # Random weights 
        W_relu = np.random.randn(n_units, n_units) * np.sqrt(2/n_units)  # He init
        W_selu = np.random.randn(n_units, n_units) * np.sqrt(1/n_units)  # LeCun init
        
        x_relu = relu(W_relu @ x_relu)
        x_selu = selu(W_selu @ x_selu)
        
        means_relu.append(np.mean(x_relu))
        vars_relu.append(np.var(x_relu))
        means_selu.append(np.mean(x_selu))
        vars_selu.append(np.var(x_selu))
    
    layers = np.arange(1, n_layers + 1)
    ax3.plot(layers, vars_relu, 'b-', linewidth=2, label='ReLU Variance')
    ax3.plot(layers, vars_selu, 'purple', linewidth=2, label='SELU Variance')
    ax3.axhline(y=1, color='gray', linestyle='--', alpha=0.5, label='Target Var=1')
    ax3.set_title('Self-Normalizing Property', fontsize=12)
    ax3.set_xlabel('Layer Depth')
    ax3.set_ylabel('Activation Variance')
    ax3.legend()
    ax3.grid(True, alpha=0.3)
    ax3.set_yscale('log')
    
    # Dead neuron comparison
    ax4 = axes[1, 1]
    
    # Simulate dead neuron rate
    n_samples = 1000
    n_neurons = 500
    
    # Random pre-activations with bias toward negative
    z_samples = np.random.randn(n_samples, n_neurons) - 0.5
    
    activations = {
        'ReLU': relu(z_samples),
        'Leaky ReLU': leaky_relu(z_samples, 0.1),
        'ELU': elu(z_samples),
    }
    
    dead_rates = {}
    for name, acts in activations.items():
        # A neuron is "dead" if it's zero for all samples
        if name == 'ReLU':
            dead = np.all(acts == 0, axis=0)
        else:
            dead = np.all(acts <= 0, axis=0)
        dead_rates[name] = 100 * np.mean(dead)
    
    bars = ax4.bar(dead_rates.keys(), dead_rates.values(), color=['blue', 'red', 'green'])
    ax4.set_title('Dead/Inactive Neuron Rate', fontsize=12)
    ax4.set_ylabel('% Neurons Dead')
    ax4.grid(True, alpha=0.3, axis='y')
    
    for bar, rate in zip(bars, dead_rates.values()):
        ax4.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
                f'{rate:.1f}%', ha='center', fontsize=10)
    
    plt.tight_layout()
    plt.savefig('relu_variants.png', dpi=150)
    plt.show()
 
 
if __name__ == "__main__":
    visualize_relu_variants()

When to Use Which Variant

ReLU: Default choice for most architectures, especially CNNs. Leaky ReLU: When dead neurons are a concern; good general alternative. ELU: When you want smoother gradients and can afford the computation. SELU: Specific to fully-connected networks without batch normalization. PReLU: When you have enough data to learn the slope parameter.

Modern Smooth Activations: Swish and GELU

Recent research has produced smooth activations that match or exceed ReLU performance, particularly in deep networks and transformers.

Swish / SiLU (Ramachandran et al., 2017):

$$\text{Swish}(z) = z \cdot \sigma(z) = \frac{z}{1 + e^{-z}}$$

Properties:

Smooth and differentiable everywhere
Bounded below (≈ -0.28), unbounded above
Non-monotonic: slight dip for negative z, then rises
Self-gated: the sigmoid acts as a learned gate

Derivative: $$\text{Swish}'(z) = \sigma(z) + z \cdot \sigma(z) \cdot (1 - \sigma(z)) = \sigma(z)(1 + z(1 - \sigma(z)))$$

GELU (Gaussian Error Linear Unit, Hendrycks & Gimpel, 2016):

$$\text{GELU}(z) = z \cdot \Phi(z) = z \cdot \frac{1}{2}\left(1 + \text{erf}\left(\frac{z}{\sqrt{2}}\right)\right)$$

where $\Phi(z)$ is the CDF of the standard normal distribution.

Approximation (faster): $$\text{GELU}(z) \approx 0.5z\left(1 + \tanh\left(\sqrt{\frac{2}{\pi}}(z + 0.044715z^3)\right)\right)$$

Properties:

Smooth approximation of a stochastic regularizer
Dominant activation in transformers (BERT, GPT, etc.)
Gradually transitions from 0 to identity
Slightly non-monotonic near z = 0

Why Smooth Matters:

Optimization: Smooth functions have nicer loss landscapes
Second-order methods: Require continuous second derivatives
Gradient flow: Smooth transitions reduce gradient noise
Empirical performance: Often outperforms ReLU on large models

modern_activations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import erf
 
def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
 
def swish(z):
    """Swish/SiLU activation: z * sigmoid(z)"""
    return z * sigmoid(z)
 
def gelu_exact(z):
    """GELU activation using exact formula with erf."""
    return z * 0.5 * (1 + erf(z / np.sqrt(2)))
 
def gelu_approx(z):
    """GELU approximation using tanh (faster)."""
    return 0.5 * z * (1 + np.tanh(np.sqrt(2/np.pi) * (z + 0.044715 * z**3)))
 
def swish_derivative(z):
    """Derivative of Swish."""
    s = sigmoid(z)
    return s + z * s * (1 - s)
 
def gelu_derivative(z):
    """Derivative of GELU (approximate)."""
    # Numerical derivative for simplicity
    eps = 1e-7
    return (gelu_exact(z + eps) - gelu_exact(z - eps)) / (2 * eps)
 
 
def compare_modern_activations():
    """
    Comprehensive comparison of modern smooth activations.
    """
    z = np.linspace(-4, 4, 1000)
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    
    # Function comparison
    ax1 = axes[0, 0]
    ax1.plot(z, np.maximum(0, z), 'b--', linewidth=1.5, alpha=0.7, label='ReLU')
    ax1.plot(z, swish(z), 'r-', linewidth=2, label='Swish')
    ax1.plot(z, gelu_exact(z), 'g-', linewidth=2, label='GELU')
    ax1.axhline(y=0, color='k', linestyle=':', alpha=0.3)
    ax1.axvline(x=0, color='k', linestyle=':', alpha=0.3)
    ax1.set_title('Activation Functions', fontsize=12)
    ax1.set_xlabel('z')
    ax1.set_ylabel('σ(z)')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    ax1.set_ylim(-0.5, 4)
    
    # Derivative comparison  
    ax2 = axes[0, 1]
    ax2.plot(z, (z > 0).astype(float), 'b--', linewidth=1.5, alpha=0.7, label='ReLU')
    ax2.plot(z, swish_derivative(z), 'r-', linewidth=2, label='Swish')
    ax2.plot(z, gelu_derivative(z), 'g-', linewidth=2, label='GELU')
    ax2.axhline(y=1, color='gray', linestyle=':', alpha=0.5)
    ax2.set_title('Derivatives', fontsize=12)
    ax2.set_xlabel('z')
    ax2.set_ylabel("σ'(z)")
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    ax2.set_ylim(-0.2, 1.5)
    
    # GELU approximation accuracy
    ax3 = axes[1, 0]
    exact = gelu_exact(z)
    approx = gelu_approx(z)
    error = np.abs(exact - approx)
    
    ax3.semilogy(z, error + 1e-10, 'purple', linewidth=2)
    ax3.set_title('GELU Approximation Error', fontsize=12)
    ax3.set_xlabel('z')
    ax3.set_ylabel('|exact - approx|')
    ax3.grid(True, alpha=0.3)
    ax3.axhline(y=1e-3, color='r', linestyle='--', alpha=0.5, label='0.001 threshold')
    ax3.legend()
    
    # Non-monotonicity detail
    ax4 = axes[1, 1]
    z_detail = np.linspace(-3, 1, 500)
    ax4.plot(z_detail, swish(z_detail), 'r-', linewidth=2, label='Swish')
    ax4.plot(z_detail, gelu_exact(z_detail), 'g-', linewidth=2, label='GELU')
    ax4.axhline(y=0, color='k', linestyle=':', alpha=0.3)
    ax4.axvline(x=0, color='k', linestyle=':', alpha=0.3)
    
    # Mark minimum points
    swish_min_z = -1.278  # Approximate minimum
    gelu_min_z = -0.77   # Approximate minimum
    ax4.scatter([swish_min_z], [swish(swish_min_z)], color='r', s=100, zorder=5, marker='v')
    ax4.scatter([gelu_min_z], [gelu_exact(gelu_min_z)], color='g', s=100, zorder=5, marker='v')
    
    ax4.set_title('Non-Monotonicity (Zoomed)', fontsize=12)
    ax4.set_xlabel('z')
    ax4.set_ylabel('σ(z)')
    ax4.legend()
    ax4.grid(True, alpha=0.3)
    ax4.set_ylim(-0.4, 0.6)
    
    plt.tight_layout()
    plt.savefig('modern_activations.png', dpi=150)
    plt.show()
 
 
def transformer_activation_comparison():
    """
    Compare activations in transformer-like setting.
    """
    print("Activation Comparison for Transformer FFN")
    print("=" * 50)
    
    np.random.seed(42)
    
    # Simulate transformer FFN block
    # d_model = 768, d_ff = 3072 (BERT-base dimensions)
    d_model = 768
    d_ff = 3072
    batch_size = 32
    seq_len = 128
    
    # Random input (simulating hidden states)
    x = np.random.randn(batch_size * seq_len, d_model).astype(np.float32)
    
    # FFN weights
    W1 = np.random.randn(d_ff, d_model).astype(np.float32) * np.sqrt(2/d_model)
    W2 = np.random.randn(d_model, d_ff).astype(np.float32) * np.sqrt(2/d_ff)
    
    # Forward with different activations
    def ffn_forward(x, activation_fn):
        h = x @ W1.T
        h = activation_fn(h)
        out = h @ W2.T
        return out
    
    activations = {
        'ReLU': lambda z: np.maximum(0, z),
        'GELU': gelu_approx,
        'Swish': swish,
    }
    
    for name, fn in activations.items():
        out = ffn_forward(x, fn)
        print(f"\n{name}:")
        print(f"  Output mean: {out.mean():.4f}")
        print(f"  Output std: {out.std():.4f}")
        print(f"  Output range: [{out.min():.2f}, {out.max():.2f}]")
 
 
if __name__ == "__main__":
    compare_modern_activations()
    transformer_activation_comparison()

Why GELU Dominates Transformers

GELU emerged from a probabilistic interpretation: multiply input by a Bernoulli random variable with probability Φ(z). This stochastic view connects to dropout-like regularization. Empirically, GELU consistently outperforms ReLU in transformer architectures, likely due to smoother gradients and gentler saturation. The original BERT, GPT-2, and most modern language models use GELU.

Activation Function Selection Guidelines

Choosing the right activation function is part science, part engineering judgment. Here are practical guidelines based on architecture type and problem characteristics.

Recommended Activations by Architecture
Architecture	Hidden Layers	Output Layer	Notes
MLP (general)	ReLU or Leaky ReLU	Task-dependent	Start with ReLU; try Leaky if dead neurons
Deep MLP (>10 layers)	SELU or ELU	Task-dependent	SELU with LeCun init; no BatchNorm needed
CNN	ReLU	Softmax (classification)	ReLU is standard; BatchNorm handles activation drift
ResNet	ReLU	Softmax	Skip connections allow ReLU to work at any depth
Transformer	GELU (preferred) or Swish	Task-dependent	GELU is standard for NLP; Swish for vision
RNN/LSTM	Tanh (gates), Sigmoid (gates)	Task-dependent	LSTM gates require bounded activations
GAN Generator	ReLU or Leaky ReLU	Tanh	Tanh output for [-1, 1] image range
GAN Discriminator	Leaky ReLU	Sigmoid or none	Leaky ReLU prevents mode collapse
VAE Encoder	ReLU or ELU	Linear (mean/logvar)	Smooth activations help gradient flow

Output Layer Activations by Task:

Task	Activation	Loss Function	Output Range
Binary Classification	Sigmoid	Binary Cross-Entropy	(0, 1)
Multi-class Classification	Softmax	Cross-Entropy	Probability simplex
Regression	Linear (none)	MSE or MAE	(-∞, +∞)
Bounded Regression	Sigmoid * scale	MSE	(0, scale)
Multi-label Classification	Sigmoid (each)	Binary CE per label	(0, 1) per label

Decision Flowchart:

Is it an output layer?
- Yes → Use task-appropriate activation (see table above)
- No → Continue
Is it a transformer or modern NLP model?
- Yes → Use GELU
- No → Continue
Is it an RNN gate?
- Yes → Use sigmoid (gates) or tanh (state)
- No → Continue
Are you using batch normalization?
- Yes → ReLU is safe, BatchNorm prevents dead neurons
- No → Consider Leaky ReLU or ELU
Is the network very deep (>20 layers) without skip connections?
- Yes → SELU with proper initialization
- No → ReLU or Leaky ReLU
Default: ReLU is almost always a safe starting choice

The Practical Priority

In practice, activation function choice rarely makes more than a few percentage points difference in final accuracy. Focus on: (1) Architecture design, (2) Data quality and quantity, (3) Optimization (learning rate, batch size), (4) Regularization. Then fine-tune activation functions. ReLU is the safe default; only switch if you have specific reasons.

Summary: Activation Functions Mastery

Activation functions are the nonlinear ingredient that transforms stacked linear layers into universal function approximators. Understanding their properties guides both architecture design and debugging.

Key Concepts Mastered

•Necessity of nonlinearity: Without activations, any-depth networks collapse to single linear transformations.
•Classical limitations: Sigmoid/tanh cause vanishing gradients (derivative < 1 everywhere), making deep networks untrainable.
•ReLU revolution: Gradient of 1 for positive inputs solved vanishing gradients; sparse activations provide regularization.
•ReLU variants: Leaky ReLU, ELU, SELU address dead neurons and maintain gradient flow for negative inputs.
•Modern smooth activations: GELU and Swish provide smooth, differentiable alternatives that excel in transformers.
•Selection is context-dependent: Architecture type, depth, and normalization strategy all influence the optimal choice.
•Output activations are task-determined: Sigmoid for binary, softmax for multi-class, linear for regression.

Module Complete:

You have now mastered the core components of Multi-Layer Perceptrons:

Network architecture: Layers, units, and connectivity
Hidden layers: The representation-learning engine
Forward propagation: The inference algorithm
Matrix formulation: Efficient batch computation
Activation functions: The nonlinear transformations

This foundation is essential for all advanced neural network topics. Next, we explore Universal Approximation—understanding what functions MLPs can represent and what that means for practical applications.

Module Complete

Congratulations! You've mastered Multi-Layer Perceptrons—the foundational neural network architecture. Every modern deep learning system builds on these principles: from CNNs (specialized connectivity) to Transformers (attention-weighted averaging) to ResNets (skip connections). With this foundation, you're ready to explore the rich landscape of neural network architectures and training techniques.

5 / 5

Loading learning content...

Neural Networks & Deep LearningMulti-Layer Perceptrons

Multi-Layer Perceptrons

LevelIntermediate

Duration90 mins

TopicMulti-Layer Perceptrons

5 / 5

Activation Functions

The Necessity of Nonlinearity

The Linear Composition Collapse:

Consider a network with only linear activations ($\sigma(z) = z$):

$$\mathbf{a}^{(L)} = W^{(L)}(W^{(L-1)}(\cdots W^{(1)}\mathbf{x})) = (W^{(L)} W^{(L-1)} \cdots W^{(1)})\mathbf{x} = \tilde{W}\mathbf{x}$$

The composition of linear functions is linear. All those layers collapse to a single matrix $\tilde{W}$. Depth provides no benefit.

The Role of Nonlinearity:

Decision boundaries of arbitrary complexity
Hierarchical feature composition
Universal function approximation

This page examines activation functions in depth: their mathematical properties, gradient behavior, computational characteristics, and practical selection for different architectures.

What You Will Master

Mathematical Framework for Activation Analysis

Before examining specific activations, we establish the mathematical properties that determine their effectiveness.

Key Properties to Analyze:

Range: The possible output values. Bounded (sigmoid: [0,1]) vs unbounded (ReLU: [0,∞)).
Monotonicity: Whether $x < y \Rightarrow \sigma(x) < \sigma(y)$. Most activations are monotonic; some (GELU) have slight non-monotonicities.
Differentiability: Required for gradient-based training. Most are differentiable everywhere or almost everywhere (ReLU is non-differentiable at 0).
Gradient Bounds: $|\sigma'(z)|$ determines gradient flow. If always < 1, gradients shrink (vanishing); if > 1, they grow (exploding).
Zero-Centeredness: Whether $\mathbb{E}[\sigma(z)] \approx 0$ for typical inputs. Non-zero mean can cause zig-zagging in gradient descent.
Saturation: Whether outputs approach constant values for extreme inputs, causing near-zero gradients.
Computational Cost: Some activations are cheap (ReLU: one comparison) while others require expensive operations (tanh: exp, division).

Activation Function Property Summary
Activation	Range	Derivative Range	Zero-Centered	Saturates	Computation
Sigmoid	(0, 1)	(0, 0.25]	No	Yes (both ends)	exp, division
Tanh	(-1, 1)	(0, 1]	Yes	Yes (both ends)	exp, division
ReLU	[0, ∞)	{0, 1}	No	Left only	max(0, z)
Leaky ReLU	(-∞, ∞)	{α, 1}	Nearly	No	max(αz, z)
ELU	(-α, ∞)	(0, 1]	Nearly	Left only	exp for z<0
SELU	(-λα, ∞)	varies	Self-normalizing	Left only	exp for z<0
Swish/SiLU	≈(-0.28, ∞)	varies	Nearly	Smooth left	sigmoid × z
GELU	(-0.17, ∞)	varies	Nearly	Smooth left	erf or approx

The Gradient Flow Perspective

Classical Activations: Sigmoid and Tanh

Sigmoid and tanh dominated neural networks from the 1980s through the early 2010s. Understanding their properties—and limitations—explains the motivation for modern alternatives.

Sigmoid (Logistic) Function:

$$\sigma(z) = \frac{1}{1 + e^{-z}} = \frac{e^z}{1 + e^z}$$

Derivative: $$\sigma'(z) = \sigma(z)(1 - \sigma(z))$$

Properties:

Range: $(0, 1)$ — natural for probabilities
Maximum derivative: $\sigma'(0) = 0.25$ at $z = 0$
Saturates for $|z| > 5$: gradients become negligible

Tanh (Hyperbolic Tangent):

$$\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} = 2\sigma(2z) - 1$$

Derivative: $$\tanh'(z) = 1 - \tanh^2(z)$$

Properties:

Range: $(-1, 1)$ — zero-centered
Maximum derivative: $\tanh'(0) = 1$ at $z = 0$
Better gradient flow than sigmoid
Still saturates at extremes

classical_activations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
import numpy as np
import matplotlib.pyplot as plt
 
def sigmoid(z):
    """Logistic sigmoid activation."""
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
 
def sigmoid_derivative(z):
    """Derivative of sigmoid."""
    s = sigmoid(z)
    return s * (1 - s)
 
def tanh(z):
    """Hyperbolic tangent activation."""
    return np.tanh(z)
 
def tanh_derivative(z):
    """Derivative of tanh."""
    return 1 - np.tanh(z)**2
 
 
def analyze_classical_activations():
    """
    Comprehensive analysis of sigmoid and tanh.
    """
    z = np.linspace(-6, 6, 1000)
    
    fig, axes = plt.subplots(1, 3, figsize=(14, 4))
    
    # Function values
    ax1 = axes[0]
    ax1.plot(z, sigmoid(z), 'b-', linewidth=2, label='Sigmoid')
    ax1.plot(z, tanh(z), 'r-', linewidth=2, label='Tanh')
    ax1.axhline(y=0, color='k', linestyle='--', alpha=0.3)
    ax1.axvline(x=0, color='k', linestyle='--', alpha=0.3)
    ax1.set_title('Activation Values', fontsize=12)
    ax1.set_xlabel('z (pre-activation)')
    ax1.set_ylabel('σ(z)')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    ax1.set_ylim(-1.2, 1.2)
    
    # Derivatives
    ax2 = axes[1]
    ax2.plot(z, sigmoid_derivative(z), 'b-', linewidth=2, label="Sigmoid'")
    ax2.plot(z, tanh_derivative(z), 'r-', linewidth=2, label="Tanh'")
    ax2.axhline(y=0.25, color='b', linestyle=':', alpha=0.5, label='Sigmoid max (0.25)')
    ax2.axhline(y=1.0, color='r', linestyle=':', alpha=0.5, label='Tanh max (1.0)')
    ax2.set_title('Derivatives (Gradient Flow)', fontsize=12)
    ax2.set_xlabel('z (pre-activation)')
    ax2.set_ylabel("σ'(z)")
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    ax2.set_ylim(-0.1, 1.2)
    
    # Gradient decay through layers
    ax3 = axes[2]
    depths = np.arange(1, 21)
    
    # Assume pre-activations near 0 (best case)
    sigmoid_decay = 0.25 ** depths  # Gradient shrinks by 0.25 each layer
    tanh_decay = 1.0 ** depths      # Best case: no decay
    
    # More realistic: mixed pre-activations
    sigmoid_realistic = 0.2 ** depths   # Average derivative < max
    tanh_realistic = 0.7 ** depths      # Average derivative < 1
    
    ax3.semilogy(depths, sigmoid_decay, 'b--', label='Sigmoid (best case)', alpha=0.5)
    ax3.semilogy(depths, sigmoid_realistic, 'b-', linewidth=2, label='Sigmoid (typical)')
    ax3.semilogy(depths, tanh_realistic, 'r-', linewidth=2, label='Tanh (typical)')
    ax3.set_title('Gradient Magnitude vs Depth', fontsize=12)
    ax3.set_xlabel('Network Depth (layers)')
    ax3.set_ylabel('Gradient Scale (log)')
    ax3.legend()
    ax3.grid(True, alpha=0.3)
    ax3.axhline(y=1e-10, color='gray', linestyle=':', label='Numerical underflow')
    
    plt.tight_layout()
    plt.savefig('classical_activations.png', dpi=150)
    plt.show()
 
 
def vanishing_gradient_demo():
    """
    Demonstrate vanishing gradients with sigmoid in a deep network.
    """
    print("Vanishing Gradient Demonstration")
    print("=" * 50)
    
    # Simulate gradient flow through layers
    depth = 20
    n_units = 100
    
    # Initialize pre-activations from standard normal
    np.random.seed(42)
    
    print("\nSimulating gradient backpropagation through sigmoid network:")
    print("-" * 50)
    
    gradient_norms = []
    grad = np.ones(n_units)  # Start with unit gradient from loss
    
    for layer in range(depth, 0, -1):
        # Pre-activation at this layer (standard normal)
        z = np.random.randn(n_units)
        
        # Gradient through sigmoid
        local_grad = sigmoid_derivative(z)
        grad = grad * local_grad  # Element-wise for simplicity
        
        # Weight matrix contribution would multiply here too
        # (simulating with random orthogonal for now)
        W = np.random.randn(n_units, n_units) * 0.1
        grad = W.T @ grad
        
        norm = np.linalg.norm(grad)
        gradient_norms.append(norm)
        
        if layer % 5 == 0 or norm < 1e-10:
            print(f"  Layer {layer:2d}: gradient norm = {norm:.2e}")
    
    print(f"\nGradient shrunk by factor: {gradient_norms[0] / gradient_norms[-1]:.2e}")
    print("This is why deep sigmoid networks are nearly impossible to train!")
 
 
if __name__ == "__main__":
    analyze_classical_activations()
    vanishing_gradient_demo()

The Vanishing Gradient Problem

The ReLU Revolution

ReLU Definition:

$$\text{ReLU}(z) = \max(0, z) = \begin{cases} z & \text{if } z > 0 \ 0 & \text{if } z \leq 0 \end{cases}$$

Derivative:

$$\text{ReLU}'(z) = \begin{cases} 1 & \text{if } z > 0 \ 0 & \text{if } z < 0 \ \text{undefined} & \text{if } z = 0 \end{cases}$$

In practice, we use a subgradient at $z=0$ (typically 0 or 1).

Why ReLU Solved Vanishing Gradients:

Gradient is 1 for positive inputs: No exponential shrinkage through layers
Linear regime: For active units, gradients pass through unchanged
Sparse activation: Many units output exactly 0, providing natural regularization
Computational efficiency: Just a comparison operation, no exp/division
Scale-invariance: Positive inputs maintain their relative magnitudes

ReLU Advantages

•No vanishing gradient: Derivative is 1 for positive inputs
•Fast computation: Just max(0, z); no exponentials
•Sparse activations: Natural regularization effect
•Linear for positive z: Easier optimization landscape
•Empirically dominant: Best default for most architectures

ReLU Limitations

•Dead neurons: Units with z < 0 for all inputs stop learning
•Non-zero centered: Outputs are always ≥ 0
•Unbounded: Can have numerical issues with large values
•Non-differentiable at 0: Theoretically awkward (practically fine)
•Not smooth: Some optimization properties lost

The Dead Neuron Problem:

A neuron "dies" when its pre-activation is negative for all training inputs. Once dead:

Gradient through that neuron is always 0
Weights never update
The neuron contributes nothing to the network

Causes:

Large negative bias learned during training
Large learning rate causing weights to overshoot
Poor initialization

Solutions:

Use Leaky ReLU or variants
Careful initialization (He initialization)
Lower learning rates
Batch normalization (keeps pre-activations centered)

How Many Dead Neurons?

In a well-trained network with ReLU:

Typically 10-30% of hidden units are inactive for any given input
This sparsity is feature, not bug—it's implicit regularization
But if >50% are dead across all inputs, there's a problem

relu_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
import numpy as np
import matplotlib.pyplot as plt
 
def relu(z):
    return np.maximum(0, z)
 
def relu_derivative(z):
    return (z > 0).astype(float)
 
 
def gradient_flow_comparison():
    """
    Compare gradient flow through ReLU vs Sigmoid networks.
    """
    np.random.seed(42)
    
    depths = [5, 10, 20, 50]
    n_units = 256
    n_simulations = 100
    
    results = {'relu': {}, 'sigmoid': {}}
    
    for depth in depths:
        relu_grads = []
        sigmoid_grads = []
        
        for _ in range(n_simulations):
            # Initialize weights properly
            relu_grad = np.ones(n_units)
            sigmoid_grad = np.ones(n_units)
            
            for layer in range(depth):
                # Random pre-activations
                z = np.random.randn(n_units)
                
                # ReLU gradient
                relu_local = relu_derivative(z)
                relu_grad = relu_grad * relu_local
                # Weight matrix (He initialization scale)
                W = np.random.randn(n_units, n_units) * np.sqrt(2/n_units)
                relu_grad = W.T @ relu_grad
                
                # Sigmoid gradient
                s = 1 / (1 + np.exp(-z))
                sigmoid_local = s * (1 - s)
                sigmoid_grad = sigmoid_grad * sigmoid_local
                W = np.random.randn(n_units, n_units) * np.sqrt(1/n_units)
                sigmoid_grad = W.T @ sigmoid_grad
            
            relu_grads.append(np.linalg.norm(relu_grad))
            sigmoid_grads.append(np.linalg.norm(sigmoid_grad))
        
        results['relu'][depth] = np.mean(relu_grads)
        results['sigmoid'][depth] = np.mean(sigmoid_grads)
    
    print("Gradient Flow Comparison")
    print("=" * 50)
    print(f"{'Depth':>6} | {'ReLU':>12} | {'Sigmoid':>12} | {'Ratio':>10}")
    print("-" * 50)
    for depth in depths:
        r = results['relu'][depth]
        s = results['sigmoid'][depth]
        ratio = r / s if s > 0 else float('inf')
        print(f"{depth:>6} | {r:>12.2e} | {s:>12.2e} | {ratio:>10.0f}x")
    
    return results
 
 
def dead_neuron_simulation():
    """
    Simulate dead neuron occurrence during training.
    """
    print("\nDead Neuron Simulation")
    print("=" * 50)
    
    n_units = 1000
    
    # Different initialization strategies
    strategies = {
        'zero_bias': (np.random.randn(n_units) * 0.1, np.zeros(n_units)),
        'negative_bias': (np.random.randn(n_units) * 0.1, -np.ones(n_units)),
        'positive_bias': (np.random.randn(n_units) * 0.1, np.ones(n_units)),
        'poor_init': (np.random.randn(n_units) * 1.0, np.zeros(n_units)),
    }
    
    # Simulate inputs (batch of 1000 samples)
    inputs = np.random.randn(1000, 10)
    input_weights = np.random.randn(n_units, 10) * np.sqrt(2/10)
    
    for name, (hidden_weights, biases) in strategies.items():
        # Compute pre-activations for all inputs
        pre_acts = inputs @ input_weights.T  # (1000, n_units)
        pre_acts += biases  # Add bias
        
        # Count dead neurons (never positive across all inputs)
        activations = relu(pre_acts)
        dead_mask = np.all(activations == 0, axis=0)
        dead_count = np.sum(dead_mask)
        dead_pct = 100 * dead_count / n_units
        
        print(f"{name:>15}: {dead_count:4d} dead neurons ({dead_pct:.1f}%)")
 
 
def relu_vs_sigmoid_training():
    """
    Demonstrate training speed difference on a simple task.
    """
    print("\nTraining Speed Comparison (XOR task)")
    print("=" * 50)
    
    # XOR dataset
    X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
    y = np.array([[0], [1], [1], [0]])
    
    def mse_loss(y_pred, y_true):
        return 0.5 * np.mean((y_pred - y_true)**2)
    
    def train_simple_network(activation, activation_deriv, epochs=5000, lr=0.5):
        np.random.seed(42)
        
        # Simple 2-4-1 network
        W1 = np.random.randn(4, 2) * 0.5
        b1 = np.zeros((1, 4))
        W2 = np.random.randn(1, 4) * 0.5
        b2 = np.zeros((1, 1))
        
        losses = []
        
        for epoch in range(epochs):
            # Forward
            z1 = X @ W1.T + b1
            a1 = activation(z1)
            z2 = a1 @ W2.T + b2
            a2 = 1 / (1 + np.exp(-z2))  # Sigmoid output for probability
            
            loss = mse_loss(a2, y)
            losses.append(loss)
            
            # Backward
            dz2 = (a2 - y) * a2 * (1 - a2)
            dW2 = dz2.T @ a1
            db2 = np.sum(dz2, axis=0, keepdims=True)
            
            da1 = dz2 @ W2
            dz1 = da1 * activation_deriv(z1)
            dW1 = dz1.T @ X
            db1 = np.sum(dz1, axis=0, keepdims=True)
            
            # Update
            W2 -= lr * dW2
            b2 -= lr * db2
            W1 -= lr * dW1
            b1 -= lr * db1
        
        return losses
    
    sigmoid_losses = train_simple_network(
        lambda z: 1 / (1 + np.exp(-z)),
        lambda z: (1 / (1 + np.exp(-z))) * (1 - 1 / (1 + np.exp(-z)))
    )
    
    relu_losses = train_simple_network(
        relu, relu_derivative
    )
    
    # Find convergence point
    threshold = 0.01
    sigmoid_converged = next((i for i, l in enumerate(sigmoid_losses) if l < threshold), len(sigmoid_losses))
    relu_converged = next((i for i, l in enumerate(relu_losses) if l < threshold), len(relu_losses))
    
    print(f"Sigmoid converged at epoch: {sigmoid_converged}")
    print(f"ReLU converged at epoch: {relu_converged}")
    print(f"ReLU was {sigmoid_converged/relu_converged:.1f}x faster")
 
 
if __name__ == "__main__":
    gradient_flow_comparison()
    dead_neuron_simulation()
    relu_vs_sigmoid_training()

ReLU Variants: Addressing the Limitations

Several ReLU variants address its limitations while preserving its gradient flow advantages.

Leaky ReLU (Maas et al., 2013):

$$\text{LeakyReLU}(z) = \begin{cases} z & \text{if } z > 0 \ \alpha z & \text{if } z \leq 0 \end{cases}$$

where $\alpha \in (0, 1)$ is typically $0.01$ or $0.1$.

Advantage: No dead neurons—negative inputs still produce nonzero gradients.

Parametric ReLU (PReLU, He et al., 2015):

$$\text{PReLU}(z) = \begin{cases} z & \text{if } z > 0 \ a z & \text{if } z \leq 0 \end{cases}$$

where $a$ is a learned parameter (per-channel or shared).

Advantage: Optimal slope is learned from data.

Exponential Linear Unit (ELU, Clevert et al., 2016):

$$\text{ELU}(z) = \begin{cases} z & \text{if } z > 0 \ \alpha(e^z - 1) & \text{if } z \leq 0 \end{cases}$$

Advantages:

Smooth at $z=0$
Negative outputs push mean toward zero
More robust to noise

Disadvantage: Requires exp computation for negative values.

Scaled ELU (SELU, Klambauer et al., 2017):

$$\text{SELU}(z) = \lambda \begin{cases} z & \text{if } z > 0 \ \alpha(e^z - 1) & \text{if } z \leq 0 \end{cases}$$

where $\alpha \approx 1.6733$ and $\lambda \approx 1.0507$ are specific constants.

Advantage: Self-normalizing—activations converge to zero mean, unit variance without batch normalization.

Requirement: Proper initialization (LeCun normal) and fully-connected architecture.

relu_variants.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
import numpy as np
import matplotlib.pyplot as plt
 
# Activation functions
def relu(z):
    return np.maximum(0, z)
 
def leaky_relu(z, alpha=0.01):
    return np.where(z > 0, z, alpha * z)
 
def elu(z, alpha=1.0):
    return np.where(z > 0, z, alpha * (np.exp(z) - 1))
 
def selu(z, alpha=1.6732632423543772, lam=1.0507009873554805):
    return lam * np.where(z > 0, z, alpha * (np.exp(np.clip(z, -10, 10)) - 1))
 
def prelu(z, a):
    """PReLU with learned parameter a."""
    return np.where(z > 0, z, a * z)
 
# Derivatives
def relu_deriv(z):
    return (z > 0).astype(float)
 
def leaky_relu_deriv(z, alpha=0.01):
    return np.where(z > 0, 1, alpha)
 
def elu_deriv(z, alpha=1.0):
    return np.where(z > 0, 1, alpha * np.exp(z))
 
def selu_deriv(z, alpha=1.6732632423543772, lam=1.0507009873554805):
    return lam * np.where(z > 0, 1, alpha * np.exp(np.clip(z, -10, 10)))
 
 
def visualize_relu_variants():
    """
    Compare ReLU variants visually.
    """
    z = np.linspace(-4, 4, 1000)
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    
    # Function values
    ax1 = axes[0, 0]
    ax1.plot(z, relu(z), 'b-', linewidth=2, label='ReLU')
    ax1.plot(z, leaky_relu(z, 0.1), 'r-', linewidth=2, label='Leaky ReLU (α=0.1)')
    ax1.plot(z, elu(z, 1.0), 'g-', linewidth=2, label='ELU (α=1.0)')
    ax1.axhline(y=0, color='k', linestyle='--', alpha=0.3)
    ax1.axvline(x=0, color='k', linestyle='--', alpha=0.3)
    ax1.set_title('Activation Functions', fontsize=12)
    ax1.set_xlabel('z')
    ax1.set_ylabel('σ(z)')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    ax1.set_ylim(-2, 4)
    
    # Derivatives
    ax2 = axes[0, 1]
    ax2.plot(z, relu_deriv(z), 'b-', linewidth=2, label='ReLU')
    ax2.plot(z, leaky_relu_deriv(z, 0.1), 'r-', linewidth=2, label='Leaky ReLU')
    ax2.plot(z, elu_deriv(z, 1.0), 'g-', linewidth=2, label='ELU')
    ax2.axhline(y=0, color='k', linestyle='--', alpha=0.3)
    ax2.axhline(y=1, color='gray', linestyle=':', alpha=0.5)
    ax2.set_title('Derivatives', fontsize=12)
    ax2.set_xlabel('z')
    ax2.set_ylabel("σ'(z)")
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    ax2.set_ylim(-0.2, 1.5)
    
    # SELU self-normalizing property
    ax3 = axes[1, 0]
    np.random.seed(42)
    
    # Simulate forward pass through many layers
    n_layers = 50
    n_units = 1000
    
    means_relu = []
    vars_relu = []
    means_selu = []
    vars_selu = []
    
    # Start with unit Gaussian
    x_relu = np.random.randn(n_units)
    x_selu = np.random.randn(n_units)
    
    for _ in range(n_layers):
        # Random weights 
        W_relu = np.random.randn(n_units, n_units) * np.sqrt(2/n_units)  # He init
        W_selu = np.random.randn(n_units, n_units) * np.sqrt(1/n_units)  # LeCun init
        
        x_relu = relu(W_relu @ x_relu)
        x_selu = selu(W_selu @ x_selu)
        
        means_relu.append(np.mean(x_relu))
        vars_relu.append(np.var(x_relu))
        means_selu.append(np.mean(x_selu))
        vars_selu.append(np.var(x_selu))
    
    layers = np.arange(1, n_layers + 1)
    ax3.plot(layers, vars_relu, 'b-', linewidth=2, label='ReLU Variance')
    ax3.plot(layers, vars_selu, 'purple', linewidth=2, label='SELU Variance')
    ax3.axhline(y=1, color='gray', linestyle='--', alpha=0.5, label='Target Var=1')
    ax3.set_title('Self-Normalizing Property', fontsize=12)
    ax3.set_xlabel('Layer Depth')
    ax3.set_ylabel('Activation Variance')
    ax3.legend()
    ax3.grid(True, alpha=0.3)
    ax3.set_yscale('log')
    
    # Dead neuron comparison
    ax4 = axes[1, 1]
    
    # Simulate dead neuron rate
    n_samples = 1000
    n_neurons = 500
    
    # Random pre-activations with bias toward negative
    z_samples = np.random.randn(n_samples, n_neurons) - 0.5
    
    activations = {
        'ReLU': relu(z_samples),
        'Leaky ReLU': leaky_relu(z_samples, 0.1),
        'ELU': elu(z_samples),
    }
    
    dead_rates = {}
    for name, acts in activations.items():
        # A neuron is "dead" if it's zero for all samples
        if name == 'ReLU':
            dead = np.all(acts == 0, axis=0)
        else:
            dead = np.all(acts <= 0, axis=0)
        dead_rates[name] = 100 * np.mean(dead)
    
    bars = ax4.bar(dead_rates.keys(), dead_rates.values(), color=['blue', 'red', 'green'])
    ax4.set_title('Dead/Inactive Neuron Rate', fontsize=12)
    ax4.set_ylabel('% Neurons Dead')
    ax4.grid(True, alpha=0.3, axis='y')
    
    for bar, rate in zip(bars, dead_rates.values()):
        ax4.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
                f'{rate:.1f}%', ha='center', fontsize=10)
    
    plt.tight_layout()
    plt.savefig('relu_variants.png', dpi=150)
    plt.show()
 
 
if __name__ == "__main__":
    visualize_relu_variants()

When to Use Which Variant

Modern Smooth Activations: Swish and GELU

Recent research has produced smooth activations that match or exceed ReLU performance, particularly in deep networks and transformers.

Swish / SiLU (Ramachandran et al., 2017):

$$\text{Swish}(z) = z \cdot \sigma(z) = \frac{z}{1 + e^{-z}}$$

Properties:

Smooth and differentiable everywhere
Bounded below (≈ -0.28), unbounded above
Non-monotonic: slight dip for negative z, then rises
Self-gated: the sigmoid acts as a learned gate

Derivative: $$\text{Swish}'(z) = \sigma(z) + z \cdot \sigma(z) \cdot (1 - \sigma(z)) = \sigma(z)(1 + z(1 - \sigma(z)))$$

GELU (Gaussian Error Linear Unit, Hendrycks & Gimpel, 2016):

$$\text{GELU}(z) = z \cdot \Phi(z) = z \cdot \frac{1}{2}\left(1 + \text{erf}\left(\frac{z}{\sqrt{2}}\right)\right)$$

where $\Phi(z)$ is the CDF of the standard normal distribution.

Approximation (faster): $$\text{GELU}(z) \approx 0.5z\left(1 + \tanh\left(\sqrt{\frac{2}{\pi}}(z + 0.044715z^3)\right)\right)$$

Properties:

Smooth approximation of a stochastic regularizer
Dominant activation in transformers (BERT, GPT, etc.)
Gradually transitions from 0 to identity
Slightly non-monotonic near z = 0

Why Smooth Matters:

Optimization: Smooth functions have nicer loss landscapes
Second-order methods: Require continuous second derivatives
Gradient flow: Smooth transitions reduce gradient noise
Empirical performance: Often outperforms ReLU on large models

modern_activations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import erf
 
def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
 
def swish(z):
    """Swish/SiLU activation: z * sigmoid(z)"""
    return z * sigmoid(z)
 
def gelu_exact(z):
    """GELU activation using exact formula with erf."""
    return z * 0.5 * (1 + erf(z / np.sqrt(2)))
 
def gelu_approx(z):
    """GELU approximation using tanh (faster)."""
    return 0.5 * z * (1 + np.tanh(np.sqrt(2/np.pi) * (z + 0.044715 * z**3)))
 
def swish_derivative(z):
    """Derivative of Swish."""
    s = sigmoid(z)
    return s + z * s * (1 - s)
 
def gelu_derivative(z):
    """Derivative of GELU (approximate)."""
    # Numerical derivative for simplicity
    eps = 1e-7
    return (gelu_exact(z + eps) - gelu_exact(z - eps)) / (2 * eps)
 
 
def compare_modern_activations():
    """
    Comprehensive comparison of modern smooth activations.
    """
    z = np.linspace(-4, 4, 1000)
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    
    # Function comparison
    ax1 = axes[0, 0]
    ax1.plot(z, np.maximum(0, z), 'b--', linewidth=1.5, alpha=0.7, label='ReLU')
    ax1.plot(z, swish(z), 'r-', linewidth=2, label='Swish')
    ax1.plot(z, gelu_exact(z), 'g-', linewidth=2, label='GELU')
    ax1.axhline(y=0, color='k', linestyle=':', alpha=0.3)
    ax1.axvline(x=0, color='k', linestyle=':', alpha=0.3)
    ax1.set_title('Activation Functions', fontsize=12)
    ax1.set_xlabel('z')
    ax1.set_ylabel('σ(z)')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    ax1.set_ylim(-0.5, 4)
    
    # Derivative comparison  
    ax2 = axes[0, 1]
    ax2.plot(z, (z > 0).astype(float), 'b--', linewidth=1.5, alpha=0.7, label='ReLU')
    ax2.plot(z, swish_derivative(z), 'r-', linewidth=2, label='Swish')
    ax2.plot(z, gelu_derivative(z), 'g-', linewidth=2, label='GELU')
    ax2.axhline(y=1, color='gray', linestyle=':', alpha=0.5)
    ax2.set_title('Derivatives', fontsize=12)
    ax2.set_xlabel('z')
    ax2.set_ylabel("σ'(z)")
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    ax2.set_ylim(-0.2, 1.5)
    
    # GELU approximation accuracy
    ax3 = axes[1, 0]
    exact = gelu_exact(z)
    approx = gelu_approx(z)
    error = np.abs(exact - approx)
    
    ax3.semilogy(z, error + 1e-10, 'purple', linewidth=2)
    ax3.set_title('GELU Approximation Error', fontsize=12)
    ax3.set_xlabel('z')
    ax3.set_ylabel('|exact - approx|')
    ax3.grid(True, alpha=0.3)
    ax3.axhline(y=1e-3, color='r', linestyle='--', alpha=0.5, label='0.001 threshold')
    ax3.legend()
    
    # Non-monotonicity detail
    ax4 = axes[1, 1]
    z_detail = np.linspace(-3, 1, 500)
    ax4.plot(z_detail, swish(z_detail), 'r-', linewidth=2, label='Swish')
    ax4.plot(z_detail, gelu_exact(z_detail), 'g-', linewidth=2, label='GELU')
    ax4.axhline(y=0, color='k', linestyle=':', alpha=0.3)
    ax4.axvline(x=0, color='k', linestyle=':', alpha=0.3)
    
    # Mark minimum points
    swish_min_z = -1.278  # Approximate minimum
    gelu_min_z = -0.77   # Approximate minimum
    ax4.scatter([swish_min_z], [swish(swish_min_z)], color='r', s=100, zorder=5, marker='v')
    ax4.scatter([gelu_min_z], [gelu_exact(gelu_min_z)], color='g', s=100, zorder=5, marker='v')
    
    ax4.set_title('Non-Monotonicity (Zoomed)', fontsize=12)
    ax4.set_xlabel('z')
    ax4.set_ylabel('σ(z)')
    ax4.legend()
    ax4.grid(True, alpha=0.3)
    ax4.set_ylim(-0.4, 0.6)
    
    plt.tight_layout()
    plt.savefig('modern_activations.png', dpi=150)
    plt.show()
 
 
def transformer_activation_comparison():
    """
    Compare activations in transformer-like setting.
    """
    print("Activation Comparison for Transformer FFN")
    print("=" * 50)
    
    np.random.seed(42)
    
    # Simulate transformer FFN block
    # d_model = 768, d_ff = 3072 (BERT-base dimensions)
    d_model = 768
    d_ff = 3072
    batch_size = 32
    seq_len = 128
    
    # Random input (simulating hidden states)
    x = np.random.randn(batch_size * seq_len, d_model).astype(np.float32)
    
    # FFN weights
    W1 = np.random.randn(d_ff, d_model).astype(np.float32) * np.sqrt(2/d_model)
    W2 = np.random.randn(d_model, d_ff).astype(np.float32) * np.sqrt(2/d_ff)
    
    # Forward with different activations
    def ffn_forward(x, activation_fn):
        h = x @ W1.T
        h = activation_fn(h)
        out = h @ W2.T
        return out
    
    activations = {
        'ReLU': lambda z: np.maximum(0, z),
        'GELU': gelu_approx,
        'Swish': swish,
    }
    
    for name, fn in activations.items():
        out = ffn_forward(x, fn)
        print(f"\n{name}:")
        print(f"  Output mean: {out.mean():.4f}")
        print(f"  Output std: {out.std():.4f}")
        print(f"  Output range: [{out.min():.2f}, {out.max():.2f}]")
 
 
if __name__ == "__main__":
    compare_modern_activations()
    transformer_activation_comparison()

Why GELU Dominates Transformers

Activation Function Selection Guidelines

Choosing the right activation function is part science, part engineering judgment. Here are practical guidelines based on architecture type and problem characteristics.

Recommended Activations by Architecture
Architecture	Hidden Layers	Output Layer	Notes
MLP (general)	ReLU or Leaky ReLU	Task-dependent	Start with ReLU; try Leaky if dead neurons
Deep MLP (>10 layers)	SELU or ELU	Task-dependent	SELU with LeCun init; no BatchNorm needed
CNN	ReLU	Softmax (classification)	ReLU is standard; BatchNorm handles activation drift
ResNet	ReLU	Softmax	Skip connections allow ReLU to work at any depth
Transformer	GELU (preferred) or Swish	Task-dependent	GELU is standard for NLP; Swish for vision
RNN/LSTM	Tanh (gates), Sigmoid (gates)	Task-dependent	LSTM gates require bounded activations
GAN Generator	ReLU or Leaky ReLU	Tanh	Tanh output for [-1, 1] image range
GAN Discriminator	Leaky ReLU	Sigmoid or none	Leaky ReLU prevents mode collapse
VAE Encoder	ReLU or ELU	Linear (mean/logvar)	Smooth activations help gradient flow

Output Layer Activations by Task:

Task	Activation	Loss Function	Output Range
Binary Classification	Sigmoid	Binary Cross-Entropy	(0, 1)
Multi-class Classification	Softmax	Cross-Entropy	Probability simplex
Regression	Linear (none)	MSE or MAE	(-∞, +∞)
Bounded Regression	Sigmoid * scale	MSE	(0, scale)
Multi-label Classification	Sigmoid (each)	Binary CE per label	(0, 1) per label

Decision Flowchart:

Is it an output layer?
- Yes → Use task-appropriate activation (see table above)
- No → Continue
Is it a transformer or modern NLP model?
- Yes → Use GELU
- No → Continue
Is it an RNN gate?
- Yes → Use sigmoid (gates) or tanh (state)
- No → Continue
Are you using batch normalization?
- Yes → ReLU is safe, BatchNorm prevents dead neurons
- No → Consider Leaky ReLU or ELU
Is the network very deep (>20 layers) without skip connections?
- Yes → SELU with proper initialization
- No → ReLU or Leaky ReLU
Default: ReLU is almost always a safe starting choice

The Practical Priority

Summary: Activation Functions Mastery

Key Concepts Mastered

•Necessity of nonlinearity: Without activations, any-depth networks collapse to single linear transformations.
•Classical limitations: Sigmoid/tanh cause vanishing gradients (derivative < 1 everywhere), making deep networks untrainable.
•ReLU revolution: Gradient of 1 for positive inputs solved vanishing gradients; sparse activations provide regularization.
•ReLU variants: Leaky ReLU, ELU, SELU address dead neurons and maintain gradient flow for negative inputs.
•Modern smooth activations: GELU and Swish provide smooth, differentiable alternatives that excel in transformers.
•Selection is context-dependent: Architecture type, depth, and normalization strategy all influence the optimal choice.
•Output activations are task-determined: Sigmoid for binary, softmax for multi-class, linear for regression.

Module Complete:

You have now mastered the core components of Multi-Layer Perceptrons:

Network architecture: Layers, units, and connectivity
Hidden layers: The representation-learning engine
Forward propagation: The inference algorithm
Matrix formulation: Efficient batch computation
Activation functions: The nonlinear transformations

Module Complete

5 / 5