Loading content...
The Rectified Linear Unit (ReLU) is arguably the most important activation function in the history of deep learning. Its introduction and widespread adoption, catalyzed by Krizhevsky, Sutskever, and Hinton's AlexNet in 2012, was a key enabler of the deep learning revolution.
Before ReLU, training networks deeper than a few layers was notoriously difficult due to the vanishing gradient problem we discussed with sigmoid and tanh. ReLU's simple, brilliant solution—a piecewise linear function—changed everything.
The impact was immediate and profound:
This page provides a complete analysis of ReLU and its variants, preparing you to make informed activation function choices in any deep learning architecture.
By completing this page, you will deeply understand ReLU's mathematical properties and why it enables deep learning, the dead neuron problem and its mitigation strategies, Leaky ReLU, Parametric ReLU, ELU, SELU, and their trade-offs, and how to diagnose and address activation-related training failures.
The ReLU function is elegantly simple:
$$\text{ReLU}(x) = \max(0, x) = \begin{cases} x & \text{if } x > 0 \ 0 & \text{if } x \leq 0 \end{cases}$$
This can also be written as:
$$\text{ReLU}(x) = x \cdot \mathbf{1}_{x > 0}$$
where 1 is the indicator function.
Domain and Range:
Sparsity:
Non-Saturation (for positive inputs):
The derivative of ReLU is the step function:
$$\text{ReLU}'(x) = \begin{cases} 1 & \text{if } x > 0 \ 0 & \text{if } x < 0 \ \text{undefined} & \text{if } x = 0 \end{cases}$$
In practice, frameworks define ReLU'(0) = 0 (or sometimes 0.5 or 1). This technical detail rarely matters because the probability of any input being exactly 0 is essentially zero for continuous inputs.
Why This Solves Vanishing Gradients:
Recall that for a deep network with L layers:
$$\frac{\partial \mathcal{L}}{\partial W^{(1)}} \propto \prod_{l=1}^{L} f'(z^{(l)})$$
For sigmoid: $\prod_{l=1}^{L} \sigma'(z^{(l)}) \leq 0.25^L \rightarrow 0$ exponentially fast.
For ReLU: $\prod_{l=1}^{L} \text{ReLU}'(z^{(l)}) = 1^k \cdot 0^{L-k} = {0 \text{ or } 1}$
Where k is the number of layers with positive pre-activations. The gradient either flows completely (value 1) or is blocked (value 0). There's no exponential decay—gradients don't vanish, they propagate or stop.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
import numpy as np def relu(x): """ Standard ReLU implementation. Extremely simple and fast. """ return np.maximum(0, x) def relu_derivative(x): """ Derivative of ReLU: 1 if x > 0, else 0. Returns a binary mask that can be used to multiply gradients. """ return (x > 0).astype(np.float64) def relu_backward(grad_output, x): """ Backward pass for ReLU. Gradient flows through only where input was positive. """ return grad_output * relu_derivative(x) # Demonstrate gradient flow comparisondef gradient_flow_analysis(): """Compare gradient products through L layers.""" np.random.seed(42) L = 10 # 10 layers # Simulate pre-activations (roughly centered at 0) pre_activations = [np.random.randn(100) for _ in range(L)] # Sigmoid gradient product sigmoid_grads = [] for z in pre_activations: s = 1 / (1 + np.exp(-z)) sigmoid_grads.append(s * (1 - s)) # σ'(z) sigmoid_product = np.prod(np.stack(sigmoid_grads), axis=0) # ReLU gradient product (binary: flows or not) relu_grads = [relu_derivative(z) for z in pre_activations] relu_product = np.prod(np.stack(relu_grads), axis=0) print(f"Through {L} layers:") print(f" Sigmoid: mean gradient = {sigmoid_product.mean():.2e}") print(f" max gradient = {sigmoid_product.max():.2e}") print(f" ReLU: mean gradient = {relu_product.mean():.4f}") print(f" max gradient = {relu_product.max():.4f}") print(f" % paths open = {100 * (relu_product > 0).mean():.1f}%") gradient_flow_analysis() # Speed comparisonimport time x = np.random.randn(10000, 1000) # Sigmoidstart = time.perf_counter()for _ in range(100): _ = 1 / (1 + np.exp(-x))sigmoid_time = time.perf_counter() - start # ReLUstart = time.perf_counter()for _ in range(100): _ = np.maximum(0, x)relu_time = time.perf_counter() - start print(f"\nSpeed comparison (100 iterations on 10M elements):")print(f" Sigmoid: {sigmoid_time:.4f}s")print(f" ReLU: {relu_time:.4f}s")print(f" Speedup: {sigmoid_time / relu_time:.1f}x")| x | ReLU(x) | ReLU'(x) | Gradient Behavior |
|---|---|---|---|
| -∞ | 0 | 0 | Gradient blocked |
| -5 | 0 | 0 | Gradient blocked |
| -1 | 0 | 0 | Gradient blocked |
| 0 | 0 | 0 (by convention) | Transition point |
| 0.001 | 0.001 | 1 | Gradient flows |
| 1 | 1 | 1 | Gradient flows |
| 5 | 5 | 1 | Gradient flows |
| +∞ | +∞ | 1 | Gradient flows (no saturation) |
ReLU's simplicity comes with a significant drawback: neurons can die.
A neuron is "dead" when:
Scenario 1: Large Negative Bias
If during training the bias term becomes sufficiently negative, the pre-activation might be negative for all inputs:
$$z = Wx + b < 0 \quad \forall x \in \text{training set}$$
Scenario 2: Large Learning Rate Catastrophe
A single large gradient update can push weights into a configuration where the neuron always outputs zero:
Scenario 3: Adversarial Data Distribution Shift
If the input distribution shifts such that all inputs push the neuron to the negative region, the neuron dies.
Studies have shown that 10-40% of neurons in ReLU networks can become dead during training, depending on learning rate and initialization. In extreme cases, entire layers can die, causing training to collapse. This is particularly problematic in the early layers, which many downstream neurons depend upon.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106
import numpy as np def detect_dead_neurons(model, data_loader, threshold=0.0): """ Detect dead neurons in a ReLU network. A neuron is considered dead if it outputs 0 for all samples. Returns: Dictionary mapping layer names to dead neuron indices """ activation_counts = {} # Hook to record activations def hook_fn(name): def hook(module, input, output): # Count how many times each neuron fires (output > 0) if name not in activation_counts: activation_counts[name] = np.zeros(output.shape[-1]) activation_counts[name] += (output.detach().cpu().numpy() > 0).sum(axis=0) return hook # Register hooks on ReLU layers hooks = [] for name, module in model.named_modules(): if isinstance(module, torch.nn.ReLU): hooks.append(module.register_forward_hook(hook_fn(name))) # Run forward passes total_samples = 0 with torch.no_grad(): for batch in data_loader: _ = model(batch) total_samples += batch.shape[0] # Remove hooks for hook in hooks: hook.remove() # Identify dead neurons (those that never fired) dead_neurons = {} for name, counts in activation_counts.items(): firing_rate = counts / total_samples dead_mask = firing_rate <= threshold dead_neurons[name] = { 'dead_indices': np.where(dead_mask)[0], 'dead_count': dead_mask.sum(), 'total_neurons': len(counts), 'dead_percentage': 100 * dead_mask.sum() / len(counts) } return dead_neurons def revive_dead_neurons(model, dead_neurons_info, method='reinitialize'): """ Attempt to revive dead neurons. Methods: - 'reinitialize': Reset weights to new random values - 'shift_bias': Add small positive bias """ for name, module in model.named_modules(): if name in dead_neurons_info: dead_indices = dead_neurons_info[name]['dead_indices'] if method == 'reinitialize': # Reinitialize dead neurons' weights with torch.no_grad(): std = module.weight.data.std() module.weight.data[dead_indices] = torch.randn_like( module.weight.data[dead_indices] ) * std module.bias.data[dead_indices] = 0.01 # Small positive bias elif method == 'shift_bias': # Just shift the bias to allow some positive outputs with torch.no_grad(): module.bias.data[dead_indices] += 0.1 return model # Example analysis without torch (conceptual)def simulate_dead_neuron_probability(n_trials=10000, n_neurons=100, n_samples=1000): """ Simulate the probability of neurons dying with random weights. """ dead_counts = [] for _ in range(n_trials): # Random weights and biases W = np.random.randn(n_neurons, 100) * 0.1 # 100 input features b = np.random.randn(n_neurons) * 0.1 # Random inputs (standard normal -> centered at 0) X = np.random.randn(n_samples, 100) # Pre-activations Z = X @ W.T + b # Shape: (n_samples, n_neurons) # A neuron is dead if Z <= 0 for all samples dead = (Z <= 0).all(axis=0) dead_counts.append(dead.sum()) print(f"With random init (mean 0, std 0.1):") print(f" Average dead neurons: {np.mean(dead_counts):.1f} / {n_neurons}") print(f" Probability of ≥1 dead: {100 * np.mean([d > 0 for d in dead_counts]):.1f}%") simulate_dead_neuron_probability()1. Proper Initialization (He Initialization):
He initialization sets weights with variance 2/n_in:
$$W \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{\text{in}}}}\right)$$
This ensures pre-activations have appropriate variance to prevent immediate death.
2. Small Learning Rate:
Prevents catastrophic weight updates that can kill neurons.
3. Batch Normalization:
Normalizes pre-activations to have mean ~0 and std ~1, ensuring roughly half are positive.
4. Use Leaky ReLU or Variants:
The most reliable solution—allow small gradients for negative inputs.
Leaky ReLU is the simplest modification to address the dead neuron problem. Instead of outputting zero for negative inputs, it outputs a small negative value scaled by α.
$$\text{LeakyReLU}(x) = \begin{cases} x & \text{if } x > 0 \ \alpha x & \text{if } x \leq 0 \end{cases}$$
where α is a small positive constant, typically 0.01 or 0.1.
This can also be written as:
$$\text{LeakyReLU}(x) = \max(\alpha x, x)$$
The crucial difference from ReLU is in the derivative for x < 0:
$$\text{LeakyReLU}'(x) = \begin{cases} 1 & \text{if } x > 0 \ \alpha & \text{if } x \leq 0 \end{cases}$$
Now even neurons with negative pre-activations receive some gradient (scaled by α). This means:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import numpy as np def leaky_relu(x, alpha=0.01): """ Leaky ReLU: max(αx, x) α is the 'leak' coefficient for negative values. """ return np.where(x > 0, x, alpha * x) def leaky_relu_derivative(x, alpha=0.01): """ Derivative: 1 if x > 0, else α """ return np.where(x > 0, 1.0, alpha) # Compare gradient flow through "dead" regionsdef compare_gradient_flow(): """ Show how Leaky ReLU maintains gradient flow where ReLU blocks it. """ # Input that would cause "death" in standard ReLU x = np.array([-5.0, -2.0, -1.0, -0.1, 0.0, 0.1, 1.0, 2.0]) print("Input: ", x) print("ReLU output: ", np.maximum(0, x)) print("LeakyReLU output:", leaky_relu(x)) print() print("ReLU gradient: ", (x > 0).astype(float)) print("Leaky gradient: ", leaky_relu_derivative(x)) compare_gradient_flow() # Effect on deep networksdef deep_gradient_analysis(L=20, alpha=0.01): """ Analyze gradient product through L layers with negative inputs. """ # Worst case: all pre-activations are negative worst_case_relu = 0 ** L worst_case_leaky = alpha ** L # Average case: 50% positive, 50% negative np.random.seed(42) masks = np.random.rand(L) > 0.5 # True = positive relu_grads = masks.astype(float) # 1 or 0 leaky_grads = np.where(masks, 1.0, alpha) # 1 or α print(f"Through {L} layers:") print(f" Worst case (all negative):") print(f" ReLU gradient: {worst_case_relu}") print(f" Leaky gradient: {worst_case_leaky:.2e}") print(f" Random case ({masks.sum()}/{L} positive):") print(f" ReLU gradient: {np.prod(relu_grads):.2e}") print(f" Leaky gradient: {np.prod(leaky_grads):.2e}") deep_gradient_analysis()Common values: α = 0.01 (conservative, close to ReLU behavior) or α = 0.1 (more aggressive leak). Very small α preserves ReLU's approximate sparsity while preventing complete death. Larger α reduces sparsity but provides stronger gradient flow. In practice, α = 0.01 is the most common default.
| x | LeakyReLU(x) | LeakyReLU'(x) | Comparison to ReLU |
|---|---|---|---|
| -5 | -0.05 | 0.01 | Non-zero output/gradient |
| -1 | -0.01 | 0.01 | Non-zero output/gradient |
| 0 | 0 | 0.01 | Non-zero gradient |
| 1 | 1 | 1 | Same as ReLU |
| 5 | 5 | 1 | Same as ReLU |
Parametric ReLU (PReLU) takes Leaky ReLU one step further: instead of using a fixed α, it learns the optimal α during training.
$$\text{PReLU}(x_i) = \begin{cases} x_i & \text{if } x_i > 0 \ a_i x_i & \text{if } x_i \leq 0 \end{cases}$$
where $a_i$ is a learnable parameter for the i-th channel/neuron.
During backpropagation, we need the gradient not just through the activation but also with respect to α:
$$\frac{\partial \text{PReLU}}{\partial a_i} = \begin{cases} 0 & \text{if } x_i > 0 \ x_i & \text{if } x_i \leq 0 \end{cases}$$
This allows α to be updated via gradient descent along with the network weights.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980
import numpy as np class PReLU: """ Parametric ReLU implementation with learnable slopes. """ def __init__(self, num_channels, init_alpha=0.25): """ Args: num_channels: Number of channels (each gets its own α) init_alpha: Initial value for α parameters """ self.alpha = np.full(num_channels, init_alpha) self.alpha_grad = np.zeros_like(self.alpha) def forward(self, x): """ Forward pass: max(x, α*x) x shape: (batch, channels, ...) or (batch, channels) """ self.x = x # Cache for backward pass # Expand alpha to broadcast correctly alpha_broadcast = self.alpha.reshape(1, -1, *([1] * (x.ndim - 2))) return np.where(x > 0, x, alpha_broadcast * x) def backward(self, grad_output): """ Backward pass: compute gradients w.r.t. input and alpha. """ # Gradient w.r.t. input alpha_broadcast = self.alpha.reshape(1, -1, *([1] * (self.x.ndim - 2))) grad_input = np.where(self.x > 0, grad_output, alpha_broadcast * grad_output) # Gradient w.r.t. alpha: sum over batch and spatial dimensions # d(PReLU)/d(alpha) = x when x <= 0, else 0 negative_mask = self.x <= 0 # Sum over all dimensions except channel dimension sum_axes = tuple([0] + list(range(2, self.x.ndim))) self.alpha_grad = np.sum( grad_output * self.x * negative_mask, axis=sum_axes ) return grad_input def update(self, learning_rate=0.001): """Update alpha parameters.""" self.alpha -= learning_rate * self.alpha_grad self.alpha_grad.fill(0) # Demonstrationdef prelu_demo(): np.random.seed(42) # Create PReLU layer for 4 channels prelu = PReLU(num_channels=4, init_alpha=0.25) # Random input (batch=8, channels=4, height=3, width=3) x = np.random.randn(8, 4, 3, 3) print("Initial α values:", prelu.alpha) # Forward pass y = prelu.forward(x) # Simulated gradient from upstream grad_output = np.random.randn(*y.shape) # Backward pass grad_input = prelu.backward(grad_output) print("α gradients:", prelu.alpha_grad) # Update prelu.update(learning_rate=0.01) print("Updated α values:", prelu.alpha) prelu_demo()PReLU adds learnable parameters, which can improve performance but also increases overfitting risk, especially on small datasets. The learned α values sometimes converge to values very different from the typical 0.01 used in Leaky ReLU, suggesting that optimal slopes are task-dependent. For large datasets, PReLU often outperforms Leaky ReLU; for small datasets, fixed Leaky ReLU may generalize better.
The Exponential Linear Unit (ELU) introduces a smooth, saturating function for negative inputs that provides several theoretical advantages over ReLU variants.
$$\text{ELU}(x) = \begin{cases} x & \text{if } x > 0 \ \alpha(e^x - 1) & \text{if } x \leq 0 \end{cases}$$
where α is typically set to 1.0.
1. Mean Activations Closer to Zero:
Unlike ReLU (mean > 0 for any centered input), ELU pushes activations toward zero mean. The negative saturation at -α balances the positive outputs. This provides a self-normalizing property similar to Batch Normalization's effect.
2. Smooth Everywhere:
The derivative is continuous (though the second derivative is not):
$$\text{ELU}'(x) = \begin{cases} 1 & \text{if } x > 0 \ \text{ELU}(x) + \alpha = \alpha e^x & \text{if } x \leq 0 \end{cases}$$
This smooth transition can lead to faster optimization compared to the sharp corner of ReLU at x=0.
3. Noise Robustness:
The saturation for negative inputs (approaching -α) makes ELU robust to small deactivations. Unlike ReLU which produces exactly 0, the ELU output for very negative inputs asymptotes to -α, maintaining some activation.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
import numpy as np def elu(x, alpha=1.0): """ Exponential Linear Unit. """ return np.where(x > 0, x, alpha * (np.exp(x) - 1)) def elu_derivative(x, alpha=1.0): """ Derivative of ELU. Note: For x <= 0, d(ELU)/dx = α*exp(x) = ELU(x) + α """ return np.where(x > 0, 1.0, alpha * np.exp(x)) def elu_derivative_from_output(output, alpha=1.0): """ Efficiently compute derivative from cached forward output. ELU'(x) = 1 if x > 0, else ELU(x) + α """ return np.where(output > 0, 1.0, output + alpha) # Compare propertiesdef compare_mean_activations(): """ Show how ELU achieves closer-to-zero mean activations. """ np.random.seed(42) # Standard normal input (mean=0, std=1) x = np.random.randn(100000) # Activations relu_out = np.maximum(0, x) leaky_out = np.where(x > 0, x, 0.01 * x) elu_out = elu(x, alpha=1.0) print("Mean activations for N(0,1) input:") print(f" ReLU: {relu_out.mean():.4f}") print(f" Leaky: {leaky_out.mean():.4f}") print(f" ELU: {elu_out.mean():.4f}") # Closest to 0 print("\nStandard deviation:") print(f" ReLU: {relu_out.std():.4f}") print(f" Leaky: {leaky_out.std():.4f}") print(f" ELU: {elu_out.std():.4f}") compare_mean_activations() # Smoothness comparison at x=0def smoothness_at_zero(): """ Demonstrate the smooth transition of ELU vs sharp corner of ReLU. """ x_fine = np.linspace(-0.1, 0.1, 1001) relu_deriv = (x_fine > 0).astype(float) elu_deriv = elu_derivative(x_fine, alpha=1.0) print("\nDerivative behavior near x=0:") print(" ReLU jumps from 0 to 1 instantly.") print(f" ELU transitions smoothly: at x=-0.1, ELU'={elu_deriv[0]:.4f}") print(f" at x=0, ELU'={elu_deriv[500]:.4f}") print(f" at x=0.1, ELU'={elu_deriv[-1]:.4f}") smoothness_at_zero()ELU is particularly beneficial when: (1) You want self-normalizing behavior without explicit BatchNorm, (2) Your network is sensitive to the mean shift caused by ReLU, (3) You're working with tasks where smooth gradients help optimization. The main downside is computational cost—the exponential is slower than max().
SELU (Scaled Exponential Linear Unit) is a self-normalizing activation function with carefully derived scale factors that provably maintain mean 0 and variance 1 throughout a deep network.
$$\text{SELU}(x) = \lambda \begin{cases} x & \text{if } x > 0 \ \alpha(e^x - 1) & \text{if } x \leq 0 \end{cases}$$
where the scale factors are:
These specific values are derived from fixed-point analysis of the mean and variance propagation.
Theoretical Guarantee:
Under certain conditions (proper weight initialization with zero mean and specific variance, no standard Dropout), SELU networks maintain:
$$\mathbb{E}[\text{output}] \to 0$$ $$\text{Var}[\text{output}] \to 1$$
This happens because:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126
import numpy as np # SELU constants (derived analytically)SELU_ALPHA = 1.6732632423543772SELU_LAMBDA = 1.0507009873554805 def selu(x): """ Scaled Exponential Linear Unit. Self-normalizing activation function. """ return SELU_LAMBDA * np.where(x > 0, x, SELU_ALPHA * (np.exp(x) - 1)) def selu_derivative(x): """ Derivative of SELU. """ return SELU_LAMBDA * np.where(x > 0, 1.0, SELU_ALPHA * np.exp(x)) def lecun_normal_init(shape): """ LeCun Normal Initialization: required for SELU self-normalization. Weights ~ N(0, 1/fan_in) """ fan_in = shape[0] if len(shape) >= 2 else shape[0] std = np.sqrt(1.0 / fan_in) return np.random.normal(0, std, shape) def alpha_dropout(x, rate=0.05, training=True): """ Alpha Dropout: SELU-compatible dropout. Standard dropout breaks SELU's self-normalizing property. """ if not training or rate == 0: return x # Alpha dropout parameters (derived to maintain SELU normalization) alpha = -SELU_LAMBDA * SELU_ALPHA # Affine transformation parameters to maintain mean=0, var=1 a = ((1 - rate) * (1 + rate * alpha**2)) ** (-0.5) b = -a * alpha * rate # Create dropout mask mask = np.random.rand(*x.shape) > rate # Apply alpha dropout y = np.where(mask, x, alpha) # Affine transformation to restore normalization return a * y + b def demonstrate_self_normalization(): """ Show that SELU maintains mean ≈ 0 and variance ≈ 1 through layers. """ np.random.seed(42) # Network parameters input_dim = 1000 hidden_dim = 1000 num_layers = 50 num_samples = 5000 # Initialize with LeCun Normal weights = [lecun_normal_init((hidden_dim, input_dim if i == 0 else hidden_dim)) for i in range(num_layers)] biases = [np.zeros(hidden_dim) for _ in range(num_layers)] # Input (standard normal) x = np.random.randn(num_samples, input_dim) print("Self-normalization through layers:") print("-" * 50) activations = x for layer in range(num_layers): # Forward pass z = activations @ weights[layer].T + biases[layer] activations = selu(z) if layer % 10 == 0 or layer == num_layers - 1: mean = activations.mean() std = activations.std() print(f"Layer {layer:2d}: mean = {mean:7.4f}, std = {std:.4f}") demonstrate_self_normalization() # Compare with ReLU (which would explode or vanish)def compare_normalization(): """ Compare SELU vs ReLU normalization through deep network. """ np.random.seed(42) input_dim = 500 hidden_dim = 500 num_layers = 30 # Same weights for fair comparison (using He init for ReLU) weights = [np.random.randn(hidden_dim, input_dim if i == 0 else hidden_dim) * np.sqrt(2 / (input_dim if i == 0 else hidden_dim)) for i in range(num_layers)] x = np.random.randn(1000, input_dim) print("\nComparison: SELU vs ReLU (no BatchNorm)") print("-" * 50) # SELU selu_x = x.copy() for layer in range(num_layers): z = selu_x @ weights[layer].T selu_x = selu(z) # ReLU relu_x = x.copy() for layer in range(num_layers): z = relu_x @ weights[layer].T relu_x = np.maximum(0, z) print(f"After {num_layers} layers:") print(f" SELU: mean = {selu_x.mean():.4f}, std = {selu_x.std():.4f}") print(f" ReLU: mean = {relu_x.mean():.4f}, std = {relu_x.std():.4f}") compare_normalization()SELU's self-normalizing property requires strict conditions: LeCun initialization, no standard dropout (use Alpha Dropout), and primarily works for fully-connected networks. For CNNs, RNNs, and Transformers, the theory doesn't apply, and empirically SELU often underperforms ReLU+BatchNorm. SELU is most useful when BatchNorm is problematic or when you want normalization-free training.
We have comprehensively analyzed the ReLU family of activation functions—understanding how they solve the vanishing gradient problem and the trade-offs between variants.
| Function | Formula (x ≤ 0) | Gradient (x ≤ 0) | Key Property |
|---|---|---|---|
| ReLU | 0 | 0 | Simplest, fastest, but dead neurons |
| Leaky ReLU | αx (α=0.01) | α | No dead neurons, fixed slope |
| PReLU | aᵢx | aᵢ (learned) | Adaptive slope, more parameters |
| ELU | α(eˣ-1) | αeˣ | Smooth, mean closer to 0 |
| SELU | λα(eˣ-1) | λαeˣ | Self-normalizing (with conditions) |
Practical Recommendation:
Looking Ahead:
The next page explores Swish and GELU—modern activation functions discovered through automated search that have become standard in state-of-the-art architectures like BERT, GPT, and EfficientNet.
You now have complete mastery of ReLU and its variants. You understand why ReLU enabled deep learning, how to diagnose and prevent dead neurons, and how to select appropriately among Leaky ReLU, PReLU, ELU, and SELU based on your network architecture and training requirements.