Loading learning content...
Every neural network—from the simplest perceptron to the largest transformer—owes its computational power to activation functions. Without them, neural networks would be nothing more than elaborate linear transformations, incapable of learning the complex, nonlinear patterns that define real-world data.
The sigmoid and hyperbolic tangent (tanh) functions represent the original activation functions that powered the neural network revolution of the 1980s and 1990s. Understanding them deeply is essential not merely for historical appreciation, but because:
By completing this page, you will understand the complete mathematical foundations of sigmoid and tanh, their derivatives and computational properties, the vanishing gradient problem they introduce, their biological motivations, and precisely when to use (and avoid) them in modern architectures.
The sigmoid function (also called the logistic function) is perhaps the most historically important activation function in neural network history. It transforms any real-valued input into an output bounded between 0 and 1, making it a natural choice for modeling probabilities.
The sigmoid function σ(x) is defined as:
$$\sigma(x) = \frac{1}{1 + e^{-x}} = \frac{e^x}{e^x + 1}$$
These two forms are mathematically equivalent, but the second form is often more numerically stable for large positive values of x.
Domain and Range:
Symmetry and Fixed Points:
Asymptotic Behavior:
The sigmoid function was introduced to neural networks in the 1970s-1980s as a differentiable alternative to the step function used in Rosenblatt's original perceptron. Its smooth, differentiable nature made gradient-based learning possible—a requirement for backpropagation discovered by Rumelhart, Hinton, and Williams in 1986.
The derivative of the sigmoid function has an elegant, self-referential form:
$$\sigma'(x) = \sigma(x) \cdot (1 - \sigma(x))$$
This can be derived using the quotient rule:
$$\sigma'(x) = \frac{d}{dx}\left[\frac{1}{1 + e^{-x}}\right] = \frac{e^{-x}}{(1 + e^{-x})^2}$$
Recognizing that:
We can verify: $\sigma(x) \cdot (1 - \sigma(x)) = \frac{1}{1 + e^{-x}} \cdot \frac{e^{-x}}{1 + e^{-x}} = \frac{e^{-x}}{(1 + e^{-x})^2}$
Properties of the Derivative:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
import numpy as npimport matplotlib.pyplot as plt def sigmoid(x): """ Numerically stable sigmoid implementation. Uses different formulations for positive/negative inputs to avoid overflow in exp(). """ # For positive x: use 1 / (1 + exp(-x)) # For negative x: use exp(x) / (1 + exp(x)) positive_mask = x >= 0 negative_mask = ~positive_mask result = np.zeros_like(x, dtype=np.float64) # Positive values: standard formula result[positive_mask] = 1.0 / (1.0 + np.exp(-x[positive_mask])) # Negative values: numerically stable alternative exp_x = np.exp(x[negative_mask]) result[negative_mask] = exp_x / (1.0 + exp_x) return result def sigmoid_derivative(x): """ Derivative of sigmoid: σ'(x) = σ(x) * (1 - σ(x)) This elegant form allows efficient computation using the already-computed forward pass value. """ s = sigmoid(x) return s * (1 - s) def sigmoid_derivative_from_output(sigmoid_output): """ Often in backpropagation, we have σ(x) from the forward pass. We can compute the derivative directly without re-evaluating sigmoid. """ return sigmoid_output * (1 - sigmoid_output) # Visualization of sigmoid and its derivativex = np.linspace(-8, 8, 1000)y_sigmoid = sigmoid(x)y_derivative = sigmoid_derivative(x) print("Key properties of sigmoid:")print(f" σ(0) = {sigmoid(np.array([0.0]))[0]:.4f}") # Should be 0.5print(f" σ'(0) = {sigmoid_derivative(np.array([0.0]))[0]:.4f}") # Should be 0.25print(f" σ(5) = {sigmoid(np.array([5.0]))[0]:.6f}") # Should be ~0.9933print(f" σ'(5) = {sigmoid_derivative(np.array([5.0]))[0]:.6f}") # Very smallprint(f" σ(-5) = {sigmoid(np.array([-5.0]))[0]:.6f}") # Should be ~0.0067| x | σ(x) | σ'(x) | Interpretation |
|---|---|---|---|
| -∞ | → 0 | → 0 | Strong negative, gradient vanishes |
| -5 | 0.0067 | 0.0066 | Gradient effectively zero |
| -2 | 0.1192 | 0.1050 | Moderate gradient |
| -1 | 0.2689 | 0.1966 | Reasonable gradient flow |
| 0 | 0.5000 | 0.2500 | Maximum gradient (inflection) |
| 1 | 0.7311 | 0.1966 | Reasonable gradient flow |
| 2 | 0.8808 | 0.1050 | Moderate gradient |
| 5 | 0.9933 | 0.0066 | Gradient effectively zero |
| +∞ | → 1 | → 0 | Strong positive, gradient vanishes |
The hyperbolic tangent (tanh) function is the zero-centered cousin of the sigmoid. It emerged as the preferred activation function in the 1990s precisely because its output distribution, centered around zero, provided more balanced gradient flow during training.
The tanh function is defined as:
$$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = \frac{e^{2x} - 1}{e^{2x} + 1}$$
Tanh and sigmoid are intimately related through a simple linear transformation:
$$\tanh(x) = 2\sigma(2x) - 1$$
Equivalently:
$$\sigma(x) = \frac{\tanh(x/2) + 1}{2}$$
This relationship reveals that tanh is essentially a rescaled and shifted sigmoid. Where sigmoid maps inputs to (0, 1), tanh maps to (-1, 1).
Domain and Range:
Symmetry:
Asymptotic Behavior:
Zero-centered activations (like tanh) produce outputs that have mean close to zero. This is crucial because it means gradients during backpropagation can be positive or negative with roughly equal probability, preventing systematic bias in weight updates. Sigmoid's all-positive outputs create a 'zig-zagging' gradient descent path that slows convergence.
The derivative of tanh has a similarly elegant form:
$$\tanh'(x) = 1 - \tanh^2(x) = \text{sech}^2(x)$$
where sech(x) = 1/cosh(x) is the hyperbolic secant.
Derivation:
Let y = tanh(x) = (e^x - e^{-x})/(e^x + e^{-x}). Using the quotient rule:
$$\frac{dy}{dx} = \frac{(e^x + e^{-x})(e^x + e^{-x}) - (e^x - e^{-x})(e^x - e^{-x})}{(e^x + e^{-x})^2}$$
$$= \frac{(e^x + e^{-x})^2 - (e^x - e^{-x})^2}{(e^x + e^{-x})^2}$$
$$= 1 - \left(\frac{e^x - e^{-x}}{e^x + e^{-x}}\right)^2 = 1 - \tanh^2(x)$$
Properties of the Derivative:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
import numpy as np def tanh(x): """ Numerically stable tanh implementation. np.tanh is already stable, but understanding the formula matters. """ # For very large |x|, exp(2x) or exp(-2x) can overflow # np.tanh handles this internally, but for education: # tanh(x) = (exp(2x) - 1) / (exp(2x) + 1) for x >= 0 # tanh(x) = (1 - exp(-2x)) / (1 + exp(-2x)) for x < 0 return np.tanh(x) def tanh_derivative(x): """ Derivative of tanh: tanh'(x) = 1 - tanh²(x) """ t = tanh(x) return 1 - t**2 def tanh_derivative_from_output(tanh_output): """ Compute derivative directly from forward pass output. Essential for efficient backpropagation. """ return 1 - tanh_output**2 # Relationship between sigmoid and tanhdef sigmoid(x): return 1 / (1 + np.exp(-x)) def verify_relationship(x): """ Verify: tanh(x) = 2 * sigmoid(2x) - 1 """ tanh_direct = np.tanh(x) tanh_from_sigmoid = 2 * sigmoid(2 * x) - 1 difference = np.abs(tanh_direct - tanh_from_sigmoid) return np.max(difference) < 1e-10 # Should be True x_test = np.linspace(-5, 5, 1000)print(f"Relationship verified: {verify_relationship(x_test)}") # Compare gradient magnitudes at key pointsprint("\nGradient comparison at x=0:")print(f" σ'(0) = {sigmoid(0) * (1 - sigmoid(0)):.4f}") # 0.25print(f" tanh'(0) = {1 - tanh(0)**2:.4f}") # 1.0 print("\nGradient comparison at x=2:")print(f" σ'(2) = {sigmoid(2) * (1 - sigmoid(2)):.4f}") # ~0.105print(f" tanh'(2) = {1 - tanh(2)**2:.4f}") # ~0.071| x | tanh(x) | tanh'(x) | Interpretation |
|---|---|---|---|
| -∞ | → -1 | → 0 | Saturated negative |
| -3 | -0.9951 | 0.0099 | Nearly saturated |
| -2 | -0.9640 | 0.0707 | Moderate gradient |
| -1 | -0.7616 | 0.4200 | Good gradient flow |
| 0 | 0.0000 | 1.0000 | Maximum gradient (origin) |
| 1 | 0.7616 | 0.4200 | Good gradient flow |
| 2 | 0.9640 | 0.0707 | Moderate gradient |
| 3 | 0.9951 | 0.0099 | Nearly saturated |
| +∞ | → 1 | → 0 | Saturated positive |
Understanding when to choose sigmoid versus tanh requires a systematic comparison across multiple dimensions. While they are mathematically related, their different output ranges lead to meaningfully different behavior in neural networks.
The gradient flow during backpropagation is the most critical difference for training deep networks.
Both sigmoid and tanh suffer from the vanishing gradient problem, though to different degrees. This phenomenon is the primary reason both were largely replaced by ReLU for hidden layers.
Mathematical Analysis:
Consider a network with L layers using sigmoid activation. During backpropagation, the gradient must pass through L sigmoid derivatives. In the worst case (inputs near saturation):
$$\frac{\partial \mathcal{L}}{\partial W^{(1)}} \propto \prod_{l=1}^{L} \sigma'(z^{(l)})$$
Since σ'(x) ≤ 0.25 always:
$$\left|\frac{\partial \mathcal{L}}{\partial W^{(1)}}\right| \leq 0.25^L$$
For L = 10 layers: gradient ≤ 0.25^10 ≈ 9.5 × 10^{-7} For L = 20 layers: gradient ≤ 0.25^20 ≈ 9.1 × 10^{-13}
These gradients are so small that early layers receive essentially zero learning signal—they become untrainable.
Tanh's Partial Mitigation:
Tanh's maximum gradient of 1.0 means that at the origin, gradients can flow without attenuation. However, for any inputs away from zero, tanh'(x) < 1, and deep networks still suffer gradient decay. The improvement is real but insufficient for very deep networks.
When pre-activations (z = Wx + b) become large in magnitude, both sigmoid and tanh 'saturate'—their derivatives become vanishingly small. Worse, once a neuron is saturated, the small gradient means it receives almost no learning signal to escape saturation. This can create 'dead' or 'stuck' neurons that contribute nothing to learning.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
import numpy as npimport matplotlib.pyplot as plt def sigmoid(x): return 1 / (1 + np.exp(-np.clip(x, -500, 500))) def sigmoid_derivative(x): s = sigmoid(x) return s * (1 - s) def tanh_derivative(x): return 1 - np.tanh(x)**2 def analyze_gradient_decay(activation_derivative, num_layers, input_values): """ Simulate gradient flow through multiple layers. Returns the cumulative gradient product. """ gradient_product = np.ones_like(input_values) for _ in range(num_layers): gradient_product *= activation_derivative(input_values) return gradient_product # Analyze gradient decay for different depthsdepths = [1, 5, 10, 15, 20]x = np.linspace(-3, 3, 1000) print("Gradient product at x=0 (best case) for L layers:")print("-" * 50)for L in depths: sigmoid_grad = 0.25 ** L tanh_grad = 1.0 ** L # At x=0, tanh'(0) = 1 print(f"L={L:2d}: Sigmoid = {sigmoid_grad:.2e}, Tanh = {tanh_grad:.2e}") print("\nGradient product at x=2 (typical case) for L layers:")print("-" * 50)sigmoid_local = sigmoid_derivative(np.array([2.0]))[0]tanh_local = tanh_derivative(np.array([2.0]))[0]for L in depths: sigmoid_grad = sigmoid_local ** L tanh_grad = tanh_local ** L print(f"L={L:2d}: Sigmoid = {sigmoid_grad:.2e}, Tanh = {tanh_grad:.2e}") # The practical reality: pre-activations are rarely exactly 0print("\nPractical gradient decay (random pre-activations in [-2, 2]):")print("-" * 50)np.random.seed(42)num_simulations = 1000 for L in depths: sigmoid_products = [] tanh_products = [] for _ in range(num_simulations): pre_activations = np.random.uniform(-2, 2, L) sigmoid_products.append(np.prod(sigmoid_derivative(pre_activations))) tanh_products.append(np.prod(tanh_derivative(pre_activations))) print(f"L={L:2d}: Sigmoid mean = {np.mean(sigmoid_products):.2e}, " f"Tanh mean = {np.mean(tanh_products):.2e}")The sigmoid and tanh functions were originally motivated by analogies to biological neurons. While these analogies are imperfect—and modern deep learning has largely moved away from biological plausibility as a design criterion—understanding the historical motivation provides valuable context.
Real biological neurons exhibit several key behaviors:
The sigmoid function was proposed as a smooth approximation to the step function that models threshold behavior:
The sigmoid output can be interpreted as the probability of firing or equivalently the normalized firing rate:
$$\text{firing rate} = r_{\max} \cdot \sigma(W \cdot x + b)$$
where r_max is the maximum firing rate. This interpretation:
Contemporary deep learning research has largely abandoned biological plausibility as a design goal. ReLU, which has no biological analog (neurons don't produce negative outputs, and real neurons do saturate), vastly outperforms biologically-inspired functions in most settings. The lesson: inspiration is valuable, but empirical performance trumps theoretical elegance.
The tanh function can be interpreted as modeling excitatory and inhibitory inputs symmetrically:
This symmetric model better captures the push-pull nature of neural circuits, where inhibition is as important as excitation.
Beyond biological analogy, sigmoid activations can be motivated from information theory:
Maximum entropy: The sigmoid is the maximum entropy distribution over binary outcomes given a mean constraint
Bits of information: The output σ(x) represents the probability that the 'feature detected by this neuron is present'
Logistic regression connection: A neuron with sigmoid activation implements logistic regression, the optimal discriminative model for linearly separable binary classification
Implementing sigmoid and tanh correctly requires attention to numerical stability. Naive implementations can produce NaN or Inf values, undermining model training.
The Problem:
$$\sigma(x) = \frac{1}{1 + e^{-x}}$$
Example: σ(-1000) would compute e^1000, which overflows float64.
The Solution: Use different formulations for positive and negative inputs:
$$\sigma(x) = \begin{cases} \frac{1}{1 + e^{-x}} & \text{if } x \geq 0 \ \frac{e^x}{1 + e^x} & \text{if } x < 0 \end{cases}$$
Both forms are mathematically identical, but numerically stable in their respective domains.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
import numpy as np def sigmoid_naive(x): """ Naive sigmoid - will overflow for large negative x. DO NOT USE IN PRODUCTION. """ return 1.0 / (1.0 + np.exp(-x)) def sigmoid_stable(x): """ Numerically stable sigmoid using conditional formulation. """ # Vectorized implementation result = np.empty_like(x, dtype=np.float64) positive = x >= 0 negative = ~positive # For x >= 0: 1 / (1 + exp(-x)) result[positive] = 1.0 / (1.0 + np.exp(-x[positive])) # For x < 0: exp(x) / (1 + exp(x)) exp_x = np.exp(x[negative]) result[negative] = exp_x / (1.0 + exp_x) return result def sigmoid_log_stable(x): """ Even more stable version using log-space computation. Useful when you need log(σ(x)) or log(1-σ(x)). """ return -np.logaddexp(0, -x) # Returns log(σ(x)) def log_sigmoid_components(x): """ Return log(σ(x)) and log(1-σ(x)) stably. Essential for cross-entropy loss computation. """ # log(σ(x)) = -log(1 + exp(-x)) = -softplus(-x) # log(1 - σ(x)) = -log(1 + exp(x)) = -softplus(x) log_sigmoid = -np.logaddexp(0, -x) log_one_minus_sigmoid = -np.logaddexp(0, x) return log_sigmoid, log_one_minus_sigmoid # Demonstrate the problemprint("Naive vs Stable implementation:")print("-" * 50)test_values = np.array([0.0, 10.0, 100.0, 1000.0, -10.0, -100.0, -1000.0]) for x in test_values: naive = sigmoid_naive(x) if x > -500 else "overflow" stable = sigmoid_stable(np.array([x]))[0] print(f"x = {x:7.1f}: naive = {naive!s:12}, stable = {stable:.10f}") # Log-space computation for extreme valuesprint("\nLog-space computation (for loss functions):")print("-" * 50)log_sig, log_one_minus_sig = log_sigmoid_components(test_values)for x, ls, loms in zip(test_values, log_sig, log_one_minus_sig): print(f"x = {x:7.1f}: log(σ(x)) = {ls:12.6f}, log(1-σ(x)) = {loms:12.6f}")Computational efficiency matters when processing billions of activations. Here's how the functions compare:
| Operation | Sigmoid | Tanh | ReLU |
|---|---|---|---|
| Forward pass | 2 exp + 1 div | 2 exp + 1 div + 1 sub | 1 max |
| Backward pass | 2 mul + 1 sub | 1 mul + 1 sub + 1 sq | 1 comparison |
| Memory (caching) | Store σ(x) | Store tanh(x) | Store mask |
| Relative speed | 1× (baseline) | ~1.1× | ~10× faster |
| Numerical issues | Overflow for large |x| | Overflow for large |x| | None |
Modern deep learning frameworks (PyTorch, TensorFlow, JAX) implement sigmoid and tanh with hardware-optimized fused kernels. The raw operation count understates the performance—but the ~10× speed advantage of ReLU persists due to the fundamental simplicity of the max() operation.
Despite the dominance of ReLU in hidden layers, sigmoid and tanh remain the correct choice for specific architectural components. Understanding these use cases is essential for proper network design.
Binary Classification: Sigmoid is the canonical output activation for binary classification:
$$P(y=1|x) = \sigma(W \cdot h + b)$$
This provides a proper probability in [0, 1] that can be used with binary cross-entropy loss:
$$\mathcal{L} = -[y \log \sigma(z) + (1-y) \log(1 - \sigma(z))]$$
Multi-Label Classification: When multiple labels can be simultaneously true, use sigmoid on each output independently:
$$P(y_i=1|x) = \sigma(z_i) \quad \text{for } i = 1, ..., K$$
Sigmoid is essential for the gates in LSTM and GRU architectures. The forget gate f_t, input gate i_t, and output gate o_t all use sigmoid because gates must produce values in [0, 1] that represent 'how much' information flows through. A value of 0 means 'block completely' and 1 means 'allow completely'. ReLU cannot provide this bounded gating behavior.
123456789101112131415161718192021222324252627282930313233343536373839404142
import numpy as np def lstm_cell(x, h_prev, c_prev, Wf, Wi, Wc, Wo, bf, bi, bc, bo): """ LSTM cell implementation showing gate activations. Note: All gates use sigmoid (σ) for [0,1] gating. The candidate cell state uses tanh for [-1,1] values. """ # Combine input and previous hidden state combined = np.concatenate([x, h_prev]) # --- GATES USE SIGMOID --- # Forget gate: What to forget from previous cell state f = sigmoid(combined @ Wf + bf) # σ: [0,1] - forget factor # Input gate: What new information to store i = sigmoid(combined @ Wi + bi) # σ: [0,1] - input factor # Output gate: What to output from cell state o = sigmoid(combined @ Wo + bo) # σ: [0,1] - output factor # --- CELL STATE USES TANH --- # Candidate cell state: New values to potentially add c_candidate = np.tanh(combined @ Wc + bc) # tanh: [-1,1] - value # Update cell state: forget old + add new c = f * c_prev + i * c_candidate # Compute hidden state: tanh-squashed cell state, gated by output h = o * np.tanh(c) return h, c def sigmoid(x): return 1 / (1 + np.exp(-np.clip(x, -500, 500))) # Why this works:# - Sigmoid gates are multiplicative: value * gate# - Gate = 0: blocks information completely# - Gate = 1: allows information completely# - Gate = 0.5: allows 50% of information# - ReLU can't do this: values can exceed 1, and 0 is not a clean "off"Additive Attention: Original attention (Bahdanau attention) uses tanh for score computation:
$$e_{ij} = v^T \cdot \tanh(W_h \cdot h_j + W_s \cdot s_i)$$
Tanh bounds the scores, preventing extreme values before softmax.
Gated Attention: Sigmoid is used for attention gates in some architectures:
$$g = \sigma(W_g \cdot [h; c])$$
Any situation requiring bounded outputs suggests sigmoid or tanh:
Sigmoid [0, 1]:
Tanh [-1, 1]:
| Use Case | Recommended | Reason |
|---|---|---|
| Hidden layers (deep networks) | ReLU/variants | No vanishing gradients |
| Binary classification output | Sigmoid | Produces valid probability |
| Multi-class output | Softmax | Produces probability distribution |
| Multi-label output | Sigmoid (per label) | Independent probabilities |
| LSTM/GRU gates | Sigmoid | Bounded [0,1] gating |
| LSTM/GRU cell state | Tanh | Bounded [-1,1] values |
| Bounded regression | Tanh or Sigmoid | Match output range to target |
| Embeddings | Tanh | Zero-centered, bounded |
| Generative models (latent) | Tanh | Symmetric, bounded range |
We have comprehensively analyzed the sigmoid and hyperbolic tangent activation functions—understanding their mathematical foundations, gradient behavior, biological motivations, computational properties, and modern applications.
Looking Ahead:
The limitations of sigmoid and tanh—particularly the vanishing gradient problem—motivated the search for better activation functions. In the next page, we explore ReLU and its variants, the functions that revolutionized deep learning by enabling training of much deeper networks.
You now have complete mastery of sigmoid and tanh activation functions—their mathematics, behavior, limitations, and proper modern usage. This foundation is essential for understanding why ReLU variants dominate and for correctly applying sigmoid/tanh in the specific contexts where they remain optimal.