Machine LearningNeural Network Foundations

Activation Functions

LevelIntermediate

Duration90 mins

TopicNeural Network Foundations

1 / 5

Sigmoid and Tanh

The Gateway to Nonlinearity

Every neural network—from the simplest perceptron to the largest transformer—owes its computational power to activation functions. Without them, neural networks would be nothing more than elaborate linear transformations, incapable of learning the complex, nonlinear patterns that define real-world data.

The sigmoid and hyperbolic tangent (tanh) functions represent the original activation functions that powered the neural network revolution of the 1980s and 1990s. Understanding them deeply is essential not merely for historical appreciation, but because:

They remain the correct choice for specific architectural components (output layers for probabilities, gating mechanisms in LSTMs/GRUs)
Their limitations illuminate why modern alternatives like ReLU exist
Their mathematical properties establish the vocabulary for analyzing all activation functions
Many foundational papers and textbooks assume familiarity with their behavior

What You Will Master

By completing this page, you will understand the complete mathematical foundations of sigmoid and tanh, their derivatives and computational properties, the vanishing gradient problem they introduce, their biological motivations, and precisely when to use (and avoid) them in modern architectures.

The Sigmoid Function: Complete Analysis

The sigmoid function (also called the logistic function) is perhaps the most historically important activation function in neural network history. It transforms any real-valued input into an output bounded between 0 and 1, making it a natural choice for modeling probabilities.

Mathematical Definition

The sigmoid function σ(x) is defined as:

$$\sigma(x) = \frac{1}{1 + e^{-x}} = \frac{e^x}{e^x + 1}$$

These two forms are mathematically equivalent, but the second form is often more numerically stable for large positive values of x.

Key Properties

Domain and Range:

Domain: All real numbers, x ∈ (-∞, +∞)
Range: (0, 1) — the output is strictly between 0 and 1, never exactly reaching either bound

Symmetry and Fixed Points:

The function passes through the point (0, 0.5)
It exhibits point symmetry about (0, 0.5): σ(-x) = 1 - σ(x)
It has no fixed points other than approximately x ≈ 0 when viewed as an iteration map

Asymptotic Behavior:

As x → +∞, σ(x) → 1 (approaches but never reaches 1)
As x → -∞, σ(x) → 0 (approaches but never reaches 0)
The function is a smooth approximation to the step function

Historical Context

The sigmoid function was introduced to neural networks in the 1970s-1980s as a differentiable alternative to the step function used in Rosenblatt's original perceptron. Its smooth, differentiable nature made gradient-based learning possible—a requirement for backpropagation discovered by Rumelhart, Hinton, and Williams in 1986.

The Derivative of Sigmoid

The derivative of the sigmoid function has an elegant, self-referential form:

$$\sigma'(x) = \sigma(x) \cdot (1 - \sigma(x))$$

This can be derived using the quotient rule:

$$\sigma'(x) = \frac{d}{dx}\left[\frac{1}{1 + e^{-x}}\right] = \frac{e^{-x}}{(1 + e^{-x})^2}$$

Recognizing that:

$\sigma(x) = \frac{1}{1 + e^{-x}}$
$1 - \sigma(x) = \frac{e^{-x}}{1 + e^{-x}}$

We can verify: $\sigma(x) \cdot (1 - \sigma(x)) = \frac{1}{1 + e^{-x}} \cdot \frac{e^{-x}}{1 + e^{-x}} = \frac{e^{-x}}{(1 + e^{-x})^2}$

Properties of the Derivative:

Maximum value: The derivative achieves its maximum of 0.25 at x = 0 (where σ(0) = 0.5)
Monotonic decay: The derivative decreases monotonically as |x| increases
Vanishing gradients: For |x| > 5, the derivative is essentially zero (< 0.007)

sigmoid_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import numpy as np
import matplotlib.pyplot as plt
 
def sigmoid(x):
    """
    Numerically stable sigmoid implementation.
    Uses different formulations for positive/negative inputs
    to avoid overflow in exp().
    """
    # For positive x: use 1 / (1 + exp(-x))
    # For negative x: use exp(x) / (1 + exp(x))
    positive_mask = x >= 0
    negative_mask = ~positive_mask
    
    result = np.zeros_like(x, dtype=np.float64)
    
    # Positive values: standard formula
    result[positive_mask] = 1.0 / (1.0 + np.exp(-x[positive_mask]))
    
    # Negative values: numerically stable alternative
    exp_x = np.exp(x[negative_mask])
    result[negative_mask] = exp_x / (1.0 + exp_x)
    
    return result
 
def sigmoid_derivative(x):
    """
    Derivative of sigmoid: σ'(x) = σ(x) * (1 - σ(x))
    This elegant form allows efficient computation using
    the already-computed forward pass value.
    """
    s = sigmoid(x)
    return s * (1 - s)
 
def sigmoid_derivative_from_output(sigmoid_output):
    """
    Often in backpropagation, we have σ(x) from the forward pass.
    We can compute the derivative directly without re-evaluating sigmoid.
    """
    return sigmoid_output * (1 - sigmoid_output)
 
# Visualization of sigmoid and its derivative
x = np.linspace(-8, 8, 1000)
y_sigmoid = sigmoid(x)
y_derivative = sigmoid_derivative(x)
 
print("Key properties of sigmoid:")
print(f"  σ(0) = {sigmoid(np.array([0.0]))[0]:.4f}")  # Should be 0.5
print(f"  σ'(0) = {sigmoid_derivative(np.array([0.0]))[0]:.4f}")  # Should be 0.25
print(f"  σ(5) = {sigmoid(np.array([5.0]))[0]:.6f}")  # Should be ~0.9933
print(f"  σ'(5) = {sigmoid_derivative(np.array([5.0]))[0]:.6f}")  # Very small
print(f"  σ(-5) = {sigmoid(np.array([-5.0]))[0]:.6f}")  # Should be ~0.0067

Sigmoid Function Values at Key Points
x	σ(x)	σ'(x)	Interpretation
-∞	→ 0	→ 0	Strong negative, gradient vanishes
-5	0.0067	0.0066	Gradient effectively zero
-2	0.1192	0.1050	Moderate gradient
-1	0.2689	0.1966	Reasonable gradient flow
0	0.5000	0.2500	Maximum gradient (inflection)
1	0.7311	0.1966	Reasonable gradient flow
2	0.8808	0.1050	Moderate gradient
5	0.9933	0.0066	Gradient effectively zero
+∞	→ 1	→ 0	Strong positive, gradient vanishes

The Hyperbolic Tangent Function: Complete Analysis

The hyperbolic tangent (tanh) function is the zero-centered cousin of the sigmoid. It emerged as the preferred activation function in the 1990s precisely because its output distribution, centered around zero, provided more balanced gradient flow during training.

Mathematical Definition

The tanh function is defined as:

$$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = \frac{e^{2x} - 1}{e^{2x} + 1}$$

Relationship to Sigmoid

Tanh and sigmoid are intimately related through a simple linear transformation:

$$\tanh(x) = 2\sigma(2x) - 1$$

Equivalently:

$$\sigma(x) = \frac{\tanh(x/2) + 1}{2}$$

This relationship reveals that tanh is essentially a rescaled and shifted sigmoid. Where sigmoid maps inputs to (0, 1), tanh maps to (-1, 1).

Key Properties

Domain and Range:

Domain: All real numbers, x ∈ (-∞, +∞)
Range: (-1, 1) — the output is strictly between -1 and 1

Symmetry:

tanh is an odd function: tanh(-x) = -tanh(x)
This means it is symmetric about the origin (point symmetry through (0, 0))
The function passes through (0, 0)

Asymptotic Behavior:

As x → +∞, tanh(x) → 1
As x → -∞, tanh(x) → -1

Why Zero-Centering Matters

Zero-centered activations (like tanh) produce outputs that have mean close to zero. This is crucial because it means gradients during backpropagation can be positive or negative with roughly equal probability, preventing systematic bias in weight updates. Sigmoid's all-positive outputs create a 'zig-zagging' gradient descent path that slows convergence.

The Derivative of Tanh

The derivative of tanh has a similarly elegant form:

$$\tanh'(x) = 1 - \tanh^2(x) = \text{sech}^2(x)$$

where sech(x) = 1/cosh(x) is the hyperbolic secant.

Derivation:

Let y = tanh(x) = (e^x - e^{-x})/(e^x + e^{-x}). Using the quotient rule:

$$\frac{dy}{dx} = \frac{(e^x + e^{-x})(e^x + e^{-x}) - (e^x - e^{-x})(e^x - e^{-x})}{(e^x + e^{-x})^2}$$

$$= \frac{(e^x + e^{-x})^2 - (e^x - e^{-x})^2}{(e^x + e^{-x})^2}$$

$$= 1 - \left(\frac{e^x - e^{-x}}{e^x + e^{-x}}\right)^2 = 1 - \tanh^2(x)$$

Properties of the Derivative:

Maximum value: The derivative achieves its maximum of 1.0 at x = 0
Monotonic decay: As |x| increases, the derivative decreases
Comparison to sigmoid: tanh's maximum derivative (1.0) is 4× larger than sigmoid's (0.25)
Vanishing gradients: Still occurs for |x| > 3, but less severely than sigmoid

tanh_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
 
def tanh(x):
    """
    Numerically stable tanh implementation.
    np.tanh is already stable, but understanding the formula matters.
    """
    # For very large |x|, exp(2x) or exp(-2x) can overflow
    # np.tanh handles this internally, but for education:
    # tanh(x) = (exp(2x) - 1) / (exp(2x) + 1)  for x >= 0
    # tanh(x) = (1 - exp(-2x)) / (1 + exp(-2x))  for x < 0
    return np.tanh(x)
 
def tanh_derivative(x):
    """
    Derivative of tanh: tanh'(x) = 1 - tanh²(x)
    """
    t = tanh(x)
    return 1 - t**2
 
def tanh_derivative_from_output(tanh_output):
    """
    Compute derivative directly from forward pass output.
    Essential for efficient backpropagation.
    """
    return 1 - tanh_output**2
 
# Relationship between sigmoid and tanh
def sigmoid(x):
    return 1 / (1 + np.exp(-x))
 
def verify_relationship(x):
    """
    Verify: tanh(x) = 2 * sigmoid(2x) - 1
    """
    tanh_direct = np.tanh(x)
    tanh_from_sigmoid = 2 * sigmoid(2 * x) - 1
    difference = np.abs(tanh_direct - tanh_from_sigmoid)
    return np.max(difference) < 1e-10  # Should be True
 
x_test = np.linspace(-5, 5, 1000)
print(f"Relationship verified: {verify_relationship(x_test)}")
 
# Compare gradient magnitudes at key points
print("\nGradient comparison at x=0:")
print(f"  σ'(0) = {sigmoid(0) * (1 - sigmoid(0)):.4f}")  # 0.25
print(f"  tanh'(0) = {1 - tanh(0)**2:.4f}")  # 1.0
 
print("\nGradient comparison at x=2:")
print(f"  σ'(2) = {sigmoid(2) * (1 - sigmoid(2)):.4f}")  # ~0.105
print(f"  tanh'(2) = {1 - tanh(2)**2:.4f}")  # ~0.071

Tanh Function Values at Key Points
x	tanh(x)	tanh'(x)	Interpretation
-∞	→ -1	→ 0	Saturated negative
-3	-0.9951	0.0099	Nearly saturated
-2	-0.9640	0.0707	Moderate gradient
-1	-0.7616	0.4200	Good gradient flow
0	0.0000	1.0000	Maximum gradient (origin)
1	0.7616	0.4200	Good gradient flow
2	0.9640	0.0707	Moderate gradient
3	0.9951	0.0099	Nearly saturated
+∞	→ 1	→ 0	Saturated positive

Sigmoid vs Tanh: Comprehensive Comparison

Understanding when to choose sigmoid versus tanh requires a systematic comparison across multiple dimensions. While they are mathematically related, their different output ranges lead to meaningfully different behavior in neural networks.

Gradient Flow Comparison

The gradient flow during backpropagation is the most critical difference for training deep networks.

Sigmoid Characteristics

•Output range: (0, 1)
•Maximum gradient: 0.25 at x = 0
•Center point: σ(0) = 0.5, not zero-centered
•Saturation: Severe for |x| > 5
•Gradient product: Decays as 0.25^L for L layers
•Weight updates: All same sign (zig-zagging)

Tanh Characteristics

•Output range: (-1, 1)
•Maximum gradient: 1.0 at x = 0
•Center point: tanh(0) = 0, zero-centered
•Saturation: Moderate for |x| > 3
•Gradient product: Decays as 1.0^L for L layers (at origin)
•Weight updates: Mixed signs (smoother optimization)

The Vanishing Gradient Problem

Both sigmoid and tanh suffer from the vanishing gradient problem, though to different degrees. This phenomenon is the primary reason both were largely replaced by ReLU for hidden layers.

Mathematical Analysis:

Consider a network with L layers using sigmoid activation. During backpropagation, the gradient must pass through L sigmoid derivatives. In the worst case (inputs near saturation):

$$\frac{\partial \mathcal{L}}{\partial W^{(1)}} \propto \prod_{l=1}^{L} \sigma'(z^{(l)})$$

Since σ'(x) ≤ 0.25 always:

$$\left|\frac{\partial \mathcal{L}}{\partial W^{(1)}}\right| \leq 0.25^L$$

For L = 10 layers: gradient ≤ 0.25^10 ≈ 9.5 × 10^{-7} For L = 20 layers: gradient ≤ 0.25^20 ≈ 9.1 × 10^{-13}

These gradients are so small that early layers receive essentially zero learning signal—they become untrainable.

Tanh's Partial Mitigation:

Tanh's maximum gradient of 1.0 means that at the origin, gradients can flow without attenuation. However, for any inputs away from zero, tanh'(x) < 1, and deep networks still suffer gradient decay. The improvement is real but insufficient for very deep networks.

The Saturation Trap

When pre-activations (z = Wx + b) become large in magnitude, both sigmoid and tanh 'saturate'—their derivatives become vanishingly small. Worse, once a neuron is saturated, the small gradient means it receives almost no learning signal to escape saturation. This can create 'dead' or 'stuck' neurons that contribute nothing to learning.

gradient_flow_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import numpy as np
import matplotlib.pyplot as plt
 
def sigmoid(x):
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
 
def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)
 
def tanh_derivative(x):
    return 1 - np.tanh(x)**2
 
def analyze_gradient_decay(activation_derivative, num_layers, input_values):
    """
    Simulate gradient flow through multiple layers.
    Returns the cumulative gradient product.
    """
    gradient_product = np.ones_like(input_values)
    for _ in range(num_layers):
        gradient_product *= activation_derivative(input_values)
    return gradient_product
 
# Analyze gradient decay for different depths
depths = [1, 5, 10, 15, 20]
x = np.linspace(-3, 3, 1000)
 
print("Gradient product at x=0 (best case) for L layers:")
print("-" * 50)
for L in depths:
    sigmoid_grad = 0.25 ** L
    tanh_grad = 1.0 ** L  # At x=0, tanh'(0) = 1
    print(f"L={L:2d}: Sigmoid = {sigmoid_grad:.2e}, Tanh = {tanh_grad:.2e}")
 
print("\nGradient product at x=2 (typical case) for L layers:")
print("-" * 50)
sigmoid_local = sigmoid_derivative(np.array([2.0]))[0]
tanh_local = tanh_derivative(np.array([2.0]))[0]
for L in depths:
    sigmoid_grad = sigmoid_local ** L
    tanh_grad = tanh_local ** L
    print(f"L={L:2d}: Sigmoid = {sigmoid_grad:.2e}, Tanh = {tanh_grad:.2e}")
 
# The practical reality: pre-activations are rarely exactly 0
print("\nPractical gradient decay (random pre-activations in [-2, 2]):")
print("-" * 50)
np.random.seed(42)
num_simulations = 1000
 
for L in depths:
    sigmoid_products = []
    tanh_products = []
    for _ in range(num_simulations):
        pre_activations = np.random.uniform(-2, 2, L)
        sigmoid_products.append(np.prod(sigmoid_derivative(pre_activations)))
        tanh_products.append(np.prod(tanh_derivative(pre_activations)))
    
    print(f"L={L:2d}: Sigmoid mean = {np.mean(sigmoid_products):.2e}, "
          f"Tanh mean = {np.mean(tanh_products):.2e}")

Biological Motivation and Interpretation

The sigmoid and tanh functions were originally motivated by analogies to biological neurons. While these analogies are imperfect—and modern deep learning has largely moved away from biological plausibility as a design criterion—understanding the historical motivation provides valuable context.

The Biological Neuron Model

Real biological neurons exhibit several key behaviors:

All-or-nothing firing: Neurons either fire (produce an action potential) or don't
Threshold behavior: Firing occurs when input exceeds a threshold
Refractory periods: After firing, neurons cannot immediately fire again
Firing rate encoding: Information is encoded in the rate of firing, not single spikes

The sigmoid function was proposed as a smooth approximation to the step function that models threshold behavior:

Inputs below threshold → low output (near 0)
Inputs above threshold → high output (near 1)
Smooth transition → allows gradient-based learning

Firing Rate Interpretation

The sigmoid output can be interpreted as the probability of firing or equivalently the normalized firing rate:

$$\text{firing rate} = r_{\max} \cdot \sigma(W \cdot x + b)$$

where r_max is the maximum firing rate. This interpretation:

Constrains outputs to physiologically plausible values (0 to r_max)
Models the sigmoidal firing rate curves observed in real neurons
Provides probabilistic semantics useful for Bayesian neural networks

Modern Perspective

Contemporary deep learning research has largely abandoned biological plausibility as a design goal. ReLU, which has no biological analog (neurons don't produce negative outputs, and real neurons do saturate), vastly outperforms biologically-inspired functions in most settings. The lesson: inspiration is valuable, but empirical performance trumps theoretical elegance.

Tanh as Balanced Activation

The tanh function can be interpreted as modeling excitatory and inhibitory inputs symmetrically:

Positive outputs: excitatory effect on downstream neurons
Negative outputs: inhibitory effect on downstream neurons
Zero-centered: balanced default state

This symmetric model better captures the push-pull nature of neural circuits, where inhibition is as important as excitation.

Information-Theoretic Perspective

Beyond biological analogy, sigmoid activations can be motivated from information theory:

Maximum entropy: The sigmoid is the maximum entropy distribution over binary outcomes given a mean constraint
Bits of information: The output σ(x) represents the probability that the 'feature detected by this neuron is present'
Logistic regression connection: A neuron with sigmoid activation implements logistic regression, the optimal discriminative model for linearly separable binary classification

Computational Considerations and Numerical Stability

Implementing sigmoid and tanh correctly requires attention to numerical stability. Naive implementations can produce NaN or Inf values, undermining model training.

Numerical Issues with Sigmoid

The Problem:

$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

For large positive x: e^{-x} → 0, so σ(x) → 1 (safe)
For large negative x: e^{-x} → ∞ (overflow!)

Example: σ(-1000) would compute e^1000, which overflows float64.

The Solution: Use different formulations for positive and negative inputs:

$$\sigma(x) = \begin{cases} \frac{1}{1 + e^{-x}} & \text{if } x \geq 0 \ \frac{e^x}{1 + e^x} & \text{if } x < 0 \end{cases}$$

Both forms are mathematically identical, but numerically stable in their respective domains.

stable_implementations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy as np
 
def sigmoid_naive(x):
    """
    Naive sigmoid - will overflow for large negative x.
    DO NOT USE IN PRODUCTION.
    """
    return 1.0 / (1.0 + np.exp(-x))
 
def sigmoid_stable(x):
    """
    Numerically stable sigmoid using conditional formulation.
    """
    # Vectorized implementation
    result = np.empty_like(x, dtype=np.float64)
    positive = x >= 0
    negative = ~positive
    
    # For x >= 0: 1 / (1 + exp(-x))
    result[positive] = 1.0 / (1.0 + np.exp(-x[positive]))
    
    # For x < 0: exp(x) / (1 + exp(x))
    exp_x = np.exp(x[negative])
    result[negative] = exp_x / (1.0 + exp_x)
    
    return result
 
def sigmoid_log_stable(x):
    """
    Even more stable version using log-space computation.
    Useful when you need log(σ(x)) or log(1-σ(x)).
    """
    return -np.logaddexp(0, -x)  # Returns log(σ(x))
 
def log_sigmoid_components(x):
    """
    Return log(σ(x)) and log(1-σ(x)) stably.
    Essential for cross-entropy loss computation.
    """
    # log(σ(x)) = -log(1 + exp(-x)) = -softplus(-x)
    # log(1 - σ(x)) = -log(1 + exp(x)) = -softplus(x)
    log_sigmoid = -np.logaddexp(0, -x)
    log_one_minus_sigmoid = -np.logaddexp(0, x)
    return log_sigmoid, log_one_minus_sigmoid
 
# Demonstrate the problem
print("Naive vs Stable implementation:")
print("-" * 50)
test_values = np.array([0.0, 10.0, 100.0, 1000.0, -10.0, -100.0, -1000.0])
 
for x in test_values:
    naive = sigmoid_naive(x) if x > -500 else "overflow"
    stable = sigmoid_stable(np.array([x]))[0]
    print(f"x = {x:7.1f}: naive = {naive!s:12}, stable = {stable:.10f}")
 
# Log-space computation for extreme values
print("\nLog-space computation (for loss functions):")
print("-" * 50)
log_sig, log_one_minus_sig = log_sigmoid_components(test_values)
for x, ls, loms in zip(test_values, log_sig, log_one_minus_sig):
    print(f"x = {x:7.1f}: log(σ(x)) = {ls:12.6f}, log(1-σ(x)) = {loms:12.6f}")

Cost Comparison: Sigmoid vs Tanh vs ReLU

Computational efficiency matters when processing billions of activations. Here's how the functions compare:

Computational Cost Comparison
Operation	Sigmoid	Tanh	ReLU
Forward pass	2 exp + 1 div	2 exp + 1 div + 1 sub	1 max
Backward pass	2 mul + 1 sub	1 mul + 1 sub + 1 sq	1 comparison
Memory (caching)	Store σ(x)	Store tanh(x)	Store mask
Relative speed	1× (baseline)	~1.1×	~10× faster
Numerical issues	Overflow for large \|x\|	Overflow for large \|x\|	None

Framework Optimization

Modern deep learning frameworks (PyTorch, TensorFlow, JAX) implement sigmoid and tanh with hardware-optimized fused kernels. The raw operation count understates the performance—but the ~10× speed advantage of ReLU persists due to the fundamental simplicity of the max() operation.

Where Sigmoid and Tanh Still Excel

Despite the dominance of ReLU in hidden layers, sigmoid and tanh remain the correct choice for specific architectural components. Understanding these use cases is essential for proper network design.

1. Output Layers for Probability

Binary Classification: Sigmoid is the canonical output activation for binary classification:

$$P(y=1|x) = \sigma(W \cdot h + b)$$

This provides a proper probability in [0, 1] that can be used with binary cross-entropy loss:

$$\mathcal{L} = -[y \log \sigma(z) + (1-y) \log(1 - \sigma(z))]$$

Multi-Label Classification: When multiple labels can be simultaneously true, use sigmoid on each output independently:

$$P(y_i=1|x) = \sigma(z_i) \quad \text{for } i = 1, ..., K$$

2. Gating Mechanisms

LSTM and GRU Gates

Sigmoid is essential for the gates in LSTM and GRU architectures. The forget gate f_t, input gate i_t, and output gate o_t all use sigmoid because gates must produce values in [0, 1] that represent 'how much' information flows through. A value of 0 means 'block completely' and 1 means 'allow completely'. ReLU cannot provide this bounded gating behavior.

lstm_gates.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import numpy as np
 
def lstm_cell(x, h_prev, c_prev, Wf, Wi, Wc, Wo, bf, bi, bc, bo):
    """
    LSTM cell implementation showing gate activations.
    Note: All gates use sigmoid (σ) for [0,1] gating.
    The candidate cell state uses tanh for [-1,1] values.
    """
    # Combine input and previous hidden state
    combined = np.concatenate([x, h_prev])
    
    # --- GATES USE SIGMOID ---
    # Forget gate: What to forget from previous cell state
    f = sigmoid(combined @ Wf + bf)  # σ: [0,1] - forget factor
    
    # Input gate: What new information to store
    i = sigmoid(combined @ Wi + bi)  # σ: [0,1] - input factor
    
    # Output gate: What to output from cell state
    o = sigmoid(combined @ Wo + bo)  # σ: [0,1] - output factor
    
    # --- CELL STATE USES TANH ---
    # Candidate cell state: New values to potentially add
    c_candidate = np.tanh(combined @ Wc + bc)  # tanh: [-1,1] - value
    
    # Update cell state: forget old + add new
    c = f * c_prev + i * c_candidate
    
    # Compute hidden state: tanh-squashed cell state, gated by output
    h = o * np.tanh(c)
    
    return h, c
 
def sigmoid(x):
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
 
# Why this works:
# - Sigmoid gates are multiplicative: value * gate
# - Gate = 0: blocks information completely
# - Gate = 1: allows information completely
# - Gate = 0.5: allows 50% of information
# - ReLU can't do this: values can exceed 1, and 0 is not a clean "off"

3. Attention Mechanisms

Additive Attention: Original attention (Bahdanau attention) uses tanh for score computation:

$$e_{ij} = v^T \cdot \tanh(W_h \cdot h_j + W_s \cdot s_i)$$

Tanh bounds the scores, preventing extreme values before softmax.

Gated Attention: Sigmoid is used for attention gates in some architectures:

$$g = \sigma(W_g \cdot [h; c])$$

4. Bounded Output Requirements

Any situation requiring bounded outputs suggests sigmoid or tanh:

Sigmoid [0, 1]:

Probability predictions
Confidence scores
Normalized weights that must sum < ∞

Tanh [-1, 1]:

Bounded regression targets
Actions in reinforcement learning (e.g., steering angle normalized to [-1, 1])
Generating normalized embeddings

When to Use Sigmoid vs Tanh vs ReLU
Use Case	Recommended	Reason
Hidden layers (deep networks)	ReLU/variants	No vanishing gradients
Binary classification output	Sigmoid	Produces valid probability
Multi-class output	Softmax	Produces probability distribution
Multi-label output	Sigmoid (per label)	Independent probabilities
LSTM/GRU gates	Sigmoid	Bounded [0,1] gating
LSTM/GRU cell state	Tanh	Bounded [-1,1] values
Bounded regression	Tanh or Sigmoid	Match output range to target
Embeddings	Tanh	Zero-centered, bounded
Generative models (latent)	Tanh	Symmetric, bounded range

Summary and Key Insights

We have comprehensively analyzed the sigmoid and hyperbolic tangent activation functions—understanding their mathematical foundations, gradient behavior, biological motivations, computational properties, and modern applications.

Core Takeaways

•Sigmoid maps to (0, 1) with σ'(x) = σ(x)(1-σ(x)), maximum gradient 0.25 at x=0, used for probabilities and gating.
•Tanh maps to (-1, 1) with tanh'(x) = 1 - tanh²(x), maximum gradient 1.0 at x=0, preferred for hidden layers when these functions are used.
•Both suffer vanishing gradients due to saturating regions where derivatives approach zero, limiting trainable depth.
•Zero-centering (tanh) provides smoother optimization by allowing mixed-sign weight updates, unlike sigmoid's all-positive outputs.
•Modern hidden layers use ReLU variants, but sigmoid/tanh remain essential for output layers, gates, and bounded requirements.
•Numerical stability requires careful implementation—use log-space for loss computation and conditional formulations for extreme values.

Looking Ahead:

The limitations of sigmoid and tanh—particularly the vanishing gradient problem—motivated the search for better activation functions. In the next page, we explore ReLU and its variants, the functions that revolutionized deep learning by enabling training of much deeper networks.

Page Complete

You now have complete mastery of sigmoid and tanh activation functions—their mathematics, behavior, limitations, and proper modern usage. This foundation is essential for understanding why ReLU variants dominate and for correctly applying sigmoid/tanh in the specific contexts where they remain optimal.

1 / 5

Loading learning content...

Machine LearningNeural Network Foundations

Activation Functions

LevelIntermediate

Duration90 mins

TopicNeural Network Foundations

1 / 5

Sigmoid and Tanh

The Gateway to Nonlinearity

They remain the correct choice for specific architectural components (output layers for probabilities, gating mechanisms in LSTMs/GRUs)
Their limitations illuminate why modern alternatives like ReLU exist
Their mathematical properties establish the vocabulary for analyzing all activation functions
Many foundational papers and textbooks assume familiarity with their behavior

What You Will Master

The Sigmoid Function: Complete Analysis

Mathematical Definition

The sigmoid function σ(x) is defined as:

$$\sigma(x) = \frac{1}{1 + e^{-x}} = \frac{e^x}{e^x + 1}$$

These two forms are mathematically equivalent, but the second form is often more numerically stable for large positive values of x.

Key Properties

Domain and Range:

Domain: All real numbers, x ∈ (-∞, +∞)
Range: (0, 1) — the output is strictly between 0 and 1, never exactly reaching either bound

Symmetry and Fixed Points:

The function passes through the point (0, 0.5)
It exhibits point symmetry about (0, 0.5): σ(-x) = 1 - σ(x)
It has no fixed points other than approximately x ≈ 0 when viewed as an iteration map

Asymptotic Behavior:

As x → +∞, σ(x) → 1 (approaches but never reaches 1)
As x → -∞, σ(x) → 0 (approaches but never reaches 0)
The function is a smooth approximation to the step function

Historical Context

The Derivative of Sigmoid

The derivative of the sigmoid function has an elegant, self-referential form:

$$\sigma'(x) = \sigma(x) \cdot (1 - \sigma(x))$$

This can be derived using the quotient rule:

$$\sigma'(x) = \frac{d}{dx}\left[\frac{1}{1 + e^{-x}}\right] = \frac{e^{-x}}{(1 + e^{-x})^2}$$

Recognizing that:

$\sigma(x) = \frac{1}{1 + e^{-x}}$
$1 - \sigma(x) = \frac{e^{-x}}{1 + e^{-x}}$

We can verify: $\sigma(x) \cdot (1 - \sigma(x)) = \frac{1}{1 + e^{-x}} \cdot \frac{e^{-x}}{1 + e^{-x}} = \frac{e^{-x}}{(1 + e^{-x})^2}$

Properties of the Derivative:

Maximum value: The derivative achieves its maximum of 0.25 at x = 0 (where σ(0) = 0.5)
Monotonic decay: The derivative decreases monotonically as |x| increases
Vanishing gradients: For |x| > 5, the derivative is essentially zero (< 0.007)

sigmoid_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import numpy as np
import matplotlib.pyplot as plt
 
def sigmoid(x):
    """
    Numerically stable sigmoid implementation.
    Uses different formulations for positive/negative inputs
    to avoid overflow in exp().
    """
    # For positive x: use 1 / (1 + exp(-x))
    # For negative x: use exp(x) / (1 + exp(x))
    positive_mask = x >= 0
    negative_mask = ~positive_mask
    
    result = np.zeros_like(x, dtype=np.float64)
    
    # Positive values: standard formula
    result[positive_mask] = 1.0 / (1.0 + np.exp(-x[positive_mask]))
    
    # Negative values: numerically stable alternative
    exp_x = np.exp(x[negative_mask])
    result[negative_mask] = exp_x / (1.0 + exp_x)
    
    return result
 
def sigmoid_derivative(x):
    """
    Derivative of sigmoid: σ'(x) = σ(x) * (1 - σ(x))
    This elegant form allows efficient computation using
    the already-computed forward pass value.
    """
    s = sigmoid(x)
    return s * (1 - s)
 
def sigmoid_derivative_from_output(sigmoid_output):
    """
    Often in backpropagation, we have σ(x) from the forward pass.
    We can compute the derivative directly without re-evaluating sigmoid.
    """
    return sigmoid_output * (1 - sigmoid_output)
 
# Visualization of sigmoid and its derivative
x = np.linspace(-8, 8, 1000)
y_sigmoid = sigmoid(x)
y_derivative = sigmoid_derivative(x)
 
print("Key properties of sigmoid:")
print(f"  σ(0) = {sigmoid(np.array([0.0]))[0]:.4f}")  # Should be 0.5
print(f"  σ'(0) = {sigmoid_derivative(np.array([0.0]))[0]:.4f}")  # Should be 0.25
print(f"  σ(5) = {sigmoid(np.array([5.0]))[0]:.6f}")  # Should be ~0.9933
print(f"  σ'(5) = {sigmoid_derivative(np.array([5.0]))[0]:.6f}")  # Very small
print(f"  σ(-5) = {sigmoid(np.array([-5.0]))[0]:.6f}")  # Should be ~0.0067

Sigmoid Function Values at Key Points
x	σ(x)	σ'(x)	Interpretation
-∞	→ 0	→ 0	Strong negative, gradient vanishes
-5	0.0067	0.0066	Gradient effectively zero
-2	0.1192	0.1050	Moderate gradient
-1	0.2689	0.1966	Reasonable gradient flow
0	0.5000	0.2500	Maximum gradient (inflection)
1	0.7311	0.1966	Reasonable gradient flow
2	0.8808	0.1050	Moderate gradient
5	0.9933	0.0066	Gradient effectively zero
+∞	→ 1	→ 0	Strong positive, gradient vanishes

The Hyperbolic Tangent Function: Complete Analysis

Mathematical Definition

The tanh function is defined as:

$$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = \frac{e^{2x} - 1}{e^{2x} + 1}$$

Relationship to Sigmoid

Tanh and sigmoid are intimately related through a simple linear transformation:

$$\tanh(x) = 2\sigma(2x) - 1$$

Equivalently:

$$\sigma(x) = \frac{\tanh(x/2) + 1}{2}$$

This relationship reveals that tanh is essentially a rescaled and shifted sigmoid. Where sigmoid maps inputs to (0, 1), tanh maps to (-1, 1).

Key Properties

Domain and Range:

Domain: All real numbers, x ∈ (-∞, +∞)
Range: (-1, 1) — the output is strictly between -1 and 1

Symmetry:

tanh is an odd function: tanh(-x) = -tanh(x)
This means it is symmetric about the origin (point symmetry through (0, 0))
The function passes through (0, 0)

Asymptotic Behavior:

As x → +∞, tanh(x) → 1
As x → -∞, tanh(x) → -1

Why Zero-Centering Matters

The Derivative of Tanh

The derivative of tanh has a similarly elegant form:

$$\tanh'(x) = 1 - \tanh^2(x) = \text{sech}^2(x)$$

where sech(x) = 1/cosh(x) is the hyperbolic secant.

Derivation:

Let y = tanh(x) = (e^x - e^{-x})/(e^x + e^{-x}). Using the quotient rule:

$$\frac{dy}{dx} = \frac{(e^x + e^{-x})(e^x + e^{-x}) - (e^x - e^{-x})(e^x - e^{-x})}{(e^x + e^{-x})^2}$$

$$= \frac{(e^x + e^{-x})^2 - (e^x - e^{-x})^2}{(e^x + e^{-x})^2}$$

$$= 1 - \left(\frac{e^x - e^{-x}}{e^x + e^{-x}}\right)^2 = 1 - \tanh^2(x)$$

Properties of the Derivative:

Maximum value: The derivative achieves its maximum of 1.0 at x = 0
Monotonic decay: As |x| increases, the derivative decreases
Comparison to sigmoid: tanh's maximum derivative (1.0) is 4× larger than sigmoid's (0.25)
Vanishing gradients: Still occurs for |x| > 3, but less severely than sigmoid

tanh_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
 
def tanh(x):
    """
    Numerically stable tanh implementation.
    np.tanh is already stable, but understanding the formula matters.
    """
    # For very large |x|, exp(2x) or exp(-2x) can overflow
    # np.tanh handles this internally, but for education:
    # tanh(x) = (exp(2x) - 1) / (exp(2x) + 1)  for x >= 0
    # tanh(x) = (1 - exp(-2x)) / (1 + exp(-2x))  for x < 0
    return np.tanh(x)
 
def tanh_derivative(x):
    """
    Derivative of tanh: tanh'(x) = 1 - tanh²(x)
    """
    t = tanh(x)
    return 1 - t**2
 
def tanh_derivative_from_output(tanh_output):
    """
    Compute derivative directly from forward pass output.
    Essential for efficient backpropagation.
    """
    return 1 - tanh_output**2
 
# Relationship between sigmoid and tanh
def sigmoid(x):
    return 1 / (1 + np.exp(-x))
 
def verify_relationship(x):
    """
    Verify: tanh(x) = 2 * sigmoid(2x) - 1
    """
    tanh_direct = np.tanh(x)
    tanh_from_sigmoid = 2 * sigmoid(2 * x) - 1
    difference = np.abs(tanh_direct - tanh_from_sigmoid)
    return np.max(difference) < 1e-10  # Should be True
 
x_test = np.linspace(-5, 5, 1000)
print(f"Relationship verified: {verify_relationship(x_test)}")
 
# Compare gradient magnitudes at key points
print("\nGradient comparison at x=0:")
print(f"  σ'(0) = {sigmoid(0) * (1 - sigmoid(0)):.4f}")  # 0.25
print(f"  tanh'(0) = {1 - tanh(0)**2:.4f}")  # 1.0
 
print("\nGradient comparison at x=2:")
print(f"  σ'(2) = {sigmoid(2) * (1 - sigmoid(2)):.4f}")  # ~0.105
print(f"  tanh'(2) = {1 - tanh(2)**2:.4f}")  # ~0.071

Tanh Function Values at Key Points
x	tanh(x)	tanh'(x)	Interpretation
-∞	→ -1	→ 0	Saturated negative
-3	-0.9951	0.0099	Nearly saturated
-2	-0.9640	0.0707	Moderate gradient
-1	-0.7616	0.4200	Good gradient flow
0	0.0000	1.0000	Maximum gradient (origin)
1	0.7616	0.4200	Good gradient flow
2	0.9640	0.0707	Moderate gradient
3	0.9951	0.0099	Nearly saturated
+∞	→ 1	→ 0	Saturated positive

Sigmoid vs Tanh: Comprehensive Comparison

Gradient Flow Comparison

The gradient flow during backpropagation is the most critical difference for training deep networks.

Sigmoid Characteristics

•Output range: (0, 1)
•Maximum gradient: 0.25 at x = 0
•Center point: σ(0) = 0.5, not zero-centered
•Saturation: Severe for |x| > 5
•Gradient product: Decays as 0.25^L for L layers
•Weight updates: All same sign (zig-zagging)

Tanh Characteristics

•Output range: (-1, 1)
•Maximum gradient: 1.0 at x = 0
•Center point: tanh(0) = 0, zero-centered
•Saturation: Moderate for |x| > 3
•Gradient product: Decays as 1.0^L for L layers (at origin)
•Weight updates: Mixed signs (smoother optimization)

The Vanishing Gradient Problem

Both sigmoid and tanh suffer from the vanishing gradient problem, though to different degrees. This phenomenon is the primary reason both were largely replaced by ReLU for hidden layers.

Mathematical Analysis:

Consider a network with L layers using sigmoid activation. During backpropagation, the gradient must pass through L sigmoid derivatives. In the worst case (inputs near saturation):

$$\frac{\partial \mathcal{L}}{\partial W^{(1)}} \propto \prod_{l=1}^{L} \sigma'(z^{(l)})$$

Since σ'(x) ≤ 0.25 always:

$$\left|\frac{\partial \mathcal{L}}{\partial W^{(1)}}\right| \leq 0.25^L$$

For L = 10 layers: gradient ≤ 0.25^10 ≈ 9.5 × 10^{-7} For L = 20 layers: gradient ≤ 0.25^20 ≈ 9.1 × 10^{-13}

These gradients are so small that early layers receive essentially zero learning signal—they become untrainable.

Tanh's Partial Mitigation:

The Saturation Trap

gradient_flow_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import numpy as np
import matplotlib.pyplot as plt
 
def sigmoid(x):
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
 
def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)
 
def tanh_derivative(x):
    return 1 - np.tanh(x)**2
 
def analyze_gradient_decay(activation_derivative, num_layers, input_values):
    """
    Simulate gradient flow through multiple layers.
    Returns the cumulative gradient product.
    """
    gradient_product = np.ones_like(input_values)
    for _ in range(num_layers):
        gradient_product *= activation_derivative(input_values)
    return gradient_product
 
# Analyze gradient decay for different depths
depths = [1, 5, 10, 15, 20]
x = np.linspace(-3, 3, 1000)
 
print("Gradient product at x=0 (best case) for L layers:")
print("-" * 50)
for L in depths:
    sigmoid_grad = 0.25 ** L
    tanh_grad = 1.0 ** L  # At x=0, tanh'(0) = 1
    print(f"L={L:2d}: Sigmoid = {sigmoid_grad:.2e}, Tanh = {tanh_grad:.2e}")
 
print("\nGradient product at x=2 (typical case) for L layers:")
print("-" * 50)
sigmoid_local = sigmoid_derivative(np.array([2.0]))[0]
tanh_local = tanh_derivative(np.array([2.0]))[0]
for L in depths:
    sigmoid_grad = sigmoid_local ** L
    tanh_grad = tanh_local ** L
    print(f"L={L:2d}: Sigmoid = {sigmoid_grad:.2e}, Tanh = {tanh_grad:.2e}")
 
# The practical reality: pre-activations are rarely exactly 0
print("\nPractical gradient decay (random pre-activations in [-2, 2]):")
print("-" * 50)
np.random.seed(42)
num_simulations = 1000
 
for L in depths:
    sigmoid_products = []
    tanh_products = []
    for _ in range(num_simulations):
        pre_activations = np.random.uniform(-2, 2, L)
        sigmoid_products.append(np.prod(sigmoid_derivative(pre_activations)))
        tanh_products.append(np.prod(tanh_derivative(pre_activations)))
    
    print(f"L={L:2d}: Sigmoid mean = {np.mean(sigmoid_products):.2e}, "
          f"Tanh mean = {np.mean(tanh_products):.2e}")

Biological Motivation and Interpretation

The Biological Neuron Model

Real biological neurons exhibit several key behaviors:

All-or-nothing firing: Neurons either fire (produce an action potential) or don't
Threshold behavior: Firing occurs when input exceeds a threshold
Refractory periods: After firing, neurons cannot immediately fire again
Firing rate encoding: Information is encoded in the rate of firing, not single spikes

The sigmoid function was proposed as a smooth approximation to the step function that models threshold behavior:

Inputs below threshold → low output (near 0)
Inputs above threshold → high output (near 1)
Smooth transition → allows gradient-based learning

Firing Rate Interpretation

The sigmoid output can be interpreted as the probability of firing or equivalently the normalized firing rate:

$$\text{firing rate} = r_{\max} \cdot \sigma(W \cdot x + b)$$

where r_max is the maximum firing rate. This interpretation:

Constrains outputs to physiologically plausible values (0 to r_max)
Models the sigmoidal firing rate curves observed in real neurons
Provides probabilistic semantics useful for Bayesian neural networks

Modern Perspective

Tanh as Balanced Activation

The tanh function can be interpreted as modeling excitatory and inhibitory inputs symmetrically:

Positive outputs: excitatory effect on downstream neurons
Negative outputs: inhibitory effect on downstream neurons
Zero-centered: balanced default state

This symmetric model better captures the push-pull nature of neural circuits, where inhibition is as important as excitation.

Information-Theoretic Perspective

Beyond biological analogy, sigmoid activations can be motivated from information theory:

Maximum entropy: The sigmoid is the maximum entropy distribution over binary outcomes given a mean constraint
Bits of information: The output σ(x) represents the probability that the 'feature detected by this neuron is present'
Logistic regression connection: A neuron with sigmoid activation implements logistic regression, the optimal discriminative model for linearly separable binary classification

Computational Considerations and Numerical Stability

Implementing sigmoid and tanh correctly requires attention to numerical stability. Naive implementations can produce NaN or Inf values, undermining model training.

Numerical Issues with Sigmoid

The Problem:

$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

For large positive x: e^{-x} → 0, so σ(x) → 1 (safe)
For large negative x: e^{-x} → ∞ (overflow!)

Example: σ(-1000) would compute e^1000, which overflows float64.

The Solution: Use different formulations for positive and negative inputs:

$$\sigma(x) = \begin{cases} \frac{1}{1 + e^{-x}} & \text{if } x \geq 0 \ \frac{e^x}{1 + e^x} & \text{if } x < 0 \end{cases}$$

Both forms are mathematically identical, but numerically stable in their respective domains.

stable_implementations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy as np
 
def sigmoid_naive(x):
    """
    Naive sigmoid - will overflow for large negative x.
    DO NOT USE IN PRODUCTION.
    """
    return 1.0 / (1.0 + np.exp(-x))
 
def sigmoid_stable(x):
    """
    Numerically stable sigmoid using conditional formulation.
    """
    # Vectorized implementation
    result = np.empty_like(x, dtype=np.float64)
    positive = x >= 0
    negative = ~positive
    
    # For x >= 0: 1 / (1 + exp(-x))
    result[positive] = 1.0 / (1.0 + np.exp(-x[positive]))
    
    # For x < 0: exp(x) / (1 + exp(x))
    exp_x = np.exp(x[negative])
    result[negative] = exp_x / (1.0 + exp_x)
    
    return result
 
def sigmoid_log_stable(x):
    """
    Even more stable version using log-space computation.
    Useful when you need log(σ(x)) or log(1-σ(x)).
    """
    return -np.logaddexp(0, -x)  # Returns log(σ(x))
 
def log_sigmoid_components(x):
    """
    Return log(σ(x)) and log(1-σ(x)) stably.
    Essential for cross-entropy loss computation.
    """
    # log(σ(x)) = -log(1 + exp(-x)) = -softplus(-x)
    # log(1 - σ(x)) = -log(1 + exp(x)) = -softplus(x)
    log_sigmoid = -np.logaddexp(0, -x)
    log_one_minus_sigmoid = -np.logaddexp(0, x)
    return log_sigmoid, log_one_minus_sigmoid
 
# Demonstrate the problem
print("Naive vs Stable implementation:")
print("-" * 50)
test_values = np.array([0.0, 10.0, 100.0, 1000.0, -10.0, -100.0, -1000.0])
 
for x in test_values:
    naive = sigmoid_naive(x) if x > -500 else "overflow"
    stable = sigmoid_stable(np.array([x]))[0]
    print(f"x = {x:7.1f}: naive = {naive!s:12}, stable = {stable:.10f}")
 
# Log-space computation for extreme values
print("\nLog-space computation (for loss functions):")
print("-" * 50)
log_sig, log_one_minus_sig = log_sigmoid_components(test_values)
for x, ls, loms in zip(test_values, log_sig, log_one_minus_sig):
    print(f"x = {x:7.1f}: log(σ(x)) = {ls:12.6f}, log(1-σ(x)) = {loms:12.6f}")

Cost Comparison: Sigmoid vs Tanh vs ReLU

Computational efficiency matters when processing billions of activations. Here's how the functions compare:

Computational Cost Comparison
Operation	Sigmoid	Tanh	ReLU
Forward pass	2 exp + 1 div	2 exp + 1 div + 1 sub	1 max
Backward pass	2 mul + 1 sub	1 mul + 1 sub + 1 sq	1 comparison
Memory (caching)	Store σ(x)	Store tanh(x)	Store mask
Relative speed	1× (baseline)	~1.1×	~10× faster
Numerical issues	Overflow for large \|x\|	Overflow for large \|x\|	None

Framework Optimization

Where Sigmoid and Tanh Still Excel

1. Output Layers for Probability

Binary Classification: Sigmoid is the canonical output activation for binary classification:

$$P(y=1|x) = \sigma(W \cdot h + b)$$

This provides a proper probability in [0, 1] that can be used with binary cross-entropy loss:

$$\mathcal{L} = -[y \log \sigma(z) + (1-y) \log(1 - \sigma(z))]$$

Multi-Label Classification: When multiple labels can be simultaneously true, use sigmoid on each output independently:

$$P(y_i=1|x) = \sigma(z_i) \quad \text{for } i = 1, ..., K$$

2. Gating Mechanisms

LSTM and GRU Gates

lstm_gates.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import numpy as np
 
def lstm_cell(x, h_prev, c_prev, Wf, Wi, Wc, Wo, bf, bi, bc, bo):
    """
    LSTM cell implementation showing gate activations.
    Note: All gates use sigmoid (σ) for [0,1] gating.
    The candidate cell state uses tanh for [-1,1] values.
    """
    # Combine input and previous hidden state
    combined = np.concatenate([x, h_prev])
    
    # --- GATES USE SIGMOID ---
    # Forget gate: What to forget from previous cell state
    f = sigmoid(combined @ Wf + bf)  # σ: [0,1] - forget factor
    
    # Input gate: What new information to store
    i = sigmoid(combined @ Wi + bi)  # σ: [0,1] - input factor
    
    # Output gate: What to output from cell state
    o = sigmoid(combined @ Wo + bo)  # σ: [0,1] - output factor
    
    # --- CELL STATE USES TANH ---
    # Candidate cell state: New values to potentially add
    c_candidate = np.tanh(combined @ Wc + bc)  # tanh: [-1,1] - value
    
    # Update cell state: forget old + add new
    c = f * c_prev + i * c_candidate
    
    # Compute hidden state: tanh-squashed cell state, gated by output
    h = o * np.tanh(c)
    
    return h, c
 
def sigmoid(x):
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
 
# Why this works:
# - Sigmoid gates are multiplicative: value * gate
# - Gate = 0: blocks information completely
# - Gate = 1: allows information completely
# - Gate = 0.5: allows 50% of information
# - ReLU can't do this: values can exceed 1, and 0 is not a clean "off"

3. Attention Mechanisms

Additive Attention: Original attention (Bahdanau attention) uses tanh for score computation:

$$e_{ij} = v^T \cdot \tanh(W_h \cdot h_j + W_s \cdot s_i)$$

Tanh bounds the scores, preventing extreme values before softmax.

Gated Attention: Sigmoid is used for attention gates in some architectures:

$$g = \sigma(W_g \cdot [h; c])$$

4. Bounded Output Requirements

Any situation requiring bounded outputs suggests sigmoid or tanh:

Sigmoid [0, 1]:

Probability predictions
Confidence scores
Normalized weights that must sum < ∞

Tanh [-1, 1]:

Bounded regression targets
Actions in reinforcement learning (e.g., steering angle normalized to [-1, 1])
Generating normalized embeddings

When to Use Sigmoid vs Tanh vs ReLU
Use Case	Recommended	Reason
Hidden layers (deep networks)	ReLU/variants	No vanishing gradients
Binary classification output	Sigmoid	Produces valid probability
Multi-class output	Softmax	Produces probability distribution
Multi-label output	Sigmoid (per label)	Independent probabilities
LSTM/GRU gates	Sigmoid	Bounded [0,1] gating
LSTM/GRU cell state	Tanh	Bounded [-1,1] values
Bounded regression	Tanh or Sigmoid	Match output range to target
Embeddings	Tanh	Zero-centered, bounded
Generative models (latent)	Tanh	Symmetric, bounded range

Summary and Key Insights

Core Takeaways

•Sigmoid maps to (0, 1) with σ'(x) = σ(x)(1-σ(x)), maximum gradient 0.25 at x=0, used for probabilities and gating.
•Tanh maps to (-1, 1) with tanh'(x) = 1 - tanh²(x), maximum gradient 1.0 at x=0, preferred for hidden layers when these functions are used.
•Both suffer vanishing gradients due to saturating regions where derivatives approach zero, limiting trainable depth.
•Zero-centering (tanh) provides smoother optimization by allowing mixed-sign weight updates, unlike sigmoid's all-positive outputs.
•Modern hidden layers use ReLU variants, but sigmoid/tanh remain essential for output layers, gates, and bounded requirements.
•Numerical stability requires careful implementation—use log-space for loss computation and conditional formulations for extreme values.

Looking Ahead:

Page Complete

1 / 5