Machine LearningLogistic Regression Model

The Logistic Regression Model

LevelIntermediate

Duration90 mins

TopicLogistic Regression Model

1 / 5

The Sigmoid Function

The Function That Bridges Worlds

In the transition from regression to classification, we face a fundamental challenge: how do we convert unbounded real-valued outputs into probabilities?

Linear regression produces predictions that span all real numbers—from negative infinity to positive infinity. Yet probabilities must live in the interval [0, 1]. We need a mathematical bridge between these two worlds, and the sigmoid function (also called the logistic function) provides exactly this transformation.

This isn't merely a convenient mathematical trick. The sigmoid function emerges naturally from probabilistic reasoning about binary outcomes, making it the theoretically justified choice for classification—not just an arbitrary squashing function. Understanding why requires diving deep into its origins, properties, and connections to probability theory.

What You Will Learn

By the end of this page, you will understand: (1) the mathematical definition and key properties of the sigmoid function, (2) why it emerges naturally from maximum entropy principles and exponential family distributions, (3) how it relates to odds and log-odds, (4) its computational properties and numerical considerations, and (5) its limitations and alternatives.

Historical Context and Motivation

The logistic function has a fascinating history that predates machine learning by over a century. Understanding this history illuminates why the function is so natural for modeling growth processes and binary outcomes.

Origins in Population Dynamics (1838)

The Belgian mathematician Pierre François Verhulst introduced the logistic function in 1838 to model population growth. Unlike the exponential growth model proposed by Malthus (which predicts unbounded growth), Verhulst recognized that populations face resource constraints. His logistic growth model describes how populations grow quickly when small, slow as they approach carrying capacity, and eventually stabilize:

$$P(t) = \frac{K}{1 + e^{-r(t-t_0)}}$$

where $K$ is the carrying capacity, $r$ is the growth rate, and $t_0$ is the time of maximum growth rate.

Transition to Statistics (1944)

Joseph Berkson coined the term logit in 1944, introducing the logistic function to biostatistics. He recognized that the logistic function could model the probability of binary outcomes—like survival vs. death, disease vs. health—as a function of continuous predictors. This was revolutionary because it provided a principled way to perform regression on binary outcomes.

Why 'Logistic'?

The name 'logistic' comes from the Greek 'logistikos' (skilled in calculating). Verhulst chose this name somewhat arbitrarily, and it has no deep mathematical significance. The term 'sigmoid' (σ-shaped, from the Greek letter sigma) describes the function's S-shaped curve and is often used interchangeably with 'logistic' in machine learning contexts.

Why Not Simply Use a Linear Model for Classification?

Before examining the sigmoid function's definition, let's understand why we can't just use linear regression for classification. Consider predicting whether an email is spam (1) or not spam (0) based on word counts.

If we fit a linear regression model $\hat{y} = w^T x + b$, several problems arise:

Predictions outside [0, 1]: Linear models can predict values like -0.3 or 1.7, which are meaningless as probabilities.
Sensitivity to outliers: Extreme feature values can wildly swing predictions, making the decision boundary unstable.
No probabilistic interpretation: Linear regression minimizes squared error, but this doesn't correspond to any sensible probabilistic model for binary outcomes.
Homoscedasticity violation: Binary outcomes have variance $p(1-p)$, which depends on the prediction itself, violating linear regression assumptions.

The sigmoid function addresses all these issues by providing a principled mapping from real-valued linear predictors to valid probabilities.

Mathematical Definition and Derivation

The sigmoid function (or standard logistic function) is defined as:

$$\sigma(z) = \frac{1}{1 + e^{-z}} = \frac{e^z}{1 + e^z}$$

where $z \in \mathbb{R}$ is any real number, and $\sigma(z) \in (0, 1)$ is the output.

These two forms are mathematically equivalent—the second form is obtained by multiplying numerator and denominator by $e^z$. Each form has computational advantages in different contexts, as we'll see.

Deriving from a Simple Constraint

We can derive the sigmoid function axiomatically by asking: what is the simplest function $f(z)$ that satisfies these requirements?

$f: \mathbb{R} \to (0, 1)$ (maps reals to valid probabilities)
$f$ is monotonically increasing
$f(0) = 0.5$ (neutral input yields neutral probability)
$f(-z) = 1 - f(z)$ (symmetry around the origin)

Starting from the exponential family of distributions and requiring maximum entropy subject to these constraints, we arrive uniquely at the sigmoid function. This isn't coincidence—it's a deep mathematical necessity.

sigmoid_implementations.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import numpy as np
 
def sigmoid_basic(z):
    """
    Basic sigmoid implementation.
    Simple but numerically unstable for large negative z.
    """
    return 1 / (1 + np.exp(-z))
 
def sigmoid_stable(z):
    """
    Numerically stable sigmoid implementation.
    Uses different formulas for positive and negative z to avoid overflow.
    """
    # For z >= 0: use standard form
    # For z < 0: use alternative form to avoid computing exp of large positive numbers
    positive_mask = z >= 0
    negative_mask = ~positive_mask
    
    result = np.zeros_like(z, dtype=np.float64)
    
    # For z >= 0: σ(z) = 1 / (1 + exp(-z))
    result[positive_mask] = 1 / (1 + np.exp(-z[positive_mask]))
    
    # For z < 0: σ(z) = exp(z) / (1 + exp(z))
    exp_z = np.exp(z[negative_mask])
    result[negative_mask] = exp_z / (1 + exp_z)
    
    return result
 
def sigmoid_scipy(z):
    """
    Using scipy's optimized implementation via the logistic CDF.
    Most efficient for production use.
    """
    from scipy.special import expit
    return expit(z)
 
# Demonstration of numerical stability
z_extreme = np.array([-1000, -100, -10, 0, 10, 100, 1000])
print("z values:", z_extreme)
print("Basic (may overflow):", sigmoid_basic(z_extreme.astype(float)))
print("Stable:", sigmoid_stable(z_extreme.astype(float)))
print("Scipy expit:", sigmoid_scipy(z_extreme.astype(float)))

Numerical Stability Matters

The basic sigmoid formula 1/(1 + exp(-z)) can cause numerical overflow when z is a large negative number (e.g., z = -1000). Computing exp(1000) produces infinity in floating-point arithmetic. Always use the stable implementation that switches formulas based on the sign of z.

Key Mathematical Properties

The sigmoid function possesses several remarkable properties that make it ideal for probabilistic classification. Understanding these properties deeply is essential for understanding logistic regression.

Property 1: Output Range (0, 1)

$$\lim_{z \to -\infty} \sigma(z) = 0, \quad \lim_{z \to +\infty} \sigma(z) = 1$$

The sigmoid function maps any real number to a value strictly between 0 and 1. This is exactly what we need to interpret outputs as probabilities. Note that the limits 0 and 1 are approached but never reached—we can never be completely certain of either class based on a logistic model.

Property 2: Symmetry

$$\sigma(-z) = 1 - \sigma(z)$$

This beautiful symmetry means that if the probability of class 1 given input $z$ is $\sigma(z)$, then the probability of class 0 is simply $\sigma(-z)$. The function is symmetric around the point $(0, 0.5)$.

Property 3: Derivative (Self-Referential Form)

$$\frac{d\sigma}{dz} = \sigma(z)(1 - \sigma(z))$$

This is perhaps the most elegant property. The derivative of the sigmoid is expressible entirely in terms of the sigmoid itself! This has profound implications:

The derivative is always positive, confirming monotonicity
Maximum at $z = 0$ where $\sigma'(0) = 0.25$
Approaches 0 as $|z| \to \infty$ (saturation regions)
Simplifies gradient computations dramatically during optimization

Sigmoid Function Values and Derivatives at Key Points
z	σ(z)	σ'(z) = σ(z)(1-σ(z))	Interpretation
-∞	→ 0	→ 0	Strong negative evidence, near-certain class 0
-4.6	≈ 0.01	≈ 0.01	99% confidence in class 0
-2.2	≈ 0.10	≈ 0.09	90% confidence in class 0
0	0.50	0.25	Maximum uncertainty, decision boundary
2.2	≈ 0.90	≈ 0.09	90% confidence in class 1
4.6	≈ 0.99	≈ 0.01	99% confidence in class 1
+∞	→ 1	→ 0	Strong positive evidence, near-certain class 1

Property 4: Log-Odds Representation

The inverse of the sigmoid function is the logit function:

$$\text{logit}(p) = \sigma^{-1}(p) = \log\left(\frac{p}{1-p}\right)$$

The quantity $\frac{p}{1-p}$ is called the odds of success. If $p = 0.75$, the odds are $0.75/0.25 = 3$, meaning success is 3 times more likely than failure.

The logit function converts probabilities to log-odds (also called the logit scale). This reveals that logistic regression is simply linear regression on the log-odds scale:

$$\log\left(\frac{p}{1-p}\right) = w^T x + b$$

This connection is profound and will be explored in depth in the next page.

Property 5: Belongs to the Exponential Family

The sigmoid function is the canonical link function for Bernoulli-distributed outcomes. This means logistic regression can be understood as a special case of Generalized Linear Models (GLMs), providing a unified theoretical framework. This will be covered extensively in a later module.

Why the Derivative Matters So Much

The self-referential derivative σ'(z) = σ(z)(1 - σ(z)) is computationally crucial. During backpropagation in neural networks or gradient descent in logistic regression, we need derivatives constantly. Since we already compute σ(z) for the forward pass, getting the derivative is nearly free—just multiply σ(z) by (1 - σ(z)).

Geometric Intuition and Visualization

Visualizing the sigmoid function reveals intuitions that mathematical formulas alone cannot convey. Let's examine its shape, its derivative, and how these translate to classification behavior.

The S-Curve Shape

The sigmoid produces the characteristic S-curve (hence 'sigmoid' from the Greek sigma). This shape reflects three distinct behavioral regions:

Saturation Region (z ≪ 0): Output close to 0, near-flat gradient. Strong evidence for class 0.
Linear Region (z ≈ 0): Approximately linear with slope ~0.25. Maximum uncertainty, where small input changes significantly affect output.
Saturation Region (z ≫ 0): Output close to 1, near-flat gradient. Strong evidence for class 1.

This behavior is exactly what we want: confidence should saturate. Once we have overwhelming evidence for a class, additional evidence shouldn't dramatically change our probability estimate.

Comparison with Step Function

The sigmoid can be viewed as a 'soft' version of the unit step function:

$$\text{step}(z) = \begin{cases} 0 & \text{if } z < 0 \ 1 & \text{if } z \geq 0 \end{cases}$$

The step function gives a hard decision but has zero gradient everywhere (except at zero, where it's undefined). This makes it completely unusable for gradient-based optimization. The sigmoid provides a smooth, differentiable approximation that retains the 'decision' character while enabling optimization.

sigmoid_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import numpy as np
import matplotlib.pyplot as plt
 
def sigmoid(z):
    return 1 / (1 + np.exp(-z))
 
def sigmoid_derivative(z):
    s = sigmoid(z)
    return s * (1 - s)
 
# Create visualization
z = np.linspace(-8, 8, 1000)
 
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
 
# Plot 1: Sigmoid function
ax1 = axes[0]
ax1.plot(z, sigmoid(z), 'b-', linewidth=2, label='σ(z)')
ax1.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
ax1.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
ax1.fill_between(z, 0, sigmoid(z), alpha=0.2)
ax1.set_xlabel('z')
ax1.set_ylabel('σ(z)')
ax1.set_title('Sigmoid Function')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_ylim(-0.1, 1.1)
 
# Plot 2: Sigmoid derivative
ax2 = axes[1]
ax2.plot(z, sigmoid_derivative(z), 'r-', linewidth=2, label="σ'(z)")
ax2.fill_between(z, 0, sigmoid_derivative(z), alpha=0.2, color='red')
ax2.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
ax2.set_xlabel('z')
ax2.set_ylabel("σ'(z)")
ax2.set_title('Sigmoid Derivative: σ(z)(1-σ(z))')
ax2.legend()
ax2.grid(True, alpha=0.3)
 
# Plot 3: Sigmoid vs Step function
ax3 = axes[2]
ax3.plot(z, sigmoid(z), 'b-', linewidth=2, label='Sigmoid')
ax3.plot(z, np.where(z >= 0, 1, 0), 'g--', linewidth=2, label='Step')
ax3.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
ax3.set_xlabel('z')
ax3.set_ylabel('Output')
ax3.set_title('Sigmoid as Smooth Step Function')
ax3.legend()
ax3.grid(True, alpha=0.3)
ax3.set_ylim(-0.1, 1.1)
 
plt.tight_layout()
plt.savefig('sigmoid_visualization.png', dpi=150)
plt.show()
 
# Annotate key regions
print("Sigmoid Behavior Regions:")
print("-" * 50)
print(f"z = -5: σ(z) = {sigmoid(-5):.6f} (Saturated low)")
print(f"z = -1: σ(z) = {sigmoid(-1):.6f} (Transition)")
print(f"z = 0:  σ(z) = {sigmoid(0):.6f}  (Decision boundary)")
print(f"z = +1: σ(z) = {sigmoid(1):.6f} (Transition)")
print(f"z = +5: σ(z) = {sigmoid(5):.6f} (Saturated high)")

The 'Soft' Threshold Interpretation

Think of the sigmoid as a 'soft' or 'probabilistic' threshold. Instead of a hard cutoff at z = 0 that says 'definitely class 1' or 'definitely class 0', the sigmoid gradually transitions through uncertainty. Values very close to the threshold (z ≈ 0) get probabilities near 0.5, expressing maximum uncertainty.

Connection to Probability Theory

The sigmoid function isn't an arbitrary choice—it emerges inevitably from foundational probability theory. Understanding these connections reveals logistic regression as a deeply principled model, not merely a convenient hack.

Derivation from Maximum Entropy Principle

The maximum entropy principle states that among all probability distributions consistent with known constraints, we should choose the one with maximum entropy (maximum uncertainty). For binary outcomes with a linear constraint on the expected value of features, the maximum entropy distribution is precisely the logistic model.

Given:

Binary outcome $y \in {0, 1}$
Feature vector $x$
Constraint: $\mathbb{E}[y \cdot x] = \text{observed feature correlations}$

The maximum entropy solution is:

$$P(y = 1 | x) = \frac{e^{w^T x}}{1 + e^{w^T x}} = \sigma(w^T x)$$

This derivation shows that logistic regression makes the minimum assumptions necessary to respect the observed correlations between features and outcomes.

Derivation from Bernoulli-Exponential Family

The Bernoulli distribution belongs to the exponential family:

$$P(y | p) = p^y (1-p)^{1-y} = \exp\left(y \log\frac{p}{1-p} + \log(1-p)\right)$$

The natural parameter of this distribution is $\eta = \log\frac{p}{1-p}$ (the log-odds). The canonical link function maps the linear predictor to the natural parameter:

$$\eta = w^T x + b$$

Solving for $p$:

$$p = \frac{e^\eta}{1 + e^\eta} = \sigma(w^T x + b)$$

The sigmoid is thus the unique link function that makes the Bernoulli distribution a proper exponential family GLM.

Why the Sigmoid Emerges Naturally

•Maximum Entropy: It's the least biased probability model that respects observed feature-outcome correlations.
•Exponential Family: It's the canonical link for Bernoulli outcomes, ensuring elegant mathematical properties.
•Odds Interpretation: Log-odds being linear in parameters gives interpretable coefficients.
•Bayesian Updating: The logistic form arises naturally when updating binary beliefs with independent evidence.
•Central Limit Theorem: Under many noise models, the threshold-crossing probability follows a logistic distribution.

The Binary Cross-Entropy Connection

When we use the sigmoid to produce probabilities, the natural loss function is the binary cross-entropy (also called log-loss):

$$\mathcal{L}(y, \hat{p}) = -[y \log(\hat{p}) + (1-y) \log(1-\hat{p})]$$

This loss is not arbitrary—it's the negative log-likelihood of the Bernoulli distribution. Minimizing cross-entropy is equivalent to maximum likelihood estimation. The sigmoid and cross-entropy are paired constructs: using one naturally implies the other.

This pairing has computational benefits too. The gradient of cross-entropy with sigmoid output simplifies beautifully:

$$\frac{\partial \mathcal{L}}{\partial z} = \hat{p} - y = \sigma(z) - y$$

This simple form (predicted probability minus true label) makes implementation clean and numerically stable.

Deep Connection

The fact that the gradient is simply (σ(z) - y) is profound. It says that the update size is proportional to the error—predictions that are confident and wrong get large updates, while correct predictions get small or zero updates. This matches our intuition for learning.

The Vanishing Gradient Problem

While the sigmoid function has elegant mathematical properties, it suffers from a critical practical issue: vanishing gradients. This problem is particularly severe in deep neural networks but also affects logistic regression optimization.

The Problem Explained

Recall that the sigmoid derivative is:

$$\sigma'(z) = \sigma(z)(1 - \sigma(z))$$

This derivative has maximum value 0.25 (at z = 0) and approaches 0 exponentially as |z| increases. When the input z is far from zero in either direction:

For z = 5: $\sigma'(5) \approx 0.0067$
For z = 10: $\sigma'(10) \approx 0.0000454$
For z = 20: $\sigma'(20) \approx 2 \times 10^{-9}$

These tiny gradients cause slow learning when the model is confident (whether correctly or incorrectly).

Implications for Optimization

Slow Convergence from Poor Initialization: If initial weights produce saturated sigmoids (outputs near 0 or 1), gradients are tiny and learning crawls.
Difficulty Updating Early Layers: In multi-layer networks, gradients are multiplied together. If each sigmoid contributes a factor < 0.25, gradients shrink exponentially with depth.
Confident Mistakes Are Slow to Fix: If the model is 99% confident and wrong, the gradient is small despite the large error. This seems counterintuitive—shouldn't we update more when we're very wrong?

vanishing_gradient_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import numpy as np
 
def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
 
def sigmoid_grad(z):
    s = sigmoid(z)
    return s * (1 - s)
 
# Demonstrate gradient magnitudes at different z values
z_values = [0, 1, 2, 3, 4, 5, 10, 15, 20]
 
print("Sigmoid Gradient Magnitudes (Vanishing Gradient Problem)")
print("-" * 60)
print(f"{'z':>5} | {'σ(z)':>10} | {'σ\'(z)':>15} | {'Relative to max':>15}")
print("-" * 60)
 
max_grad = sigmoid_grad(0)  # 0.25
for z in z_values:
    s = sigmoid(z)
    grad = sigmoid_grad(z)
    relative = grad / max_grad
    print(f"{z:>5} | {s:>10.6f} | {grad:>15.10f} | {relative:>13.2%}")
 
# Show impact in deep networks (chained gradients)
print("\n" + "=" * 60)
print("Chained Gradients in Deep Networks")
print("=" * 60)
 
for depth in [2, 5, 10, 20]:
    # Even at z=2 (mild saturation), chained gradients vanish
    grad_z2 = sigmoid_grad(2)  # ≈ 0.1
    chained = grad_z2 ** depth
    print(f"Depth {depth:>2}: (σ'(2))^{depth:<2} = {grad_z2:.4f}^{depth:<2} = {chained:.2e}")
 
# Contrast with more modern activations
print("\nComparison with ReLU (for context):")
print("ReLU gradient = 1 for z > 0, 0 for z <= 0")
print("No gradient shrinkage regardless of z magnitude!")

Mitigation Strategies

For logistic regression specifically, vanishing gradients are manageable with proper initialization and learning rate tuning. However, for deep networks, the sigmoid activation has been largely replaced by ReLU and its variants, which don't saturate for positive inputs. The sigmoid remains important as the final output layer for binary classification.

Alternatives and Generalizations

While the sigmoid function is the standard for binary classification, several related functions serve different purposes or offer different properties.

The Hyperbolic Tangent (tanh)

$$\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} = 2\sigma(2z) - 1$$

The tanh function is a scaled and shifted sigmoid mapping to (-1, 1) instead of (0, 1). It's zero-centered, which can improve gradient flow in neural networks. The relationship between tanh and sigmoid is:

$$\tanh(z) = 2\sigma(2z) - 1$$

The Softmax Function (Multi-class Generalization)

For K-class classification, the sigmoid generalizes to the softmax function:

$$\text{softmax}(z)k = \frac{e^{z_k}}{\sum{j=1}^K e^{z_j}}$$

When K = 2, softmax reduces to the sigmoid:

$$\text{softmax}([z, 0])_1 = \frac{e^z}{e^z + e^0} = \frac{e^z}{e^z + 1} = \sigma(z)$$

The Probit Function

An alternative to the logistic link is the probit function, which uses the cumulative distribution function (CDF) of the standard normal distribution:

$$\Phi^{-1}(p) = \text{linear predictor}$$

Probit regression assumes errors follow a normal distribution, while logistic regression assumes a logistic distribution. In practice, results are nearly identical for most problems, but logistic regression is computationally simpler and has more interpretable coefficients (odds ratios).

Comparison of Sigmoid-like Functions
Function	Range	Formula	Use Case
Sigmoid (Logistic)	(0, 1)	1 / (1 + e^(-z))	Binary classification probability
Tanh	(-1, 1)	(e^z - e^(-z)) / (e^z + e^(-z))	Neural network hidden layers
Softmax	(0, 1)^K, sums to 1	exp(z_k) / Σexp(z_j)	Multi-class classification
Probit (Φ)	(0, 1)	Gaussian CDF	When normality is assumed
Hard sigmoid	[0, 1]	clip(0.2z + 0.5, 0, 1)	Fast approximation

sigmoid_alternatives.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import numpy as np
from scipy.stats import norm
 
def sigmoid(z):
    return 1 / (1 + np.exp(-z))
 
def tanh_func(z):
    return np.tanh(z)
 
def softmax(z):
    """Softmax for a vector of logits."""
    exp_z = np.exp(z - np.max(z))  # Subtract max for numerical stability
    return exp_z / exp_z.sum()
 
def probit(p):
    """Probit function: inverse of normal CDF."""
    return norm.ppf(p)
 
def probit_link(z):
    """Probit link: normal CDF."""
    return norm.cdf(z)
 
def hard_sigmoid(z):
    """Piecewise linear approximation to sigmoid."""
    return np.clip(0.2 * z + 0.5, 0, 1)
 
# Compare functions
z = np.linspace(-5, 5, 11)
print("Comparison of sigmoid-like functions:")
print("-" * 75)
print(f"{'z':>6} | {'sigmoid':>10} | {'tanh':>10} | {'probit':>10} | {'hard_sig':>10}")
print("-" * 75)
for zi in z:
    print(f"{zi:>6.1f} | {sigmoid(zi):>10.4f} | {tanh_func(zi):>10.4f} | "
          f"{probit_link(zi):>10.4f} | {hard_sigmoid(zi):>10.4f}")
 
# Show softmax for a 3-class example
logits = np.array([2.0, 1.0, 0.5])
probs = softmax(logits)
print("\nSoftmax example (3 classes):")
print(f"Logits: {logits}")
print(f"Probabilities: {probs}")
print(f"Sum: {probs.sum():.4f}")

When to Use Each

Use sigmoid for binary classification output. Use softmax for multi-class classification output. Use tanh (or ReLU) for hidden layers in neural networks. The probit is rarely used in ML but appears in econometrics and some specialized applications where normal distribution assumptions are appropriate.

Summary: The Sigmoid Function

We've explored the sigmoid function from multiple angles—historical, mathematical, computational, and theoretical. Let's consolidate the essential takeaways:

Key Takeaways

•Definition: The sigmoid σ(z) = 1/(1 + e^(-z)) maps any real number to the interval (0, 1), making it ideal for probability outputs.
•Symmetry: σ(-z) = 1 - σ(z) means probabilities for opposite classes sum to 1.
•Elegant Derivative: σ'(z) = σ(z)(1 - σ(z)) enables efficient gradient computation.
•Log-Odds Interpretation: The inverse (logit) reveals that logistic regression is linear regression on log-odds.
•Principled Origin: The sigmoid emerges from maximum entropy and exponential family theory—it's theoretically justified, not arbitrary.
•Numerical Care Required: Use stable implementations to avoid overflow; prefer scipy.special.expit in production.
•Vanishing Gradients: In saturation regions (|z| large), gradients nearly vanish. This limits sigmoid's use in deep network hidden layers but is manageable for output layers.
•Multi-class Extension: Softmax generalizes sigmoid to K classes while preserving the probabilistic interpretation.

What's Next:

With the sigmoid function understood, we're ready to explore its deeper implications. The next page examines the log-odds (logit) interpretation—how the logistic model makes a linear assumption not on probabilities directly, but on their logarithm of odds. This interpretation is crucial for understanding coefficient meanings and the geometric structure of logistic regression.

Page Complete

You now have a deep understanding of the sigmoid function—its definition, properties, theoretical justification, and practical considerations. This foundation is essential for everything that follows in logistic regression and beyond.

1 / 5

Loading learning content...

Machine LearningLogistic Regression Model

The Logistic Regression Model

LevelIntermediate

Duration90 mins

TopicLogistic Regression Model

1 / 5

The Sigmoid Function

The Function That Bridges Worlds

In the transition from regression to classification, we face a fundamental challenge: how do we convert unbounded real-valued outputs into probabilities?

What You Will Learn

Historical Context and Motivation

Origins in Population Dynamics (1838)

$$P(t) = \frac{K}{1 + e^{-r(t-t_0)}}$$

where $K$ is the carrying capacity, $r$ is the growth rate, and $t_0$ is the time of maximum growth rate.

Transition to Statistics (1944)

Why 'Logistic'?

Why Not Simply Use a Linear Model for Classification?

If we fit a linear regression model $\hat{y} = w^T x + b$, several problems arise:

Predictions outside [0, 1]: Linear models can predict values like -0.3 or 1.7, which are meaningless as probabilities.
Sensitivity to outliers: Extreme feature values can wildly swing predictions, making the decision boundary unstable.
No probabilistic interpretation: Linear regression minimizes squared error, but this doesn't correspond to any sensible probabilistic model for binary outcomes.
Homoscedasticity violation: Binary outcomes have variance $p(1-p)$, which depends on the prediction itself, violating linear regression assumptions.

The sigmoid function addresses all these issues by providing a principled mapping from real-valued linear predictors to valid probabilities.

Mathematical Definition and Derivation

The sigmoid function (or standard logistic function) is defined as:

$$\sigma(z) = \frac{1}{1 + e^{-z}} = \frac{e^z}{1 + e^z}$$

where $z \in \mathbb{R}$ is any real number, and $\sigma(z) \in (0, 1)$ is the output.

Deriving from a Simple Constraint

We can derive the sigmoid function axiomatically by asking: what is the simplest function $f(z)$ that satisfies these requirements?

$f: \mathbb{R} \to (0, 1)$ (maps reals to valid probabilities)
$f$ is monotonically increasing
$f(0) = 0.5$ (neutral input yields neutral probability)
$f(-z) = 1 - f(z)$ (symmetry around the origin)

sigmoid_implementations.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import numpy as np
 
def sigmoid_basic(z):
    """
    Basic sigmoid implementation.
    Simple but numerically unstable for large negative z.
    """
    return 1 / (1 + np.exp(-z))
 
def sigmoid_stable(z):
    """
    Numerically stable sigmoid implementation.
    Uses different formulas for positive and negative z to avoid overflow.
    """
    # For z >= 0: use standard form
    # For z < 0: use alternative form to avoid computing exp of large positive numbers
    positive_mask = z >= 0
    negative_mask = ~positive_mask
    
    result = np.zeros_like(z, dtype=np.float64)
    
    # For z >= 0: σ(z) = 1 / (1 + exp(-z))
    result[positive_mask] = 1 / (1 + np.exp(-z[positive_mask]))
    
    # For z < 0: σ(z) = exp(z) / (1 + exp(z))
    exp_z = np.exp(z[negative_mask])
    result[negative_mask] = exp_z / (1 + exp_z)
    
    return result
 
def sigmoid_scipy(z):
    """
    Using scipy's optimized implementation via the logistic CDF.
    Most efficient for production use.
    """
    from scipy.special import expit
    return expit(z)
 
# Demonstration of numerical stability
z_extreme = np.array([-1000, -100, -10, 0, 10, 100, 1000])
print("z values:", z_extreme)
print("Basic (may overflow):", sigmoid_basic(z_extreme.astype(float)))
print("Stable:", sigmoid_stable(z_extreme.astype(float)))
print("Scipy expit:", sigmoid_scipy(z_extreme.astype(float)))

Numerical Stability Matters

Key Mathematical Properties

Property 1: Output Range (0, 1)

$$\lim_{z \to -\infty} \sigma(z) = 0, \quad \lim_{z \to +\infty} \sigma(z) = 1$$

Property 2: Symmetry

$$\sigma(-z) = 1 - \sigma(z)$$

Property 3: Derivative (Self-Referential Form)

$$\frac{d\sigma}{dz} = \sigma(z)(1 - \sigma(z))$$

This is perhaps the most elegant property. The derivative of the sigmoid is expressible entirely in terms of the sigmoid itself! This has profound implications:

The derivative is always positive, confirming monotonicity
Maximum at $z = 0$ where $\sigma'(0) = 0.25$
Approaches 0 as $|z| \to \infty$ (saturation regions)
Simplifies gradient computations dramatically during optimization

Sigmoid Function Values and Derivatives at Key Points
z	σ(z)	σ'(z) = σ(z)(1-σ(z))	Interpretation
-∞	→ 0	→ 0	Strong negative evidence, near-certain class 0
-4.6	≈ 0.01	≈ 0.01	99% confidence in class 0
-2.2	≈ 0.10	≈ 0.09	90% confidence in class 0
0	0.50	0.25	Maximum uncertainty, decision boundary
2.2	≈ 0.90	≈ 0.09	90% confidence in class 1
4.6	≈ 0.99	≈ 0.01	99% confidence in class 1
+∞	→ 1	→ 0	Strong positive evidence, near-certain class 1

Property 4: Log-Odds Representation

The inverse of the sigmoid function is the logit function:

$$\text{logit}(p) = \sigma^{-1}(p) = \log\left(\frac{p}{1-p}\right)$$

The quantity $\frac{p}{1-p}$ is called the odds of success. If $p = 0.75$, the odds are $0.75/0.25 = 3$, meaning success is 3 times more likely than failure.

The logit function converts probabilities to log-odds (also called the logit scale). This reveals that logistic regression is simply linear regression on the log-odds scale:

$$\log\left(\frac{p}{1-p}\right) = w^T x + b$$

This connection is profound and will be explored in depth in the next page.

Property 5: Belongs to the Exponential Family

Why the Derivative Matters So Much

Geometric Intuition and Visualization

Visualizing the sigmoid function reveals intuitions that mathematical formulas alone cannot convey. Let's examine its shape, its derivative, and how these translate to classification behavior.

The S-Curve Shape

The sigmoid produces the characteristic S-curve (hence 'sigmoid' from the Greek sigma). This shape reflects three distinct behavioral regions:

Saturation Region (z ≪ 0): Output close to 0, near-flat gradient. Strong evidence for class 0.
Linear Region (z ≈ 0): Approximately linear with slope ~0.25. Maximum uncertainty, where small input changes significantly affect output.
Saturation Region (z ≫ 0): Output close to 1, near-flat gradient. Strong evidence for class 1.

This behavior is exactly what we want: confidence should saturate. Once we have overwhelming evidence for a class, additional evidence shouldn't dramatically change our probability estimate.

Comparison with Step Function

The sigmoid can be viewed as a 'soft' version of the unit step function:

$$\text{step}(z) = \begin{cases} 0 & \text{if } z < 0 \ 1 & \text{if } z \geq 0 \end{cases}$$

sigmoid_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import numpy as np
import matplotlib.pyplot as plt
 
def sigmoid(z):
    return 1 / (1 + np.exp(-z))
 
def sigmoid_derivative(z):
    s = sigmoid(z)
    return s * (1 - s)
 
# Create visualization
z = np.linspace(-8, 8, 1000)
 
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
 
# Plot 1: Sigmoid function
ax1 = axes[0]
ax1.plot(z, sigmoid(z), 'b-', linewidth=2, label='σ(z)')
ax1.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
ax1.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
ax1.fill_between(z, 0, sigmoid(z), alpha=0.2)
ax1.set_xlabel('z')
ax1.set_ylabel('σ(z)')
ax1.set_title('Sigmoid Function')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_ylim(-0.1, 1.1)
 
# Plot 2: Sigmoid derivative
ax2 = axes[1]
ax2.plot(z, sigmoid_derivative(z), 'r-', linewidth=2, label="σ'(z)")
ax2.fill_between(z, 0, sigmoid_derivative(z), alpha=0.2, color='red')
ax2.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
ax2.set_xlabel('z')
ax2.set_ylabel("σ'(z)")
ax2.set_title('Sigmoid Derivative: σ(z)(1-σ(z))')
ax2.legend()
ax2.grid(True, alpha=0.3)
 
# Plot 3: Sigmoid vs Step function
ax3 = axes[2]
ax3.plot(z, sigmoid(z), 'b-', linewidth=2, label='Sigmoid')
ax3.plot(z, np.where(z >= 0, 1, 0), 'g--', linewidth=2, label='Step')
ax3.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
ax3.set_xlabel('z')
ax3.set_ylabel('Output')
ax3.set_title('Sigmoid as Smooth Step Function')
ax3.legend()
ax3.grid(True, alpha=0.3)
ax3.set_ylim(-0.1, 1.1)
 
plt.tight_layout()
plt.savefig('sigmoid_visualization.png', dpi=150)
plt.show()
 
# Annotate key regions
print("Sigmoid Behavior Regions:")
print("-" * 50)
print(f"z = -5: σ(z) = {sigmoid(-5):.6f} (Saturated low)")
print(f"z = -1: σ(z) = {sigmoid(-1):.6f} (Transition)")
print(f"z = 0:  σ(z) = {sigmoid(0):.6f}  (Decision boundary)")
print(f"z = +1: σ(z) = {sigmoid(1):.6f} (Transition)")
print(f"z = +5: σ(z) = {sigmoid(5):.6f} (Saturated high)")

The 'Soft' Threshold Interpretation

Connection to Probability Theory

Derivation from Maximum Entropy Principle

Given:

Binary outcome $y \in {0, 1}$
Feature vector $x$
Constraint: $\mathbb{E}[y \cdot x] = \text{observed feature correlations}$

The maximum entropy solution is:

$$P(y = 1 | x) = \frac{e^{w^T x}}{1 + e^{w^T x}} = \sigma(w^T x)$$

This derivation shows that logistic regression makes the minimum assumptions necessary to respect the observed correlations between features and outcomes.

Derivation from Bernoulli-Exponential Family

The Bernoulli distribution belongs to the exponential family:

$$P(y | p) = p^y (1-p)^{1-y} = \exp\left(y \log\frac{p}{1-p} + \log(1-p)\right)$$

The natural parameter of this distribution is $\eta = \log\frac{p}{1-p}$ (the log-odds). The canonical link function maps the linear predictor to the natural parameter:

$$\eta = w^T x + b$$

Solving for $p$:

$$p = \frac{e^\eta}{1 + e^\eta} = \sigma(w^T x + b)$$

The sigmoid is thus the unique link function that makes the Bernoulli distribution a proper exponential family GLM.

Why the Sigmoid Emerges Naturally

•Maximum Entropy: It's the least biased probability model that respects observed feature-outcome correlations.
•Exponential Family: It's the canonical link for Bernoulli outcomes, ensuring elegant mathematical properties.
•Odds Interpretation: Log-odds being linear in parameters gives interpretable coefficients.
•Bayesian Updating: The logistic form arises naturally when updating binary beliefs with independent evidence.
•Central Limit Theorem: Under many noise models, the threshold-crossing probability follows a logistic distribution.

The Binary Cross-Entropy Connection

When we use the sigmoid to produce probabilities, the natural loss function is the binary cross-entropy (also called log-loss):

$$\mathcal{L}(y, \hat{p}) = -[y \log(\hat{p}) + (1-y) \log(1-\hat{p})]$$

This pairing has computational benefits too. The gradient of cross-entropy with sigmoid output simplifies beautifully:

$$\frac{\partial \mathcal{L}}{\partial z} = \hat{p} - y = \sigma(z) - y$$

This simple form (predicted probability minus true label) makes implementation clean and numerically stable.

Deep Connection

The Vanishing Gradient Problem

The Problem Explained

Recall that the sigmoid derivative is:

$$\sigma'(z) = \sigma(z)(1 - \sigma(z))$$

This derivative has maximum value 0.25 (at z = 0) and approaches 0 exponentially as |z| increases. When the input z is far from zero in either direction:

For z = 5: $\sigma'(5) \approx 0.0067$
For z = 10: $\sigma'(10) \approx 0.0000454$
For z = 20: $\sigma'(20) \approx 2 \times 10^{-9}$

These tiny gradients cause slow learning when the model is confident (whether correctly or incorrectly).

Implications for Optimization

Slow Convergence from Poor Initialization: If initial weights produce saturated sigmoids (outputs near 0 or 1), gradients are tiny and learning crawls.
Difficulty Updating Early Layers: In multi-layer networks, gradients are multiplied together. If each sigmoid contributes a factor < 0.25, gradients shrink exponentially with depth.
Confident Mistakes Are Slow to Fix: If the model is 99% confident and wrong, the gradient is small despite the large error. This seems counterintuitive—shouldn't we update more when we're very wrong?

vanishing_gradient_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import numpy as np
 
def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
 
def sigmoid_grad(z):
    s = sigmoid(z)
    return s * (1 - s)
 
# Demonstrate gradient magnitudes at different z values
z_values = [0, 1, 2, 3, 4, 5, 10, 15, 20]
 
print("Sigmoid Gradient Magnitudes (Vanishing Gradient Problem)")
print("-" * 60)
print(f"{'z':>5} | {'σ(z)':>10} | {'σ\'(z)':>15} | {'Relative to max':>15}")
print("-" * 60)
 
max_grad = sigmoid_grad(0)  # 0.25
for z in z_values:
    s = sigmoid(z)
    grad = sigmoid_grad(z)
    relative = grad / max_grad
    print(f"{z:>5} | {s:>10.6f} | {grad:>15.10f} | {relative:>13.2%}")
 
# Show impact in deep networks (chained gradients)
print("\n" + "=" * 60)
print("Chained Gradients in Deep Networks")
print("=" * 60)
 
for depth in [2, 5, 10, 20]:
    # Even at z=2 (mild saturation), chained gradients vanish
    grad_z2 = sigmoid_grad(2)  # ≈ 0.1
    chained = grad_z2 ** depth
    print(f"Depth {depth:>2}: (σ'(2))^{depth:<2} = {grad_z2:.4f}^{depth:<2} = {chained:.2e}")
 
# Contrast with more modern activations
print("\nComparison with ReLU (for context):")
print("ReLU gradient = 1 for z > 0, 0 for z <= 0")
print("No gradient shrinkage regardless of z magnitude!")

Mitigation Strategies

Alternatives and Generalizations

While the sigmoid function is the standard for binary classification, several related functions serve different purposes or offer different properties.

The Hyperbolic Tangent (tanh)

$$\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} = 2\sigma(2z) - 1$$

$$\tanh(z) = 2\sigma(2z) - 1$$

The Softmax Function (Multi-class Generalization)

For K-class classification, the sigmoid generalizes to the softmax function:

$$\text{softmax}(z)k = \frac{e^{z_k}}{\sum{j=1}^K e^{z_j}}$$

When K = 2, softmax reduces to the sigmoid:

$$\text{softmax}([z, 0])_1 = \frac{e^z}{e^z + e^0} = \frac{e^z}{e^z + 1} = \sigma(z)$$

The Probit Function

An alternative to the logistic link is the probit function, which uses the cumulative distribution function (CDF) of the standard normal distribution:

$$\Phi^{-1}(p) = \text{linear predictor}$$

Comparison of Sigmoid-like Functions
Function	Range	Formula	Use Case
Sigmoid (Logistic)	(0, 1)	1 / (1 + e^(-z))	Binary classification probability
Tanh	(-1, 1)	(e^z - e^(-z)) / (e^z + e^(-z))	Neural network hidden layers
Softmax	(0, 1)^K, sums to 1	exp(z_k) / Σexp(z_j)	Multi-class classification
Probit (Φ)	(0, 1)	Gaussian CDF	When normality is assumed
Hard sigmoid	[0, 1]	clip(0.2z + 0.5, 0, 1)	Fast approximation

sigmoid_alternatives.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import numpy as np
from scipy.stats import norm
 
def sigmoid(z):
    return 1 / (1 + np.exp(-z))
 
def tanh_func(z):
    return np.tanh(z)
 
def softmax(z):
    """Softmax for a vector of logits."""
    exp_z = np.exp(z - np.max(z))  # Subtract max for numerical stability
    return exp_z / exp_z.sum()
 
def probit(p):
    """Probit function: inverse of normal CDF."""
    return norm.ppf(p)
 
def probit_link(z):
    """Probit link: normal CDF."""
    return norm.cdf(z)
 
def hard_sigmoid(z):
    """Piecewise linear approximation to sigmoid."""
    return np.clip(0.2 * z + 0.5, 0, 1)
 
# Compare functions
z = np.linspace(-5, 5, 11)
print("Comparison of sigmoid-like functions:")
print("-" * 75)
print(f"{'z':>6} | {'sigmoid':>10} | {'tanh':>10} | {'probit':>10} | {'hard_sig':>10}")
print("-" * 75)
for zi in z:
    print(f"{zi:>6.1f} | {sigmoid(zi):>10.4f} | {tanh_func(zi):>10.4f} | "
          f"{probit_link(zi):>10.4f} | {hard_sigmoid(zi):>10.4f}")
 
# Show softmax for a 3-class example
logits = np.array([2.0, 1.0, 0.5])
probs = softmax(logits)
print("\nSoftmax example (3 classes):")
print(f"Logits: {logits}")
print(f"Probabilities: {probs}")
print(f"Sum: {probs.sum():.4f}")

When to Use Each

Summary: The Sigmoid Function

We've explored the sigmoid function from multiple angles—historical, mathematical, computational, and theoretical. Let's consolidate the essential takeaways:

Key Takeaways

•Definition: The sigmoid σ(z) = 1/(1 + e^(-z)) maps any real number to the interval (0, 1), making it ideal for probability outputs.
•Symmetry: σ(-z) = 1 - σ(z) means probabilities for opposite classes sum to 1.
•Elegant Derivative: σ'(z) = σ(z)(1 - σ(z)) enables efficient gradient computation.
•Log-Odds Interpretation: The inverse (logit) reveals that logistic regression is linear regression on log-odds.
•Principled Origin: The sigmoid emerges from maximum entropy and exponential family theory—it's theoretically justified, not arbitrary.
•Numerical Care Required: Use stable implementations to avoid overflow; prefer scipy.special.expit in production.
•Vanishing Gradients: In saturation regions (|z| large), gradients nearly vanish. This limits sigmoid's use in deep network hidden layers but is manageable for output layers.
•Multi-class Extension: Softmax generalizes sigmoid to K classes while preserving the probabilistic interpretation.

What's Next:

Page Complete

1 / 5