Loading learning content...
In the transition from regression to classification, we face a fundamental challenge: how do we convert unbounded real-valued outputs into probabilities?
Linear regression produces predictions that span all real numbers—from negative infinity to positive infinity. Yet probabilities must live in the interval [0, 1]. We need a mathematical bridge between these two worlds, and the sigmoid function (also called the logistic function) provides exactly this transformation.
This isn't merely a convenient mathematical trick. The sigmoid function emerges naturally from probabilistic reasoning about binary outcomes, making it the theoretically justified choice for classification—not just an arbitrary squashing function. Understanding why requires diving deep into its origins, properties, and connections to probability theory.
By the end of this page, you will understand: (1) the mathematical definition and key properties of the sigmoid function, (2) why it emerges naturally from maximum entropy principles and exponential family distributions, (3) how it relates to odds and log-odds, (4) its computational properties and numerical considerations, and (5) its limitations and alternatives.
The logistic function has a fascinating history that predates machine learning by over a century. Understanding this history illuminates why the function is so natural for modeling growth processes and binary outcomes.
Origins in Population Dynamics (1838)
The Belgian mathematician Pierre François Verhulst introduced the logistic function in 1838 to model population growth. Unlike the exponential growth model proposed by Malthus (which predicts unbounded growth), Verhulst recognized that populations face resource constraints. His logistic growth model describes how populations grow quickly when small, slow as they approach carrying capacity, and eventually stabilize:
$$P(t) = \frac{K}{1 + e^{-r(t-t_0)}}$$
where $K$ is the carrying capacity, $r$ is the growth rate, and $t_0$ is the time of maximum growth rate.
Transition to Statistics (1944)
Joseph Berkson coined the term logit in 1944, introducing the logistic function to biostatistics. He recognized that the logistic function could model the probability of binary outcomes—like survival vs. death, disease vs. health—as a function of continuous predictors. This was revolutionary because it provided a principled way to perform regression on binary outcomes.
The name 'logistic' comes from the Greek 'logistikos' (skilled in calculating). Verhulst chose this name somewhat arbitrarily, and it has no deep mathematical significance. The term 'sigmoid' (σ-shaped, from the Greek letter sigma) describes the function's S-shaped curve and is often used interchangeably with 'logistic' in machine learning contexts.
Why Not Simply Use a Linear Model for Classification?
Before examining the sigmoid function's definition, let's understand why we can't just use linear regression for classification. Consider predicting whether an email is spam (1) or not spam (0) based on word counts.
If we fit a linear regression model $\hat{y} = w^T x + b$, several problems arise:
Predictions outside [0, 1]: Linear models can predict values like -0.3 or 1.7, which are meaningless as probabilities.
Sensitivity to outliers: Extreme feature values can wildly swing predictions, making the decision boundary unstable.
No probabilistic interpretation: Linear regression minimizes squared error, but this doesn't correspond to any sensible probabilistic model for binary outcomes.
Homoscedasticity violation: Binary outcomes have variance $p(1-p)$, which depends on the prediction itself, violating linear regression assumptions.
The sigmoid function addresses all these issues by providing a principled mapping from real-valued linear predictors to valid probabilities.
The sigmoid function (or standard logistic function) is defined as:
$$\sigma(z) = \frac{1}{1 + e^{-z}} = \frac{e^z}{1 + e^z}$$
where $z \in \mathbb{R}$ is any real number, and $\sigma(z) \in (0, 1)$ is the output.
These two forms are mathematically equivalent—the second form is obtained by multiplying numerator and denominator by $e^z$. Each form has computational advantages in different contexts, as we'll see.
Deriving from a Simple Constraint
We can derive the sigmoid function axiomatically by asking: what is the simplest function $f(z)$ that satisfies these requirements?
Starting from the exponential family of distributions and requiring maximum entropy subject to these constraints, we arrive uniquely at the sigmoid function. This isn't coincidence—it's a deep mathematical necessity.
1234567891011121314151617181920212223242526272829303132333435363738394041424344
import numpy as np def sigmoid_basic(z): """ Basic sigmoid implementation. Simple but numerically unstable for large negative z. """ return 1 / (1 + np.exp(-z)) def sigmoid_stable(z): """ Numerically stable sigmoid implementation. Uses different formulas for positive and negative z to avoid overflow. """ # For z >= 0: use standard form # For z < 0: use alternative form to avoid computing exp of large positive numbers positive_mask = z >= 0 negative_mask = ~positive_mask result = np.zeros_like(z, dtype=np.float64) # For z >= 0: σ(z) = 1 / (1 + exp(-z)) result[positive_mask] = 1 / (1 + np.exp(-z[positive_mask])) # For z < 0: σ(z) = exp(z) / (1 + exp(z)) exp_z = np.exp(z[negative_mask]) result[negative_mask] = exp_z / (1 + exp_z) return result def sigmoid_scipy(z): """ Using scipy's optimized implementation via the logistic CDF. Most efficient for production use. """ from scipy.special import expit return expit(z) # Demonstration of numerical stabilityz_extreme = np.array([-1000, -100, -10, 0, 10, 100, 1000])print("z values:", z_extreme)print("Basic (may overflow):", sigmoid_basic(z_extreme.astype(float)))print("Stable:", sigmoid_stable(z_extreme.astype(float)))print("Scipy expit:", sigmoid_scipy(z_extreme.astype(float)))The basic sigmoid formula 1/(1 + exp(-z)) can cause numerical overflow when z is a large negative number (e.g., z = -1000). Computing exp(1000) produces infinity in floating-point arithmetic. Always use the stable implementation that switches formulas based on the sign of z.
The sigmoid function possesses several remarkable properties that make it ideal for probabilistic classification. Understanding these properties deeply is essential for understanding logistic regression.
Property 1: Output Range (0, 1)
$$\lim_{z \to -\infty} \sigma(z) = 0, \quad \lim_{z \to +\infty} \sigma(z) = 1$$
The sigmoid function maps any real number to a value strictly between 0 and 1. This is exactly what we need to interpret outputs as probabilities. Note that the limits 0 and 1 are approached but never reached—we can never be completely certain of either class based on a logistic model.
Property 2: Symmetry
$$\sigma(-z) = 1 - \sigma(z)$$
This beautiful symmetry means that if the probability of class 1 given input $z$ is $\sigma(z)$, then the probability of class 0 is simply $\sigma(-z)$. The function is symmetric around the point $(0, 0.5)$.
Property 3: Derivative (Self-Referential Form)
$$\frac{d\sigma}{dz} = \sigma(z)(1 - \sigma(z))$$
This is perhaps the most elegant property. The derivative of the sigmoid is expressible entirely in terms of the sigmoid itself! This has profound implications:
| z | σ(z) | σ'(z) = σ(z)(1-σ(z)) | Interpretation |
|---|---|---|---|
| -∞ | → 0 | → 0 | Strong negative evidence, near-certain class 0 |
| -4.6 | ≈ 0.01 | ≈ 0.01 | 99% confidence in class 0 |
| -2.2 | ≈ 0.10 | ≈ 0.09 | 90% confidence in class 0 |
| 0 | 0.50 | 0.25 | Maximum uncertainty, decision boundary |
| 2.2 | ≈ 0.90 | ≈ 0.09 | 90% confidence in class 1 |
| 4.6 | ≈ 0.99 | ≈ 0.01 | 99% confidence in class 1 |
| +∞ | → 1 | → 0 | Strong positive evidence, near-certain class 1 |
Property 4: Log-Odds Representation
The inverse of the sigmoid function is the logit function:
$$\text{logit}(p) = \sigma^{-1}(p) = \log\left(\frac{p}{1-p}\right)$$
The quantity $\frac{p}{1-p}$ is called the odds of success. If $p = 0.75$, the odds are $0.75/0.25 = 3$, meaning success is 3 times more likely than failure.
The logit function converts probabilities to log-odds (also called the logit scale). This reveals that logistic regression is simply linear regression on the log-odds scale:
$$\log\left(\frac{p}{1-p}\right) = w^T x + b$$
This connection is profound and will be explored in depth in the next page.
Property 5: Belongs to the Exponential Family
The sigmoid function is the canonical link function for Bernoulli-distributed outcomes. This means logistic regression can be understood as a special case of Generalized Linear Models (GLMs), providing a unified theoretical framework. This will be covered extensively in a later module.
The self-referential derivative σ'(z) = σ(z)(1 - σ(z)) is computationally crucial. During backpropagation in neural networks or gradient descent in logistic regression, we need derivatives constantly. Since we already compute σ(z) for the forward pass, getting the derivative is nearly free—just multiply σ(z) by (1 - σ(z)).
Visualizing the sigmoid function reveals intuitions that mathematical formulas alone cannot convey. Let's examine its shape, its derivative, and how these translate to classification behavior.
The S-Curve Shape
The sigmoid produces the characteristic S-curve (hence 'sigmoid' from the Greek sigma). This shape reflects three distinct behavioral regions:
Saturation Region (z ≪ 0): Output close to 0, near-flat gradient. Strong evidence for class 0.
Linear Region (z ≈ 0): Approximately linear with slope ~0.25. Maximum uncertainty, where small input changes significantly affect output.
Saturation Region (z ≫ 0): Output close to 1, near-flat gradient. Strong evidence for class 1.
This behavior is exactly what we want: confidence should saturate. Once we have overwhelming evidence for a class, additional evidence shouldn't dramatically change our probability estimate.
Comparison with Step Function
The sigmoid can be viewed as a 'soft' version of the unit step function:
$$\text{step}(z) = \begin{cases} 0 & \text{if } z < 0 \ 1 & \text{if } z \geq 0 \end{cases}$$
The step function gives a hard decision but has zero gradient everywhere (except at zero, where it's undefined). This makes it completely unusable for gradient-based optimization. The sigmoid provides a smooth, differentiable approximation that retains the 'decision' character while enabling optimization.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
import numpy as npimport matplotlib.pyplot as plt def sigmoid(z): return 1 / (1 + np.exp(-z)) def sigmoid_derivative(z): s = sigmoid(z) return s * (1 - s) # Create visualizationz = np.linspace(-8, 8, 1000) fig, axes = plt.subplots(1, 3, figsize=(15, 4)) # Plot 1: Sigmoid functionax1 = axes[0]ax1.plot(z, sigmoid(z), 'b-', linewidth=2, label='σ(z)')ax1.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)ax1.axvline(x=0, color='gray', linestyle='--', alpha=0.5)ax1.fill_between(z, 0, sigmoid(z), alpha=0.2)ax1.set_xlabel('z')ax1.set_ylabel('σ(z)')ax1.set_title('Sigmoid Function')ax1.legend()ax1.grid(True, alpha=0.3)ax1.set_ylim(-0.1, 1.1) # Plot 2: Sigmoid derivativeax2 = axes[1]ax2.plot(z, sigmoid_derivative(z), 'r-', linewidth=2, label="σ'(z)")ax2.fill_between(z, 0, sigmoid_derivative(z), alpha=0.2, color='red')ax2.axvline(x=0, color='gray', linestyle='--', alpha=0.5)ax2.set_xlabel('z')ax2.set_ylabel("σ'(z)")ax2.set_title('Sigmoid Derivative: σ(z)(1-σ(z))')ax2.legend()ax2.grid(True, alpha=0.3) # Plot 3: Sigmoid vs Step functionax3 = axes[2]ax3.plot(z, sigmoid(z), 'b-', linewidth=2, label='Sigmoid')ax3.plot(z, np.where(z >= 0, 1, 0), 'g--', linewidth=2, label='Step')ax3.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)ax3.set_xlabel('z')ax3.set_ylabel('Output')ax3.set_title('Sigmoid as Smooth Step Function')ax3.legend()ax3.grid(True, alpha=0.3)ax3.set_ylim(-0.1, 1.1) plt.tight_layout()plt.savefig('sigmoid_visualization.png', dpi=150)plt.show() # Annotate key regionsprint("Sigmoid Behavior Regions:")print("-" * 50)print(f"z = -5: σ(z) = {sigmoid(-5):.6f} (Saturated low)")print(f"z = -1: σ(z) = {sigmoid(-1):.6f} (Transition)")print(f"z = 0: σ(z) = {sigmoid(0):.6f} (Decision boundary)")print(f"z = +1: σ(z) = {sigmoid(1):.6f} (Transition)")print(f"z = +5: σ(z) = {sigmoid(5):.6f} (Saturated high)")Think of the sigmoid as a 'soft' or 'probabilistic' threshold. Instead of a hard cutoff at z = 0 that says 'definitely class 1' or 'definitely class 0', the sigmoid gradually transitions through uncertainty. Values very close to the threshold (z ≈ 0) get probabilities near 0.5, expressing maximum uncertainty.
The sigmoid function isn't an arbitrary choice—it emerges inevitably from foundational probability theory. Understanding these connections reveals logistic regression as a deeply principled model, not merely a convenient hack.
Derivation from Maximum Entropy Principle
The maximum entropy principle states that among all probability distributions consistent with known constraints, we should choose the one with maximum entropy (maximum uncertainty). For binary outcomes with a linear constraint on the expected value of features, the maximum entropy distribution is precisely the logistic model.
Given:
The maximum entropy solution is:
$$P(y = 1 | x) = \frac{e^{w^T x}}{1 + e^{w^T x}} = \sigma(w^T x)$$
This derivation shows that logistic regression makes the minimum assumptions necessary to respect the observed correlations between features and outcomes.
Derivation from Bernoulli-Exponential Family
The Bernoulli distribution belongs to the exponential family:
$$P(y | p) = p^y (1-p)^{1-y} = \exp\left(y \log\frac{p}{1-p} + \log(1-p)\right)$$
The natural parameter of this distribution is $\eta = \log\frac{p}{1-p}$ (the log-odds). The canonical link function maps the linear predictor to the natural parameter:
$$\eta = w^T x + b$$
Solving for $p$:
$$p = \frac{e^\eta}{1 + e^\eta} = \sigma(w^T x + b)$$
The sigmoid is thus the unique link function that makes the Bernoulli distribution a proper exponential family GLM.
The Binary Cross-Entropy Connection
When we use the sigmoid to produce probabilities, the natural loss function is the binary cross-entropy (also called log-loss):
$$\mathcal{L}(y, \hat{p}) = -[y \log(\hat{p}) + (1-y) \log(1-\hat{p})]$$
This loss is not arbitrary—it's the negative log-likelihood of the Bernoulli distribution. Minimizing cross-entropy is equivalent to maximum likelihood estimation. The sigmoid and cross-entropy are paired constructs: using one naturally implies the other.
This pairing has computational benefits too. The gradient of cross-entropy with sigmoid output simplifies beautifully:
$$\frac{\partial \mathcal{L}}{\partial z} = \hat{p} - y = \sigma(z) - y$$
This simple form (predicted probability minus true label) makes implementation clean and numerically stable.
The fact that the gradient is simply (σ(z) - y) is profound. It says that the update size is proportional to the error—predictions that are confident and wrong get large updates, while correct predictions get small or zero updates. This matches our intuition for learning.
While the sigmoid function has elegant mathematical properties, it suffers from a critical practical issue: vanishing gradients. This problem is particularly severe in deep neural networks but also affects logistic regression optimization.
The Problem Explained
Recall that the sigmoid derivative is:
$$\sigma'(z) = \sigma(z)(1 - \sigma(z))$$
This derivative has maximum value 0.25 (at z = 0) and approaches 0 exponentially as |z| increases. When the input z is far from zero in either direction:
These tiny gradients cause slow learning when the model is confident (whether correctly or incorrectly).
Implications for Optimization
Slow Convergence from Poor Initialization: If initial weights produce saturated sigmoids (outputs near 0 or 1), gradients are tiny and learning crawls.
Difficulty Updating Early Layers: In multi-layer networks, gradients are multiplied together. If each sigmoid contributes a factor < 0.25, gradients shrink exponentially with depth.
Confident Mistakes Are Slow to Fix: If the model is 99% confident and wrong, the gradient is small despite the large error. This seems counterintuitive—shouldn't we update more when we're very wrong?
123456789101112131415161718192021222324252627282930313233343536373839
import numpy as np def sigmoid(z): return 1 / (1 + np.exp(-np.clip(z, -500, 500))) def sigmoid_grad(z): s = sigmoid(z) return s * (1 - s) # Demonstrate gradient magnitudes at different z valuesz_values = [0, 1, 2, 3, 4, 5, 10, 15, 20] print("Sigmoid Gradient Magnitudes (Vanishing Gradient Problem)")print("-" * 60)print(f"{'z':>5} | {'σ(z)':>10} | {'σ\'(z)':>15} | {'Relative to max':>15}")print("-" * 60) max_grad = sigmoid_grad(0) # 0.25for z in z_values: s = sigmoid(z) grad = sigmoid_grad(z) relative = grad / max_grad print(f"{z:>5} | {s:>10.6f} | {grad:>15.10f} | {relative:>13.2%}") # Show impact in deep networks (chained gradients)print("\n" + "=" * 60)print("Chained Gradients in Deep Networks")print("=" * 60) for depth in [2, 5, 10, 20]: # Even at z=2 (mild saturation), chained gradients vanish grad_z2 = sigmoid_grad(2) # ≈ 0.1 chained = grad_z2 ** depth print(f"Depth {depth:>2}: (σ'(2))^{depth:<2} = {grad_z2:.4f}^{depth:<2} = {chained:.2e}") # Contrast with more modern activationsprint("\nComparison with ReLU (for context):")print("ReLU gradient = 1 for z > 0, 0 for z <= 0")print("No gradient shrinkage regardless of z magnitude!")For logistic regression specifically, vanishing gradients are manageable with proper initialization and learning rate tuning. However, for deep networks, the sigmoid activation has been largely replaced by ReLU and its variants, which don't saturate for positive inputs. The sigmoid remains important as the final output layer for binary classification.
While the sigmoid function is the standard for binary classification, several related functions serve different purposes or offer different properties.
The Hyperbolic Tangent (tanh)
$$\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} = 2\sigma(2z) - 1$$
The tanh function is a scaled and shifted sigmoid mapping to (-1, 1) instead of (0, 1). It's zero-centered, which can improve gradient flow in neural networks. The relationship between tanh and sigmoid is:
$$\tanh(z) = 2\sigma(2z) - 1$$
The Softmax Function (Multi-class Generalization)
For K-class classification, the sigmoid generalizes to the softmax function:
$$\text{softmax}(z)k = \frac{e^{z_k}}{\sum{j=1}^K e^{z_j}}$$
When K = 2, softmax reduces to the sigmoid:
$$\text{softmax}([z, 0])_1 = \frac{e^z}{e^z + e^0} = \frac{e^z}{e^z + 1} = \sigma(z)$$
The Probit Function
An alternative to the logistic link is the probit function, which uses the cumulative distribution function (CDF) of the standard normal distribution:
$$\Phi^{-1}(p) = \text{linear predictor}$$
Probit regression assumes errors follow a normal distribution, while logistic regression assumes a logistic distribution. In practice, results are nearly identical for most problems, but logistic regression is computationally simpler and has more interpretable coefficients (odds ratios).
| Function | Range | Formula | Use Case |
|---|---|---|---|
| Sigmoid (Logistic) | (0, 1) | 1 / (1 + e^(-z)) | Binary classification probability |
| Tanh | (-1, 1) | (e^z - e^(-z)) / (e^z + e^(-z)) | Neural network hidden layers |
| Softmax | (0, 1)^K, sums to 1 | exp(z_k) / Σexp(z_j) | Multi-class classification |
| Probit (Φ) | (0, 1) | Gaussian CDF | When normality is assumed |
| Hard sigmoid | [0, 1] | clip(0.2z + 0.5, 0, 1) | Fast approximation |
12345678910111213141516171819202122232425262728293031323334353637383940414243
import numpy as npfrom scipy.stats import norm def sigmoid(z): return 1 / (1 + np.exp(-z)) def tanh_func(z): return np.tanh(z) def softmax(z): """Softmax for a vector of logits.""" exp_z = np.exp(z - np.max(z)) # Subtract max for numerical stability return exp_z / exp_z.sum() def probit(p): """Probit function: inverse of normal CDF.""" return norm.ppf(p) def probit_link(z): """Probit link: normal CDF.""" return norm.cdf(z) def hard_sigmoid(z): """Piecewise linear approximation to sigmoid.""" return np.clip(0.2 * z + 0.5, 0, 1) # Compare functionsz = np.linspace(-5, 5, 11)print("Comparison of sigmoid-like functions:")print("-" * 75)print(f"{'z':>6} | {'sigmoid':>10} | {'tanh':>10} | {'probit':>10} | {'hard_sig':>10}")print("-" * 75)for zi in z: print(f"{zi:>6.1f} | {sigmoid(zi):>10.4f} | {tanh_func(zi):>10.4f} | " f"{probit_link(zi):>10.4f} | {hard_sigmoid(zi):>10.4f}") # Show softmax for a 3-class examplelogits = np.array([2.0, 1.0, 0.5])probs = softmax(logits)print("\nSoftmax example (3 classes):")print(f"Logits: {logits}")print(f"Probabilities: {probs}")print(f"Sum: {probs.sum():.4f}")Use sigmoid for binary classification output. Use softmax for multi-class classification output. Use tanh (or ReLU) for hidden layers in neural networks. The probit is rarely used in ML but appears in econometrics and some specialized applications where normal distribution assumptions are appropriate.
We've explored the sigmoid function from multiple angles—historical, mathematical, computational, and theoretical. Let's consolidate the essential takeaways:
What's Next:
With the sigmoid function understood, we're ready to explore its deeper implications. The next page examines the log-odds (logit) interpretation—how the logistic model makes a linear assumption not on probabilities directly, but on their logarithm of odds. This interpretation is crucial for understanding coefficient meanings and the geometric structure of logistic regression.
You now have a deep understanding of the sigmoid function—its definition, properties, theoretical justification, and practical considerations. This foundation is essential for everything that follows in logistic regression and beyond.