Supervised Learning - ClassificationMulti-class Logistic Regression

Multi-class Logistic Regression

LevelIntermediate

Duration90 mins

TopicMulti-class Logistic Regression

1 / 5

The Softmax Function

From Binary to Multi-class: A Fundamental Challenge

Binary classification—distinguishing between exactly two classes—is elegantly handled by logistic regression. The sigmoid function maps any real-valued score to the probability range $(0, 1)$, and the decision rule follows naturally. But the real world rarely presents such neat dichotomies.

Consider classifying handwritten digits (10 classes), identifying animal species in photographs (thousands of possibilities), or categorizing customer support tickets into departments (dozens of routing options). These multi-class classification problems demand a principled extension of our binary framework.

Enter the softmax function—the mathematical transformation that generalizes the sigmoid to arbitrary numbers of classes, enabling probabilistic interpretation and gradient-based learning for multi-class problems.

What You Will Learn

By the end of this page, you will understand the softmax function at a deep level: its derivation from maximum entropy principles, its mathematical properties, its relationship to the sigmoid, numerical stability considerations, and its role as the fundamental building block of multi-class classification systems—from simple logistic regression to deep neural networks.

The Need for a Multi-class Probability Function

Before diving into the softmax function itself, let's understand precisely why we need it—and why simpler alternatives fail.

The Constraint We Must Satisfy:

In multi-class classification with $K$ classes, we want to output a probability distribution $\mathbf{p} = (p_1, p_2, \ldots, p_K)$ over all possible classes. This probability vector must satisfy two fundamental constraints:

Non-negativity: Each probability must be non-negative: $p_k \geq 0$ for all $k \in {1, 2, \ldots, K}$
Normalization: The probabilities must sum to one: $\sum_{k=1}^{K} p_k = 1$

These constraints define the $(K-1)$-dimensional probability simplex $\Delta^{K-1}$, the geometric space of all valid probability distributions over $K$ outcomes.

The Probability Simplex

For $K=3$ classes, the probability simplex is a 2D triangle in 3D space with vertices at $(1,0,0)$, $(0,1,0)$, and $(0,0,1)$. Every point inside this triangle represents a valid probability distribution. The simplex sits in the hyperplane $p_1 + p_2 + p_3 = 1$ and is bounded by $p_k \geq 0$.

Why Not Simply Normalize Raw Scores?

Suppose our model produces raw scores (logits) $z_1, z_2, \ldots, z_K$ for each class. A naive approach might be:

$$p_k = \frac{z_k}{\sum_{j=1}^{K} z_j}$$

This simple normalization seems reasonable but fails catastrophically:

Negative logits: If $z_k < 0$, we get negative 'probabilities'—nonsensical
Zero denominator: If logits sum to zero, we divide by zero
Negative denominators: With negative logits, normalization can increase magnitudes
No natural gradient flow: The derivative structure doesn't encourage clean learning dynamics

Example of Failure:

Consider logits $\mathbf{z} = (2, -3, 1)$: $$p_1 = \frac{2}{2-3+1} = \frac{2}{0} \rightarrow \text{undefined}$$

Or with $\mathbf{z} = (1, -2, -3)$: $$p_1 = \frac{1}{1-2-3} = \frac{1}{-4} = -0.25 \rightarrow \text{negative 'probability'}$$

Why Not Per-class Sigmoids?

Another intuitive approach: apply the sigmoid function independently to each logit:

$$p_k = \sigma(z_k) = \frac{1}{1 + e^{-z_k}}$$

This ensures $p_k \in (0, 1)$—satisfying non-negativity—but violates normalization:

$$\sum_{k=1}^{K} \sigma(z_k) \neq 1 \quad \text{(in general)}$$

Example:

With $\mathbf{z} = (0, 0, 0)$: $$\sigma(0) = 0.5 \quad \Rightarrow \quad \sum_{k=1}^{3} \sigma(z_k) = 1.5 \neq 1$$

Independent sigmoids are used in multi-label classification where an input can belong to multiple classes simultaneously. But for multi-class classification with mutually exclusive classes, we need a function that respects the probability simplex constraint.

Multi-label vs. Multi-class

Don't confuse these settings. Multi-class: exactly one correct class (use softmax). Multi-label: possibly multiple correct classes (use independent sigmoids). The constraint structure differs fundamentally, and using the wrong function is a common error.

Deriving the Softmax Function

The softmax function isn't an arbitrary choice—it emerges naturally from several foundational principles. Let's explore multiple derivations that reveal different facets of this elegant transformation.

Definition: The Softmax Function

Given a vector of real-valued scores (logits) $\mathbf{z} = (z_1, z_2, \ldots, z_K) \in \mathbb{R}^K$, the softmax function $\text{softmax}: \mathbb{R}^K \rightarrow \Delta^{K-1}$ produces a probability distribution:

$$\text{softmax}(\mathbf{z})k = \frac{e^{z_k}}{\sum{j=1}^{K} e^{z_j}} \quad \text{for } k = 1, 2, \ldots, K$$

We often write this compactly as:

$$p_k = \frac{\exp(z_k)}{\sum_{j=1}^{K} \exp(z_j)}$$

The Exponential's Role

The exponential function $e^z$ is the key ingredient. It maps any real number to a strictly positive value: $e^z > 0$ for all $z \in \mathbb{R}$. This guarantees non-negativity of all probabilities, and normalization then ensures they sum to one.

Derivation 1: Maximum Entropy Principle

The softmax function can be derived as the solution to a constrained optimization problem rooted in information theory.

Problem Statement: Find the probability distribution $\mathbf{p}$ that maximizes entropy subject to expected value constraints.

Given features $\mathbf{x}$ and model parameters $\boldsymbol{\theta}_k$ for each class, we want:

$$\max_{\mathbf{p} \in \Delta^{K-1}} H(\mathbf{p}) = -\sum_{k=1}^{K} p_k \log p_k$$

subject to: $$\mathbb{E}[\phi_k(\mathbf{x})] = \sum_{k=1}^{K} p_k \cdot \phi_k(\mathbf{x}) = c_k \quad \text{for each } k$$

Using Lagrange multipliers $\lambda_k$ for the feature constraints and $\mu$ for normalization:

$$\mathcal{L} = -\sum_k p_k \log p_k + \mu\left(1 - \sum_k p_k\right) + \sum_k \lambda_k\left(c_k - p_k \phi_k\right)$$

Taking derivatives and solving:

$$\frac{\partial \mathcal{L}}{\partial p_k} = -\log p_k - 1 - \mu - \lambda_k \phi_k = 0$$

$$\Rightarrow p_k = \exp(-1 - \mu - \lambda_k \phi_k) = \frac{\exp(-\lambda_k \phi_k)}{\exp(1 + \mu)}$$

Normalizing to sum to 1:

$$p_k = \frac{\exp(z_k)}{\sum_j \exp(z_j)}$$

where $z_k = \boldsymbol{\theta}_k^T \mathbf{x}$ represents the linear score for class $k$.

The maximum entropy interpretation reveals that softmax produces the least biased probability distribution consistent with our model's linear scores.

Derivation 2: Generalization of the Sigmoid

Recall that for binary classification with sigmoid:

$$P(y=1|\mathbf{x}) = \sigma(z) = \frac{1}{1 + e^{-z}} = \frac{e^z}{e^z + 1}$$

$$P(y=0|\mathbf{x}) = 1 - \sigma(z) = \frac{1}{1 + e^z}$$

Now consider the ratio:

$$\frac{P(y=1|\mathbf{x})}{P(y=0|\mathbf{x})} = \frac{e^z}{1} = e^z$$

This suggests that the log-odds of class 1 vs. class 0 equals the score $z$:

$$\log \frac{P(y=1|\mathbf{x})}{P(y=0|\mathbf{x})} = z$$

Extending to $K$ classes:

For multi-class, we want the log-odds of class $k$ vs. a reference class (say, class $K$) to equal some score:

$$\log \frac{P(y=k|\mathbf{x})}{P(y=K|\mathbf{x})} = z_k - z_K$$

This implies: $$P(y=k|\mathbf{x}) = P(y=K|\mathbf{x}) \cdot e^{z_k - z_K} = P(y=K|\mathbf{x}) \cdot \frac{e^{z_k}}{e^{z_K}}$$

Summing over all classes and using normalization: $$1 = \sum_{k=1}^{K} P(y=k|\mathbf{x}) = P(y=K|\mathbf{x}) \cdot \frac{\sum_k e^{z_k}}{e^{z_K}}$$

Solving for $P(y=K|\mathbf{x})$ and substituting back:

$$P(y=k|\mathbf{x}) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}$$

Thus, softmax is the unique function that generalizes sigmoid while maintaining log-linear relationships between class probabilities.

Derivation 3: Exponential Family Connection

The softmax function arises naturally when modeling discrete distributions from the exponential family perspective.

The categorical distribution with parameter $\boldsymbol{\pi} = (\pi_1, \ldots, \pi_K)$ can be written in exponential family form:

$$P(y=k|\boldsymbol{\pi}) = \exp\left(\eta_k - A(\boldsymbol{\eta})\right)$$

where:

Natural parameters: $\eta_k = \log \pi_k$
Log-partition function: $A(\boldsymbol{\eta}) = \log\left(\sum_{j=1}^{K} e^{\eta_j}\right)$

To recover probabilities from natural parameters:

$$\pi_k = \exp\left(\eta_k - A(\boldsymbol{\eta})\right) = \frac{e^{\eta_k}}{\sum_j e^{\eta_j}}$$

This is precisely the softmax function! In multinomial logistic regression, we parameterize $\eta_k = \boldsymbol{\theta}_k^T \mathbf{x}$, making the natural parameter a linear function of features.

The exponential family derivation connects softmax to the broader theory of generalized linear models and explains why maximum likelihood estimation leads to well-behaved convex optimization problems.

Multiple Perspectives, One Function

The softmax function emerges independently from maximum entropy, generalization of sigmoid, and exponential family theory. This convergence of derivations suggests that softmax isn't merely convenient—it's mathematically canonical for multi-class probability modeling.

Mathematical Properties of Softmax

Understanding the softmax function requires examining its mathematical properties in detail. These properties explain its behavior and inform practical implementation.

Property 1: Invariance to Translation

For any constant $c \in \mathbb{R}$:

$$\text{softmax}(\mathbf{z} + c \cdot \mathbf{1}) = \text{softmax}(\mathbf{z})$$

where $\mathbf{1} = (1, 1, \ldots, 1)$.

Proof: $$\text{softmax}(\mathbf{z} + c)_k = \frac{e^{z_k + c}}{\sum_j e^{z_j + c}} = \frac{e^c \cdot e^{z_k}}{e^c \cdot \sum_j e^{z_j}} = \frac{e^{z_k}}{\sum_j e^{z_j}} = \text{softmax}(\mathbf{z})_k$$

Implication: Adding the same constant to all logits doesn't change the output probabilities. This property is crucial for numerical stability (as we'll see) and means softmax only cares about relative differences between logits.

Property 2: Sensitivity to Scale

For $\alpha > 0$, consider the temperature-scaled softmax:

$$\text{softmax}(\alpha \mathbf{z})_k = \frac{e^{\alpha z_k}}{\sum_j e^{\alpha z_j}}$$

$\alpha \to \infty$ (low temperature): Probabilities concentrate on the maximum logit, approaching a one-hot vector
$\alpha \to 0$ (high temperature): Probabilities become uniform: $p_k \to 1/K$ for all $k$
$\alpha = 1$ (standard): The standard softmax function

Mathematical analysis:

Let $z_\text{max} = \max_k z_k$ and assume it's unique. As $\alpha \to \infty$:

$$\text{softmax}(\alpha \mathbf{z})k = \frac{e^{\alpha z_k}}{\sum_j e^{\alpha z_j}} = \frac{e^{\alpha(z_k - z\text{max})}}{\sum_j e^{\alpha(z_j - z_\text{max})}}$$

For $z_k < z_\text{max}$, $e^{\alpha(z_k - z_\text{max})} \to 0$. For $z_k = z_\text{max}$, $e^{\alpha(z_k - z_\text{max})} = 1$.

Thus, the argmax class gets probability 1, others get 0.

Temperature in Practice

Temperature scaling is widely used in neural networks. During training, standard temperature ($\tau=1$) is used. For inference, higher temperature (e.g., $\tau=2$) softens predictions for uncertainty estimation or diverse sampling in generative models. Lower temperature (e.g., $\tau=0.5$) sharpens predictions for more confident decisions.

Property 3: Gradient Structure

The Jacobian of the softmax function has a beautiful form essential for backpropagation.

Let $\mathbf{p} = \text{softmax}(\mathbf{z})$. The partial derivatives are:

$$\frac{\partial p_i}{\partial z_j} = \begin{cases} p_i(1 - p_i) & \text{if } i = j \ -p_i p_j & \text{if } i \neq j \end{cases}$$

This can be written compactly as:

$$\frac{\partial \mathbf{p}}{\partial \mathbf{z}} = \text{diag}(\mathbf{p}) - \mathbf{p}\mathbf{p}^T$$

Derivation:

Using the quotient rule for $p_i = \frac{e^{z_i}}{\sum_k e^{z_k}}$:

$$\frac{\partial p_i}{\partial z_j} = \frac{\mathbf{1}_{i=j} \cdot e^{z_i} \cdot \sum_k e^{z_k} - e^{z_i} \cdot e^{z_j}}{\left(\sum_k e^{z_k}\right)^2}$$

For $i = j$: $$\frac{\partial p_i}{\partial z_i} = \frac{e^{z_i}}{\sum_k e^{z_k}} - \frac{e^{z_i} \cdot e^{z_i}}{\left(\sum_k e^{z_k}\right)^2} = p_i - p_i^2 = p_i(1 - p_i)$$

For $i \neq j$: $$\frac{\partial p_i}{\partial z_j} = - \frac{e^{z_i} \cdot e^{z_j}}{\left(\sum_k e^{z_k}\right)^2} = -p_i p_j$$

Interpretation: The diagonal terms show each class probability responds positively to its own logit (with diminishing returns as $p_i \to 1$). Off-diagonal terms show competition: increasing one logit decreases all other probabilities.

Property 4: Convexity and Injectivity

Convexity: The log-sum-exp function $\text{LSE}(\mathbf{z}) = \log\left(\sum_k e^{z_k}\right)$ is convex. Its gradient is exactly the softmax:

$$\nabla \text{LSE}(\mathbf{z}) = \text{softmax}(\mathbf{z})$$

This means softmax maps $\mathbb{R}^K$ onto the interior of the probability simplex (never on the boundary where some $p_k = 0$).

Non-injectivity: Softmax is not one-to-one. By translation invariance, infinitely many logit vectors map to the same probability distribution:

$$\text{softmax}(\mathbf{z}) = \text{softmax}(\mathbf{z} + c\mathbf{1}) \quad \forall c \in \mathbb{R}$$

To achieve identifiability in parameter estimation, we typically fix one class's logits to zero (e.g., $z_K = 0$), reducing the parameter space from $\mathbb{R}^K$ to $\mathbb{R}^{K-1}$.

Summary of Softmax Properties
Property	Mathematical Form	Practical Implication
Translation invariance	$\text{softmax}(\mathbf{z} + c) = \text{softmax}(\mathbf{z})$	Only relative logit differences matter; enables numerical stability tricks
Scale sensitivity	$\alpha \to \infty \Rightarrow$ argmax; $\alpha \to 0 \Rightarrow$ uniform	Temperature controls sharpness of distribution
Output range	$p_k \in (0, 1)$, never exactly 0 or 1	All classes always have some probability mass
Gradient structure	$\frac{\partial p_i}{\partial z_j} = p_i(\delta_{ij} - p_j)$	Enables efficient backpropagation; shows class competition
Smoothness	Infinitely differentiable	Well-behaved for gradient-based optimization

Numerical Stability Considerations

The softmax function, despite its mathematical elegance, poses serious numerical challenges in practice. Understanding and addressing these issues is essential for robust implementations.

The Overflow Problem

Consider computing softmax for $\mathbf{z} = (1000, 1001, 1002)$:

$$e^{1000} \approx 1.97 \times 10^{434}$$

This exceeds the maximum representable float64 value ($\approx 1.8 \times 10^{308}$), causing overflow to infinity.

The Underflow Problem

Conversely, for $\mathbf{z} = (-1000, -1001, -1002)$:

$$e^{-1000} \approx 5.08 \times 10^{-435}$$

This is smaller than the minimum positive float64 ($\approx 2.2 \times 10^{-308}$), causing underflow to zero. The sum in the denominator becomes zero, leading to division by zero.

Silent Failures

Numerical instability often manifests silently as NaN (Not a Number) values that propagate through computations, causing training to diverge without obvious error messages. Always implement numerically stable versions of softmax.

The Stable Softmax Solution

Using translation invariance, we subtract the maximum logit before exponentiating:

$$\text{softmax_stable}(\mathbf{z})k = \frac{e^{z_k - \max(\mathbf{z})}}{\sum{j=1}^{K} e^{z_j - \max(\mathbf{z})}}$$

This transformation ensures:

The largest exponent is exactly $e^0 = 1$ (no overflow)
All other exponents are $\leq 1$ (no overflow)
At least one term in the denominator equals 1 (no division by zero due to underflow)

Correctness proof (by translation invariance): $$\frac{e^{z_k - m}}{\sum_j e^{z_j - m}} = \frac{e^{-m} \cdot e^{z_k}}{e^{-m} \cdot \sum_j e^{z_j}} = \frac{e^{z_k}}{\sum_j e^{z_j}}$$

softmax_implementations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
 
def softmax_naive(z):
    """
    Naive softmax implementation - DO NOT USE IN PRODUCTION.
    Fails for large or small logit values.
    """
    exp_z = np.exp(z)
    return exp_z / np.sum(exp_z)
 
def softmax_stable(z):
    """
    Numerically stable softmax implementation.
    Subtracts max to prevent overflow while preserving output.
    
    Args:
        z: Array of logits, shape (K,) or (N, K)
    
    Returns:
        Probability distribution, same shape as input
    """
    # Handle both 1D and 2D inputs
    z = np.asarray(z)
    if z.ndim == 1:
        z_max = np.max(z)
        exp_z = np.exp(z - z_max)
        return exp_z / np.sum(exp_z)
    else:
        # For batched inputs, max over class dimension
        z_max = np.max(z, axis=-1, keepdims=True)
        exp_z = np.exp(z - z_max)
        return exp_z / np.sum(exp_z, axis=-1, keepdims=True)
 
def softmax_temperature(z, temperature=1.0):
    """
    Temperature-scaled softmax.
    
    Args:
        z: Logits
        temperature: Scaling factor (higher = softer distribution)
    
    Returns:
        Temperature-scaled probability distribution
    """
    return softmax_stable(z / temperature)
 
# Demonstration of numerical issues
print("=== Numerical Stability Demonstration ===")
print()
 
# Case 1: Normal logits - both work
z_normal = np.array([1.0, 2.0, 3.0])
print(f"Normal logits: {z_normal}")
print(f"Naive softmax:  {softmax_naive(z_normal)}")
print(f"Stable softmax: {softmax_stable(z_normal)}")
print()
 
# Case 2: Large logits - naive fails
z_large = np.array([1000.0, 1001.0, 1002.0])
print(f"Large logits: {z_large}")
print(f"Naive softmax:  {softmax_naive(z_large)}")  # [nan, nan, nan] or [0, 0, nan]
print(f"Stable softmax: {softmax_stable(z_large)}")  # Correct output
print()
 
# Case 3: Very negative logits - naive may fail
z_negative = np.array([-1000.0, -1001.0, -1002.0])
print(f"Negative logits: {z_negative}")
print(f"Naive softmax:  {softmax_naive(z_negative)}")  # May give nan
print(f"Stable softmax: {softmax_stable(z_negative)}")  # Correct output
print()
 
# Temperature scaling demonstration
print("=== Temperature Scaling ===")
z = np.array([1.0, 2.0, 5.0])
print(f"Logits: {z}")
for temp in [0.5, 1.0, 2.0, 5.0]:
    print(f"T={temp}: {softmax_temperature(z, temp)}")

Log-Softmax for Enhanced Stability

When computing cross-entropy loss, we often need $\log(\text{softmax}(\mathbf{z})_k) = \log p_k$. Computing softmax then taking log incurs unnecessary numerical error. Instead, compute log-softmax directly:

$$\log\text{-softmax}(\mathbf{z})_k = z_k - \text{LSE}(\mathbf{z})$$

where $\text{LSE}(\mathbf{z}) = \log\left(\sum_j e^{z_j}\right)$ is the log-sum-exp.

Stable log-sum-exp: $$\text{LSE}(\mathbf{z}) = m + \log\left(\sum_j e^{z_j - m}\right)$$ where $m = \max(\mathbf{z})$.

This formulation:

Avoids computing very small probabilities (which may underflow)
Retains full precision in the log-domain
Is used by torch.nn.functional.log_softmax and tf.nn.log_softmax

Framework Implementations

Modern deep learning frameworks (PyTorch, TensorFlow, JAX) implement numerically stable softmax and log-softmax. Use torch.nn.functional.softmax, tf.nn.softmax, or jax.nn.softmax instead of implementing manually. When combining softmax with cross-entropy, use fused operations like F.cross_entropy (PyTorch) or tf.nn.softmax_cross_entropy_with_logits for best numerical behavior.

The Softmax-Sigmoid Relationship

Softmax and sigmoid are intimately related—the sigmoid is a special case of softmax for two classes. Understanding this connection deepens intuition and clarifies when to use each.

Binary Softmax Equals Sigmoid

Consider softmax with $K=2$ classes and logits $(z_1, z_2)$:

$$p_1 = \frac{e^{z_1}}{e^{z_1} + e^{z_2}}$$

Divide numerator and denominator by $e^{z_1}$:

$$p_1 = \frac{1}{1 + e^{z_2 - z_1}} = \frac{1}{1 + e^{-z}} = \sigma(z)$$

where $z = z_1 - z_2$ is the log-odds of class 1 vs. class 2.

Key insight: For binary classification, we only need one logit (the difference), not two. The sigmoid function compactly represents this.

Parameter Efficiency Comparison

For a linear model with $d$ features:

Setting	Parameters Needed	Why
Binary + Sigmoid	$d + 1$	Single weight vector + bias
Binary + Softmax	$2(d + 1)$	Two weight vectors + biases
Binary + Softmax (constrained)	$d + 1$	Set $\mathbf{w}_2 = \mathbf{0}$

The softmax formulation is overparameterized for binary classification due to translation invariance. In practice:

Binary: Use sigmoid with one set of parameters
Multi-class: Use softmax with $K$ sets of parameters (but $K-1$ degrees of freedom)

Reparameterization:

In softmax with $K$ classes, we can always set $\mathbf{w}_K = \mathbf{0}$ and $b_K = 0$ without loss of generality. This gives the equivalent model:

$$z_k = \mathbf{w}_k^T \mathbf{x} + b_k \quad \text{for } k = 1, \ldots, K-1$$ $$z_K = 0$$

Then $p_k$ represents the probability relative to class $K$, which serves as the reference.

Use Sigmoid When

•Binary classification — Exactly two mutually exclusive classes
•Multi-label classification — Multiple independent binary decisions
•Output gating — Controlling information flow (e.g., LSTM gates)
•Probability estimation — When single probability suffices

Use Softmax When

•Multi-class classification — One correct class from $K > 2$ options
•Attention mechanisms — Distributing weight across elements
•Mixture models — Computing mixture weights that sum to 1
•Policy networks — Probability distribution over discrete actions

Common Mistake: Softmax for Multi-label

Using softmax for multi-label classification (where an image might be labeled both 'sunny' and 'scenic') forces probabilities to compete. If $p(\text{sunny}) = 0.9$, then $p(\text{scenic})$ is constrained to be at most $0.1$. Use independent sigmoids instead, allowing both to be high simultaneously.

Softmax in Neural Network Architectures

The softmax function appears throughout modern deep learning, far beyond simple classification layers. Understanding its architectural roles reveals the function's versatility.

Role 1: Classification Output Layer

The most common use—the final layer of a classification network:

$$\text{Input} \to \text{Hidden Layers} \to \mathbf{z} \in \mathbb{R}^K \to \text{softmax} \to \mathbf{p} \in \Delta^{K-1}$$

The network learns feature representations in hidden layers; the final linear layer projects to $K$ logits; softmax converts to probabilities.

Role 2: Attention Mechanisms

In Transformers and attention-based models, softmax computes attention weights:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$

Here, softmax ensures attention weights over keys sum to 1, creating a proper weighted average of values. The scaling by $\sqrt{d_k}$ prevents softmax saturation in high dimensions.

Role 3: Reinforcement Learning Policies

In policy gradient methods, the policy network outputs action probabilities:

$$\pi_\theta(a|s) = \text{softmax}(f_\theta(s))_a$$

Softmax ensures a valid probability distribution over the discrete action space, enabling sampling during exploration and gradient computation for policy updates.

Role 4: Knowledge Distillation

In model compression, a large 'teacher' network's soft predictions guide a smaller 'student':

$$\mathbf{p}_{\text{soft}} = \text{softmax}(\mathbf{z} / T)$$

High temperature $T$ produces smoother distributions that encode more information about inter-class relationships than hard labels.

Role 5: Mixture of Experts

In MoE architectures, softmax computes gating weights determining which experts process each input:

$$\mathbf{g} = \text{softmax}(W_g \cdot \mathbf{x})$$ $$\text{Output} = \sum_i g_i \cdot \text{Expert}_i(\mathbf{x})$$

The softmax ensures expert contributions sum to 1, maintaining interpretability as mixture weights.

Softmax Variants in Modern Architectures

Modern architectures employ softmax variants: Sparse softmax zeros out small probabilities for efficiency. Gumbel-softmax enables differentiable discrete sampling. Entmax (α-softmax) provides learnable sparsity. Softmax with relative positions handles sequence length variation in attention.

softmax_applications.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import numpy as np
 
def scaled_dot_product_attention(Q, K, V):
    """
    Simplified attention mechanism using softmax.
    
    Args:
        Q: Query matrix (seq_len_q, d_k)
        K: Key matrix (seq_len_k, d_k)
        V: Value matrix (seq_len_k, d_v)
    
    Returns:
        Attention output: (seq_len_q, d_v)
    """
    d_k = Q.shape[-1]
    
    # Compute attention scores
    scores = Q @ K.T / np.sqrt(d_k)
    
    # Softmax to get attention weights (each row sums to 1)
    weights = softmax_stable(scores)  # Shape: (seq_len_q, seq_len_k)
    
    # Weighted sum of values
    output = weights @ V
    
    return output, weights
 
def gumbel_softmax(logits, temperature=1.0):
    """
    Gumbel-softmax for differentiable discrete sampling.
    Produces approximately one-hot vectors that are differentiable.
    
    Args:
        logits: Unnormalized log probabilities
        temperature: Controls discreteness (lower = more discrete)
    
    Returns:
        Approximately one-hot vector
    """
    # Sample from Gumbel(0, 1)
    gumbels = -np.log(-np.log(np.random.uniform(size=logits.shape) + 1e-20) + 1e-20)
    
    # Add Gumbel noise and apply temperature-scaled softmax
    y = softmax_stable((logits + gumbels) / temperature)
    
    return y
 
def softmax_stable(z):
    """Numerically stable softmax (same as before)."""
    z = np.asarray(z)
    if z.ndim == 1:
        z_max = np.max(z)
        exp_z = np.exp(z - z_max)
        return exp_z / np.sum(exp_z)
    else:
        z_max = np.max(z, axis=-1, keepdims=True)
        exp_z = np.exp(z - z_max)
        return exp_z / np.sum(exp_z, axis=-1, keepdims=True)
 
# Demo: Attention mechanism
print("=== Attention Mechanism Demo ===")
np.random.seed(42)
Q = np.random.randn(3, 4)  # 3 queries, dimension 4
K = np.random.randn(5, 4)  # 5 keys
V = np.random.randn(5, 6)  # 5 values, dimension 6
 
output, weights = scaled_dot_product_attention(Q, K, V)
print(f"Query shape: {Q.shape}")
print(f"Attention weights shape: {weights.shape}")
print(f"Weights sum per query: {weights.sum(axis=1)}")  # Should be [1, 1, 1]
print(f"Output shape: {output.shape}")
print()
 
# Demo: Gumbel-softmax for discrete sampling
print("=== Gumbel-Softmax Demo ===")
logits = np.array([1.0, 2.0, 5.0])
print(f"Logits: {logits}")
print(f"Regular softmax: {softmax_stable(logits)}")
print(f"Gumbel softmax (T=1.0): {gumbel_softmax(logits, 1.0)}")
print(f"Gumbel softmax (T=0.1): {gumbel_softmax(logits, 0.1)}")  # More discrete

Theoretical Foundations and Connections

The softmax function connects to several fundamental concepts in mathematics and machine learning. These connections provide deeper understanding and suggest generalizations.

Connection to Boltzmann Distribution

In statistical physics, the probability of a system being in state $k$ with energy $E_k$ at temperature $T$ is:

$$P(\text{state } k) = \frac{e^{-E_k / (k_B T)}}{\sum_j e^{-E_j / (k_B T)}}$$

This is the Boltzmann distribution—identical to softmax with logits $z_k = -E_k / (k_B T)$.

The analogy:

Energy ↔ Negative logit (lower energy = higher probability)
Temperature ↔ Softmax temperature (higher T = more uniform distribution)
Partition function $Z = \sum_j e^{-E_j/T}$ ↔ Softmax denominator

This physical interpretation explains temperature scaling: at low temperature, the system 'freezes' into the lowest energy state (highest logit class); at high temperature, it explores all states equally.

Connection to KL Divergence and Cross-Entropy

The softmax function is the solution to minimizing KL divergence from a uniform prior:

$$\text{softmax}(\mathbf{z})= \arg\min_{\mathbf{p} \in \Delta^{K-1}} \sum_k p_k \log p_k + \text{const} \cdot \sum_k p_k z_k$$

This is equivalent to: $$\text{softmax}(\mathbf{z}) = \arg\max_{\mathbf{p}} \left[ H(\mathbf{p}) + \mathbf{z}^T \mathbf{p} \right]$$

where $H(\mathbf{p}) = -\sum_k p_k \log p_k$ is entropy.

Interpretation: Softmax produces the maximum entropy distribution subject to the constraint that expected logits equal a specific value. It's the 'most uncertain' distribution consistent with our belief encoded in logits.

Connection to Convex Optimization

The softmax function is the gradient of the log-sum-exp (LSE) function:

$$\text{LSE}(\mathbf{z}) = \log\left(\sum_k e^{z_k}\right)$$

$$\nabla \text{LSE}(\mathbf{z}) = \text{softmax}(\mathbf{z})$$

Since LSE is convex, its gradient (softmax) maps $\mathbb{R}^K$ monotonically onto the simplex interior. This gradient relationship underlies the equivalence between maximum likelihood estimation and convex optimization for softmax regression.

Connection to Information Geometry

In information geometry, the probability simplex is viewed as a Riemannian manifold with the Fisher information metric. The softmax parameterization:

$$\mathbf{p} = \text{softmax}(\boldsymbol{\eta})$$

defines a natural coordinate system where:

$\boldsymbol{\eta}$ are the natural parameters
The Fisher information matrix equals the Hessian of the log-partition function
Gradient descent in $\eta$-space corresponds to natural gradient methods

This geometric perspective explains why optimization in the logit space (before softmax) is often more stable than directly optimizing probabilities.

Generalization: The α-Softmax and Tsallis Entropy

Softmax maximizes Shannon entropy. Generalizing to Tsallis entropy (with parameter $\alpha$) yields the $\alpha$-softmax or entmax:

$$\text{entmax}{\alpha}(\mathbf{z}) = \arg\max{\mathbf{p} \in \Delta^{K-1}} \mathbf{z}^T \mathbf{p} + H_\alpha^T(\mathbf{p})$$

where $H_\alpha^T$ is Tsallis entropy. For $\alpha=1$, this recovers standard softmax. For $\alpha=2$, it gives sparsemax, which produces truly sparse probability distributions.

Why These Connections Matter

These theoretical connections aren't merely academic. Understanding softmax through statistical physics explains temperature scaling. The convex optimization perspective guarantees global optima. Information geometry motivates natural gradient methods. The α-softmax generalization enables learnable sparsity in attention mechanisms.

Summary: The Softmax Function

We have thoroughly explored the softmax function—the fundamental transformation enabling multi-class probabilistic classification. Let's consolidate the key insights:

Key Takeaways

•Mathematical Definition: $\text{softmax}(\mathbf{z})_k = \frac{e^{z_k}}{\sum_j e^{z_j}}$ maps any real vector to a valid probability distribution.
•Core Properties: Translation invariance (only relative differences matter), scale sensitivity (temperature controls sharpness), and smooth gradient structure enabling optimization.
•Multiple Derivations: Emerges from maximum entropy principles, sigmoid generalization, and exponential family theory—indicating mathematical canonicity.
•Numerical Stability: Always subtract max logit before exponentiating; use log-softmax for loss computation; rely on framework implementations.
•Sigmoid Relationship: Softmax with $K=2$ reduces to sigmoid; binary uses sigmoid (1 param set), multi-class uses softmax ($K-1$ effective param sets).
•Architectural Ubiquity: Classification layers, attention mechanisms, policy networks, mixture models, and more rely on softmax.
•Theoretical Depth: Connects to Boltzmann distributions, convex optimization, information geometry, and generalizes to sparse variants.

What's Next:

With the softmax function firmly understood, we're ready to build the complete multinomial logistic regression model. The next page develops the full probabilistic framework, showing how softmax combines with linear scoring functions to create a powerful multi-class classifier with solid theoretical foundations.

Page Complete

You now possess a comprehensive understanding of the softmax function—from its basic definition through numerical implementation to theoretical foundations. This knowledge forms the bedrock for understanding multi-class classification, attention mechanisms, and modern deep learning architectures.

1 / 5

Loading learning content...

Supervised Learning - ClassificationMulti-class Logistic Regression

Multi-class Logistic Regression

LevelIntermediate

Duration90 mins

TopicMulti-class Logistic Regression

1 / 5

The Softmax Function

From Binary to Multi-class: A Fundamental Challenge

What You Will Learn

The Need for a Multi-class Probability Function

Before diving into the softmax function itself, let's understand precisely why we need it—and why simpler alternatives fail.

The Constraint We Must Satisfy:

Non-negativity: Each probability must be non-negative: $p_k \geq 0$ for all $k \in {1, 2, \ldots, K}$
Normalization: The probabilities must sum to one: $\sum_{k=1}^{K} p_k = 1$

These constraints define the $(K-1)$-dimensional probability simplex $\Delta^{K-1}$, the geometric space of all valid probability distributions over $K$ outcomes.

The Probability Simplex

Why Not Simply Normalize Raw Scores?

Suppose our model produces raw scores (logits) $z_1, z_2, \ldots, z_K$ for each class. A naive approach might be:

$$p_k = \frac{z_k}{\sum_{j=1}^{K} z_j}$$

This simple normalization seems reasonable but fails catastrophically:

Negative logits: If $z_k < 0$, we get negative 'probabilities'—nonsensical
Zero denominator: If logits sum to zero, we divide by zero
Negative denominators: With negative logits, normalization can increase magnitudes
No natural gradient flow: The derivative structure doesn't encourage clean learning dynamics

Example of Failure:

Consider logits $\mathbf{z} = (2, -3, 1)$: $$p_1 = \frac{2}{2-3+1} = \frac{2}{0} \rightarrow \text{undefined}$$

Or with $\mathbf{z} = (1, -2, -3)$: $$p_1 = \frac{1}{1-2-3} = \frac{1}{-4} = -0.25 \rightarrow \text{negative 'probability'}$$

Why Not Per-class Sigmoids?

Another intuitive approach: apply the sigmoid function independently to each logit:

$$p_k = \sigma(z_k) = \frac{1}{1 + e^{-z_k}}$$

This ensures $p_k \in (0, 1)$—satisfying non-negativity—but violates normalization:

$$\sum_{k=1}^{K} \sigma(z_k) \neq 1 \quad \text{(in general)}$$

Example:

With $\mathbf{z} = (0, 0, 0)$: $$\sigma(0) = 0.5 \quad \Rightarrow \quad \sum_{k=1}^{3} \sigma(z_k) = 1.5 \neq 1$$

Multi-label vs. Multi-class

Deriving the Softmax Function

Definition: The Softmax Function

$$\text{softmax}(\mathbf{z})k = \frac{e^{z_k}}{\sum{j=1}^{K} e^{z_j}} \quad \text{for } k = 1, 2, \ldots, K$$

We often write this compactly as:

$$p_k = \frac{\exp(z_k)}{\sum_{j=1}^{K} \exp(z_j)}$$

The Exponential's Role

Derivation 1: Maximum Entropy Principle

The softmax function can be derived as the solution to a constrained optimization problem rooted in information theory.

Problem Statement: Find the probability distribution $\mathbf{p}$ that maximizes entropy subject to expected value constraints.

Given features $\mathbf{x}$ and model parameters $\boldsymbol{\theta}_k$ for each class, we want:

$$\max_{\mathbf{p} \in \Delta^{K-1}} H(\mathbf{p}) = -\sum_{k=1}^{K} p_k \log p_k$$

subject to: $$\mathbb{E}[\phi_k(\mathbf{x})] = \sum_{k=1}^{K} p_k \cdot \phi_k(\mathbf{x}) = c_k \quad \text{for each } k$$

Using Lagrange multipliers $\lambda_k$ for the feature constraints and $\mu$ for normalization:

$$\mathcal{L} = -\sum_k p_k \log p_k + \mu\left(1 - \sum_k p_k\right) + \sum_k \lambda_k\left(c_k - p_k \phi_k\right)$$

Taking derivatives and solving:

$$\frac{\partial \mathcal{L}}{\partial p_k} = -\log p_k - 1 - \mu - \lambda_k \phi_k = 0$$

$$\Rightarrow p_k = \exp(-1 - \mu - \lambda_k \phi_k) = \frac{\exp(-\lambda_k \phi_k)}{\exp(1 + \mu)}$$

Normalizing to sum to 1:

$$p_k = \frac{\exp(z_k)}{\sum_j \exp(z_j)}$$

where $z_k = \boldsymbol{\theta}_k^T \mathbf{x}$ represents the linear score for class $k$.

The maximum entropy interpretation reveals that softmax produces the least biased probability distribution consistent with our model's linear scores.

Derivation 2: Generalization of the Sigmoid

Recall that for binary classification with sigmoid:

$$P(y=1|\mathbf{x}) = \sigma(z) = \frac{1}{1 + e^{-z}} = \frac{e^z}{e^z + 1}$$

$$P(y=0|\mathbf{x}) = 1 - \sigma(z) = \frac{1}{1 + e^z}$$

Now consider the ratio:

$$\frac{P(y=1|\mathbf{x})}{P(y=0|\mathbf{x})} = \frac{e^z}{1} = e^z$$

This suggests that the log-odds of class 1 vs. class 0 equals the score $z$:

$$\log \frac{P(y=1|\mathbf{x})}{P(y=0|\mathbf{x})} = z$$

Extending to $K$ classes:

For multi-class, we want the log-odds of class $k$ vs. a reference class (say, class $K$) to equal some score:

$$\log \frac{P(y=k|\mathbf{x})}{P(y=K|\mathbf{x})} = z_k - z_K$$

This implies: $$P(y=k|\mathbf{x}) = P(y=K|\mathbf{x}) \cdot e^{z_k - z_K} = P(y=K|\mathbf{x}) \cdot \frac{e^{z_k}}{e^{z_K}}$$

Summing over all classes and using normalization: $$1 = \sum_{k=1}^{K} P(y=k|\mathbf{x}) = P(y=K|\mathbf{x}) \cdot \frac{\sum_k e^{z_k}}{e^{z_K}}$$

Solving for $P(y=K|\mathbf{x})$ and substituting back:

$$P(y=k|\mathbf{x}) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}$$

Thus, softmax is the unique function that generalizes sigmoid while maintaining log-linear relationships between class probabilities.

Derivation 3: Exponential Family Connection

The softmax function arises naturally when modeling discrete distributions from the exponential family perspective.

The categorical distribution with parameter $\boldsymbol{\pi} = (\pi_1, \ldots, \pi_K)$ can be written in exponential family form:

$$P(y=k|\boldsymbol{\pi}) = \exp\left(\eta_k - A(\boldsymbol{\eta})\right)$$

where:

Natural parameters: $\eta_k = \log \pi_k$
Log-partition function: $A(\boldsymbol{\eta}) = \log\left(\sum_{j=1}^{K} e^{\eta_j}\right)$

To recover probabilities from natural parameters:

$$\pi_k = \exp\left(\eta_k - A(\boldsymbol{\eta})\right) = \frac{e^{\eta_k}}{\sum_j e^{\eta_j}}$$

This is precisely the softmax function! In multinomial logistic regression, we parameterize $\eta_k = \boldsymbol{\theta}_k^T \mathbf{x}$, making the natural parameter a linear function of features.

Multiple Perspectives, One Function

Mathematical Properties of Softmax

Understanding the softmax function requires examining its mathematical properties in detail. These properties explain its behavior and inform practical implementation.

Property 1: Invariance to Translation

For any constant $c \in \mathbb{R}$:

$$\text{softmax}(\mathbf{z} + c \cdot \mathbf{1}) = \text{softmax}(\mathbf{z})$$

where $\mathbf{1} = (1, 1, \ldots, 1)$.

Property 2: Sensitivity to Scale

For $\alpha > 0$, consider the temperature-scaled softmax:

$$\text{softmax}(\alpha \mathbf{z})_k = \frac{e^{\alpha z_k}}{\sum_j e^{\alpha z_j}}$$

$\alpha \to \infty$ (low temperature): Probabilities concentrate on the maximum logit, approaching a one-hot vector
$\alpha \to 0$ (high temperature): Probabilities become uniform: $p_k \to 1/K$ for all $k$
$\alpha = 1$ (standard): The standard softmax function

Mathematical analysis:

Let $z_\text{max} = \max_k z_k$ and assume it's unique. As $\alpha \to \infty$:

$$\text{softmax}(\alpha \mathbf{z})k = \frac{e^{\alpha z_k}}{\sum_j e^{\alpha z_j}} = \frac{e^{\alpha(z_k - z\text{max})}}{\sum_j e^{\alpha(z_j - z_\text{max})}}$$

For $z_k < z_\text{max}$, $e^{\alpha(z_k - z_\text{max})} \to 0$. For $z_k = z_\text{max}$, $e^{\alpha(z_k - z_\text{max})} = 1$.

Thus, the argmax class gets probability 1, others get 0.

Temperature in Practice

Property 3: Gradient Structure

The Jacobian of the softmax function has a beautiful form essential for backpropagation.

Let $\mathbf{p} = \text{softmax}(\mathbf{z})$. The partial derivatives are:

$$\frac{\partial p_i}{\partial z_j} = \begin{cases} p_i(1 - p_i) & \text{if } i = j \ -p_i p_j & \text{if } i \neq j \end{cases}$$

This can be written compactly as:

$$\frac{\partial \mathbf{p}}{\partial \mathbf{z}} = \text{diag}(\mathbf{p}) - \mathbf{p}\mathbf{p}^T$$

Derivation:

Using the quotient rule for $p_i = \frac{e^{z_i}}{\sum_k e^{z_k}}$:

$$\frac{\partial p_i}{\partial z_j} = \frac{\mathbf{1}_{i=j} \cdot e^{z_i} \cdot \sum_k e^{z_k} - e^{z_i} \cdot e^{z_j}}{\left(\sum_k e^{z_k}\right)^2}$$

For $i = j$: $$\frac{\partial p_i}{\partial z_i} = \frac{e^{z_i}}{\sum_k e^{z_k}} - \frac{e^{z_i} \cdot e^{z_i}}{\left(\sum_k e^{z_k}\right)^2} = p_i - p_i^2 = p_i(1 - p_i)$$

For $i \neq j$: $$\frac{\partial p_i}{\partial z_j} = - \frac{e^{z_i} \cdot e^{z_j}}{\left(\sum_k e^{z_k}\right)^2} = -p_i p_j$$

Property 4: Convexity and Injectivity

Convexity: The log-sum-exp function $\text{LSE}(\mathbf{z}) = \log\left(\sum_k e^{z_k}\right)$ is convex. Its gradient is exactly the softmax:

$$\nabla \text{LSE}(\mathbf{z}) = \text{softmax}(\mathbf{z})$$

This means softmax maps $\mathbb{R}^K$ onto the interior of the probability simplex (never on the boundary where some $p_k = 0$).

Non-injectivity: Softmax is not one-to-one. By translation invariance, infinitely many logit vectors map to the same probability distribution:

$$\text{softmax}(\mathbf{z}) = \text{softmax}(\mathbf{z} + c\mathbf{1}) \quad \forall c \in \mathbb{R}$$

To achieve identifiability in parameter estimation, we typically fix one class's logits to zero (e.g., $z_K = 0$), reducing the parameter space from $\mathbb{R}^K$ to $\mathbb{R}^{K-1}$.

Summary of Softmax Properties
Property	Mathematical Form	Practical Implication
Translation invariance	$\text{softmax}(\mathbf{z} + c) = \text{softmax}(\mathbf{z})$	Only relative logit differences matter; enables numerical stability tricks
Scale sensitivity	$\alpha \to \infty \Rightarrow$ argmax; $\alpha \to 0 \Rightarrow$ uniform	Temperature controls sharpness of distribution
Output range	$p_k \in (0, 1)$, never exactly 0 or 1	All classes always have some probability mass
Gradient structure	$\frac{\partial p_i}{\partial z_j} = p_i(\delta_{ij} - p_j)$	Enables efficient backpropagation; shows class competition
Smoothness	Infinitely differentiable	Well-behaved for gradient-based optimization

Numerical Stability Considerations

The softmax function, despite its mathematical elegance, poses serious numerical challenges in practice. Understanding and addressing these issues is essential for robust implementations.

The Overflow Problem

Consider computing softmax for $\mathbf{z} = (1000, 1001, 1002)$:

$$e^{1000} \approx 1.97 \times 10^{434}$$

This exceeds the maximum representable float64 value ($\approx 1.8 \times 10^{308}$), causing overflow to infinity.

The Underflow Problem

Conversely, for $\mathbf{z} = (-1000, -1001, -1002)$:

$$e^{-1000} \approx 5.08 \times 10^{-435}$$

This is smaller than the minimum positive float64 ($\approx 2.2 \times 10^{-308}$), causing underflow to zero. The sum in the denominator becomes zero, leading to division by zero.

Silent Failures

The Stable Softmax Solution

Using translation invariance, we subtract the maximum logit before exponentiating:

$$\text{softmax_stable}(\mathbf{z})k = \frac{e^{z_k - \max(\mathbf{z})}}{\sum{j=1}^{K} e^{z_j - \max(\mathbf{z})}}$$

This transformation ensures:

The largest exponent is exactly $e^0 = 1$ (no overflow)
All other exponents are $\leq 1$ (no overflow)
At least one term in the denominator equals 1 (no division by zero due to underflow)

Correctness proof (by translation invariance): $$\frac{e^{z_k - m}}{\sum_j e^{z_j - m}} = \frac{e^{-m} \cdot e^{z_k}}{e^{-m} \cdot \sum_j e^{z_j}} = \frac{e^{z_k}}{\sum_j e^{z_j}}$$

softmax_implementations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
 
def softmax_naive(z):
    """
    Naive softmax implementation - DO NOT USE IN PRODUCTION.
    Fails for large or small logit values.
    """
    exp_z = np.exp(z)
    return exp_z / np.sum(exp_z)
 
def softmax_stable(z):
    """
    Numerically stable softmax implementation.
    Subtracts max to prevent overflow while preserving output.
    
    Args:
        z: Array of logits, shape (K,) or (N, K)
    
    Returns:
        Probability distribution, same shape as input
    """
    # Handle both 1D and 2D inputs
    z = np.asarray(z)
    if z.ndim == 1:
        z_max = np.max(z)
        exp_z = np.exp(z - z_max)
        return exp_z / np.sum(exp_z)
    else:
        # For batched inputs, max over class dimension
        z_max = np.max(z, axis=-1, keepdims=True)
        exp_z = np.exp(z - z_max)
        return exp_z / np.sum(exp_z, axis=-1, keepdims=True)
 
def softmax_temperature(z, temperature=1.0):
    """
    Temperature-scaled softmax.
    
    Args:
        z: Logits
        temperature: Scaling factor (higher = softer distribution)
    
    Returns:
        Temperature-scaled probability distribution
    """
    return softmax_stable(z / temperature)
 
# Demonstration of numerical issues
print("=== Numerical Stability Demonstration ===")
print()
 
# Case 1: Normal logits - both work
z_normal = np.array([1.0, 2.0, 3.0])
print(f"Normal logits: {z_normal}")
print(f"Naive softmax:  {softmax_naive(z_normal)}")
print(f"Stable softmax: {softmax_stable(z_normal)}")
print()
 
# Case 2: Large logits - naive fails
z_large = np.array([1000.0, 1001.0, 1002.0])
print(f"Large logits: {z_large}")
print(f"Naive softmax:  {softmax_naive(z_large)}")  # [nan, nan, nan] or [0, 0, nan]
print(f"Stable softmax: {softmax_stable(z_large)}")  # Correct output
print()
 
# Case 3: Very negative logits - naive may fail
z_negative = np.array([-1000.0, -1001.0, -1002.0])
print(f"Negative logits: {z_negative}")
print(f"Naive softmax:  {softmax_naive(z_negative)}")  # May give nan
print(f"Stable softmax: {softmax_stable(z_negative)}")  # Correct output
print()
 
# Temperature scaling demonstration
print("=== Temperature Scaling ===")
z = np.array([1.0, 2.0, 5.0])
print(f"Logits: {z}")
for temp in [0.5, 1.0, 2.0, 5.0]:
    print(f"T={temp}: {softmax_temperature(z, temp)}")

Log-Softmax for Enhanced Stability

$$\log\text{-softmax}(\mathbf{z})_k = z_k - \text{LSE}(\mathbf{z})$$

where $\text{LSE}(\mathbf{z}) = \log\left(\sum_j e^{z_j}\right)$ is the log-sum-exp.

Stable log-sum-exp: $$\text{LSE}(\mathbf{z}) = m + \log\left(\sum_j e^{z_j - m}\right)$$ where $m = \max(\mathbf{z})$.

This formulation:

Avoids computing very small probabilities (which may underflow)
Retains full precision in the log-domain
Is used by torch.nn.functional.log_softmax and tf.nn.log_softmax

Framework Implementations

The Softmax-Sigmoid Relationship

Softmax and sigmoid are intimately related—the sigmoid is a special case of softmax for two classes. Understanding this connection deepens intuition and clarifies when to use each.

Binary Softmax Equals Sigmoid

Consider softmax with $K=2$ classes and logits $(z_1, z_2)$:

$$p_1 = \frac{e^{z_1}}{e^{z_1} + e^{z_2}}$$

Divide numerator and denominator by $e^{z_1}$:

$$p_1 = \frac{1}{1 + e^{z_2 - z_1}} = \frac{1}{1 + e^{-z}} = \sigma(z)$$

where $z = z_1 - z_2$ is the log-odds of class 1 vs. class 2.

Key insight: For binary classification, we only need one logit (the difference), not two. The sigmoid function compactly represents this.

Parameter Efficiency Comparison

For a linear model with $d$ features:

Setting	Parameters Needed	Why
Binary + Sigmoid	$d + 1$	Single weight vector + bias
Binary + Softmax	$2(d + 1)$	Two weight vectors + biases
Binary + Softmax (constrained)	$d + 1$	Set $\mathbf{w}_2 = \mathbf{0}$

The softmax formulation is overparameterized for binary classification due to translation invariance. In practice:

Binary: Use sigmoid with one set of parameters
Multi-class: Use softmax with $K$ sets of parameters (but $K-1$ degrees of freedom)

Reparameterization:

In softmax with $K$ classes, we can always set $\mathbf{w}_K = \mathbf{0}$ and $b_K = 0$ without loss of generality. This gives the equivalent model:

$$z_k = \mathbf{w}_k^T \mathbf{x} + b_k \quad \text{for } k = 1, \ldots, K-1$$ $$z_K = 0$$

Then $p_k$ represents the probability relative to class $K$, which serves as the reference.

Use Sigmoid When

•Binary classification — Exactly two mutually exclusive classes
•Multi-label classification — Multiple independent binary decisions
•Output gating — Controlling information flow (e.g., LSTM gates)
•Probability estimation — When single probability suffices

Use Softmax When

•Multi-class classification — One correct class from $K > 2$ options
•Attention mechanisms — Distributing weight across elements
•Mixture models — Computing mixture weights that sum to 1
•Policy networks — Probability distribution over discrete actions

Common Mistake: Softmax for Multi-label

Softmax in Neural Network Architectures

The softmax function appears throughout modern deep learning, far beyond simple classification layers. Understanding its architectural roles reveals the function's versatility.

Role 1: Classification Output Layer

The most common use—the final layer of a classification network:

$$\text{Input} \to \text{Hidden Layers} \to \mathbf{z} \in \mathbb{R}^K \to \text{softmax} \to \mathbf{p} \in \Delta^{K-1}$$

The network learns feature representations in hidden layers; the final linear layer projects to $K$ logits; softmax converts to probabilities.

Role 2: Attention Mechanisms

In Transformers and attention-based models, softmax computes attention weights:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$

Here, softmax ensures attention weights over keys sum to 1, creating a proper weighted average of values. The scaling by $\sqrt{d_k}$ prevents softmax saturation in high dimensions.

Role 3: Reinforcement Learning Policies

In policy gradient methods, the policy network outputs action probabilities:

$$\pi_\theta(a|s) = \text{softmax}(f_\theta(s))_a$$

Softmax ensures a valid probability distribution over the discrete action space, enabling sampling during exploration and gradient computation for policy updates.

Role 4: Knowledge Distillation

In model compression, a large 'teacher' network's soft predictions guide a smaller 'student':

$$\mathbf{p}_{\text{soft}} = \text{softmax}(\mathbf{z} / T)$$

High temperature $T$ produces smoother distributions that encode more information about inter-class relationships than hard labels.

Role 5: Mixture of Experts

In MoE architectures, softmax computes gating weights determining which experts process each input:

$$\mathbf{g} = \text{softmax}(W_g \cdot \mathbf{x})$$ $$\text{Output} = \sum_i g_i \cdot \text{Expert}_i(\mathbf{x})$$

The softmax ensures expert contributions sum to 1, maintaining interpretability as mixture weights.

Softmax Variants in Modern Architectures

softmax_applications.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import numpy as np
 
def scaled_dot_product_attention(Q, K, V):
    """
    Simplified attention mechanism using softmax.
    
    Args:
        Q: Query matrix (seq_len_q, d_k)
        K: Key matrix (seq_len_k, d_k)
        V: Value matrix (seq_len_k, d_v)
    
    Returns:
        Attention output: (seq_len_q, d_v)
    """
    d_k = Q.shape[-1]
    
    # Compute attention scores
    scores = Q @ K.T / np.sqrt(d_k)
    
    # Softmax to get attention weights (each row sums to 1)
    weights = softmax_stable(scores)  # Shape: (seq_len_q, seq_len_k)
    
    # Weighted sum of values
    output = weights @ V
    
    return output, weights
 
def gumbel_softmax(logits, temperature=1.0):
    """
    Gumbel-softmax for differentiable discrete sampling.
    Produces approximately one-hot vectors that are differentiable.
    
    Args:
        logits: Unnormalized log probabilities
        temperature: Controls discreteness (lower = more discrete)
    
    Returns:
        Approximately one-hot vector
    """
    # Sample from Gumbel(0, 1)
    gumbels = -np.log(-np.log(np.random.uniform(size=logits.shape) + 1e-20) + 1e-20)
    
    # Add Gumbel noise and apply temperature-scaled softmax
    y = softmax_stable((logits + gumbels) / temperature)
    
    return y
 
def softmax_stable(z):
    """Numerically stable softmax (same as before)."""
    z = np.asarray(z)
    if z.ndim == 1:
        z_max = np.max(z)
        exp_z = np.exp(z - z_max)
        return exp_z / np.sum(exp_z)
    else:
        z_max = np.max(z, axis=-1, keepdims=True)
        exp_z = np.exp(z - z_max)
        return exp_z / np.sum(exp_z, axis=-1, keepdims=True)
 
# Demo: Attention mechanism
print("=== Attention Mechanism Demo ===")
np.random.seed(42)
Q = np.random.randn(3, 4)  # 3 queries, dimension 4
K = np.random.randn(5, 4)  # 5 keys
V = np.random.randn(5, 6)  # 5 values, dimension 6
 
output, weights = scaled_dot_product_attention(Q, K, V)
print(f"Query shape: {Q.shape}")
print(f"Attention weights shape: {weights.shape}")
print(f"Weights sum per query: {weights.sum(axis=1)}")  # Should be [1, 1, 1]
print(f"Output shape: {output.shape}")
print()
 
# Demo: Gumbel-softmax for discrete sampling
print("=== Gumbel-Softmax Demo ===")
logits = np.array([1.0, 2.0, 5.0])
print(f"Logits: {logits}")
print(f"Regular softmax: {softmax_stable(logits)}")
print(f"Gumbel softmax (T=1.0): {gumbel_softmax(logits, 1.0)}")
print(f"Gumbel softmax (T=0.1): {gumbel_softmax(logits, 0.1)}")  # More discrete

Theoretical Foundations and Connections

The softmax function connects to several fundamental concepts in mathematics and machine learning. These connections provide deeper understanding and suggest generalizations.

Connection to Boltzmann Distribution

In statistical physics, the probability of a system being in state $k$ with energy $E_k$ at temperature $T$ is:

$$P(\text{state } k) = \frac{e^{-E_k / (k_B T)}}{\sum_j e^{-E_j / (k_B T)}}$$

This is the Boltzmann distribution—identical to softmax with logits $z_k = -E_k / (k_B T)$.

The analogy:

Energy ↔ Negative logit (lower energy = higher probability)
Temperature ↔ Softmax temperature (higher T = more uniform distribution)
Partition function $Z = \sum_j e^{-E_j/T}$ ↔ Softmax denominator

Connection to KL Divergence and Cross-Entropy

The softmax function is the solution to minimizing KL divergence from a uniform prior:

$$\text{softmax}(\mathbf{z})= \arg\min_{\mathbf{p} \in \Delta^{K-1}} \sum_k p_k \log p_k + \text{const} \cdot \sum_k p_k z_k$$

This is equivalent to: $$\text{softmax}(\mathbf{z}) = \arg\max_{\mathbf{p}} \left[ H(\mathbf{p}) + \mathbf{z}^T \mathbf{p} \right]$$

where $H(\mathbf{p}) = -\sum_k p_k \log p_k$ is entropy.

Connection to Convex Optimization

The softmax function is the gradient of the log-sum-exp (LSE) function:

$$\text{LSE}(\mathbf{z}) = \log\left(\sum_k e^{z_k}\right)$$

$$\nabla \text{LSE}(\mathbf{z}) = \text{softmax}(\mathbf{z})$$

Connection to Information Geometry

In information geometry, the probability simplex is viewed as a Riemannian manifold with the Fisher information metric. The softmax parameterization:

$$\mathbf{p} = \text{softmax}(\boldsymbol{\eta})$$

defines a natural coordinate system where:

$\boldsymbol{\eta}$ are the natural parameters
The Fisher information matrix equals the Hessian of the log-partition function
Gradient descent in $\eta$-space corresponds to natural gradient methods

This geometric perspective explains why optimization in the logit space (before softmax) is often more stable than directly optimizing probabilities.

Generalization: The α-Softmax and Tsallis Entropy

Softmax maximizes Shannon entropy. Generalizing to Tsallis entropy (with parameter $\alpha$) yields the $\alpha$-softmax or entmax:

$$\text{entmax}{\alpha}(\mathbf{z}) = \arg\max{\mathbf{p} \in \Delta^{K-1}} \mathbf{z}^T \mathbf{p} + H_\alpha^T(\mathbf{p})$$

where $H_\alpha^T$ is Tsallis entropy. For $\alpha=1$, this recovers standard softmax. For $\alpha=2$, it gives sparsemax, which produces truly sparse probability distributions.

Why These Connections Matter

Summary: The Softmax Function

We have thoroughly explored the softmax function—the fundamental transformation enabling multi-class probabilistic classification. Let's consolidate the key insights:

Key Takeaways

•Mathematical Definition: $\text{softmax}(\mathbf{z})_k = \frac{e^{z_k}}{\sum_j e^{z_j}}$ maps any real vector to a valid probability distribution.
•Core Properties: Translation invariance (only relative differences matter), scale sensitivity (temperature controls sharpness), and smooth gradient structure enabling optimization.
•Multiple Derivations: Emerges from maximum entropy principles, sigmoid generalization, and exponential family theory—indicating mathematical canonicity.
•Numerical Stability: Always subtract max logit before exponentiating; use log-softmax for loss computation; rely on framework implementations.
•Sigmoid Relationship: Softmax with $K=2$ reduces to sigmoid; binary uses sigmoid (1 param set), multi-class uses softmax ($K-1$ effective param sets).
•Architectural Ubiquity: Classification layers, attention mechanisms, policy networks, mixture models, and more rely on softmax.
•Theoretical Depth: Connects to Boltzmann distributions, convex optimization, information geometry, and generalizes to sparse variants.

What's Next:

Page Complete

1 / 5