Loading learning content...
Binary classification—distinguishing between exactly two classes—is elegantly handled by logistic regression. The sigmoid function maps any real-valued score to the probability range $(0, 1)$, and the decision rule follows naturally. But the real world rarely presents such neat dichotomies.
Consider classifying handwritten digits (10 classes), identifying animal species in photographs (thousands of possibilities), or categorizing customer support tickets into departments (dozens of routing options). These multi-class classification problems demand a principled extension of our binary framework.
Enter the softmax function—the mathematical transformation that generalizes the sigmoid to arbitrary numbers of classes, enabling probabilistic interpretation and gradient-based learning for multi-class problems.
By the end of this page, you will understand the softmax function at a deep level: its derivation from maximum entropy principles, its mathematical properties, its relationship to the sigmoid, numerical stability considerations, and its role as the fundamental building block of multi-class classification systems—from simple logistic regression to deep neural networks.
Before diving into the softmax function itself, let's understand precisely why we need it—and why simpler alternatives fail.
The Constraint We Must Satisfy:
In multi-class classification with $K$ classes, we want to output a probability distribution $\mathbf{p} = (p_1, p_2, \ldots, p_K)$ over all possible classes. This probability vector must satisfy two fundamental constraints:
These constraints define the $(K-1)$-dimensional probability simplex $\Delta^{K-1}$, the geometric space of all valid probability distributions over $K$ outcomes.
For $K=3$ classes, the probability simplex is a 2D triangle in 3D space with vertices at $(1,0,0)$, $(0,1,0)$, and $(0,0,1)$. Every point inside this triangle represents a valid probability distribution. The simplex sits in the hyperplane $p_1 + p_2 + p_3 = 1$ and is bounded by $p_k \geq 0$.
Why Not Simply Normalize Raw Scores?
Suppose our model produces raw scores (logits) $z_1, z_2, \ldots, z_K$ for each class. A naive approach might be:
$$p_k = \frac{z_k}{\sum_{j=1}^{K} z_j}$$
This simple normalization seems reasonable but fails catastrophically:
Example of Failure:
Consider logits $\mathbf{z} = (2, -3, 1)$: $$p_1 = \frac{2}{2-3+1} = \frac{2}{0} \rightarrow \text{undefined}$$
Or with $\mathbf{z} = (1, -2, -3)$: $$p_1 = \frac{1}{1-2-3} = \frac{1}{-4} = -0.25 \rightarrow \text{negative 'probability'}$$
Why Not Per-class Sigmoids?
Another intuitive approach: apply the sigmoid function independently to each logit:
$$p_k = \sigma(z_k) = \frac{1}{1 + e^{-z_k}}$$
This ensures $p_k \in (0, 1)$—satisfying non-negativity—but violates normalization:
$$\sum_{k=1}^{K} \sigma(z_k) \neq 1 \quad \text{(in general)}$$
Example:
With $\mathbf{z} = (0, 0, 0)$: $$\sigma(0) = 0.5 \quad \Rightarrow \quad \sum_{k=1}^{3} \sigma(z_k) = 1.5 \neq 1$$
Independent sigmoids are used in multi-label classification where an input can belong to multiple classes simultaneously. But for multi-class classification with mutually exclusive classes, we need a function that respects the probability simplex constraint.
Don't confuse these settings. Multi-class: exactly one correct class (use softmax). Multi-label: possibly multiple correct classes (use independent sigmoids). The constraint structure differs fundamentally, and using the wrong function is a common error.
The softmax function isn't an arbitrary choice—it emerges naturally from several foundational principles. Let's explore multiple derivations that reveal different facets of this elegant transformation.
Definition: The Softmax Function
Given a vector of real-valued scores (logits) $\mathbf{z} = (z_1, z_2, \ldots, z_K) \in \mathbb{R}^K$, the softmax function $\text{softmax}: \mathbb{R}^K \rightarrow \Delta^{K-1}$ produces a probability distribution:
$$\text{softmax}(\mathbf{z})k = \frac{e^{z_k}}{\sum{j=1}^{K} e^{z_j}} \quad \text{for } k = 1, 2, \ldots, K$$
We often write this compactly as:
$$p_k = \frac{\exp(z_k)}{\sum_{j=1}^{K} \exp(z_j)}$$
The exponential function $e^z$ is the key ingredient. It maps any real number to a strictly positive value: $e^z > 0$ for all $z \in \mathbb{R}$. This guarantees non-negativity of all probabilities, and normalization then ensures they sum to one.
Derivation 1: Maximum Entropy Principle
The softmax function can be derived as the solution to a constrained optimization problem rooted in information theory.
Problem Statement: Find the probability distribution $\mathbf{p}$ that maximizes entropy subject to expected value constraints.
Given features $\mathbf{x}$ and model parameters $\boldsymbol{\theta}_k$ for each class, we want:
$$\max_{\mathbf{p} \in \Delta^{K-1}} H(\mathbf{p}) = -\sum_{k=1}^{K} p_k \log p_k$$
subject to: $$\mathbb{E}[\phi_k(\mathbf{x})] = \sum_{k=1}^{K} p_k \cdot \phi_k(\mathbf{x}) = c_k \quad \text{for each } k$$
Using Lagrange multipliers $\lambda_k$ for the feature constraints and $\mu$ for normalization:
$$\mathcal{L} = -\sum_k p_k \log p_k + \mu\left(1 - \sum_k p_k\right) + \sum_k \lambda_k\left(c_k - p_k \phi_k\right)$$
Taking derivatives and solving:
$$\frac{\partial \mathcal{L}}{\partial p_k} = -\log p_k - 1 - \mu - \lambda_k \phi_k = 0$$
$$\Rightarrow p_k = \exp(-1 - \mu - \lambda_k \phi_k) = \frac{\exp(-\lambda_k \phi_k)}{\exp(1 + \mu)}$$
Normalizing to sum to 1:
$$p_k = \frac{\exp(z_k)}{\sum_j \exp(z_j)}$$
where $z_k = \boldsymbol{\theta}_k^T \mathbf{x}$ represents the linear score for class $k$.
The maximum entropy interpretation reveals that softmax produces the least biased probability distribution consistent with our model's linear scores.
Derivation 2: Generalization of the Sigmoid
Recall that for binary classification with sigmoid:
$$P(y=1|\mathbf{x}) = \sigma(z) = \frac{1}{1 + e^{-z}} = \frac{e^z}{e^z + 1}$$
$$P(y=0|\mathbf{x}) = 1 - \sigma(z) = \frac{1}{1 + e^z}$$
Now consider the ratio:
$$\frac{P(y=1|\mathbf{x})}{P(y=0|\mathbf{x})} = \frac{e^z}{1} = e^z$$
This suggests that the log-odds of class 1 vs. class 0 equals the score $z$:
$$\log \frac{P(y=1|\mathbf{x})}{P(y=0|\mathbf{x})} = z$$
Extending to $K$ classes:
For multi-class, we want the log-odds of class $k$ vs. a reference class (say, class $K$) to equal some score:
$$\log \frac{P(y=k|\mathbf{x})}{P(y=K|\mathbf{x})} = z_k - z_K$$
This implies: $$P(y=k|\mathbf{x}) = P(y=K|\mathbf{x}) \cdot e^{z_k - z_K} = P(y=K|\mathbf{x}) \cdot \frac{e^{z_k}}{e^{z_K}}$$
Summing over all classes and using normalization: $$1 = \sum_{k=1}^{K} P(y=k|\mathbf{x}) = P(y=K|\mathbf{x}) \cdot \frac{\sum_k e^{z_k}}{e^{z_K}}$$
Solving for $P(y=K|\mathbf{x})$ and substituting back:
$$P(y=k|\mathbf{x}) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}$$
Thus, softmax is the unique function that generalizes sigmoid while maintaining log-linear relationships between class probabilities.
Derivation 3: Exponential Family Connection
The softmax function arises naturally when modeling discrete distributions from the exponential family perspective.
The categorical distribution with parameter $\boldsymbol{\pi} = (\pi_1, \ldots, \pi_K)$ can be written in exponential family form:
$$P(y=k|\boldsymbol{\pi}) = \exp\left(\eta_k - A(\boldsymbol{\eta})\right)$$
where:
To recover probabilities from natural parameters:
$$\pi_k = \exp\left(\eta_k - A(\boldsymbol{\eta})\right) = \frac{e^{\eta_k}}{\sum_j e^{\eta_j}}$$
This is precisely the softmax function! In multinomial logistic regression, we parameterize $\eta_k = \boldsymbol{\theta}_k^T \mathbf{x}$, making the natural parameter a linear function of features.
The exponential family derivation connects softmax to the broader theory of generalized linear models and explains why maximum likelihood estimation leads to well-behaved convex optimization problems.
The softmax function emerges independently from maximum entropy, generalization of sigmoid, and exponential family theory. This convergence of derivations suggests that softmax isn't merely convenient—it's mathematically canonical for multi-class probability modeling.
Understanding the softmax function requires examining its mathematical properties in detail. These properties explain its behavior and inform practical implementation.
Property 1: Invariance to Translation
For any constant $c \in \mathbb{R}$:
$$\text{softmax}(\mathbf{z} + c \cdot \mathbf{1}) = \text{softmax}(\mathbf{z})$$
where $\mathbf{1} = (1, 1, \ldots, 1)$.
Proof: $$\text{softmax}(\mathbf{z} + c)_k = \frac{e^{z_k + c}}{\sum_j e^{z_j + c}} = \frac{e^c \cdot e^{z_k}}{e^c \cdot \sum_j e^{z_j}} = \frac{e^{z_k}}{\sum_j e^{z_j}} = \text{softmax}(\mathbf{z})_k$$
Implication: Adding the same constant to all logits doesn't change the output probabilities. This property is crucial for numerical stability (as we'll see) and means softmax only cares about relative differences between logits.
Property 2: Sensitivity to Scale
For $\alpha > 0$, consider the temperature-scaled softmax:
$$\text{softmax}(\alpha \mathbf{z})_k = \frac{e^{\alpha z_k}}{\sum_j e^{\alpha z_j}}$$
Mathematical analysis:
Let $z_\text{max} = \max_k z_k$ and assume it's unique. As $\alpha \to \infty$:
$$\text{softmax}(\alpha \mathbf{z})k = \frac{e^{\alpha z_k}}{\sum_j e^{\alpha z_j}} = \frac{e^{\alpha(z_k - z\text{max})}}{\sum_j e^{\alpha(z_j - z_\text{max})}}$$
For $z_k < z_\text{max}$, $e^{\alpha(z_k - z_\text{max})} \to 0$. For $z_k = z_\text{max}$, $e^{\alpha(z_k - z_\text{max})} = 1$.
Thus, the argmax class gets probability 1, others get 0.
Temperature scaling is widely used in neural networks. During training, standard temperature ($\tau=1$) is used. For inference, higher temperature (e.g., $\tau=2$) softens predictions for uncertainty estimation or diverse sampling in generative models. Lower temperature (e.g., $\tau=0.5$) sharpens predictions for more confident decisions.
Property 3: Gradient Structure
The Jacobian of the softmax function has a beautiful form essential for backpropagation.
Let $\mathbf{p} = \text{softmax}(\mathbf{z})$. The partial derivatives are:
$$\frac{\partial p_i}{\partial z_j} = \begin{cases} p_i(1 - p_i) & \text{if } i = j \ -p_i p_j & \text{if } i \neq j \end{cases}$$
This can be written compactly as:
$$\frac{\partial \mathbf{p}}{\partial \mathbf{z}} = \text{diag}(\mathbf{p}) - \mathbf{p}\mathbf{p}^T$$
Derivation:
Using the quotient rule for $p_i = \frac{e^{z_i}}{\sum_k e^{z_k}}$:
$$\frac{\partial p_i}{\partial z_j} = \frac{\mathbf{1}_{i=j} \cdot e^{z_i} \cdot \sum_k e^{z_k} - e^{z_i} \cdot e^{z_j}}{\left(\sum_k e^{z_k}\right)^2}$$
For $i = j$: $$\frac{\partial p_i}{\partial z_i} = \frac{e^{z_i}}{\sum_k e^{z_k}} - \frac{e^{z_i} \cdot e^{z_i}}{\left(\sum_k e^{z_k}\right)^2} = p_i - p_i^2 = p_i(1 - p_i)$$
For $i \neq j$: $$\frac{\partial p_i}{\partial z_j} = - \frac{e^{z_i} \cdot e^{z_j}}{\left(\sum_k e^{z_k}\right)^2} = -p_i p_j$$
Interpretation: The diagonal terms show each class probability responds positively to its own logit (with diminishing returns as $p_i \to 1$). Off-diagonal terms show competition: increasing one logit decreases all other probabilities.
Property 4: Convexity and Injectivity
$$\nabla \text{LSE}(\mathbf{z}) = \text{softmax}(\mathbf{z})$$
This means softmax maps $\mathbb{R}^K$ onto the interior of the probability simplex (never on the boundary where some $p_k = 0$).
$$\text{softmax}(\mathbf{z}) = \text{softmax}(\mathbf{z} + c\mathbf{1}) \quad \forall c \in \mathbb{R}$$
To achieve identifiability in parameter estimation, we typically fix one class's logits to zero (e.g., $z_K = 0$), reducing the parameter space from $\mathbb{R}^K$ to $\mathbb{R}^{K-1}$.
| Property | Mathematical Form | Practical Implication |
|---|---|---|
| Translation invariance | $\text{softmax}(\mathbf{z} + c) = \text{softmax}(\mathbf{z})$ | Only relative logit differences matter; enables numerical stability tricks |
| Scale sensitivity | $\alpha \to \infty \Rightarrow$ argmax; $\alpha \to 0 \Rightarrow$ uniform | Temperature controls sharpness of distribution |
| Output range | $p_k \in (0, 1)$, never exactly 0 or 1 | All classes always have some probability mass |
| Gradient structure | $\frac{\partial p_i}{\partial z_j} = p_i(\delta_{ij} - p_j)$ | Enables efficient backpropagation; shows class competition |
| Smoothness | Infinitely differentiable | Well-behaved for gradient-based optimization |
The softmax function, despite its mathematical elegance, poses serious numerical challenges in practice. Understanding and addressing these issues is essential for robust implementations.
The Overflow Problem
Consider computing softmax for $\mathbf{z} = (1000, 1001, 1002)$:
$$e^{1000} \approx 1.97 \times 10^{434}$$
This exceeds the maximum representable float64 value ($\approx 1.8 \times 10^{308}$), causing overflow to infinity.
The Underflow Problem
Conversely, for $\mathbf{z} = (-1000, -1001, -1002)$:
$$e^{-1000} \approx 5.08 \times 10^{-435}$$
This is smaller than the minimum positive float64 ($\approx 2.2 \times 10^{-308}$), causing underflow to zero. The sum in the denominator becomes zero, leading to division by zero.
Numerical instability often manifests silently as NaN (Not a Number) values that propagate through computations, causing training to diverge without obvious error messages. Always implement numerically stable versions of softmax.
The Stable Softmax Solution
Using translation invariance, we subtract the maximum logit before exponentiating:
$$\text{softmax_stable}(\mathbf{z})k = \frac{e^{z_k - \max(\mathbf{z})}}{\sum{j=1}^{K} e^{z_j - \max(\mathbf{z})}}$$
This transformation ensures:
Correctness proof (by translation invariance): $$\frac{e^{z_k - m}}{\sum_j e^{z_j - m}} = \frac{e^{-m} \cdot e^{z_k}}{e^{-m} \cdot \sum_j e^{z_j}} = \frac{e^{z_k}}{\sum_j e^{z_j}}$$
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
import numpy as np def softmax_naive(z): """ Naive softmax implementation - DO NOT USE IN PRODUCTION. Fails for large or small logit values. """ exp_z = np.exp(z) return exp_z / np.sum(exp_z) def softmax_stable(z): """ Numerically stable softmax implementation. Subtracts max to prevent overflow while preserving output. Args: z: Array of logits, shape (K,) or (N, K) Returns: Probability distribution, same shape as input """ # Handle both 1D and 2D inputs z = np.asarray(z) if z.ndim == 1: z_max = np.max(z) exp_z = np.exp(z - z_max) return exp_z / np.sum(exp_z) else: # For batched inputs, max over class dimension z_max = np.max(z, axis=-1, keepdims=True) exp_z = np.exp(z - z_max) return exp_z / np.sum(exp_z, axis=-1, keepdims=True) def softmax_temperature(z, temperature=1.0): """ Temperature-scaled softmax. Args: z: Logits temperature: Scaling factor (higher = softer distribution) Returns: Temperature-scaled probability distribution """ return softmax_stable(z / temperature) # Demonstration of numerical issuesprint("=== Numerical Stability Demonstration ===")print() # Case 1: Normal logits - both workz_normal = np.array([1.0, 2.0, 3.0])print(f"Normal logits: {z_normal}")print(f"Naive softmax: {softmax_naive(z_normal)}")print(f"Stable softmax: {softmax_stable(z_normal)}")print() # Case 2: Large logits - naive failsz_large = np.array([1000.0, 1001.0, 1002.0])print(f"Large logits: {z_large}")print(f"Naive softmax: {softmax_naive(z_large)}") # [nan, nan, nan] or [0, 0, nan]print(f"Stable softmax: {softmax_stable(z_large)}") # Correct outputprint() # Case 3: Very negative logits - naive may failz_negative = np.array([-1000.0, -1001.0, -1002.0])print(f"Negative logits: {z_negative}")print(f"Naive softmax: {softmax_naive(z_negative)}") # May give nanprint(f"Stable softmax: {softmax_stable(z_negative)}") # Correct outputprint() # Temperature scaling demonstrationprint("=== Temperature Scaling ===")z = np.array([1.0, 2.0, 5.0])print(f"Logits: {z}")for temp in [0.5, 1.0, 2.0, 5.0]: print(f"T={temp}: {softmax_temperature(z, temp)}")Log-Softmax for Enhanced Stability
When computing cross-entropy loss, we often need $\log(\text{softmax}(\mathbf{z})_k) = \log p_k$. Computing softmax then taking log incurs unnecessary numerical error. Instead, compute log-softmax directly:
$$\log\text{-softmax}(\mathbf{z})_k = z_k - \text{LSE}(\mathbf{z})$$
where $\text{LSE}(\mathbf{z}) = \log\left(\sum_j e^{z_j}\right)$ is the log-sum-exp.
Stable log-sum-exp: $$\text{LSE}(\mathbf{z}) = m + \log\left(\sum_j e^{z_j - m}\right)$$ where $m = \max(\mathbf{z})$.
This formulation:
torch.nn.functional.log_softmax and tf.nn.log_softmaxModern deep learning frameworks (PyTorch, TensorFlow, JAX) implement numerically stable softmax and log-softmax. Use torch.nn.functional.softmax, tf.nn.softmax, or jax.nn.softmax instead of implementing manually. When combining softmax with cross-entropy, use fused operations like F.cross_entropy (PyTorch) or tf.nn.softmax_cross_entropy_with_logits for best numerical behavior.
Softmax and sigmoid are intimately related—the sigmoid is a special case of softmax for two classes. Understanding this connection deepens intuition and clarifies when to use each.
Binary Softmax Equals Sigmoid
Consider softmax with $K=2$ classes and logits $(z_1, z_2)$:
$$p_1 = \frac{e^{z_1}}{e^{z_1} + e^{z_2}}$$
Divide numerator and denominator by $e^{z_1}$:
$$p_1 = \frac{1}{1 + e^{z_2 - z_1}} = \frac{1}{1 + e^{-z}} = \sigma(z)$$
where $z = z_1 - z_2$ is the log-odds of class 1 vs. class 2.
Key insight: For binary classification, we only need one logit (the difference), not two. The sigmoid function compactly represents this.
Parameter Efficiency Comparison
For a linear model with $d$ features:
| Setting | Parameters Needed | Why |
|---|---|---|
| Binary + Sigmoid | $d + 1$ | Single weight vector + bias |
| Binary + Softmax | $2(d + 1)$ | Two weight vectors + biases |
| Binary + Softmax (constrained) | $d + 1$ | Set $\mathbf{w}_2 = \mathbf{0}$ |
The softmax formulation is overparameterized for binary classification due to translation invariance. In practice:
Reparameterization:
In softmax with $K$ classes, we can always set $\mathbf{w}_K = \mathbf{0}$ and $b_K = 0$ without loss of generality. This gives the equivalent model:
$$z_k = \mathbf{w}_k^T \mathbf{x} + b_k \quad \text{for } k = 1, \ldots, K-1$$ $$z_K = 0$$
Then $p_k$ represents the probability relative to class $K$, which serves as the reference.
Using softmax for multi-label classification (where an image might be labeled both 'sunny' and 'scenic') forces probabilities to compete. If $p(\text{sunny}) = 0.9$, then $p(\text{scenic})$ is constrained to be at most $0.1$. Use independent sigmoids instead, allowing both to be high simultaneously.
The softmax function appears throughout modern deep learning, far beyond simple classification layers. Understanding its architectural roles reveals the function's versatility.
Role 1: Classification Output Layer
The most common use—the final layer of a classification network:
$$\text{Input} \to \text{Hidden Layers} \to \mathbf{z} \in \mathbb{R}^K \to \text{softmax} \to \mathbf{p} \in \Delta^{K-1}$$
The network learns feature representations in hidden layers; the final linear layer projects to $K$ logits; softmax converts to probabilities.
Role 2: Attention Mechanisms
In Transformers and attention-based models, softmax computes attention weights:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$
Here, softmax ensures attention weights over keys sum to 1, creating a proper weighted average of values. The scaling by $\sqrt{d_k}$ prevents softmax saturation in high dimensions.
Role 3: Reinforcement Learning Policies
In policy gradient methods, the policy network outputs action probabilities:
$$\pi_\theta(a|s) = \text{softmax}(f_\theta(s))_a$$
Softmax ensures a valid probability distribution over the discrete action space, enabling sampling during exploration and gradient computation for policy updates.
Role 4: Knowledge Distillation
In model compression, a large 'teacher' network's soft predictions guide a smaller 'student':
$$\mathbf{p}_{\text{soft}} = \text{softmax}(\mathbf{z} / T)$$
High temperature $T$ produces smoother distributions that encode more information about inter-class relationships than hard labels.
Role 5: Mixture of Experts
In MoE architectures, softmax computes gating weights determining which experts process each input:
$$\mathbf{g} = \text{softmax}(W_g \cdot \mathbf{x})$$ $$\text{Output} = \sum_i g_i \cdot \text{Expert}_i(\mathbf{x})$$
The softmax ensures expert contributions sum to 1, maintaining interpretability as mixture weights.
Modern architectures employ softmax variants: Sparse softmax zeros out small probabilities for efficiency. Gumbel-softmax enables differentiable discrete sampling. Entmax (α-softmax) provides learnable sparsity. Softmax with relative positions handles sequence length variation in attention.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980
import numpy as np def scaled_dot_product_attention(Q, K, V): """ Simplified attention mechanism using softmax. Args: Q: Query matrix (seq_len_q, d_k) K: Key matrix (seq_len_k, d_k) V: Value matrix (seq_len_k, d_v) Returns: Attention output: (seq_len_q, d_v) """ d_k = Q.shape[-1] # Compute attention scores scores = Q @ K.T / np.sqrt(d_k) # Softmax to get attention weights (each row sums to 1) weights = softmax_stable(scores) # Shape: (seq_len_q, seq_len_k) # Weighted sum of values output = weights @ V return output, weights def gumbel_softmax(logits, temperature=1.0): """ Gumbel-softmax for differentiable discrete sampling. Produces approximately one-hot vectors that are differentiable. Args: logits: Unnormalized log probabilities temperature: Controls discreteness (lower = more discrete) Returns: Approximately one-hot vector """ # Sample from Gumbel(0, 1) gumbels = -np.log(-np.log(np.random.uniform(size=logits.shape) + 1e-20) + 1e-20) # Add Gumbel noise and apply temperature-scaled softmax y = softmax_stable((logits + gumbels) / temperature) return y def softmax_stable(z): """Numerically stable softmax (same as before).""" z = np.asarray(z) if z.ndim == 1: z_max = np.max(z) exp_z = np.exp(z - z_max) return exp_z / np.sum(exp_z) else: z_max = np.max(z, axis=-1, keepdims=True) exp_z = np.exp(z - z_max) return exp_z / np.sum(exp_z, axis=-1, keepdims=True) # Demo: Attention mechanismprint("=== Attention Mechanism Demo ===")np.random.seed(42)Q = np.random.randn(3, 4) # 3 queries, dimension 4K = np.random.randn(5, 4) # 5 keysV = np.random.randn(5, 6) # 5 values, dimension 6 output, weights = scaled_dot_product_attention(Q, K, V)print(f"Query shape: {Q.shape}")print(f"Attention weights shape: {weights.shape}")print(f"Weights sum per query: {weights.sum(axis=1)}") # Should be [1, 1, 1]print(f"Output shape: {output.shape}")print() # Demo: Gumbel-softmax for discrete samplingprint("=== Gumbel-Softmax Demo ===")logits = np.array([1.0, 2.0, 5.0])print(f"Logits: {logits}")print(f"Regular softmax: {softmax_stable(logits)}")print(f"Gumbel softmax (T=1.0): {gumbel_softmax(logits, 1.0)}")print(f"Gumbel softmax (T=0.1): {gumbel_softmax(logits, 0.1)}") # More discreteThe softmax function connects to several fundamental concepts in mathematics and machine learning. These connections provide deeper understanding and suggest generalizations.
Connection to Boltzmann Distribution
In statistical physics, the probability of a system being in state $k$ with energy $E_k$ at temperature $T$ is:
$$P(\text{state } k) = \frac{e^{-E_k / (k_B T)}}{\sum_j e^{-E_j / (k_B T)}}$$
This is the Boltzmann distribution—identical to softmax with logits $z_k = -E_k / (k_B T)$.
The analogy:
This physical interpretation explains temperature scaling: at low temperature, the system 'freezes' into the lowest energy state (highest logit class); at high temperature, it explores all states equally.
Connection to KL Divergence and Cross-Entropy
The softmax function is the solution to minimizing KL divergence from a uniform prior:
$$\text{softmax}(\mathbf{z})= \arg\min_{\mathbf{p} \in \Delta^{K-1}} \sum_k p_k \log p_k + \text{const} \cdot \sum_k p_k z_k$$
This is equivalent to: $$\text{softmax}(\mathbf{z}) = \arg\max_{\mathbf{p}} \left[ H(\mathbf{p}) + \mathbf{z}^T \mathbf{p} \right]$$
where $H(\mathbf{p}) = -\sum_k p_k \log p_k$ is entropy.
Interpretation: Softmax produces the maximum entropy distribution subject to the constraint that expected logits equal a specific value. It's the 'most uncertain' distribution consistent with our belief encoded in logits.
Connection to Convex Optimization
The softmax function is the gradient of the log-sum-exp (LSE) function:
$$\text{LSE}(\mathbf{z}) = \log\left(\sum_k e^{z_k}\right)$$
$$\nabla \text{LSE}(\mathbf{z}) = \text{softmax}(\mathbf{z})$$
Since LSE is convex, its gradient (softmax) maps $\mathbb{R}^K$ monotonically onto the simplex interior. This gradient relationship underlies the equivalence between maximum likelihood estimation and convex optimization for softmax regression.
Connection to Information Geometry
In information geometry, the probability simplex is viewed as a Riemannian manifold with the Fisher information metric. The softmax parameterization:
$$\mathbf{p} = \text{softmax}(\boldsymbol{\eta})$$
defines a natural coordinate system where:
This geometric perspective explains why optimization in the logit space (before softmax) is often more stable than directly optimizing probabilities.
Generalization: The α-Softmax and Tsallis Entropy
Softmax maximizes Shannon entropy. Generalizing to Tsallis entropy (with parameter $\alpha$) yields the $\alpha$-softmax or entmax:
$$\text{entmax}{\alpha}(\mathbf{z}) = \arg\max{\mathbf{p} \in \Delta^{K-1}} \mathbf{z}^T \mathbf{p} + H_\alpha^T(\mathbf{p})$$
where $H_\alpha^T$ is Tsallis entropy. For $\alpha=1$, this recovers standard softmax. For $\alpha=2$, it gives sparsemax, which produces truly sparse probability distributions.
These theoretical connections aren't merely academic. Understanding softmax through statistical physics explains temperature scaling. The convex optimization perspective guarantees global optima. Information geometry motivates natural gradient methods. The α-softmax generalization enables learnable sparsity in attention mechanisms.
We have thoroughly explored the softmax function—the fundamental transformation enabling multi-class probabilistic classification. Let's consolidate the key insights:
What's Next:
With the softmax function firmly understood, we're ready to build the complete multinomial logistic regression model. The next page develops the full probabilistic framework, showing how softmax combines with linear scoring functions to create a powerful multi-class classifier with solid theoretical foundations.
You now possess a comprehensive understanding of the softmax function—from its basic definition through numerical implementation to theoretical foundations. This knowledge forms the bedrock for understanding multi-class classification, attention mechanisms, and modern deep learning architectures.