Loading content...
When neural networks learn to classify images, detect spam, or predict the next word in a sentence, they are fundamentally learning to assign probabilities to discrete outcomes. But here lies a crucial question: how do we measure the quality of a probability distribution?
Enter cross-entropy loss—arguably the most important loss function in modern machine learning. It doesn't merely measure whether predictions are 'right' or 'wrong'; it quantifies the information-theoretic cost of using predicted probabilities instead of the true distribution. This elegant perspective, rooted in Claude Shannon's foundational work on information theory, provides deep insight into why cross-entropy is so effective for training classifiers.
This page will take you on a rigorous journey through cross-entropy loss, from its information-theoretic origins to its role as the cornerstone of neural network training for classification tasks.
By the end of this page, you will understand: (1) the information-theoretic foundations of cross-entropy, (2) its derivation from maximum likelihood estimation, (3) why it produces superior gradients compared to alternatives, (4) numerical stability techniques essential for implementation, and (5) its deep connections to KL divergence and the broader landscape of probabilistic learning.
To truly understand cross-entropy, we must first grasp the foundational concepts of information theory. Claude Shannon's 1948 paper 'A Mathematical Theory of Communication' introduced a revolutionary way to quantify information, uncertainty, and the cost of communication.
Entropy quantifies the average uncertainty or 'information content' in a probability distribution. For a discrete random variable $X$ with possible outcomes $x_1, x_2, \ldots, x_n$ and associated probabilities $p(x_i)$, the entropy is defined as:
$$H(P) = -\sum_{i=1}^{n} p(x_i) \log p(x_i)$$
The choice of logarithm base determines the unit: base 2 gives bits, base $e$ gives nats. In machine learning, we typically use natural logarithms (nats) for computational convenience.
Intuition: Entropy measures the average number of bits (or nats) needed to encode samples from a distribution using an optimal coding scheme. Events that occur rarely carry more information when they happen—a rare event is 'surprising'.
Key properties of entropy:
| Distribution | Probabilities | Entropy (bits) | Interpretation |
|---|---|---|---|
| Deterministic | [1.0, 0.0] | 0.00 | No uncertainty—outcome is certain |
| Fair coin | [0.5, 0.5] | 1.00 | Maximum uncertainty for 2 outcomes |
| Biased coin | [0.9, 0.1] | 0.47 | Low uncertainty—one outcome dominates |
| Fair die (6-sided) | [1/6, ...] | 2.58 | Maximum uncertainty for 6 outcomes |
| Loaded die | [0.5, 0.1, 0.1, 0.1, 0.1, 0.1] | 2.16 | Reduced uncertainty due to bias |
Now consider a fundamental problem: you have a true distribution $P$ over outcomes, but you're using a different distribution $Q$ to design your encoding scheme. The cross-entropy between $P$ and $Q$ measures the expected number of bits needed to encode samples from $P$ when using a code optimized for $Q$:
$$H(P, Q) = -\sum_{i=1}^{n} p(x_i) \log q(x_i)$$
Critical insight: Cross-entropy is always at least as large as entropy: $H(P, Q) \geq H(P)$, with equality only when $P = Q$. This extra cost is precisely the KL divergence:
$$D_{KL}(P | Q) = H(P, Q) - H(P) = \sum_{i=1}^{n} p(x_i) \log \frac{p(x_i)}{q(x_i)}$$
In machine learning context:
Cross-entropy penalizes predictions that assign low probability to the true outcome. If the true label is class 3 but your model predicts p(class 3) = 0.01, the loss is -log(0.01) ≈ 4.6 nats—a heavy penalty. If p(class 3) = 0.99, the loss is only -log(0.99) ≈ 0.01 nats. This logarithmic scaling is precisely what makes cross-entropy so effective: it creates strong pressure to assign high probability to correct classes.
For binary classification problems—where we predict between two classes (often labeled 0 and 1)—we use binary cross-entropy (BCE), also known as log loss. Let's derive it rigorously.
Given:
The true distribution $P$ is: $$P(Y=1) = y, \quad P(Y=0) = 1 - y$$
The predicted distribution $Q$ is: $$Q(Y=1) = \hat{y}, \quad Q(Y=0) = 1 - \hat{y}$$
Applying the cross-entropy formula:
$$\mathcal{L}_{BCE} = H(P, Q) = -[y \log \hat{y} + (1-y) \log(1-\hat{y})]$$
Understanding the formula:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
import numpy as npimport matplotlib.pyplot as plt def sigmoid(z): """Numerically stable sigmoid function.""" return np.where( z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z)) ) def binary_cross_entropy(y_true, y_pred, epsilon=1e-15): """ Compute binary cross-entropy loss. Args: y_true: Ground truth labels (0 or 1) y_pred: Predicted probabilities (0 to 1) epsilon: Small constant for numerical stability Returns: BCE loss value """ # Clip predictions to prevent log(0) y_pred = np.clip(y_pred, epsilon, 1 - epsilon) # Compute BCE loss = -np.mean( y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred) ) return loss def binary_cross_entropy_with_logits(y_true, logits): """ Numerically stable BCE directly from logits. Uses the log-sum-exp trick to avoid overflow. This is equivalent to: -y * log(sigmoid(z)) - (1-y) * log(1 - sigmoid(z)) But computed in a numerically stable way. """ # For numerical stability, we use: # BCE = max(z, 0) - z*y + log(1 + exp(-|z|)) return np.mean( np.maximum(logits, 0) - logits * y_true + np.log1p(np.exp(-np.abs(logits))) ) # Demonstrationprint("Binary Cross-Entropy Examples:")print("="*50) # Perfect predictiony, p = 1.0, 0.99loss = binary_cross_entropy(np.array([y]), np.array([p]))print(f"y=1, pred=0.99: BCE = {loss:.4f}") # Confident wrong predictiony, p = 1.0, 0.01loss = binary_cross_entropy(np.array([y]), np.array([p]))print(f"y=1, pred=0.01: BCE = {loss:.4f} (heavily penalized)") # Uncertain predictiony, p = 1.0, 0.5loss = binary_cross_entropy(np.array([y]), np.array([p]))print(f"y=1, pred=0.50: BCE = {loss:.4f} (moderate penalty)")Binary cross-entropy has a beautiful probabilistic interpretation as negative log-likelihood. Consider a Bernoulli likelihood for each sample:
$$p(y | \hat{y}) = \hat{y}^y (1 - \hat{y})^{1-y}$$
For a dataset of $N$ independent samples, the likelihood is:
$$\mathcal{L}(\theta) = \prod_{i=1}^{N} \hat{y}_i^{y_i} (1 - \hat{y}_i)^{1-y_i}$$
Taking the negative log and dividing by $N$:
$$-\frac{1}{N} \log \mathcal{L}(\theta) = -\frac{1}{N} \sum_{i=1}^{N} [y_i \log \hat{y}_i + (1-y_i) \log(1-\hat{y}_i)]$$
This is exactly the binary cross-entropy loss! Minimizing BCE is equivalent to maximum likelihood estimation under a Bernoulli model.
The gradient of BCE with respect to the logit $z$ has a remarkably elegant form:
$$\frac{\partial \mathcal{L}_{BCE}}{\partial z} = \hat{y} - y = \sigma(z) - y$$
This is extraordinarily important. The gradient is simply the difference between prediction and target, bounded between -1 and 1. This provides:
Let L = -[y log(σ(z)) + (1-y) log(1-σ(z))]. Using chain rule and σ'(z) = σ(z)(1-σ(z)):
∂L/∂z = -y · (1/σ(z)) · σ(z)(1-σ(z)) - (1-y) · (1/(1-σ(z))) · (-σ(z)(1-σ(z))) = -y(1-σ(z)) + (1-y)σ(z) = -y + yσ(z) + σ(z) - yσ(z) = σ(z) - y
This elegant result is why sigmoid + BCE is such a powerful combination.
When we have $K > 2$ mutually exclusive classes, we generalize to categorical cross-entropy (CCE), also called softmax cross-entropy or simply cross-entropy loss in many frameworks.
Given:
$$\mathcal{L}{CCE} = -\sum{k=1}^{K} y_k \log \hat{y}_k$$
Since $\mathbf{y}$ is one-hot with the true class at index $c$, this simplifies to:
$$\mathcal{L}{CCE} = -\log \hat{y}c = -\log \frac{e^{z_c}}{\sum{j=1}^{K} e^{z_j}} = -z_c + \log \sum{j=1}^{K} e^{z_j}$$
This formulation is known as the softmax cross-entropy or log-softmax loss.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
import numpy as np def softmax(logits, axis=-1): """ Numerically stable softmax. Subtracts max to prevent exp overflow. """ # Shift for numerical stability shifted = logits - np.max(logits, axis=axis, keepdims=True) exp_shifted = np.exp(shifted) return exp_shifted / np.sum(exp_shifted, axis=axis, keepdims=True) def categorical_cross_entropy(y_true_onehot, y_pred_probs, epsilon=1e-15): """ Compute categorical cross-entropy loss. Args: y_true_onehot: Ground truth as one-hot vectors (N, K) y_pred_probs: Predicted probabilities (N, K) epsilon: Small constant for numerical stability Returns: Average CCE loss """ y_pred_probs = np.clip(y_pred_probs, epsilon, 1 - epsilon) loss = -np.sum(y_true_onehot * np.log(y_pred_probs), axis=-1) return np.mean(loss) def softmax_cross_entropy_with_logits(y_true_onehot, logits): """ Numerically stable CCE directly from logits. Avoids computing softmax explicitly. Uses log-sum-exp trick for stability. """ # Compute log(softmax(z)) in a stable way: # log_softmax = z - log(sum(exp(z))) # Using log-sum-exp with max subtraction for stability max_logits = np.max(logits, axis=-1, keepdims=True) log_sum_exp = max_logits + np.log( np.sum(np.exp(logits - max_logits), axis=-1, keepdims=True) ) log_softmax = logits - log_sum_exp # CCE = -sum(y * log_softmax) loss = -np.sum(y_true_onehot * log_softmax, axis=-1) return np.mean(loss) def sparse_categorical_cross_entropy(y_true_indices, logits): """ CCE when labels are given as class indices, not one-hot. More memory efficient for large K. Args: y_true_indices: Ground truth class indices (N,) logits: Raw model outputs (N, K) """ n_samples = logits.shape[0] # Compute log-sum-exp for stability max_logits = np.max(logits, axis=-1, keepdims=True) log_sum_exp = max_logits.squeeze() + np.log( np.sum(np.exp(logits - max_logits), axis=-1) ) # Get logits at true class indices true_class_logits = logits[np.arange(n_samples), y_true_indices] # Loss = -log(softmax) = -z_c + log_sum_exp loss = -true_class_logits + log_sum_exp return np.mean(loss) # Demonstrationprint("Categorical Cross-Entropy Examples:")print("="*50) # 3-class problemlogits = np.array([[2.0, 1.0, 0.1]]) # Raw model outputy_true = np.array([[1, 0, 0]]) # Class 0 is correct probs = softmax(logits)print(f"Logits: {logits[0]}")print(f"Softmax probabilities: {probs[0]}")print(f"True class probability: {probs[0, 0]:.4f}") loss1 = categorical_cross_entropy(y_true, probs)loss2 = softmax_cross_entropy_with_logits(y_true, logits)loss3 = sparse_categorical_cross_entropy(np.array([0]), logits) print(f"\nCCE (from probs): {loss1:.4f}")print(f"CCE (from logits): {loss2:.4f}")print(f"Sparse CCE: {loss3:.4f}")Remarkably, the gradient of CCE with respect to logits has the same elegant form as BCE:
$$\frac{\partial \mathcal{L}_{CCE}}{\partial z_k} = \hat{y}_k - y_k$$
For the true class $c$: $\frac{\partial \mathcal{L}}{\partial z_c} = \hat{y}_c - 1$ (pushes logit up)
For other classes $j \neq c$: $\frac{\partial \mathcal{L}}{\partial z_j} = \hat{y}_j$ (pushes logits down)
This beautiful symmetry is no coincidence. Both binary and categorical cross-entropy are derived from minimizing KL divergence between the true distribution and the model's predictions. The gradient being simply prediction minus target is a consequence of the exponential family structure of the softmax/sigmoid functions.
In practice, one-hot labels can cause overconfidence. Label smoothing softens targets:
$$y_k^{smooth} = (1 - \alpha) \cdot y_k + \frac{\alpha}{K}$$
where $\alpha$ is the smoothing parameter (typically 0.1). This regularizes the model by:
For problems with many classes (e.g., ImageNet with 1000 classes), storing one-hot vectors wastes memory. 'Sparse' categorical cross-entropy uses integer class indices instead, computing the same loss more efficiently. Most deep learning frameworks support both formats.
Cross-entropy loss involves logarithms and exponentials—operations notorious for numerical issues. Understanding and implementing stable versions is crucial for reliable training.
Problem 1: Log of zero When predicted probability $\hat{y} \to 0$, we get $\log(\hat{y}) \to -\infty$. This happens when the model is very confident about the wrong class.
Problem 2: Exponential overflow For large logits, $e^{z}$ can overflow floating-point representation (>1e308 for float64, >1e38 for float32).
Problem 3: Exponential underflow For large negative logits, $e^{z} \to 0$, causing division by zero in softmax.
The naive sigmoid $\sigma(z) = \frac{1}{1 + e^{-z}}$ overflows for large negative $z$. The stable version:
$$\sigma(z) = \begin{cases} \frac{1}{1 + e^{-z}} & \text{if } z \geq 0 \ \frac{e^z}{1 + e^z} & \text{if } z < 0 \end{cases}$$
The key insight: softmax is invariant to adding a constant to all logits. We subtract the maximum:
$$\hat{y}_k = \frac{e^{z_k - \max(\mathbf{z})}}{\sum_j e^{z_j - \max(\mathbf{z})}}$$
This ensures all exponents are $\leq 0$, preventing overflow.
For cross-entropy, we need $\log(\text{softmax})$. Computing this directly risks underflow. Instead:
$$\log \hat{y}_k = z_k - \log \sum_j e^{z_j}$$
The log-sum-exp is computed stably as:
$$\log \sum_j e^{z_j} = m + \log \sum_j e^{z_j - m}$$
where $m = \max(\mathbf{z})$.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
import numpy as np def log_sum_exp_stable(logits, axis=-1): """ Compute log(sum(exp(logits))) in a numerically stable way. This is the key primitive for stable softmax cross-entropy. """ max_val = np.max(logits, axis=axis, keepdims=True) return max_val.squeeze(axis) + np.log( np.sum(np.exp(logits - max_val), axis=axis) ) def stable_softmax_cross_entropy(y_true_idx, logits): """ Production-quality softmax cross-entropy. Computes: -logit[true_class] + log(sum(exp(logits))) All operations are numerically stable. """ n_samples = len(y_true_idx) # Stable log-sum-exp lse = log_sum_exp_stable(logits, axis=-1) # Get logit at true class true_logits = logits[np.arange(n_samples), y_true_idx] # Cross-entropy = -z_c + LSE loss = -true_logits + lse return np.mean(loss) def stable_binary_cross_entropy(y_true, logits): """ Production-quality binary cross-entropy from logits. Uses the identity: -y*log(σ(z)) - (1-y)*log(1-σ(z)) = max(z,0) - z*y + log(1+exp(-|z|)) This formulation avoids computing sigmoid explicitly and is stable for all values of z. """ # For large positive z: max(z,0)=z, exp(-z)≈0, so loss ≈ z - zy = z(1-y) # For large negative z: max(z,0)=0, exp(z)≈0, so loss ≈ 0 - zy = -zy # Both are correct and finite! return np.mean( np.maximum(logits, 0) - logits * y_true + np.log1p(np.exp(-np.abs(logits))) ) # Demonstration: Why stability mattersprint("Demonstrating Numerical Stability")print("="*50) # Extreme logits that would break naive implementationextreme_logits = np.array([[1000.0, 0.0, -1000.0]]) # Will overflow naive exp()y_true = np.array([0]) # True class is 0 # Naive approach would failtry: naive_exp = np.exp(extreme_logits) # This doesn't overflow in NumPy but... naive_softmax = naive_exp / np.sum(naive_exp, axis=-1, keepdims=True) print(f"Naive softmax: {naive_softmax}") print(f"Sum to 1? {np.sum(naive_softmax)}") # May not sum to 1 due to precisionexcept: print("Naive approach failed!") # Stable approachloss = stable_softmax_cross_entropy(y_true, extreme_logits)print(f"\nStable CCE with extreme logits: {loss:.6f}")print("(Loss is finite and correct!)") # Binary case with extreme logitprint("\n" + "="*50)y_binary = np.array([1.0])extreme_z = np.array([50.0]) # Very confident prediction loss = stable_binary_cross_entropy(y_binary, extreme_z)print(f"Stable BCE with logit=50: {loss:.10f}")print("(Correctly close to 0 since prediction is confident and correct)")Never implement cross-entropy loss by first computing softmax/sigmoid and then taking the log. Use combined functions like 'tf.nn.softmax_cross_entropy_with_logits', 'F.cross_entropy' (PyTorch), or 'jax.nn.log_softmax'. These implement the stable formulations internally and are also more efficient (fewer passes through memory).
Cross-entropy and KL divergence are intimately connected, and understanding this relationship illuminates why cross-entropy is the 'right' loss for classification.
The Kullback-Leibler divergence from distribution $Q$ to $P$ measures the information lost when $Q$ is used to approximate $P$:
$$D_{KL}(P | Q) = \sum_x p(x) \log \frac{p(x)}{q(x)} = H(P, Q) - H(P)$$
Key properties:
In supervised learning, $P_{data}$ is the true distribution (one-hot labels) and $P_{model}$ is our neural network's output. We minimize:
$$D_{KL}(P_{data} | P_{model}) = H(P_{data}, P_{model}) - H(P_{data})$$
Since $H(P_{data})$ is constant (determined by the data), minimizing KL divergence is equivalent to minimizing cross-entropy $H(P_{data}, P_{model})$.
The direction of KL divergence matters:
$D_{KL}(P | Q)$ (forward KL):
$D_{KL}(Q | P)$ (reverse KL):
In classification, we use forward KL because we want the model to assign high probability wherever the true label says so.
Minimizing cross-entropy has a beautiful interpretation: we are finding the model distribution $Q$ that requires the fewest extra bits (beyond the entropy of the true distribution) to encode samples from $P$.
$$\underbrace{H(P, Q)}{\text{bits to encode}} = \underbrace{H(P)}{\text{min possible}} + \underbrace{D_{KL}(P | Q)}_{\text{wasted bits}}$$
Training a classifier is literally finding the most efficient encoding of the labels given the inputs.
This perspective explains why cross-entropy generalizes well: a model that compresses data well must have learned the true underlying patterns, not memorized spurious correlations.
A natural question arises: why use cross-entropy for classification instead of simpler alternatives? Let's compare.
For binary classification with target $y \in {0, 1}$ and prediction $\hat{y} = \sigma(z)$:
MSE Loss: $\mathcal{L}_{MSE} = (y - \hat{y})^2$
Gradient of MSE w.r.t. logit $z$: $$\frac{\partial \mathcal{L}_{MSE}}{\partial z} = 2(\hat{y} - y) \cdot \sigma(z)(1 - \sigma(z))$$
Problem: The gradient includes $\sigma(z)(1-\sigma(z))$, which vanishes when $\sigma(z) \to 0$ or $\sigma(z) \to 1$. This means:
Cross-Entropy Gradient: $\frac{\partial \mathcal{L}_{BCE}}{\partial z} = \hat{y} - y$
| Prediction (ŷ) | MSE Gradient (×σ') | BCE Gradient | Learning Signal |
|---|---|---|---|
| 0.01 (wrong) | 0.0198 × 0.0099 = 0.0002 | -0.99 | BCE: 5000× stronger |
| 0.10 (wrong) | 0.18 × 0.09 = 0.016 | -0.90 | BCE: 56× stronger |
| 0.50 (uncertain) | 0.50 × 0.25 = 0.125 | -0.50 | BCE: 4× stronger |
| 0.90 (correct) | 0.18 × 0.09 = 0.016 | -0.10 | Similar |
| 0.99 (correct) | 0.0198 × 0.0099 = 0.0002 | -0.01 | Similar |
Cross-entropy is the natural loss for classification because:
It respects the geometry of probability space: Probabilities live on the simplex, and cross-entropy is a Bregman divergence that respects this geometry.
It matches the natural gradient: The gradient of cross-entropy corresponds to the "natural gradient" in the space of distributions, leading to more efficient optimization.
It's derived from first principles: Unlike MSE, which is chosen for mathematical convenience in regression, cross-entropy emerges from maximum likelihood—the principled approach to fitting probability distributions.
Despite its gradient issues, MSE can work in specific cases:
But for standard classification, cross-entropy remains the gold standard.
Cross-entropy's theoretical justification goes beyond just 'good gradients'. By the principle of maximum entropy, among all distributions satisfying known constraints, the one with highest entropy is the least biased. Cross-entropy loss encourages the model to find this least biased distribution that still explains the data—a form of regularization built into the loss itself.
Armed with theoretical understanding, let's discuss practical considerations for using cross-entropy in real systems.
When classes are severely imbalanced (e.g., 99% negative, 1% positive), standard cross-entropy can lead to models that simply predict the majority class. Solutions:
Weighted Cross-Entropy: $$\mathcal{L} = -\sum_k w_k \cdot y_k \log \hat{y}_k$$
where $w_k$ is inversely proportional to class frequency. Common choice: $w_k = \frac{N}{K \cdot n_k}$ where $n_k$ is the count of class $k$.
Focal Loss (covered in a later page): Down-weights easy examples to focus on hard ones.
When samples can belong to multiple classes simultaneously, use binary cross-entropy per class:
$$\mathcal{L} = -\frac{1}{K}\sum_{k=1}^{K} [y_k \log \sigma(z_k) + (1-y_k) \log(1 - \sigma(z_k))]$$
Each output is now an independent binary prediction.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
import numpy as np def weighted_cross_entropy(y_true, y_pred, class_weights, epsilon=1e-15): """ Cross-entropy with class weights for imbalanced data. Args: y_true: One-hot encoded labels (N, K) y_pred: Predicted probabilities (N, K) class_weights: Weight for each class (K,) """ y_pred = np.clip(y_pred, epsilon, 1 - epsilon) # Weight each sample by its true class weight # For one-hot y_true, this picks the weight of the true class sample_weights = np.sum(y_true * class_weights, axis=-1) # Standard cross-entropy per sample ce = -np.sum(y_true * np.log(y_pred), axis=-1) # Weighted average return np.mean(sample_weights * ce) def multilabel_binary_cross_entropy(y_true, logits): """ BCE for multi-label classification. Each class is treated as independent binary prediction. Args: y_true: Binary labels per class (N, K) logits: Raw model outputs (N, K) """ # Stable BCE for each class independently bce_per_class = ( np.maximum(logits, 0) - logits * y_true + np.log1p(np.exp(-np.abs(logits))) ) # Average over classes, then over samples return np.mean(bce_per_class) def label_smoothing_cross_entropy(y_true_indices, logits, alpha=0.1): """ Cross-entropy with label smoothing regularization. Soft labels: (1-α) for true class, α/K for others Prevents overconfidence and improves generalization. Args: y_true_indices: True class indices (N,) logits: Raw outputs (N, K) alpha: Smoothing parameter (0.1 typical) """ n_samples, n_classes = logits.shape # Create smoothed labels smooth_labels = np.full((n_samples, n_classes), alpha / n_classes) smooth_labels[np.arange(n_samples), y_true_indices] = 1 - alpha + alpha / n_classes # Stable log-softmax max_logits = np.max(logits, axis=-1, keepdims=True) log_softmax = logits - max_logits - np.log( np.sum(np.exp(logits - max_logits), axis=-1, keepdims=True) ) # Cross-entropy with soft labels loss = -np.sum(smooth_labels * log_softmax, axis=-1) return np.mean(loss) # Demonstrationprint("Practical Cross-Entropy Variants")print("="*50) # Setupnp.random.seed(42)logits = np.random.randn(4, 5) # 4 samples, 5 classesy_true = np.array([0, 1, 2, 0]) # Standard CCEloss_standard = sparse_categorical_cross_entropy(y_true, logits)print(f"Standard CCE: {loss_standard:.4f}") # With label smoothingloss_smooth = label_smoothing_cross_entropy(y_true, logits, alpha=0.1)print(f"Label-smoothed CCE (α=0.1): {loss_smooth:.4f}") # Class-weighted (pretend class 2 is rare)weights = np.array([1.0, 1.0, 5.0, 1.0, 1.0]) # 5× weight for class 2y_onehot = np.zeros((4, 5))y_onehot[np.arange(4), y_true] = 1probs = softmax(logits)loss_weighted = weighted_cross_entropy(y_onehot, probs, weights)print(f"Class-weighted CCE: {loss_weighted:.4f}")We've taken a comprehensive journey through cross-entropy loss, from information theory to practical implementation. Let's consolidate the key insights.
Looking Forward
Cross-entropy is the foundation for classification losses, but it's not the only option. In the following pages, we'll explore:
You now possess deep understanding of cross-entropy loss—its theoretical foundations, gradient properties, numerical implementation, and practical usage. This knowledge will serve as the benchmark against which you evaluate all other loss functions.