Loading content...
In classification tasks, neural networks are typically trained using hard labels—one-hot encoded vectors where the true class has probability 1.0 and all other classes have probability 0.0. However, this approach can lead to overconfident predictions and poor generalization, especially when training data is noisy or limited.
Soft target distributions address this limitation by redistributing a small portion of probability mass from the true class to all other classes using a smoothing parameter ε (epsilon). This technique, known as label relaxation or confidence regularization, encourages the model to be less certain about its predictions while still learning the correct class.
Given a classification problem with K classes, a smoothing parameter ε, and a true class label y, the soft target distribution q is defined as:
$$q_i = \begin{cases} 1 - \varepsilon + \frac{\varepsilon}{K} & \text{if } i = y \text{ (true class)} \ \frac{\varepsilon}{K} & \text{otherwise} \end{cases}$$
The cross-entropy loss between the model's predicted distribution p (obtained via softmax) and the soft target q is:
$$\mathcal{L} = -\sum_{i=1}^{K} q_i \log(p_i)$$
Direct computation of softmax followed by logarithm can cause numerical overflow or underflow (e.g., computing (e^{1000}) or (\log(0))). A numerically stable approach computes the log-softmax directly:
$$\log(\text{softmax}(z))i = z_i - \log\left(\sum{j=1}^{K} e^{z_j}\right)$$
Using the log-sum-exp trick with a max-shift for stability:
$$\log\left(\sum_{j=1}^{K} e^{z_j}\right) = m + \log\left(\sum_{j=1}^{K} e^{z_j - m}\right), \text{ where } m = \max_j(z_j)$$
Implement a function that:
The function should handle arbitrary batch sizes, varying numbers of classes, and different smoothing intensities from 0.0 (hard labels) to values approaching 1.0 (maximum smoothing).
logits = [[2.0, 0.0, -1.0], [0.0, 1.0, 0.0]]
y_true = [0, 2]
num_classes = 3
epsilon = 0.1
round_decimals = 60.927312Sample 1: True class is 0. With ε=0.1 and K=3: • Soft target: [1 - 0.1 + 0.1/3, 0.1/3, 0.1/3] = [0.9333, 0.0333, 0.0333] • Log-softmax of [2.0, 0.0, -1.0] ≈ [-0.407, -2.407, -3.407] • Cross-entropy: -[(0.9333 × -0.407) + (0.0333 × -2.407) + (0.0333 × -3.407)] ≈ 0.574
Sample 2: True class is 2. • Soft target: [0.0333, 0.0333, 0.9333] • Log-softmax of [0.0, 1.0, 0.0] ≈ [-1.551, -0.551, -1.551] • Cross-entropy ≈ 1.281
Mean Loss: (0.574 + 1.281) / 2 ≈ 0.927312
logits = [[1.0, -1.0], [0.5, 0.5]]
y_true = [0, 1]
num_classes = 2
epsilon = 0.05
round_decimals = 60.435038Sample 1: True class is 0. With ε=0.05 and K=2: • Soft target: [0.975, 0.025] • Log-softmax of [1.0, -1.0] ≈ [-0.127, -2.127] • Cross-entropy ≈ 0.177
Sample 2: True class is 1. • Soft target: [0.025, 0.975] • Log-softmax of [0.5, 0.5] = [-0.693, -0.693] • Cross-entropy ≈ 0.693
Mean Loss: (0.177 + 0.693) / 2 ≈ 0.435038
logits = [[3.0, 1.0, 0.0, -1.0]]
y_true = [0]
num_classes = 4
epsilon = 0.0
round_decimals = 60.185182When ε=0.0, we revert to standard hard labels (no smoothing): • Soft target: [1.0, 0.0, 0.0, 0.0] (pure one-hot) • Log-softmax of [3.0, 1.0, 0.0, -1.0] ≈ [-0.185, -2.185, -3.185, -4.185] • Cross-entropy: -(1.0 × -0.185) = 0.185182
This demonstrates that with zero smoothing, the loss reduces to standard categorical cross-entropy focusing only on the true class probability.
Constraints