Loading content...
In the realm of model compression and neural network optimization, knowledge distillation stands as one of the most powerful techniques for creating efficient, deployable models. The core insight is that a smaller "student" model can learn far more effectively from the rich, nuanced outputs of a larger "teacher" model than from raw ground-truth labels alone.
When a teacher network produces predictions, it doesn't just indicate which class is correct—it reveals a complete probability landscape showing relationships between all classes. For instance, a teacher classifying handwritten digits might assign probabilities [0.02, 0.01, 0.80, 0.05, 0.01, 0.02, 0.04, 0.02, 0.02, 0.01] for the digit "2", subtly indicating that "2" somewhat resembles "3" and "7". These dark knowledge signals—encoded in the "wrong" class probabilities—contain invaluable information about data structure that hard labels completely discard.
Temperature Scaling: The standard softmax function tends to produce very peaked distributions when one class dominates. To expose the hidden inter-class relationships, we introduce a temperature parameter T that softens the distribution:
$$\text{softmax}_T(z_i) = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}}$$
At T = 1, this is the standard softmax. As T → ∞, the distribution approaches uniform. Moderate temperatures (typically T = 2 to 20) reveal the most useful structural information.
The Distillation Loss: The distillation loss measures how well the student matches the teacher's softened distribution using Kullback-Leibler (KL) divergence:
$$\mathcal{L}{\text{distill}} = T^2 \cdot D{\text{KL}}(P_T | Q_T) = T^2 \sum_i P_T(i) \cdot \log\left(\frac{P_T(i)}{Q_T(i)}\right)$$
Where:
Your Task: Implement a function that computes the knowledge distillation loss between a student and teacher model. Given raw logits from both models and a temperature parameter, compute the temperature-softened distributions and return the scaled KL divergence loss (rounded to 4 decimal places).
student_logits = [1.0, 2.0]
teacher_logits = [1.5, 2.5]
temperature = 1.00.0000First, we apply temperature-softened softmax to both logit vectors at T=1.0:
Teacher: softmax([1.5, 2.5]/1.0) = softmax([1.5, 2.5]) ≈ [0.2689, 0.7311] Student: softmax([1.0, 2.0]/1.0) = softmax([1.0, 2.0]) ≈ [0.2689, 0.7311]
Since both distributions have the same relative differences (offset by 0.5 from each other in logits, which doesn't change the softmax output ratios when difference between elements is preserved), the KL divergence is approximately 0.
The T^2 scaling at T=1 doesn't change the magnitude, resulting in a loss of 0.0000.
student_logits = [0.5, 1.5, 2.5]
teacher_logits = [1.0, 2.0, 3.0]
temperature = 2.00.0000At temperature T=2.0, both logit vectors are divided by 2 before softmax:
Teacher: softmax([1.0, 2.0, 3.0]/2.0) = softmax([0.5, 1.0, 1.5]) Student: softmax([0.5, 1.5, 2.5]/2.0) = softmax([0.25, 0.75, 1.25])
Because the logit differences within each vector maintain the same pattern (+0.5, +1.0 from first element), the resulting probability ratios remain identical. The KL divergence between identical distributions is 0.
With T^2 = 4 scaling applied to 0, the final loss is 0.0000.
student_logits = [1.0, 2.0, 3.0]
teacher_logits = [1.0, 2.0, 3.0]
temperature = 1.00.0000When student and teacher have identical logits, their softmax distributions are exactly the same:
Both: softmax([1.0, 2.0, 3.0]) ≈ [0.0900, 0.2447, 0.6652]
The KL divergence D_KL(P || P) = 0 for any distribution P (a property known as Gibbs' inequality).
This demonstrates that when the student perfectly matches the teacher, the distillation loss is zero, which is the optimization target during training.
Constraints