0/318

00:00:00

Description

Editorial

Teacher-Student Soft Target Transfer Loss

MEDIUM20 pts

In the realm of model compression and neural network optimization, knowledge distillation stands as one of the most powerful techniques for creating efficient, deployable models. The core insight is that a smaller "student" model can learn far more effectively from the rich, nuanced outputs of a larger "teacher" model than from raw ground-truth labels alone.

When a teacher network produces predictions, it doesn't just indicate which class is correct—it reveals a complete probability landscape showing relationships between all classes. For instance, a teacher classifying handwritten digits might assign probabilities [0.02, 0.01, 0.80, 0.05, 0.01, 0.02, 0.04, 0.02, 0.02, 0.01] for the digit "2", subtly indicating that "2" somewhat resembles "3" and "7". These dark knowledge signals—encoded in the "wrong" class probabilities—contain invaluable information about data structure that hard labels completely discard.

Temperature Scaling: The standard softmax function tends to produce very peaked distributions when one class dominates. To expose the hidden inter-class relationships, we introduce a temperature parameter T that softens the distribution:

$$\text{softmax}_T(z_i) = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}}$$

At T = 1, this is the standard softmax. As T → ∞, the distribution approaches uniform. Moderate temperatures (typically T = 2 to 20) reveal the most useful structural information.

The Distillation Loss: The distillation loss measures how well the student matches the teacher's softened distribution using Kullback-Leibler (KL) divergence:

$$\mathcal{L}{\text{distill}} = T^2 \cdot D{\text{KL}}(P_T | Q_T) = T^2 \sum_i P_T(i) \cdot \log\left(\frac{P_T(i)}{Q_T(i)}\right)$$

Where:

$P_T$ is the teacher's softened probability distribution
$Q_T$ is the student's softened probability distribution
$T^2$ scaling ensures gradients maintain appropriate magnitude as temperature increases

Your Task: Implement a function that computes the knowledge distillation loss between a student and teacher model. Given raw logits from both models and a temperature parameter, compute the temperature-softened distributions and return the scaled KL divergence loss (rounded to 4 decimal places).

Example

Input

student_logits = [1.0, 2.0]
teacher_logits = [1.5, 2.5]
temperature = 1.0

Output

0.0000

Explanation

First, we apply temperature-softened softmax to both logit vectors at T=1.0:

Teacher: softmax([1.5, 2.5]/1.0) = softmax([1.5, 2.5]) ≈ [0.2689, 0.7311] Student: softmax([1.0, 2.0]/1.0) = softmax([1.0, 2.0]) ≈ [0.2689, 0.7311]

Since both distributions have the same relative differences (offset by 0.5 from each other in logits, which doesn't change the softmax output ratios when difference between elements is preserved), the KL divergence is approximately 0.

The T^2 scaling at T=1 doesn't change the magnitude, resulting in a loss of 0.0000.

Example

Input

student_logits = [0.5, 1.5, 2.5]
teacher_logits = [1.0, 2.0, 3.0]
temperature = 2.0

Output

0.0000

Explanation

At temperature T=2.0, both logit vectors are divided by 2 before softmax:

Teacher: softmax([1.0, 2.0, 3.0]/2.0) = softmax([0.5, 1.0, 1.5]) Student: softmax([0.5, 1.5, 2.5]/2.0) = softmax([0.25, 0.75, 1.25])

Because the logit differences within each vector maintain the same pattern (+0.5, +1.0 from first element), the resulting probability ratios remain identical. The KL divergence between identical distributions is 0.

With T^2 = 4 scaling applied to 0, the final loss is 0.0000.

Example

Input

student_logits = [1.0, 2.0, 3.0]
teacher_logits = [1.0, 2.0, 3.0]
temperature = 1.0

Output

0.0000

Explanation

When student and teacher have identical logits, their softmax distributions are exactly the same:

Both: softmax([1.0, 2.0, 3.0]) ≈ [0.0900, 0.2447, 0.6652]

The KL divergence D_KL(P || P) = 0 for any distribution P (a property known as Gibbs' inequality).

This demonstrates that when the student perfectly matches the teacher, the distillation loss is zero, which is the optimization target during training.

Accepted0/0·0% Acceptance

Constraints

1 ≤ len(student_logits) = len(teacher_logits) ≤ 1000
-100.0 ≤ student_logits[i], teacher_logits[i] ≤ 100.0
0.1 ≤ temperature ≤ 100.0
Both logit vectors will always have the same length
Temperature will always be a positive value
Output should be rounded to 4 decimal places

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

temperature =

student_logits =

[1,2]

teacher_logits =

[1.5,2.5]

Teacher-Student Soft Target Transfer Loss

Hints

Teacher-Student Soft Target Transfer Loss

Hints