Loss Functions - Learning Module

Loading content...

0/245

Focal Loss: Conquering Class Imbalance

The Imbalance Epidemic

In many real-world classification problems, class imbalance isn't just a minor annoyance—it's a fundamental obstacle that standard loss functions fail to overcome. Consider object detection: in a single image, there might be 3 objects of interest and 100,000 background regions. Traditional cross-entropy treats each sample equally, causing the overwhelming majority of easy negatives to dominate the loss and gradient.

The result? Models that predict "background" everywhere, achieving low loss by ignoring the rare but important foreground objects.

Focal Loss, introduced by Lin et al. in 2017 for the RetinaNet detector, elegantly solves this problem. Rather than just reweighting classes (which helps but doesn't fully solve the problem), focal loss fundamentally changes how cross-entropy behaves: it down-weights easy examples so that the model focuses its learning capacity on hard examples.

This page will take you through focal loss from motivation to implementation, exploring why it works, when to use it, and how to tune its hyperparameters.

What You Will Master

By the end of this page, you will understand: (1) the class imbalance problem and why simple reweighting is insufficient, (2) focal loss's mathematical formulation and the modulating factor, (3) gradient analysis showing how easy examples are suppressed, (4) hyperparameter tuning (focusing weight γ, class balancing α), and (5) applications from object detection to medical imaging to NLP.

Understanding Class Imbalance

The Scale of the Problem

Object Detection Example:

A typical image: ~100,000 anchor boxes evaluated
Foreground objects: 3-10
Background anchors: 99,990+
Imbalance ratio: 10,000:1 or worse

Medical Diagnosis:

Disease prevalence: 0.1-1%
Healthy samples: 99-99.9%
Imbalance ratio: 100:1 to 1000:1

Fraud Detection:

Fraudulent transactions: 0.01%
Normal transactions: 99.99%
Imbalance ratio: 10,000:1

Why Standard Cross-Entropy Fails

Let's analyze what happens with standard cross-entropy under extreme imbalance.

The Loss Breakdown: For binary classification with cross-entropy: $$\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N}[y_i \log p_i + (1-y_i) \log(1-p_i)]$$

With 99% negatives and 1% positives:

99% of samples contribute to the "predict negative correctly" signal
1% contribute to the "predict positive correctly" signal
Model learns: predicting negative is almost always right!

The Gradient Problem: Even worse, the easy negatives contribute non-trivial gradients. A model predicting $p=0.1$ for an easy negative has:

Cross-entropy: $-\log(0.9) \approx 0.105$
Gradient: $0.1$ (pushing toward lower prediction)

With 99,000 such samples, these small per-sample gradients sum to a massive signal that overwhelms the 100 hard positives.

Cross-Entropy Loss for Various Prediction Confidences
Probability p	CE Loss (if y=1)	CE Loss (if y=0)	Classification
0.99	0.010	4.605	Confident positive
0.90	0.105	2.303	Positive leaning
0.60	0.511	0.916	Uncertain positive
0.40	0.916	0.511	Uncertain negative
0.10	2.303	0.105	Negative leaning
0.01	4.605	0.010	Confident negative

Traditional Solutions and Their Limits

1. Class Reweighting: $$\mathcal{L} = -\frac{1}{N}\sum[w_1 \cdot y_i \log p_i + w_0 \cdot (1-y_i) \log(1-p_i)]$$

with weights inversely proportional to class frequency.

Problem: Helps with the frequency imbalance but doesn't address that most samples are easy. Easy negatives (p=0.01 for background) still contribute, just with lower weight.

2. Oversampling / Undersampling:

Oversample minority class or undersample majority class

Problem: Oversampling can cause overfitting to minority class; undersampling discards potentially useful data.

3. Hard Example Mining:

Only use the K hardest examples for each batch
Common in object detection (OHEM)

Problem: Introduces additional complexity, hyperparameters, and can miss useful information from medium-difficulty examples.

The key insight missed: The problem isn't just class frequency—it's that easy examples dominate the gradient, regardless of their frequency.

The Focal Loss Formula

Core Idea: Down-Weight Easy Examples

Focal loss modifies cross-entropy to reduce the contribution from easy, well-classified examples:

$$\mathcal{L}_{FL} = -\alpha_t (1 - p_t)^\gamma \log(p_t)$$

where:

$p_t$ is the model's predicted probability for the true class:
- $p_t = p$ if $y = 1$
- $p_t = 1 - p$ if $y = 0$
$\gamma \geq 0$ is the focusing parameter
$\alpha_t$ is the class balancing weight (optional)

Unpacking the Formula

The new term: $(1 - p_t)^\gamma$

This is the modulating factor. Let's see how it behaves:

When $p_t$ is high (correct and confident): $(1 - p_t)^\gamma \to 0$
- Example: $p_t = 0.9$, $\gamma = 2$ → $(0.1)^2 = 0.01$ (loss reduced by 100×!)
When $p_t$ is low (wrong or uncertain): $(1 - p_t)^\gamma \to 1$
- Example: $p_t = 0.1$, $\gamma = 2$ → $(0.9)^2 = 0.81$ (loss nearly unchanged)

Interpretation: The modulating factor acts as a soft attention mechanism that focuses learning on hard examples.

The Focusing Parameter $\gamma$

$\gamma$ controls how much to down-weight easy examples:

$\gamma = 0$: Focal loss = cross-entropy (no modulation)
$\gamma = 1$: Moderate focus on hard examples
$\gamma = 2$: Strong focus (standard recommendation)
$\gamma = 5$: Very aggressive focus (may ignore some useful signal)

focal_loss.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
 
def cross_entropy_loss(y_true, p_pred, epsilon=1e-15):
    """Standard cross-entropy loss."""
    p_pred = np.clip(p_pred, epsilon, 1 - epsilon)
    loss = -(y_true * np.log(p_pred) + (1 - y_true) * np.log(1 - p_pred))
    return loss
 
def focal_loss(y_true, p_pred, gamma=2.0, alpha=None, epsilon=1e-15):
    """
    Focal loss: down-weights easy examples.
    
    Args:
        y_true: True labels (0 or 1)
        p_pred: Predicted probability of class 1
        gamma: Focusing parameter (0 = cross-entropy)
        alpha: Class balancing weight for class 1 (None = no balancing)
    """
    p_pred = np.clip(p_pred, epsilon, 1 - epsilon)
    
    # p_t: probability assigned to the true class
    p_t = np.where(y_true == 1, p_pred, 1 - p_pred)
    
    # Modulating factor
    modulating_factor = (1 - p_t) ** gamma
    
    # Cross-entropy term
    ce = -np.log(p_t)
    
    # Focal loss
    loss = modulating_factor * ce
    
    # Optional alpha weighting
    if alpha is not None:
        alpha_t = np.where(y_true == 1, alpha, 1 - alpha)
        loss = alpha_t * loss
    
    return loss
 
def focal_loss_reduction_factor(p_t, gamma=2.0):
    """How much focal loss reduces compared to cross-entropy."""
    ce = -np.log(p_t)
    fl = (1 - p_t) ** gamma * ce
    return fl / ce  # Should equal (1 - p_t)^gamma
 
# Demonstration: Effect of gamma
print("Focal Loss vs Cross-Entropy")
print("="*60)
print(f"{'p_t (conf.)':>12} {'CE Loss':>10} {'FL (γ=1)':>10} {'FL (γ=2)':>10} {'FL (γ=5)':>10}")
print("-" * 60)
 
for p_t in [0.9999, 0.99, 0.9, 0.8, 0.6, 0.5, 0.3, 0.1, 0.01]:
    ce = -np.log(p_t)
    fl1 = (1 - p_t) ** 1 * ce
    fl2 = (1 - p_t) ** 2 * ce
    fl5 = (1 - p_t) ** 5 * ce
    print(f"{p_t:12.4f} {ce:10.4f} {fl1:10.4f} {fl2:10.4f} {fl5:10.4f}")
 
print("\n" + "="*60)
print("Loss Reduction Factor: FL / CE = (1 - p_t)^γ")
print(f"{'p_t':>12} {'γ=1':>10} {'γ=2':>10} {'γ=5':>10}")
print("-" * 60)
 
for p_t in [0.99, 0.9, 0.8, 0.5, 0.2]:
    r1 = (1 - p_t) ** 1
    r2 = (1 - p_t) ** 2
    r5 = (1 - p_t) ** 5
    print(f"{p_t:12.2f} {r1:10.4f} {r2:10.4f} {r5:10.4f}")

The Magic of γ=2

At γ=2, an easy example with p_t=0.9 has its loss reduced by (0.1)² = 0.01, a 100× reduction! Meanwhile, a hard example with p_t=0.2 has its loss reduced by only (0.8)² = 0.64, less than 2×.

This differential treatment is the key insight: easy examples contribute almost nothing; hard examples dominate the gradient.

Gradient Analysis: Why It Works

Deriving the Focal Loss Gradient

For thorough understanding, let's derive the gradient of focal loss with respect to the logit $z$ (where $p = \sigma(z)$).

Setup: $\mathcal{L}_{FL} = -(1-p_t)^\gamma \log(p_t)$

For $y = 1$ (positive class): $p_t = p = \sigma(z)$

$$\frac{\partial \mathcal{L}_{FL}}{\partial z} = -(1-p)^\gamma \cdot \frac{1}{p} \cdot p(1-p) + \gamma(1-p)^{\gamma-1} \cdot (-1) \cdot \log(p) \cdot p(1-p)$$

Simplifying: $$\frac{\partial \mathcal{L}_{FL}}{\partial z} = (1-p)^\gamma (p - 1) + \gamma(1-p)^{\gamma} p \log(p)$$

Further simplification (after algebra): $$\frac{\partial \mathcal{L}_{FL}}{\partial z} = (1-p)^\gamma \left[\gamma p \log(p) + p - 1\right]$$

Gradient Comparison: CE vs Focal

Cross-Entropy Gradient (for y=1): $p - 1$

Focal Loss Gradient (for y=1): $(1-p)^\gamma \cdot [\gamma p \log(p) + p - 1]$

The modulating factor $(1-p)^\gamma$ appears in the gradient, directly suppressing gradients from easy examples.

Numerical Analysis

focal_gradients.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy as np
 
def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
 
def ce_gradient_wrt_logit(y, z):
    """Gradient of cross-entropy w.r.t. logit z."""
    p = sigmoid(z)
    return p - y  # Simple and beautiful
 
def focal_gradient_wrt_logit(y, z, gamma=2.0):
    """
    Gradient of focal loss w.r.t. logit z.
    
    For y=1: (1-p)^γ * [γp*log(p) + p - 1]
    For y=0: p^γ * [-γ(1-p)*log(1-p) + p]  (symmetric derivation)
    """
    p = sigmoid(z)
    
    if y == 1:
        # Factor from chain rule with modulating factor
        pt = p
        modulating = (1 - pt) ** gamma
        grad = modulating * (gamma * pt * np.log(pt + 1e-15) + pt - 1)
    else:
        pt = 1 - p
        modulating = (1 - pt) ** gamma  # = p^gamma
        grad = modulating * (-gamma * pt * np.log(pt + 1e-15) + pt)
        # This is gradient pushing toward higher z (more positive)
        # Need to negate since we want gradient w.r.t z that reduces loss
        grad = p ** gamma * (-gamma * (1-p) * np.log(1-p + 1e-15) + p)
    
    return grad
 
def focal_gradient_numerical(y, z, gamma=2.0, epsilon=1e-5):
    """Numerical gradient for verification."""
    p_plus = sigmoid(z + epsilon)
    p_minus = sigmoid(z - epsilon)
    
    def fl(p, y):
        pt = p if y == 1 else 1 - p
        return -(1 - pt) ** gamma * np.log(pt + 1e-15)
    
    return (fl(p_plus, y) - fl(p_minus, y)) / (2 * epsilon)
 
# Gradient comparison
print("Gradient Comparison: Cross-Entropy vs Focal Loss (y=1)")
print("="*70)
print(f"{'p':>8} {'z':>8} {'CE grad':>12} {'FL grad (γ=2)':>15} {'FL/CE ratio':>12}")
print("-" * 70)
 
for p in [0.99, 0.95, 0.9, 0.7, 0.5, 0.3, 0.1, 0.05]:
    z = np.log(p / (1 - p))  # logit
    ce_grad = ce_gradient_wrt_logit(1, z)
    fl_grad = focal_gradient_numerical(1, z, gamma=2.0)
    ratio = abs(fl_grad / ce_grad) if abs(ce_grad) > 1e-10 else 0
    
    print(f"{p:8.2f} {z:8.2f} {ce_grad:12.4f} {fl_grad:15.6f} {ratio:12.4f}")
 
print("\nKey Insight: For well-classified examples (high p), focal loss gradient is")
print("            orders of magnitude smaller than cross-entropy gradient.")

Aggregate Effect on Training

Consider a batch with:

1 hard positive (p=0.1)
99 easy negatives (p=0.05, i.e., correctly predicting low probability)

Cross-Entropy Gradients:

Hard positive: $|p - 1| = 0.9$
Easy negatives: $|p - 0| = 0.05$ each, sum = 4.95
Total from negatives: 5.5× harder positives!

Focal Loss Gradients (γ=2):

Hard positive: ~0.9 × modulating factor ≈ 0.73
Easy negatives: ~0.05 × 0.05² ≈ 0.000125 each, sum ≈ 0.012
Total from negatives: 60× smaller than hard positive!

The reversal is dramatic: Cross-entropy lets easy negatives dominate; focal loss makes hard positives dominate.

Effective Sample Weighting

Focal loss can be viewed as an adaptive weighting scheme:

$$w_i = (1 - p_{t,i})^\gamma$$

Easy examples get weight near 0; hard examples get weight near 1. Unlike fixed class weights, this is dynamic—it adapts based on current model confidence, effectively doing implicit curriculum learning where easy examples are removed from training as they become easy.

Connection to Curriculum Learning

Focal loss naturally implements a form of self-paced learning. As training progresses:

• Initially, most examples are hard → broad learning • As model improves, easy examples get down-weighted → focused learning • Training automatically concentrates on remaining hard cases

This is similar to 'self-paced learning' and 'hard example mining' but without explicit sample selection.

Alpha Parameter: Class Balancing

Combining γ and α

Focal loss can include an optional class balancing weight $\alpha$:

$$\mathcal{L}_{FL} = -\alpha_t (1 - p_t)^\gamma \log(p_t)$$

where:

$\alpha_t = \alpha$ for positive class (foreground)
$\alpha_t = 1 - \alpha$ for negative class (background)

The Role of α

While $\gamma$ addresses the easy/hard imbalance, $\alpha$ addresses the frequency imbalance:

Even with $\gamma > 0$, if true positives are 1% of data, they contribute less aggregate gradient
$\alpha > 0.5$ up-weights positives to compensate for frequency

Common settings:

RetinaNet paper: $\alpha = 0.25$, $\gamma = 2$
Wait—why $\alpha < 0.5$ for rare positives?

The Counter-Intuitive Alpha

With $\gamma = 2$, easy negatives are already heavily down-weighted. Setting $\alpha > 0.5$ would further reduce their weight, potentially making positives too dominant.

The optimal $\alpha$ depends on $\gamma$:

$\gamma = 0$ (standard CE): use $\alpha \propto$ inverse frequency
$\gamma = 2$: use $\alpha = 0.25$ (original paper)
Higher $\gamma$: may need even lower $\alpha$

The paper's finding: $\alpha = 0.25$ works well with $\gamma = 2$. This slightly down-weights positives because the $(1-p_t)^2$ term so dramatically up-weights hard examples that additional raw weighting would over-correct.

focal_alpha_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import numpy as np
 
def focal_loss_with_alpha(y_true, p_pred, gamma=2.0, alpha=0.25, epsilon=1e-15):
    """
    Full focal loss with both gamma and alpha.
    
    Args:
        y_true: True labels (0 or 1)
        p_pred: Predicted probability of class 1
        gamma: Focusing parameter
        alpha: Weight for positive class (class 1)
    """
    p_pred = np.clip(p_pred, epsilon, 1 - epsilon)
    
    # p_t and alpha_t depend on true label
    p_t = np.where(y_true == 1, p_pred, 1 - p_pred)
    alpha_t = np.where(y_true == 1, alpha, 1 - alpha)
    
    # Focal loss
    modulating = (1 - p_t) ** gamma
    ce = -np.log(p_t)
    loss = alpha_t * modulating * ce
    
    return loss
 
# Analyze effective weighting for different scenarios
print("Effective Weight Analysis: α × (1-p_t)^γ")
print("="*60)
print("With α=0.25, γ=2 (RetinaNet default)")
print()
 
alpha = 0.25
gamma = 2.0
 
print("Positive examples (y=1):")
print(f"  {'p_t (conf)':>12} {'raw α':>10} {'modulating':>12} {'effective wt':>14}")
for p_t in [0.9, 0.7, 0.5, 0.3, 0.1]:
    mod = (1 - p_t) ** gamma
    eff = alpha * mod
    print(f"  {p_t:12.2f} {alpha:10.2f} {mod:12.4f} {eff:14.4f}")
 
print("\nNegative examples (y=0):")
print(f"  {'p_t (conf)':>12} {'raw 1-α':>10} {'modulating':>12} {'effective wt':>14}")
for p_t in [0.9, 0.7, 0.5, 0.3, 0.1]:
    mod = (1 - p_t) ** gamma
    eff = (1 - alpha) * mod
    print(f"  {p_t:12.2f} {1-alpha:10.2f} {mod:12.4f} {eff:14.4f}")
 
# Compare aggregate contributions
print("\n" + "="*60)
print("Aggregate Contribution in Typical Batch")
print("Scenario: 1 hard positive (p=0.2), 100 easy negatives (p=0.05)")
print()
 
# Hard positive: true y=1, model gives p=0.2, so p_t=0.2
p_t_pos = 0.2
mod_pos = (1 - p_t_pos) ** gamma
ce_pos = -np.log(p_t_pos)
loss_pos = alpha * mod_pos * ce_pos
print(f"Hard positive: p_t={p_t_pos}, loss = {loss_pos:.4f}")
 
# Easy negatives: true y=0, model gives p=0.05, so p_t=0.95
p_t_neg = 0.95  # probability of correct class (0)
mod_neg = (1 - p_t_neg) ** gamma
ce_neg = -np.log(p_t_neg)
loss_neg = (1 - alpha) * mod_neg * ce_neg
print(f"Easy negative: p_t={p_t_neg}, loss = {loss_neg:.6f}")
 
print(f"\n100 easy negatives contribute: {100 * loss_neg:.4f}")
print(f"1 hard positive contributes: {loss_pos:.4f}")
print(f"Ratio (pos/100neg): {loss_pos / (100 * loss_neg):.1f}x")
print(f"\nThe hard positive dominates despite being outnumbered 100:1!")

Recommended Hyperparameters from Literature
Task	γ (Focusing)	α (Positive Weight)	Notes
RetinaNet (detection)	2.0	0.25	Original paper, dense detectors
Medical imaging	2.0-3.0	0.25-0.5	Depends on disease prevalence
Text classification	1.0-2.0	0.5	Less extreme imbalance typically
Semantic segmentation	2.0	0.25	Similar to detection
Fraud detection	2.0-5.0	varies	May need high γ for extreme imbalance

Multi-Class Focal Loss

Extending to K Classes

The focal loss naturally extends to multi-class classification:

$$\mathcal{L}{FL-multi} = -\sum{k=1}^{K} \alpha_k (1 - p_k)^\gamma y_k \log(p_k)$$

where:

$y_k \in {0, 1}$ is the one-hot label for class $k$
$p_k$ is the softmax probability for class $k$
$\alpha_k$ is the weight for class $k$

Simplified form (for one-hot labels where only true class $c$ has $y_c = 1$):

$$\mathcal{L}_{FL-multi} = -\alpha_c (1 - p_c)^\gamma \log(p_c)$$

Per-Class Alpha Weights

For multi-class, $\alpha$ becomes a vector $\boldsymbol{\alpha} = [\alpha_1, \ldots, \alpha_K]$:

Inversely proportional to frequency: $$\alpha_k = \frac{N}{K \cdot n_k}$$ where $n_k$ is the number of samples in class $k$.

Effective number of samples (Cui et al., 2019): $$\alpha_k \propto \frac{1 - \beta^{n_k}}{1 - \beta}$$ where $\beta \in [0, 1)$ is a hyperparameter. This provides smoother weights than inverse frequency.

Learned weights: Treat $\boldsymbol{\alpha}$ as trainable parameters or tune on validation set.

multiclass_focal.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import numpy as np
 
def softmax(logits):
    """Numerically stable softmax."""
    shifted = logits - np.max(logits, axis=-1, keepdims=True)
    exp_shifted = np.exp(shifted)
    return exp_shifted / np.sum(exp_shifted, axis=-1, keepdims=True)
 
def multiclass_focal_loss(y_true_idx, logits, gamma=2.0, alpha=None, epsilon=1e-15):
    """
    Multi-class focal loss from logits.
    
    Args:
        y_true_idx: True class indices (N,)
        logits: Raw model outputs (N, K)
        gamma: Focusing parameter
        alpha: Per-class weights (K,) or None for uniform
    """
    n_samples, n_classes = logits.shape
    
    # Convert to probabilities
    probs = softmax(logits)
    
    # Get probability of true class
    p_true = probs[np.arange(n_samples), y_true_idx]
    p_true = np.clip(p_true, epsilon, 1 - epsilon)
    
    # Focal loss components
    modulating = (1 - p_true) ** gamma
    ce = -np.log(p_true)
    
    loss = modulating * ce
    
    # Apply class-specific alpha if provided
    if alpha is not None:
        alpha_true = alpha[y_true_idx]
        loss = alpha_true * loss
    
    return np.mean(loss)
 
def compute_class_weights_inverse_freq(class_counts):
    """Inverse frequency class weights."""
    n_total = np.sum(class_counts)
    n_classes = len(class_counts)
    weights = n_total / (n_classes * class_counts)
    return weights / np.sum(weights)  # Normalize
 
def compute_class_weights_effective_number(class_counts, beta=0.999):
    """
    Effective number of samples weighting (Cui et al., 2019).
    
    More moderate than inverse frequency for extreme imbalance.
    """
    effective_num = (1 - beta ** class_counts) / (1 - beta)
    weights = 1.0 / effective_num
    return weights / np.sum(weights)  # Normalize
 
# Demonstration
print("Multi-Class Focal Loss")
print("="*50)
 
# Simulate imbalanced scenario
np.random.seed(42)
class_counts = np.array([1000, 100, 50, 10, 5])  # Highly imbalanced
n_classes = len(class_counts)
 
print(f"Class counts: {class_counts}")
print(f"Class frequencies: {class_counts / np.sum(class_counts)}")
 
# Compute different weight schemes
w_uniform = np.ones(n_classes) / n_classes
w_inverse = compute_class_weights_inverse_freq(class_counts)
w_effective = compute_class_weights_effective_number(class_counts, beta=0.999)
 
print(f"\nClass weights:")
print(f"  Uniform: {w_uniform}")
print(f"  Inverse freq: {w_inverse}")
print(f"  Effective num: {w_effective}")
 
# Sample batch with class imbalance
batch_size = 50
# Proportional sampling (realistic)
logits = np.random.randn(batch_size, n_classes)
y_true = np.random.choice(n_classes, size=batch_size, 
                           p=class_counts/class_counts.sum())
 
print(f"\nBatch class distribution: {np.bincount(y_true, minlength=n_classes)}")
 
# Compare losses
loss_ce = multiclass_focal_loss(y_true, logits, gamma=0, alpha=None)
loss_focal = multiclass_focal_loss(y_true, logits, gamma=2.0, alpha=None)
loss_focal_weighted = multiclass_focal_loss(y_true, logits, gamma=2.0, alpha=w_effective)
 
print(f"\nLoss values:")
print(f"  Standard CE (γ=0, no α): {loss_ce:.4f}")
print(f"  Focal (γ=2, no α): {loss_focal:.4f}")
print(f"  Focal + class weights: {loss_focal_weighted:.4f}")

Class-Balanced Focal Loss

Cui et al.'s 'Class-Balanced Loss' paper (CVPR 2019) recommends the 'effective number of samples' weighting, which provides smoother weights than inverse frequency:

E_n = (1 - β^n) / (1 - β)

As β → 1, this approaches n (under-weighting dominates). As β → 0, all classes get equal weight. β = 0.999 is a good default for extreme imbalance.

Applications Beyond Object Detection

Object Detection: The Origin Story

Focal loss was invented to save the one-stage detector paradigm.

Problem: Two-stage detectors (R-CNN family) use a region proposal network to filter easy negatives before classification. One-stage detectors (YOLO, SSD) classify all anchors directly, suffering from extreme foreground-background imbalance.

Solution: RetinaNet with focal loss achieved accuracy comparable to two-stage detectors with one-stage speed.

Key insight: The problem wasn't the one-stage architecture; it was using standard cross-entropy.

Medical Image Analysis

Medical imaging is rife with class imbalance:

Lesion detection: tiny lesions in large volumes
Disease screening: rare positive cases
Cell segmentation: few abnormal cells among millions

Focal loss benefits:

Focuses on hard-to-detect lesions
Reduces false negative rate (critical in medicine)
Requires less data augmentation for rare classes

Successful Applications

•Object detection: RetinaNet, FCOS, ATSS
•Medical imaging: Tumor detection, lesion segmentation
•Natural language: Named entity recognition with rare entities
•Fraud detection: Financial transaction monitoring
•Anomaly detection: Manufacturing defect identification
•Semantic segmentation: Categories with small area coverage

When to Be Careful

•Balanced datasets: Little benefit, may hurt calibration
•When all examples are hard: Modulating factor becomes uniform
•Probability calibration critical: Focal skews probabilities
•Noisy labels: May amplify noise on mislabeled samples
•Soft labels: Need modified formulation
•Ensemble/distillation: May need temperature scaling

NLP Applications

Named Entity Recognition (NER):

Most tokens are 'O' (no entity): 90%+
Rare entity types: possibly <1%
Focal loss improves recall on rare entities

Sentiment Analysis with Nuanced Classes:

Strong negative/positive: common
Neutral or mixed: rarer, harder
Focal loss focuses on ambiguous cases

Intent Classification:

Common intents: well-represented
Long-tail intents: few training examples
Focal loss combined with class-balanced sampling

Instance Segmentation

In Mask R-CNN and similar models:

Foreground/background pixel imbalance within each mask
Small objects have fewer foreground pixels
Focal loss for mask branch improves small object segmentation

Practical Recommendations

Start with γ=2, α=0.25 (the original settings)
Tune γ first: If still dominated by easy examples, increase γ
Tune α second: Adjust based on validation performance per class
Monitor per-class metrics: Not just overall accuracy
Consider label smoothing: Can combine with focal loss for regularization

Numerically Stable Implementation

The Log-Sum-Exp Challenge

Focal loss involves $\log(p_t)$ where $p_t$ can be very close to 0 or 1. Combined with the modulating factor, numerical issues can arise.

Stable Binary Focal Loss Implementation:

Using the identity for binary cross-entropy from logits: $$-\log(\sigma(z)) = \log(1 + e^{-z}) = \text{softplus}(-z)$$ $$-\log(1 - \sigma(z)) = \log(1 + e^z) = \text{softplus}(z)$$

We can write stable focal loss without explicitly computing probabilities.

stable_focal_loss.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
import numpy as np
 
def stable_sigmoid(z):
    """Numerically stable sigmoid."""
    return np.where(
        z >= 0,
        1 / (1 + np.exp(-z)),
        np.exp(z) / (1 + np.exp(z))
    )
 
def stable_softplus(x):
    """log(1 + exp(x)) computed stably."""
    return np.where(
        x > 20,
        x,  # For large x, softplus(x) ≈ x
        np.log1p(np.exp(np.minimum(x, 20)))
    )
 
def binary_focal_loss_from_logits(y_true, logits, gamma=2.0, alpha=0.25):
    """
    Numerically stable binary focal loss from logits.
    
    Avoids computing probabilities explicitly where possible.
    """
    # Compute p and 1-p in a stable way
    p = stable_sigmoid(logits)
    
    # p_t = p if y=1, else 1-p
    p_t = np.where(y_true == 1, p, 1 - p)
    
    # -log(p_t) computed stably from logits
    # -log(sigmoid(z)) = softplus(-z)
    # -log(1-sigmoid(z)) = softplus(z)
    ce_loss = np.where(
        y_true == 1,
        stable_softplus(-logits),  # -log(p)
        stable_softplus(logits)    # -log(1-p)
    )
    
    # Modulating factor
    modulating = (1 - p_t) ** gamma
    
    # Alpha weighting
    alpha_t = np.where(y_true == 1, alpha, 1 - alpha)
    
    # Full focal loss
    loss = alpha_t * modulating * ce_loss
    
    return np.mean(loss)
 
def multiclass_focal_loss_from_logits(y_true_idx, logits, gamma=2.0, alpha=None):
    """
    Numerically stable multi-class focal loss.
    
    Uses log-sum-exp trick for stability.
    """
    n_samples, n_classes = logits.shape
    
    # Stable log-softmax
    max_logits = np.max(logits, axis=-1, keepdims=True)
    log_sum_exp = max_logits + np.log(
        np.sum(np.exp(logits - max_logits), axis=-1, keepdims=True)
    )
    log_probs = logits - log_sum_exp
    
    # Log probability and probability of true class
    log_p_true = log_probs[np.arange(n_samples), y_true_idx]
    p_true = np.exp(log_p_true)
    
    # Focal modulating factor
    modulating = (1 - p_true) ** gamma
    
    # CE is just -log_p_true
    ce = -log_p_true
    
    loss = modulating * ce
    
    if alpha is not None:
        loss = alpha[y_true_idx] * loss
    
    return np.mean(loss)
 
# Test numerical stability
print("Numerical Stability Test")
print("="*50)
 
# Extreme logits
test_cases = [
    ("Moderate positive", 2.0),
    ("Large positive", 20.0),
    ("Very large positive", 100.0),
    ("Moderate negative", -2.0),
    ("Large negative", -20.0),
    ("Very large negative", -100.0),
]
 
print(f"{'Case':>25} {'logit':>8} {'p':>12} {'loss (y=1)':>12} {'stable?':>8}")
print("-" * 70)
 
for name, z in test_cases:
    logits = np.array([z])
    y_true = np.array([1.0])
    
    p = stable_sigmoid(z)
    loss = binary_focal_loss_from_logits(y_true, logits, gamma=2.0, alpha=0.25)
    
    is_stable = np.isfinite(loss)
    print(f"{name:>25} {z:8.1f} {p:12.6f} {loss:12.6f} {'✓' if is_stable else '✗':>8}")

Framework Implementations

PyTorch (manual implementation—not built-in as of 2024):

import torch
import torch.nn.functional as F

def focal_loss(inputs, targets, gamma=2.0, alpha=0.25):
    bce = F.binary_cross_entropy_with_logits(inputs, targets, reduction='none')
    p = torch.sigmoid(inputs)
    p_t = p * targets + (1 - p) * (1 - targets)
    modulating = (1 - p_t) ** gamma
    alpha_t = alpha * targets + (1 - alpha) * (1 - targets)
    return (alpha_t * modulating * bce).mean()

TensorFlow:

import tensorflow as tf

loss_fn = tf.keras.losses.BinaryFocalCrossentropy(
    gamma=2.0, alpha=0.25, from_logits=True
)

Detectron2 (Facebook's detection library):

from fvcore.nn import sigmoid_focal_loss_jit

Focal Loss with Label Smoothing

Label smoothing can be combined with focal loss: $$y_{smooth} = (1 - \epsilon) \cdot y + \epsilon / K$$

The focal loss then uses smoothed targets, which:

Reduces overconfidence
Provides gradients even for "easy" examples (since targets are never 0 or 1)
Requires careful tuning as it interacts with the focusing mechanism

Summary: Focal Loss Mastery

We've explored focal loss from its motivation in object detection to practical implementation. Here are the essential takeaways:

Key Takeaways

•Core mechanism: Focal loss down-weights easy examples via the (1-p_t)^γ modulating factor, focusing learning on hard examples.
•The γ parameter: Controls focusing strength. γ=0 is standard CE; γ=2 is the recommended default; higher values aggressively ignore easy examples.
•The α parameter: Provides class balancing. Counter-intuitively, α=0.25 (down-weighting positives) works well with γ=2 because the modulating factor already up-weights hard positives.
•Gradient behavior: Easy examples contribute near-zero gradient; training naturally focuses on hard cases without explicit hard example mining.
•Main use case: Extreme class imbalance, especially object detection with 1000:1+ foreground:background ratio.
•Caution areas: Balanced data, calibration-critical applications, noisy labels—focal loss may hurt more than help in these cases.

Looking Forward

We've now covered the major loss functions: cross-entropy for general classification, MSE for regression, hinge for margin-based learning, and focal for imbalanced data. The final topic brings these together:

Custom Losses: Combining loss functions, designing domain-specific objectives, and encoding prior knowledge directly into the optimization target.

Page Complete

You now understand focal loss deeply—why class imbalance defeats standard losses, how the modulating factor solves this, and when focal loss is your best tool. This knowledge is essential for object detection, medical imaging, and any domain with extreme class imbalance.

Focal Loss: Conquering Class Imbalance

The Imbalance Epidemic

The result? Models that predict "background" everywhere, achieving low loss by ignoring the rare but important foreground objects.

This page will take you through focal loss from motivation to implementation, exploring why it works, when to use it, and how to tune its hyperparameters.

What You Will Master

Understanding Class Imbalance

The Scale of the Problem

Object Detection Example:

A typical image: ~100,000 anchor boxes evaluated
Foreground objects: 3-10
Background anchors: 99,990+
Imbalance ratio: 10,000:1 or worse

Medical Diagnosis:

Disease prevalence: 0.1-1%
Healthy samples: 99-99.9%
Imbalance ratio: 100:1 to 1000:1

Fraud Detection:

Fraudulent transactions: 0.01%
Normal transactions: 99.99%
Imbalance ratio: 10,000:1

Why Standard Cross-Entropy Fails

Let's analyze what happens with standard cross-entropy under extreme imbalance.

The Loss Breakdown: For binary classification with cross-entropy: $$\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N}[y_i \log p_i + (1-y_i) \log(1-p_i)]$$

With 99% negatives and 1% positives:

99% of samples contribute to the "predict negative correctly" signal
1% contribute to the "predict positive correctly" signal
Model learns: predicting negative is almost always right!

The Gradient Problem: Even worse, the easy negatives contribute non-trivial gradients. A model predicting $p=0.1$ for an easy negative has:

Cross-entropy: $-\log(0.9) \approx 0.105$
Gradient: $0.1$ (pushing toward lower prediction)

With 99,000 such samples, these small per-sample gradients sum to a massive signal that overwhelms the 100 hard positives.

Cross-Entropy Loss for Various Prediction Confidences
Probability p	CE Loss (if y=1)	CE Loss (if y=0)	Classification
0.99	0.010	4.605	Confident positive
0.90	0.105	2.303	Positive leaning
0.60	0.511	0.916	Uncertain positive
0.40	0.916	0.511	Uncertain negative
0.10	2.303	0.105	Negative leaning
0.01	4.605	0.010	Confident negative

Traditional Solutions and Their Limits

1. Class Reweighting: $$\mathcal{L} = -\frac{1}{N}\sum[w_1 \cdot y_i \log p_i + w_0 \cdot (1-y_i) \log(1-p_i)]$$

with weights inversely proportional to class frequency.

Problem: Helps with the frequency imbalance but doesn't address that most samples are easy. Easy negatives (p=0.01 for background) still contribute, just with lower weight.

2. Oversampling / Undersampling:

Oversample minority class or undersample majority class

Problem: Oversampling can cause overfitting to minority class; undersampling discards potentially useful data.

3. Hard Example Mining:

Only use the K hardest examples for each batch
Common in object detection (OHEM)

Problem: Introduces additional complexity, hyperparameters, and can miss useful information from medium-difficulty examples.

The key insight missed: The problem isn't just class frequency—it's that easy examples dominate the gradient, regardless of their frequency.

The Focal Loss Formula

Core Idea: Down-Weight Easy Examples

Focal loss modifies cross-entropy to reduce the contribution from easy, well-classified examples:

$$\mathcal{L}_{FL} = -\alpha_t (1 - p_t)^\gamma \log(p_t)$$

where:

$p_t$ is the model's predicted probability for the true class:
- $p_t = p$ if $y = 1$
- $p_t = 1 - p$ if $y = 0$
$\gamma \geq 0$ is the focusing parameter
$\alpha_t$ is the class balancing weight (optional)

Unpacking the Formula

The new term: $(1 - p_t)^\gamma$

This is the modulating factor. Let's see how it behaves:

When $p_t$ is high (correct and confident): $(1 - p_t)^\gamma \to 0$
- Example: $p_t = 0.9$, $\gamma = 2$ → $(0.1)^2 = 0.01$ (loss reduced by 100×!)
When $p_t$ is low (wrong or uncertain): $(1 - p_t)^\gamma \to 1$
- Example: $p_t = 0.1$, $\gamma = 2$ → $(0.9)^2 = 0.81$ (loss nearly unchanged)

Interpretation: The modulating factor acts as a soft attention mechanism that focuses learning on hard examples.

The Focusing Parameter $\gamma$

$\gamma$ controls how much to down-weight easy examples:

$\gamma = 0$: Focal loss = cross-entropy (no modulation)
$\gamma = 1$: Moderate focus on hard examples
$\gamma = 2$: Strong focus (standard recommendation)
$\gamma = 5$: Very aggressive focus (may ignore some useful signal)

focal_loss.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
 
def cross_entropy_loss(y_true, p_pred, epsilon=1e-15):
    """Standard cross-entropy loss."""
    p_pred = np.clip(p_pred, epsilon, 1 - epsilon)
    loss = -(y_true * np.log(p_pred) + (1 - y_true) * np.log(1 - p_pred))
    return loss
 
def focal_loss(y_true, p_pred, gamma=2.0, alpha=None, epsilon=1e-15):
    """
    Focal loss: down-weights easy examples.
    
    Args:
        y_true: True labels (0 or 1)
        p_pred: Predicted probability of class 1
        gamma: Focusing parameter (0 = cross-entropy)
        alpha: Class balancing weight for class 1 (None = no balancing)
    """
    p_pred = np.clip(p_pred, epsilon, 1 - epsilon)
    
    # p_t: probability assigned to the true class
    p_t = np.where(y_true == 1, p_pred, 1 - p_pred)
    
    # Modulating factor
    modulating_factor = (1 - p_t) ** gamma
    
    # Cross-entropy term
    ce = -np.log(p_t)
    
    # Focal loss
    loss = modulating_factor * ce
    
    # Optional alpha weighting
    if alpha is not None:
        alpha_t = np.where(y_true == 1, alpha, 1 - alpha)
        loss = alpha_t * loss
    
    return loss
 
def focal_loss_reduction_factor(p_t, gamma=2.0):
    """How much focal loss reduces compared to cross-entropy."""
    ce = -np.log(p_t)
    fl = (1 - p_t) ** gamma * ce
    return fl / ce  # Should equal (1 - p_t)^gamma
 
# Demonstration: Effect of gamma
print("Focal Loss vs Cross-Entropy")
print("="*60)
print(f"{'p_t (conf.)':>12} {'CE Loss':>10} {'FL (γ=1)':>10} {'FL (γ=2)':>10} {'FL (γ=5)':>10}")
print("-" * 60)
 
for p_t in [0.9999, 0.99, 0.9, 0.8, 0.6, 0.5, 0.3, 0.1, 0.01]:
    ce = -np.log(p_t)
    fl1 = (1 - p_t) ** 1 * ce
    fl2 = (1 - p_t) ** 2 * ce
    fl5 = (1 - p_t) ** 5 * ce
    print(f"{p_t:12.4f} {ce:10.4f} {fl1:10.4f} {fl2:10.4f} {fl5:10.4f}")
 
print("\n" + "="*60)
print("Loss Reduction Factor: FL / CE = (1 - p_t)^γ")
print(f"{'p_t':>12} {'γ=1':>10} {'γ=2':>10} {'γ=5':>10}")
print("-" * 60)
 
for p_t in [0.99, 0.9, 0.8, 0.5, 0.2]:
    r1 = (1 - p_t) ** 1
    r2 = (1 - p_t) ** 2
    r5 = (1 - p_t) ** 5
    print(f"{p_t:12.2f} {r1:10.4f} {r2:10.4f} {r5:10.4f}")

The Magic of γ=2

At γ=2, an easy example with p_t=0.9 has its loss reduced by (0.1)² = 0.01, a 100× reduction! Meanwhile, a hard example with p_t=0.2 has its loss reduced by only (0.8)² = 0.64, less than 2×.

This differential treatment is the key insight: easy examples contribute almost nothing; hard examples dominate the gradient.

Gradient Analysis: Why It Works

Deriving the Focal Loss Gradient

For thorough understanding, let's derive the gradient of focal loss with respect to the logit $z$ (where $p = \sigma(z)$).

Setup: $\mathcal{L}_{FL} = -(1-p_t)^\gamma \log(p_t)$

For $y = 1$ (positive class): $p_t = p = \sigma(z)$

$$\frac{\partial \mathcal{L}_{FL}}{\partial z} = -(1-p)^\gamma \cdot \frac{1}{p} \cdot p(1-p) + \gamma(1-p)^{\gamma-1} \cdot (-1) \cdot \log(p) \cdot p(1-p)$$

Simplifying: $$\frac{\partial \mathcal{L}_{FL}}{\partial z} = (1-p)^\gamma (p - 1) + \gamma(1-p)^{\gamma} p \log(p)$$

Further simplification (after algebra): $$\frac{\partial \mathcal{L}_{FL}}{\partial z} = (1-p)^\gamma \left[\gamma p \log(p) + p - 1\right]$$

Gradient Comparison: CE vs Focal

Cross-Entropy Gradient (for y=1): $p - 1$

Focal Loss Gradient (for y=1): $(1-p)^\gamma \cdot [\gamma p \log(p) + p - 1]$

The modulating factor $(1-p)^\gamma$ appears in the gradient, directly suppressing gradients from easy examples.

Numerical Analysis

focal_gradients.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy as np
 
def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
 
def ce_gradient_wrt_logit(y, z):
    """Gradient of cross-entropy w.r.t. logit z."""
    p = sigmoid(z)
    return p - y  # Simple and beautiful
 
def focal_gradient_wrt_logit(y, z, gamma=2.0):
    """
    Gradient of focal loss w.r.t. logit z.
    
    For y=1: (1-p)^γ * [γp*log(p) + p - 1]
    For y=0: p^γ * [-γ(1-p)*log(1-p) + p]  (symmetric derivation)
    """
    p = sigmoid(z)
    
    if y == 1:
        # Factor from chain rule with modulating factor
        pt = p
        modulating = (1 - pt) ** gamma
        grad = modulating * (gamma * pt * np.log(pt + 1e-15) + pt - 1)
    else:
        pt = 1 - p
        modulating = (1 - pt) ** gamma  # = p^gamma
        grad = modulating * (-gamma * pt * np.log(pt + 1e-15) + pt)
        # This is gradient pushing toward higher z (more positive)
        # Need to negate since we want gradient w.r.t z that reduces loss
        grad = p ** gamma * (-gamma * (1-p) * np.log(1-p + 1e-15) + p)
    
    return grad
 
def focal_gradient_numerical(y, z, gamma=2.0, epsilon=1e-5):
    """Numerical gradient for verification."""
    p_plus = sigmoid(z + epsilon)
    p_minus = sigmoid(z - epsilon)
    
    def fl(p, y):
        pt = p if y == 1 else 1 - p
        return -(1 - pt) ** gamma * np.log(pt + 1e-15)
    
    return (fl(p_plus, y) - fl(p_minus, y)) / (2 * epsilon)
 
# Gradient comparison
print("Gradient Comparison: Cross-Entropy vs Focal Loss (y=1)")
print("="*70)
print(f"{'p':>8} {'z':>8} {'CE grad':>12} {'FL grad (γ=2)':>15} {'FL/CE ratio':>12}")
print("-" * 70)
 
for p in [0.99, 0.95, 0.9, 0.7, 0.5, 0.3, 0.1, 0.05]:
    z = np.log(p / (1 - p))  # logit
    ce_grad = ce_gradient_wrt_logit(1, z)
    fl_grad = focal_gradient_numerical(1, z, gamma=2.0)
    ratio = abs(fl_grad / ce_grad) if abs(ce_grad) > 1e-10 else 0
    
    print(f"{p:8.2f} {z:8.2f} {ce_grad:12.4f} {fl_grad:15.6f} {ratio:12.4f}")
 
print("\nKey Insight: For well-classified examples (high p), focal loss gradient is")
print("            orders of magnitude smaller than cross-entropy gradient.")

Aggregate Effect on Training

Consider a batch with:

1 hard positive (p=0.1)
99 easy negatives (p=0.05, i.e., correctly predicting low probability)

Cross-Entropy Gradients:

Hard positive: $|p - 1| = 0.9$
Easy negatives: $|p - 0| = 0.05$ each, sum = 4.95
Total from negatives: 5.5× harder positives!

Focal Loss Gradients (γ=2):

Hard positive: ~0.9 × modulating factor ≈ 0.73
Easy negatives: ~0.05 × 0.05² ≈ 0.000125 each, sum ≈ 0.012
Total from negatives: 60× smaller than hard positive!

The reversal is dramatic: Cross-entropy lets easy negatives dominate; focal loss makes hard positives dominate.

Effective Sample Weighting

Focal loss can be viewed as an adaptive weighting scheme:

$$w_i = (1 - p_{t,i})^\gamma$$

Connection to Curriculum Learning

Focal loss naturally implements a form of self-paced learning. As training progresses:

• Initially, most examples are hard → broad learning • As model improves, easy examples get down-weighted → focused learning • Training automatically concentrates on remaining hard cases

This is similar to 'self-paced learning' and 'hard example mining' but without explicit sample selection.

Alpha Parameter: Class Balancing

Combining γ and α

Focal loss can include an optional class balancing weight $\alpha$:

$$\mathcal{L}_{FL} = -\alpha_t (1 - p_t)^\gamma \log(p_t)$$

where:

$\alpha_t = \alpha$ for positive class (foreground)
$\alpha_t = 1 - \alpha$ for negative class (background)

The Role of α

While $\gamma$ addresses the easy/hard imbalance, $\alpha$ addresses the frequency imbalance:

Even with $\gamma > 0$, if true positives are 1% of data, they contribute less aggregate gradient
$\alpha > 0.5$ up-weights positives to compensate for frequency

Common settings:

RetinaNet paper: $\alpha = 0.25$, $\gamma = 2$
Wait—why $\alpha < 0.5$ for rare positives?

The Counter-Intuitive Alpha

With $\gamma = 2$, easy negatives are already heavily down-weighted. Setting $\alpha > 0.5$ would further reduce their weight, potentially making positives too dominant.

The optimal $\alpha$ depends on $\gamma$:

$\gamma = 0$ (standard CE): use $\alpha \propto$ inverse frequency
$\gamma = 2$: use $\alpha = 0.25$ (original paper)
Higher $\gamma$: may need even lower $\alpha$

focal_alpha_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import numpy as np
 
def focal_loss_with_alpha(y_true, p_pred, gamma=2.0, alpha=0.25, epsilon=1e-15):
    """
    Full focal loss with both gamma and alpha.
    
    Args:
        y_true: True labels (0 or 1)
        p_pred: Predicted probability of class 1
        gamma: Focusing parameter
        alpha: Weight for positive class (class 1)
    """
    p_pred = np.clip(p_pred, epsilon, 1 - epsilon)
    
    # p_t and alpha_t depend on true label
    p_t = np.where(y_true == 1, p_pred, 1 - p_pred)
    alpha_t = np.where(y_true == 1, alpha, 1 - alpha)
    
    # Focal loss
    modulating = (1 - p_t) ** gamma
    ce = -np.log(p_t)
    loss = alpha_t * modulating * ce
    
    return loss
 
# Analyze effective weighting for different scenarios
print("Effective Weight Analysis: α × (1-p_t)^γ")
print("="*60)
print("With α=0.25, γ=2 (RetinaNet default)")
print()
 
alpha = 0.25
gamma = 2.0
 
print("Positive examples (y=1):")
print(f"  {'p_t (conf)':>12} {'raw α':>10} {'modulating':>12} {'effective wt':>14}")
for p_t in [0.9, 0.7, 0.5, 0.3, 0.1]:
    mod = (1 - p_t) ** gamma
    eff = alpha * mod
    print(f"  {p_t:12.2f} {alpha:10.2f} {mod:12.4f} {eff:14.4f}")
 
print("\nNegative examples (y=0):")
print(f"  {'p_t (conf)':>12} {'raw 1-α':>10} {'modulating':>12} {'effective wt':>14}")
for p_t in [0.9, 0.7, 0.5, 0.3, 0.1]:
    mod = (1 - p_t) ** gamma
    eff = (1 - alpha) * mod
    print(f"  {p_t:12.2f} {1-alpha:10.2f} {mod:12.4f} {eff:14.4f}")
 
# Compare aggregate contributions
print("\n" + "="*60)
print("Aggregate Contribution in Typical Batch")
print("Scenario: 1 hard positive (p=0.2), 100 easy negatives (p=0.05)")
print()
 
# Hard positive: true y=1, model gives p=0.2, so p_t=0.2
p_t_pos = 0.2
mod_pos = (1 - p_t_pos) ** gamma
ce_pos = -np.log(p_t_pos)
loss_pos = alpha * mod_pos * ce_pos
print(f"Hard positive: p_t={p_t_pos}, loss = {loss_pos:.4f}")
 
# Easy negatives: true y=0, model gives p=0.05, so p_t=0.95
p_t_neg = 0.95  # probability of correct class (0)
mod_neg = (1 - p_t_neg) ** gamma
ce_neg = -np.log(p_t_neg)
loss_neg = (1 - alpha) * mod_neg * ce_neg
print(f"Easy negative: p_t={p_t_neg}, loss = {loss_neg:.6f}")
 
print(f"\n100 easy negatives contribute: {100 * loss_neg:.4f}")
print(f"1 hard positive contributes: {loss_pos:.4f}")
print(f"Ratio (pos/100neg): {loss_pos / (100 * loss_neg):.1f}x")
print(f"\nThe hard positive dominates despite being outnumbered 100:1!")

Recommended Hyperparameters from Literature
Task	γ (Focusing)	α (Positive Weight)	Notes
RetinaNet (detection)	2.0	0.25	Original paper, dense detectors
Medical imaging	2.0-3.0	0.25-0.5	Depends on disease prevalence
Text classification	1.0-2.0	0.5	Less extreme imbalance typically
Semantic segmentation	2.0	0.25	Similar to detection
Fraud detection	2.0-5.0	varies	May need high γ for extreme imbalance

Multi-Class Focal Loss

Extending to K Classes

The focal loss naturally extends to multi-class classification:

$$\mathcal{L}{FL-multi} = -\sum{k=1}^{K} \alpha_k (1 - p_k)^\gamma y_k \log(p_k)$$

where:

$y_k \in {0, 1}$ is the one-hot label for class $k$
$p_k$ is the softmax probability for class $k$
$\alpha_k$ is the weight for class $k$

Simplified form (for one-hot labels where only true class $c$ has $y_c = 1$):

$$\mathcal{L}_{FL-multi} = -\alpha_c (1 - p_c)^\gamma \log(p_c)$$

Per-Class Alpha Weights

For multi-class, $\alpha$ becomes a vector $\boldsymbol{\alpha} = [\alpha_1, \ldots, \alpha_K]$:

Inversely proportional to frequency: $$\alpha_k = \frac{N}{K \cdot n_k}$$ where $n_k$ is the number of samples in class $k$.

Learned weights: Treat $\boldsymbol{\alpha}$ as trainable parameters or tune on validation set.

multiclass_focal.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import numpy as np
 
def softmax(logits):
    """Numerically stable softmax."""
    shifted = logits - np.max(logits, axis=-1, keepdims=True)
    exp_shifted = np.exp(shifted)
    return exp_shifted / np.sum(exp_shifted, axis=-1, keepdims=True)
 
def multiclass_focal_loss(y_true_idx, logits, gamma=2.0, alpha=None, epsilon=1e-15):
    """
    Multi-class focal loss from logits.
    
    Args:
        y_true_idx: True class indices (N,)
        logits: Raw model outputs (N, K)
        gamma: Focusing parameter
        alpha: Per-class weights (K,) or None for uniform
    """
    n_samples, n_classes = logits.shape
    
    # Convert to probabilities
    probs = softmax(logits)
    
    # Get probability of true class
    p_true = probs[np.arange(n_samples), y_true_idx]
    p_true = np.clip(p_true, epsilon, 1 - epsilon)
    
    # Focal loss components
    modulating = (1 - p_true) ** gamma
    ce = -np.log(p_true)
    
    loss = modulating * ce
    
    # Apply class-specific alpha if provided
    if alpha is not None:
        alpha_true = alpha[y_true_idx]
        loss = alpha_true * loss
    
    return np.mean(loss)
 
def compute_class_weights_inverse_freq(class_counts):
    """Inverse frequency class weights."""
    n_total = np.sum(class_counts)
    n_classes = len(class_counts)
    weights = n_total / (n_classes * class_counts)
    return weights / np.sum(weights)  # Normalize
 
def compute_class_weights_effective_number(class_counts, beta=0.999):
    """
    Effective number of samples weighting (Cui et al., 2019).
    
    More moderate than inverse frequency for extreme imbalance.
    """
    effective_num = (1 - beta ** class_counts) / (1 - beta)
    weights = 1.0 / effective_num
    return weights / np.sum(weights)  # Normalize
 
# Demonstration
print("Multi-Class Focal Loss")
print("="*50)
 
# Simulate imbalanced scenario
np.random.seed(42)
class_counts = np.array([1000, 100, 50, 10, 5])  # Highly imbalanced
n_classes = len(class_counts)
 
print(f"Class counts: {class_counts}")
print(f"Class frequencies: {class_counts / np.sum(class_counts)}")
 
# Compute different weight schemes
w_uniform = np.ones(n_classes) / n_classes
w_inverse = compute_class_weights_inverse_freq(class_counts)
w_effective = compute_class_weights_effective_number(class_counts, beta=0.999)
 
print(f"\nClass weights:")
print(f"  Uniform: {w_uniform}")
print(f"  Inverse freq: {w_inverse}")
print(f"  Effective num: {w_effective}")
 
# Sample batch with class imbalance
batch_size = 50
# Proportional sampling (realistic)
logits = np.random.randn(batch_size, n_classes)
y_true = np.random.choice(n_classes, size=batch_size, 
                           p=class_counts/class_counts.sum())
 
print(f"\nBatch class distribution: {np.bincount(y_true, minlength=n_classes)}")
 
# Compare losses
loss_ce = multiclass_focal_loss(y_true, logits, gamma=0, alpha=None)
loss_focal = multiclass_focal_loss(y_true, logits, gamma=2.0, alpha=None)
loss_focal_weighted = multiclass_focal_loss(y_true, logits, gamma=2.0, alpha=w_effective)
 
print(f"\nLoss values:")
print(f"  Standard CE (γ=0, no α): {loss_ce:.4f}")
print(f"  Focal (γ=2, no α): {loss_focal:.4f}")
print(f"  Focal + class weights: {loss_focal_weighted:.4f}")

Class-Balanced Focal Loss

Cui et al.'s 'Class-Balanced Loss' paper (CVPR 2019) recommends the 'effective number of samples' weighting, which provides smoother weights than inverse frequency:

E_n = (1 - β^n) / (1 - β)

As β → 1, this approaches n (under-weighting dominates). As β → 0, all classes get equal weight. β = 0.999 is a good default for extreme imbalance.

Applications Beyond Object Detection

Object Detection: The Origin Story

Focal loss was invented to save the one-stage detector paradigm.

Solution: RetinaNet with focal loss achieved accuracy comparable to two-stage detectors with one-stage speed.

Key insight: The problem wasn't the one-stage architecture; it was using standard cross-entropy.

Medical Image Analysis

Medical imaging is rife with class imbalance:

Lesion detection: tiny lesions in large volumes
Disease screening: rare positive cases
Cell segmentation: few abnormal cells among millions

Focal loss benefits:

Focuses on hard-to-detect lesions
Reduces false negative rate (critical in medicine)
Requires less data augmentation for rare classes

Successful Applications

•Object detection: RetinaNet, FCOS, ATSS
•Medical imaging: Tumor detection, lesion segmentation
•Natural language: Named entity recognition with rare entities
•Fraud detection: Financial transaction monitoring
•Anomaly detection: Manufacturing defect identification
•Semantic segmentation: Categories with small area coverage

When to Be Careful

•Balanced datasets: Little benefit, may hurt calibration
•When all examples are hard: Modulating factor becomes uniform
•Probability calibration critical: Focal skews probabilities
•Noisy labels: May amplify noise on mislabeled samples
•Soft labels: Need modified formulation
•Ensemble/distillation: May need temperature scaling

NLP Applications

Named Entity Recognition (NER):

Most tokens are 'O' (no entity): 90%+
Rare entity types: possibly <1%
Focal loss improves recall on rare entities

Sentiment Analysis with Nuanced Classes:

Strong negative/positive: common
Neutral or mixed: rarer, harder
Focal loss focuses on ambiguous cases

Intent Classification:

Common intents: well-represented
Long-tail intents: few training examples
Focal loss combined with class-balanced sampling

Instance Segmentation

In Mask R-CNN and similar models:

Foreground/background pixel imbalance within each mask
Small objects have fewer foreground pixels
Focal loss for mask branch improves small object segmentation

Practical Recommendations

Start with γ=2, α=0.25 (the original settings)
Tune γ first: If still dominated by easy examples, increase γ
Tune α second: Adjust based on validation performance per class
Monitor per-class metrics: Not just overall accuracy
Consider label smoothing: Can combine with focal loss for regularization

Numerically Stable Implementation

The Log-Sum-Exp Challenge

Focal loss involves $\log(p_t)$ where $p_t$ can be very close to 0 or 1. Combined with the modulating factor, numerical issues can arise.

Stable Binary Focal Loss Implementation:

Using the identity for binary cross-entropy from logits: $$-\log(\sigma(z)) = \log(1 + e^{-z}) = \text{softplus}(-z)$$ $$-\log(1 - \sigma(z)) = \log(1 + e^z) = \text{softplus}(z)$$

We can write stable focal loss without explicitly computing probabilities.

stable_focal_loss.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
import numpy as np
 
def stable_sigmoid(z):
    """Numerically stable sigmoid."""
    return np.where(
        z >= 0,
        1 / (1 + np.exp(-z)),
        np.exp(z) / (1 + np.exp(z))
    )
 
def stable_softplus(x):
    """log(1 + exp(x)) computed stably."""
    return np.where(
        x > 20,
        x,  # For large x, softplus(x) ≈ x
        np.log1p(np.exp(np.minimum(x, 20)))
    )
 
def binary_focal_loss_from_logits(y_true, logits, gamma=2.0, alpha=0.25):
    """
    Numerically stable binary focal loss from logits.
    
    Avoids computing probabilities explicitly where possible.
    """
    # Compute p and 1-p in a stable way
    p = stable_sigmoid(logits)
    
    # p_t = p if y=1, else 1-p
    p_t = np.where(y_true == 1, p, 1 - p)
    
    # -log(p_t) computed stably from logits
    # -log(sigmoid(z)) = softplus(-z)
    # -log(1-sigmoid(z)) = softplus(z)
    ce_loss = np.where(
        y_true == 1,
        stable_softplus(-logits),  # -log(p)
        stable_softplus(logits)    # -log(1-p)
    )
    
    # Modulating factor
    modulating = (1 - p_t) ** gamma
    
    # Alpha weighting
    alpha_t = np.where(y_true == 1, alpha, 1 - alpha)
    
    # Full focal loss
    loss = alpha_t * modulating * ce_loss
    
    return np.mean(loss)
 
def multiclass_focal_loss_from_logits(y_true_idx, logits, gamma=2.0, alpha=None):
    """
    Numerically stable multi-class focal loss.
    
    Uses log-sum-exp trick for stability.
    """
    n_samples, n_classes = logits.shape
    
    # Stable log-softmax
    max_logits = np.max(logits, axis=-1, keepdims=True)
    log_sum_exp = max_logits + np.log(
        np.sum(np.exp(logits - max_logits), axis=-1, keepdims=True)
    )
    log_probs = logits - log_sum_exp
    
    # Log probability and probability of true class
    log_p_true = log_probs[np.arange(n_samples), y_true_idx]
    p_true = np.exp(log_p_true)
    
    # Focal modulating factor
    modulating = (1 - p_true) ** gamma
    
    # CE is just -log_p_true
    ce = -log_p_true
    
    loss = modulating * ce
    
    if alpha is not None:
        loss = alpha[y_true_idx] * loss
    
    return np.mean(loss)
 
# Test numerical stability
print("Numerical Stability Test")
print("="*50)
 
# Extreme logits
test_cases = [
    ("Moderate positive", 2.0),
    ("Large positive", 20.0),
    ("Very large positive", 100.0),
    ("Moderate negative", -2.0),
    ("Large negative", -20.0),
    ("Very large negative", -100.0),
]
 
print(f"{'Case':>25} {'logit':>8} {'p':>12} {'loss (y=1)':>12} {'stable?':>8}")
print("-" * 70)
 
for name, z in test_cases:
    logits = np.array([z])
    y_true = np.array([1.0])
    
    p = stable_sigmoid(z)
    loss = binary_focal_loss_from_logits(y_true, logits, gamma=2.0, alpha=0.25)
    
    is_stable = np.isfinite(loss)
    print(f"{name:>25} {z:8.1f} {p:12.6f} {loss:12.6f} {'✓' if is_stable else '✗':>8}")

Framework Implementations

PyTorch (manual implementation—not built-in as of 2024):

import torch
import torch.nn.functional as F

def focal_loss(inputs, targets, gamma=2.0, alpha=0.25):
    bce = F.binary_cross_entropy_with_logits(inputs, targets, reduction='none')
    p = torch.sigmoid(inputs)
    p_t = p * targets + (1 - p) * (1 - targets)
    modulating = (1 - p_t) ** gamma
    alpha_t = alpha * targets + (1 - alpha) * (1 - targets)
    return (alpha_t * modulating * bce).mean()

TensorFlow:

import tensorflow as tf

loss_fn = tf.keras.losses.BinaryFocalCrossentropy(
    gamma=2.0, alpha=0.25, from_logits=True
)

Detectron2 (Facebook's detection library):

from fvcore.nn import sigmoid_focal_loss_jit

Focal Loss with Label Smoothing

Label smoothing can be combined with focal loss: $$y_{smooth} = (1 - \epsilon) \cdot y + \epsilon / K$$

The focal loss then uses smoothed targets, which:

Reduces overconfidence
Provides gradients even for "easy" examples (since targets are never 0 or 1)
Requires careful tuning as it interacts with the focusing mechanism

Summary: Focal Loss Mastery

We've explored focal loss from its motivation in object detection to practical implementation. Here are the essential takeaways:

Key Takeaways

•Core mechanism: Focal loss down-weights easy examples via the (1-p_t)^γ modulating factor, focusing learning on hard examples.
•The γ parameter: Controls focusing strength. γ=0 is standard CE; γ=2 is the recommended default; higher values aggressively ignore easy examples.
•The α parameter: Provides class balancing. Counter-intuitively, α=0.25 (down-weighting positives) works well with γ=2 because the modulating factor already up-weights hard positives.
•Gradient behavior: Easy examples contribute near-zero gradient; training naturally focuses on hard cases without explicit hard example mining.
•Main use case: Extreme class imbalance, especially object detection with 1000:1+ foreground:background ratio.
•Caution areas: Balanced data, calibration-critical applications, noisy labels—focal loss may hurt more than help in these cases.

Looking Forward

Custom Losses: Combining loss functions, designing domain-specific objectives, and encoding prior knowledge directly into the optimization target.

Page Complete