Loading content...
In many real-world classification problems, class imbalance isn't just a minor annoyance—it's a fundamental obstacle that standard loss functions fail to overcome. Consider object detection: in a single image, there might be 3 objects of interest and 100,000 background regions. Traditional cross-entropy treats each sample equally, causing the overwhelming majority of easy negatives to dominate the loss and gradient.
The result? Models that predict "background" everywhere, achieving low loss by ignoring the rare but important foreground objects.
Focal Loss, introduced by Lin et al. in 2017 for the RetinaNet detector, elegantly solves this problem. Rather than just reweighting classes (which helps but doesn't fully solve the problem), focal loss fundamentally changes how cross-entropy behaves: it down-weights easy examples so that the model focuses its learning capacity on hard examples.
This page will take you through focal loss from motivation to implementation, exploring why it works, when to use it, and how to tune its hyperparameters.
By the end of this page, you will understand: (1) the class imbalance problem and why simple reweighting is insufficient, (2) focal loss's mathematical formulation and the modulating factor, (3) gradient analysis showing how easy examples are suppressed, (4) hyperparameter tuning (focusing weight γ, class balancing α), and (5) applications from object detection to medical imaging to NLP.
Object Detection Example:
Medical Diagnosis:
Fraud Detection:
Let's analyze what happens with standard cross-entropy under extreme imbalance.
The Loss Breakdown: For binary classification with cross-entropy: $$\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N}[y_i \log p_i + (1-y_i) \log(1-p_i)]$$
With 99% negatives and 1% positives:
The Gradient Problem: Even worse, the easy negatives contribute non-trivial gradients. A model predicting $p=0.1$ for an easy negative has:
With 99,000 such samples, these small per-sample gradients sum to a massive signal that overwhelms the 100 hard positives.
| Probability p | CE Loss (if y=1) | CE Loss (if y=0) | Classification |
|---|---|---|---|
| 0.99 | 0.010 | 4.605 | Confident positive |
| 0.90 | 0.105 | 2.303 | Positive leaning |
| 0.60 | 0.511 | 0.916 | Uncertain positive |
| 0.40 | 0.916 | 0.511 | Uncertain negative |
| 0.10 | 2.303 | 0.105 | Negative leaning |
| 0.01 | 4.605 | 0.010 | Confident negative |
1. Class Reweighting: $$\mathcal{L} = -\frac{1}{N}\sum[w_1 \cdot y_i \log p_i + w_0 \cdot (1-y_i) \log(1-p_i)]$$
with weights inversely proportional to class frequency.
Problem: Helps with the frequency imbalance but doesn't address that most samples are easy. Easy negatives (p=0.01 for background) still contribute, just with lower weight.
2. Oversampling / Undersampling:
Problem: Oversampling can cause overfitting to minority class; undersampling discards potentially useful data.
3. Hard Example Mining:
Problem: Introduces additional complexity, hyperparameters, and can miss useful information from medium-difficulty examples.
The key insight missed: The problem isn't just class frequency—it's that easy examples dominate the gradient, regardless of their frequency.
Focal loss modifies cross-entropy to reduce the contribution from easy, well-classified examples:
$$\mathcal{L}_{FL} = -\alpha_t (1 - p_t)^\gamma \log(p_t)$$
where:
The new term: $(1 - p_t)^\gamma$
This is the modulating factor. Let's see how it behaves:
When $p_t$ is high (correct and confident): $(1 - p_t)^\gamma \to 0$
When $p_t$ is low (wrong or uncertain): $(1 - p_t)^\gamma \to 1$
Interpretation: The modulating factor acts as a soft attention mechanism that focuses learning on hard examples.
$\gamma$ controls how much to down-weight easy examples:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
import numpy as np def cross_entropy_loss(y_true, p_pred, epsilon=1e-15): """Standard cross-entropy loss.""" p_pred = np.clip(p_pred, epsilon, 1 - epsilon) loss = -(y_true * np.log(p_pred) + (1 - y_true) * np.log(1 - p_pred)) return loss def focal_loss(y_true, p_pred, gamma=2.0, alpha=None, epsilon=1e-15): """ Focal loss: down-weights easy examples. Args: y_true: True labels (0 or 1) p_pred: Predicted probability of class 1 gamma: Focusing parameter (0 = cross-entropy) alpha: Class balancing weight for class 1 (None = no balancing) """ p_pred = np.clip(p_pred, epsilon, 1 - epsilon) # p_t: probability assigned to the true class p_t = np.where(y_true == 1, p_pred, 1 - p_pred) # Modulating factor modulating_factor = (1 - p_t) ** gamma # Cross-entropy term ce = -np.log(p_t) # Focal loss loss = modulating_factor * ce # Optional alpha weighting if alpha is not None: alpha_t = np.where(y_true == 1, alpha, 1 - alpha) loss = alpha_t * loss return loss def focal_loss_reduction_factor(p_t, gamma=2.0): """How much focal loss reduces compared to cross-entropy.""" ce = -np.log(p_t) fl = (1 - p_t) ** gamma * ce return fl / ce # Should equal (1 - p_t)^gamma # Demonstration: Effect of gammaprint("Focal Loss vs Cross-Entropy")print("="*60)print(f"{'p_t (conf.)':>12} {'CE Loss':>10} {'FL (γ=1)':>10} {'FL (γ=2)':>10} {'FL (γ=5)':>10}")print("-" * 60) for p_t in [0.9999, 0.99, 0.9, 0.8, 0.6, 0.5, 0.3, 0.1, 0.01]: ce = -np.log(p_t) fl1 = (1 - p_t) ** 1 * ce fl2 = (1 - p_t) ** 2 * ce fl5 = (1 - p_t) ** 5 * ce print(f"{p_t:12.4f} {ce:10.4f} {fl1:10.4f} {fl2:10.4f} {fl5:10.4f}") print("\n" + "="*60)print("Loss Reduction Factor: FL / CE = (1 - p_t)^γ")print(f"{'p_t':>12} {'γ=1':>10} {'γ=2':>10} {'γ=5':>10}")print("-" * 60) for p_t in [0.99, 0.9, 0.8, 0.5, 0.2]: r1 = (1 - p_t) ** 1 r2 = (1 - p_t) ** 2 r5 = (1 - p_t) ** 5 print(f"{p_t:12.2f} {r1:10.4f} {r2:10.4f} {r5:10.4f}")At γ=2, an easy example with p_t=0.9 has its loss reduced by (0.1)² = 0.01, a 100× reduction! Meanwhile, a hard example with p_t=0.2 has its loss reduced by only (0.8)² = 0.64, less than 2×.
This differential treatment is the key insight: easy examples contribute almost nothing; hard examples dominate the gradient.
For thorough understanding, let's derive the gradient of focal loss with respect to the logit $z$ (where $p = \sigma(z)$).
Setup: $\mathcal{L}_{FL} = -(1-p_t)^\gamma \log(p_t)$
For $y = 1$ (positive class): $p_t = p = \sigma(z)$
$$\frac{\partial \mathcal{L}_{FL}}{\partial z} = -(1-p)^\gamma \cdot \frac{1}{p} \cdot p(1-p) + \gamma(1-p)^{\gamma-1} \cdot (-1) \cdot \log(p) \cdot p(1-p)$$
Simplifying: $$\frac{\partial \mathcal{L}_{FL}}{\partial z} = (1-p)^\gamma (p - 1) + \gamma(1-p)^{\gamma} p \log(p)$$
Further simplification (after algebra): $$\frac{\partial \mathcal{L}_{FL}}{\partial z} = (1-p)^\gamma \left[\gamma p \log(p) + p - 1\right]$$
Cross-Entropy Gradient (for y=1): $p - 1$
Focal Loss Gradient (for y=1): $(1-p)^\gamma \cdot [\gamma p \log(p) + p - 1]$
The modulating factor $(1-p)^\gamma$ appears in the gradient, directly suppressing gradients from easy examples.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
import numpy as np def sigmoid(z): return 1 / (1 + np.exp(-np.clip(z, -500, 500))) def ce_gradient_wrt_logit(y, z): """Gradient of cross-entropy w.r.t. logit z.""" p = sigmoid(z) return p - y # Simple and beautiful def focal_gradient_wrt_logit(y, z, gamma=2.0): """ Gradient of focal loss w.r.t. logit z. For y=1: (1-p)^γ * [γp*log(p) + p - 1] For y=0: p^γ * [-γ(1-p)*log(1-p) + p] (symmetric derivation) """ p = sigmoid(z) if y == 1: # Factor from chain rule with modulating factor pt = p modulating = (1 - pt) ** gamma grad = modulating * (gamma * pt * np.log(pt + 1e-15) + pt - 1) else: pt = 1 - p modulating = (1 - pt) ** gamma # = p^gamma grad = modulating * (-gamma * pt * np.log(pt + 1e-15) + pt) # This is gradient pushing toward higher z (more positive) # Need to negate since we want gradient w.r.t z that reduces loss grad = p ** gamma * (-gamma * (1-p) * np.log(1-p + 1e-15) + p) return grad def focal_gradient_numerical(y, z, gamma=2.0, epsilon=1e-5): """Numerical gradient for verification.""" p_plus = sigmoid(z + epsilon) p_minus = sigmoid(z - epsilon) def fl(p, y): pt = p if y == 1 else 1 - p return -(1 - pt) ** gamma * np.log(pt + 1e-15) return (fl(p_plus, y) - fl(p_minus, y)) / (2 * epsilon) # Gradient comparisonprint("Gradient Comparison: Cross-Entropy vs Focal Loss (y=1)")print("="*70)print(f"{'p':>8} {'z':>8} {'CE grad':>12} {'FL grad (γ=2)':>15} {'FL/CE ratio':>12}")print("-" * 70) for p in [0.99, 0.95, 0.9, 0.7, 0.5, 0.3, 0.1, 0.05]: z = np.log(p / (1 - p)) # logit ce_grad = ce_gradient_wrt_logit(1, z) fl_grad = focal_gradient_numerical(1, z, gamma=2.0) ratio = abs(fl_grad / ce_grad) if abs(ce_grad) > 1e-10 else 0 print(f"{p:8.2f} {z:8.2f} {ce_grad:12.4f} {fl_grad:15.6f} {ratio:12.4f}") print("\nKey Insight: For well-classified examples (high p), focal loss gradient is")print(" orders of magnitude smaller than cross-entropy gradient.")Consider a batch with:
Cross-Entropy Gradients:
Focal Loss Gradients (γ=2):
The reversal is dramatic: Cross-entropy lets easy negatives dominate; focal loss makes hard positives dominate.
Focal loss can be viewed as an adaptive weighting scheme:
$$w_i = (1 - p_{t,i})^\gamma$$
Easy examples get weight near 0; hard examples get weight near 1. Unlike fixed class weights, this is dynamic—it adapts based on current model confidence, effectively doing implicit curriculum learning where easy examples are removed from training as they become easy.
Focal loss naturally implements a form of self-paced learning. As training progresses:
• Initially, most examples are hard → broad learning • As model improves, easy examples get down-weighted → focused learning • Training automatically concentrates on remaining hard cases
This is similar to 'self-paced learning' and 'hard example mining' but without explicit sample selection.
Focal loss can include an optional class balancing weight $\alpha$:
$$\mathcal{L}_{FL} = -\alpha_t (1 - p_t)^\gamma \log(p_t)$$
where:
While $\gamma$ addresses the easy/hard imbalance, $\alpha$ addresses the frequency imbalance:
Common settings:
With $\gamma = 2$, easy negatives are already heavily down-weighted. Setting $\alpha > 0.5$ would further reduce their weight, potentially making positives too dominant.
The optimal $\alpha$ depends on $\gamma$:
The paper's finding: $\alpha = 0.25$ works well with $\gamma = 2$. This slightly down-weights positives because the $(1-p_t)^2$ term so dramatically up-weights hard examples that additional raw weighting would over-correct.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
import numpy as np def focal_loss_with_alpha(y_true, p_pred, gamma=2.0, alpha=0.25, epsilon=1e-15): """ Full focal loss with both gamma and alpha. Args: y_true: True labels (0 or 1) p_pred: Predicted probability of class 1 gamma: Focusing parameter alpha: Weight for positive class (class 1) """ p_pred = np.clip(p_pred, epsilon, 1 - epsilon) # p_t and alpha_t depend on true label p_t = np.where(y_true == 1, p_pred, 1 - p_pred) alpha_t = np.where(y_true == 1, alpha, 1 - alpha) # Focal loss modulating = (1 - p_t) ** gamma ce = -np.log(p_t) loss = alpha_t * modulating * ce return loss # Analyze effective weighting for different scenariosprint("Effective Weight Analysis: α × (1-p_t)^γ")print("="*60)print("With α=0.25, γ=2 (RetinaNet default)")print() alpha = 0.25gamma = 2.0 print("Positive examples (y=1):")print(f" {'p_t (conf)':>12} {'raw α':>10} {'modulating':>12} {'effective wt':>14}")for p_t in [0.9, 0.7, 0.5, 0.3, 0.1]: mod = (1 - p_t) ** gamma eff = alpha * mod print(f" {p_t:12.2f} {alpha:10.2f} {mod:12.4f} {eff:14.4f}") print("\nNegative examples (y=0):")print(f" {'p_t (conf)':>12} {'raw 1-α':>10} {'modulating':>12} {'effective wt':>14}")for p_t in [0.9, 0.7, 0.5, 0.3, 0.1]: mod = (1 - p_t) ** gamma eff = (1 - alpha) * mod print(f" {p_t:12.2f} {1-alpha:10.2f} {mod:12.4f} {eff:14.4f}") # Compare aggregate contributionsprint("\n" + "="*60)print("Aggregate Contribution in Typical Batch")print("Scenario: 1 hard positive (p=0.2), 100 easy negatives (p=0.05)")print() # Hard positive: true y=1, model gives p=0.2, so p_t=0.2p_t_pos = 0.2mod_pos = (1 - p_t_pos) ** gammace_pos = -np.log(p_t_pos)loss_pos = alpha * mod_pos * ce_posprint(f"Hard positive: p_t={p_t_pos}, loss = {loss_pos:.4f}") # Easy negatives: true y=0, model gives p=0.05, so p_t=0.95p_t_neg = 0.95 # probability of correct class (0)mod_neg = (1 - p_t_neg) ** gammace_neg = -np.log(p_t_neg)loss_neg = (1 - alpha) * mod_neg * ce_negprint(f"Easy negative: p_t={p_t_neg}, loss = {loss_neg:.6f}") print(f"\n100 easy negatives contribute: {100 * loss_neg:.4f}")print(f"1 hard positive contributes: {loss_pos:.4f}")print(f"Ratio (pos/100neg): {loss_pos / (100 * loss_neg):.1f}x")print(f"\nThe hard positive dominates despite being outnumbered 100:1!")| Task | γ (Focusing) | α (Positive Weight) | Notes |
|---|---|---|---|
| RetinaNet (detection) | 2.0 | 0.25 | Original paper, dense detectors |
| Medical imaging | 2.0-3.0 | 0.25-0.5 | Depends on disease prevalence |
| Text classification | 1.0-2.0 | 0.5 | Less extreme imbalance typically |
| Semantic segmentation | 2.0 | 0.25 | Similar to detection |
| Fraud detection | 2.0-5.0 | varies | May need high γ for extreme imbalance |
The focal loss naturally extends to multi-class classification:
$$\mathcal{L}{FL-multi} = -\sum{k=1}^{K} \alpha_k (1 - p_k)^\gamma y_k \log(p_k)$$
where:
Simplified form (for one-hot labels where only true class $c$ has $y_c = 1$):
$$\mathcal{L}_{FL-multi} = -\alpha_c (1 - p_c)^\gamma \log(p_c)$$
For multi-class, $\alpha$ becomes a vector $\boldsymbol{\alpha} = [\alpha_1, \ldots, \alpha_K]$:
Inversely proportional to frequency: $$\alpha_k = \frac{N}{K \cdot n_k}$$ where $n_k$ is the number of samples in class $k$.
Effective number of samples (Cui et al., 2019): $$\alpha_k \propto \frac{1 - \beta^{n_k}}{1 - \beta}$$ where $\beta \in [0, 1)$ is a hyperparameter. This provides smoother weights than inverse frequency.
Learned weights: Treat $\boldsymbol{\alpha}$ as trainable parameters or tune on validation set.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697
import numpy as np def softmax(logits): """Numerically stable softmax.""" shifted = logits - np.max(logits, axis=-1, keepdims=True) exp_shifted = np.exp(shifted) return exp_shifted / np.sum(exp_shifted, axis=-1, keepdims=True) def multiclass_focal_loss(y_true_idx, logits, gamma=2.0, alpha=None, epsilon=1e-15): """ Multi-class focal loss from logits. Args: y_true_idx: True class indices (N,) logits: Raw model outputs (N, K) gamma: Focusing parameter alpha: Per-class weights (K,) or None for uniform """ n_samples, n_classes = logits.shape # Convert to probabilities probs = softmax(logits) # Get probability of true class p_true = probs[np.arange(n_samples), y_true_idx] p_true = np.clip(p_true, epsilon, 1 - epsilon) # Focal loss components modulating = (1 - p_true) ** gamma ce = -np.log(p_true) loss = modulating * ce # Apply class-specific alpha if provided if alpha is not None: alpha_true = alpha[y_true_idx] loss = alpha_true * loss return np.mean(loss) def compute_class_weights_inverse_freq(class_counts): """Inverse frequency class weights.""" n_total = np.sum(class_counts) n_classes = len(class_counts) weights = n_total / (n_classes * class_counts) return weights / np.sum(weights) # Normalize def compute_class_weights_effective_number(class_counts, beta=0.999): """ Effective number of samples weighting (Cui et al., 2019). More moderate than inverse frequency for extreme imbalance. """ effective_num = (1 - beta ** class_counts) / (1 - beta) weights = 1.0 / effective_num return weights / np.sum(weights) # Normalize # Demonstrationprint("Multi-Class Focal Loss")print("="*50) # Simulate imbalanced scenarionp.random.seed(42)class_counts = np.array([1000, 100, 50, 10, 5]) # Highly imbalancedn_classes = len(class_counts) print(f"Class counts: {class_counts}")print(f"Class frequencies: {class_counts / np.sum(class_counts)}") # Compute different weight schemesw_uniform = np.ones(n_classes) / n_classesw_inverse = compute_class_weights_inverse_freq(class_counts)w_effective = compute_class_weights_effective_number(class_counts, beta=0.999) print(f"\nClass weights:")print(f" Uniform: {w_uniform}")print(f" Inverse freq: {w_inverse}")print(f" Effective num: {w_effective}") # Sample batch with class imbalancebatch_size = 50# Proportional sampling (realistic)logits = np.random.randn(batch_size, n_classes)y_true = np.random.choice(n_classes, size=batch_size, p=class_counts/class_counts.sum()) print(f"\nBatch class distribution: {np.bincount(y_true, minlength=n_classes)}") # Compare lossesloss_ce = multiclass_focal_loss(y_true, logits, gamma=0, alpha=None)loss_focal = multiclass_focal_loss(y_true, logits, gamma=2.0, alpha=None)loss_focal_weighted = multiclass_focal_loss(y_true, logits, gamma=2.0, alpha=w_effective) print(f"\nLoss values:")print(f" Standard CE (γ=0, no α): {loss_ce:.4f}")print(f" Focal (γ=2, no α): {loss_focal:.4f}")print(f" Focal + class weights: {loss_focal_weighted:.4f}")Cui et al.'s 'Class-Balanced Loss' paper (CVPR 2019) recommends the 'effective number of samples' weighting, which provides smoother weights than inverse frequency:
E_n = (1 - β^n) / (1 - β)
As β → 1, this approaches n (under-weighting dominates). As β → 0, all classes get equal weight. β = 0.999 is a good default for extreme imbalance.
Focal loss was invented to save the one-stage detector paradigm.
Problem: Two-stage detectors (R-CNN family) use a region proposal network to filter easy negatives before classification. One-stage detectors (YOLO, SSD) classify all anchors directly, suffering from extreme foreground-background imbalance.
Solution: RetinaNet with focal loss achieved accuracy comparable to two-stage detectors with one-stage speed.
Key insight: The problem wasn't the one-stage architecture; it was using standard cross-entropy.
Medical imaging is rife with class imbalance:
Focal loss benefits:
Named Entity Recognition (NER):
Sentiment Analysis with Nuanced Classes:
Intent Classification:
In Mask R-CNN and similar models:
Focal loss involves $\log(p_t)$ where $p_t$ can be very close to 0 or 1. Combined with the modulating factor, numerical issues can arise.
Stable Binary Focal Loss Implementation:
Using the identity for binary cross-entropy from logits: $$-\log(\sigma(z)) = \log(1 + e^{-z}) = \text{softplus}(-z)$$ $$-\log(1 - \sigma(z)) = \log(1 + e^z) = \text{softplus}(z)$$
We can write stable focal loss without explicitly computing probabilities.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108
import numpy as np def stable_sigmoid(z): """Numerically stable sigmoid.""" return np.where( z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z)) ) def stable_softplus(x): """log(1 + exp(x)) computed stably.""" return np.where( x > 20, x, # For large x, softplus(x) ≈ x np.log1p(np.exp(np.minimum(x, 20))) ) def binary_focal_loss_from_logits(y_true, logits, gamma=2.0, alpha=0.25): """ Numerically stable binary focal loss from logits. Avoids computing probabilities explicitly where possible. """ # Compute p and 1-p in a stable way p = stable_sigmoid(logits) # p_t = p if y=1, else 1-p p_t = np.where(y_true == 1, p, 1 - p) # -log(p_t) computed stably from logits # -log(sigmoid(z)) = softplus(-z) # -log(1-sigmoid(z)) = softplus(z) ce_loss = np.where( y_true == 1, stable_softplus(-logits), # -log(p) stable_softplus(logits) # -log(1-p) ) # Modulating factor modulating = (1 - p_t) ** gamma # Alpha weighting alpha_t = np.where(y_true == 1, alpha, 1 - alpha) # Full focal loss loss = alpha_t * modulating * ce_loss return np.mean(loss) def multiclass_focal_loss_from_logits(y_true_idx, logits, gamma=2.0, alpha=None): """ Numerically stable multi-class focal loss. Uses log-sum-exp trick for stability. """ n_samples, n_classes = logits.shape # Stable log-softmax max_logits = np.max(logits, axis=-1, keepdims=True) log_sum_exp = max_logits + np.log( np.sum(np.exp(logits - max_logits), axis=-1, keepdims=True) ) log_probs = logits - log_sum_exp # Log probability and probability of true class log_p_true = log_probs[np.arange(n_samples), y_true_idx] p_true = np.exp(log_p_true) # Focal modulating factor modulating = (1 - p_true) ** gamma # CE is just -log_p_true ce = -log_p_true loss = modulating * ce if alpha is not None: loss = alpha[y_true_idx] * loss return np.mean(loss) # Test numerical stabilityprint("Numerical Stability Test")print("="*50) # Extreme logitstest_cases = [ ("Moderate positive", 2.0), ("Large positive", 20.0), ("Very large positive", 100.0), ("Moderate negative", -2.0), ("Large negative", -20.0), ("Very large negative", -100.0),] print(f"{'Case':>25} {'logit':>8} {'p':>12} {'loss (y=1)':>12} {'stable?':>8}")print("-" * 70) for name, z in test_cases: logits = np.array([z]) y_true = np.array([1.0]) p = stable_sigmoid(z) loss = binary_focal_loss_from_logits(y_true, logits, gamma=2.0, alpha=0.25) is_stable = np.isfinite(loss) print(f"{name:>25} {z:8.1f} {p:12.6f} {loss:12.6f} {'✓' if is_stable else '✗':>8}")PyTorch (manual implementation—not built-in as of 2024):
import torch
import torch.nn.functional as F
def focal_loss(inputs, targets, gamma=2.0, alpha=0.25):
bce = F.binary_cross_entropy_with_logits(inputs, targets, reduction='none')
p = torch.sigmoid(inputs)
p_t = p * targets + (1 - p) * (1 - targets)
modulating = (1 - p_t) ** gamma
alpha_t = alpha * targets + (1 - alpha) * (1 - targets)
return (alpha_t * modulating * bce).mean()
TensorFlow:
import tensorflow as tf
loss_fn = tf.keras.losses.BinaryFocalCrossentropy(
gamma=2.0, alpha=0.25, from_logits=True
)
Detectron2 (Facebook's detection library):
from fvcore.nn import sigmoid_focal_loss_jit
Label smoothing can be combined with focal loss: $$y_{smooth} = (1 - \epsilon) \cdot y + \epsilon / K$$
The focal loss then uses smoothed targets, which:
We've explored focal loss from its motivation in object detection to practical implementation. Here are the essential takeaways:
Looking Forward
We've now covered the major loss functions: cross-entropy for general classification, MSE for regression, hinge for margin-based learning, and focal for imbalanced data. The final topic brings these together:
You now understand focal loss deeply—why class imbalance defeats standard losses, how the modulating factor solves this, and when focal loss is your best tool. This knowledge is essential for object detection, medical imaging, and any domain with extreme class imbalance.