Loading learning content...
Cross-entropy, MSE, hinge, and focal loss cover most use cases—but not all. Real-world problems often have unique requirements: physical constraints that predictions must satisfy, business objectives that don't map to standard accuracy metrics, or prior knowledge about the problem structure.
Custom loss functions are the bridge between what neural networks naturally optimize and what your application truly needs. This page teaches you how to design, implement, and debug custom objectives.
By the end of this page, you will understand: (1) principles for combining multiple losses, (2) perceptual and feature-matching losses, (3) physics-informed and constraint-based losses, (4) ranking and contrastive objectives, and (5) practical debugging strategies.
Many applications require multiple objectives:
$$\mathcal{L}{total} = \sum{i=1}^{T} \lambda_i \mathcal{L}_i$$
Examples:
1. Fixed weights: Set $\lambda_i$ manually based on domain knowledge or preliminary experiments.
2. Uncertainty weighting (Kendall et al., 2018): $$\mathcal{L} = \sum_i \frac{1}{2\sigma_i^2}\mathcal{L}_i + \log\sigma_i$$
where $\sigma_i$ are learned. Tasks with higher uncertainty get lower weight.
3. Gradient normalization: Scale losses so gradients have similar magnitudes.
4. Dynamic weighting: Adjust weights during training based on task progress.
1234567891011121314151617181920212223242526272829303132
import numpy as np class UncertaintyWeightedLoss: """ Learns task weights based on homoscedastic uncertainty. """ def __init__(self, n_tasks): # log(sigma^2) for each task, initialized to 0 self.log_vars = np.zeros(n_tasks) def __call__(self, task_losses): """ Args: task_losses: List of loss values, one per task Returns: Combined loss and gradients w.r.t. log_vars """ total = 0 for i, loss in enumerate(task_losses): precision = np.exp(-self.log_vars[i]) # 1/sigma^2 total += precision * loss + self.log_vars[i] return 0.5 * total def get_weights(self): """Current effective weights (inverse variance).""" return np.exp(-self.log_vars) # Demolosses = [0.5, 2.0, 0.1] # Classification, bbox regression, maskweighted = UncertaintyWeightedLoss(3)print(f"Initial weights: {weighted.get_weights()}")print(f"Combined loss: {weighted(losses):.4f}")Losses with different scales can cause training instability. If L1 is typically 0.1 and L2 is typically 100, raw combination means L2 dominates completely. Always normalize or weight appropriately.
MSE on pixels encourages blurriness: when uncertain, predicting the mean minimizes squared error. For images, this means losing high-frequency details.
Compare images in feature space of a pretrained network:
$$\mathcal{L}_{perceptual} = \sum_l \lambda_l |\phi_l(y) - \phi_l(\hat{y})|_2^2$$
where $\phi_l$ extracts features from layer $l$ of a pretrained network (typically VGG).
Why it works:
For style transfer, match feature correlations:
$$\mathcal{L}_{style} = \sum_l |G_l(y) - G_l(\hat{y})|_F^2$$
where $G_l = \phi_l \phi_l^T$ is the Gram matrix (captures texture statistics).
123456789101112131415161718192021222324252627282930313233343536
import numpy as np def perceptual_loss(target_features, pred_features, weights=None): """ Perceptual loss across multiple feature layers. Args: target_features: List of feature maps from target image pred_features: List of feature maps from predicted image weights: Per-layer weights (default: uniform) """ if weights is None: weights = [1.0] * len(target_features) loss = 0 for w, t, p in zip(weights, target_features, pred_features): loss += w * np.mean((t - p) ** 2) return loss def gram_matrix(features): """Compute Gram matrix for style loss.""" # features: (H, W, C) or (C, H, W) # Reshape to (C, H*W) if len(features.shape) == 3: c, h, w = features.shape f = features.reshape(c, -1) else: f = features.reshape(features.shape[0], -1) return f @ f.T / f.shape[1] def style_loss(target_grams, pred_grams): """Style loss using Gram matrices.""" loss = 0 for t, p in zip(target_grams, pred_grams): loss += np.mean((t - p) ** 2) return loss| Application | Network | Layers | Notes |
|---|---|---|---|
| Super-resolution | VGG19 | relu5_4 | High-level features |
| Style transfer | VGG19 | relu1_1 to relu5_1 | Multiple scales |
| Image synthesis | VGG16 | relu2_2, relu3_3 | Mid-level features |
| Face generation | VGGFace | Various | Face-specific features |
When you care about ordering, not absolute values:
Pairwise Ranking Loss: $$\mathcal{L} = \max(0, \Delta + s_{neg} - s_{pos})$$
Positive items should score higher than negatives by margin $\Delta$.
ListNet/ListMLE: Probability distribution over rankings.
Learn embeddings where similar items are close:
$$\mathcal{L}_{triplet} = \max(0, |f(a) - f(p)|^2 - |f(a) - f(n)|^2 + \alpha)$$
where $a$ is anchor, $p$ is positive, $n$ is negative.
The foundation of SimCLR, CLIP, etc.:
$$\mathcal{L}{NCE} = -\log \frac{\exp(s{pos}/\tau)}{\sum_j \exp(s_j/\tau)}$$
Push positive pairs together, negatives apart in embedding space.
12345678910111213141516171819202122232425262728293031323334
import numpy as np def triplet_loss(anchor, positive, negative, margin=1.0): """Triplet loss for metric learning.""" pos_dist = np.sum((anchor - positive) ** 2) neg_dist = np.sum((anchor - negative) ** 2) return np.maximum(0, margin + pos_dist - neg_dist) def info_nce_loss(query, positive_key, negative_keys, temperature=0.07): """ InfoNCE loss for contrastive learning. Args: query: Query embedding (d,) positive_key: Positive sample embedding (d,) negative_keys: Negative sample embeddings (N, d) """ # Similarities pos_sim = np.dot(query, positive_key) / temperature neg_sims = negative_keys @ query / temperature # Log-softmax denominator all_sims = np.concatenate([[pos_sim], neg_sims]) log_sum_exp = np.max(all_sims) + np.log(np.sum(np.exp(all_sims - np.max(all_sims)))) return -pos_sim + log_sum_exp # Demonp.random.seed(42)query = np.random.randn(128)positive = query + 0.1 * np.random.randn(128) # Similarnegatives = np.random.randn(100, 128) # Random print(f"InfoNCE loss: {info_nce_loss(query, positive, negatives):.4f}")Contrastive losses benefit greatly from hard negatives—samples that are difficult to distinguish from positives. Random negatives become too easy as training progresses. Strategies: (1) in-batch hard mining, (2) memory banks, (3) momentum encoders (MoCo).
Sometimes predictions must satisfy constraints:
Soft constraints (add penalty term): $$\mathcal{L} = \mathcal{L}{data} + \lambda \mathcal{L}{constraint}$$
Examples:
Encode physical laws as loss terms:
$$\mathcal{L} = \mathcal{L}{data} + \lambda{physics} \mathcal{L}_{PDE}$$
where $\mathcal{L}_{PDE}$ penalizes violations of differential equations.
Example: Heat equation $\frac{\partial u}{\partial t} = \alpha abla^2 u$
$$\mathcal{L}_{PDE} = \left|\frac{\partial \hat{u}}{\partial t} - \alpha abla^2 \hat{u}\right|^2$$
evaluated at collocation points.
123456789101112131415161718192021222324252627282930
import numpy as np def sum_to_one_constraint(predictions): """Penalize if predictions don't sum to 1.""" return (np.sum(predictions, axis=-1) - 1) ** 2 def non_negative_constraint(predictions): """Penalize negative predictions.""" return np.mean(np.maximum(0, -predictions) ** 2) def monotonicity_constraint(predictions): """ Penalize non-monotonic sequences. Assumes predictions should be increasing. """ diffs = predictions[1:] - predictions[:-1] return np.mean(np.maximum(0, -diffs) ** 2) def boundary_constraint(edge_values, boundary_conditions): """Penalize deviation from boundary conditions.""" return np.mean((edge_values - boundary_conditions) ** 2) # Physics-informed example: simple ODE constraintdef ode_residual_loss(t, y_pred, dy_dt_pred, f): """ For ODE: dy/dt = f(t, y) Penalize: dy_dt_pred - f(t, y_pred) """ expected_derivative = f(t, y_pred) return np.mean((dy_dt_pred - expected_derivative) ** 2)| Constraint Type | Loss Term | Example Application |
|---|---|---|
| Sum constraint | (Σyᵢ - target)² | Probability distributions |
| Bound constraint | max(0, y-upper)² + max(0, lower-y)² | Physical quantities |
| Symmetry | ‖f(x) - f(flip(x))‖² | Image tasks |
| Smoothness | ‖∇²y‖² | Denoising, interpolation |
| Conservation | (Σinput - Σoutput)² | Mass/energy balance |
Generative Adversarial Networks use a minimax game:
$$\min_G \max_D \mathbb{E}[\log D(x)] + \mathbb{E}[\log(1 - D(G(z)))]$$
Generator loss: $-\log(D(G(z)))$ (non-saturating version) Discriminator loss: Standard binary cross-entropy
Wasserstein GAN (WGAN):
Hinge GAN:
Least Squares GAN (LSGAN):
For conditional generation (pix2pix, image-to-image): $$\mathcal{L} = \lambda_{adv}\mathcal{L}{GAN} + \lambda{rec}\mathcal{L}_{L1}$$
Adversarial loss provides sharpness; reconstruction loss provides structure.
1234567891011121314151617181920212223242526272829
import numpy as np def vanilla_gan_d_loss(real_scores, fake_scores, epsilon=1e-15): """Discriminator loss for vanilla GAN.""" real_loss = -np.mean(np.log(real_scores + epsilon)) fake_loss = -np.mean(np.log(1 - fake_scores + epsilon)) return real_loss + fake_loss def vanilla_gan_g_loss(fake_scores, epsilon=1e-15): """Non-saturating generator loss.""" return -np.mean(np.log(fake_scores + epsilon)) def wasserstein_d_loss(real_scores, fake_scores): """WGAN discriminator (critic) loss.""" return np.mean(fake_scores) - np.mean(real_scores) def wasserstein_g_loss(fake_scores): """WGAN generator loss.""" return -np.mean(fake_scores) def hinge_d_loss(real_scores, fake_scores): """Hinge loss for discriminator.""" real_loss = np.mean(np.maximum(0, 1 - real_scores)) fake_loss = np.mean(np.maximum(0, 1 + fake_scores)) return real_loss + fake_loss def hinge_g_loss(fake_scores): """Hinge loss for generator.""" return -np.mean(fake_scores)1. Gradient issues: NaN, inf, or vanishing gradients
2. Loss scale mismatch: One component dominates
3. Conflicting objectives: Losses fight each other
4. Wrong minimum: Loss decreases but model fails
12345678910111213141516171819202122232425
import numpy as np def gradient_check(loss_fn, params, epsilon=1e-5): """Numerical gradient check for custom loss.""" numerical_grad = np.zeros_like(params) for i in range(len(params)): params_plus = params.copy() params_plus[i] += epsilon params_minus = params.copy() params_minus[i] -= epsilon numerical_grad[i] = (loss_fn(params_plus) - loss_fn(params_minus)) / (2*epsilon) return numerical_grad def check_gradient_health(gradients, name="grad"): """Check for common gradient issues.""" issues = [] if np.any(np.isnan(gradients)): issues.append(f"{name} contains NaN") if np.any(np.isinf(gradients)): issues.append(f"{name} contains Inf") if np.max(np.abs(gradients)) > 1000: issues.append(f"{name} may be exploding (max={np.max(np.abs(gradients)):.1f})") if np.max(np.abs(gradients)) < 1e-7: issues.append(f"{name} may be vanishing (max={np.max(np.abs(gradients)):.2e})") return issues if issues else ["OK"]You've now mastered the complete landscape of loss functions: cross-entropy for classification, MSE for regression, hinge for margins, focal for imbalance, and the principles for designing custom objectives. This knowledge empowers you to match any learning problem with an appropriate objective function.