Machine LearningNeural Network Foundations

Loss Functions

LevelIntermediate

Duration90 mins

TopicNeural Network Foundations

5 / 5

Custom Losses: Encoding Domain Knowledge

Beyond Off-the-Shelf Losses

Cross-entropy, MSE, hinge, and focal loss cover most use cases—but not all. Real-world problems often have unique requirements: physical constraints that predictions must satisfy, business objectives that don't map to standard accuracy metrics, or prior knowledge about the problem structure.

Custom loss functions are the bridge between what neural networks naturally optimize and what your application truly needs. This page teaches you how to design, implement, and debug custom objectives.

What You Will Master

By the end of this page, you will understand: (1) principles for combining multiple losses, (2) perceptual and feature-matching losses, (3) physics-informed and constraint-based losses, (4) ranking and contrastive objectives, and (5) practical debugging strategies.

Combining Multiple Losses

The Multi-Task Objective

Many applications require multiple objectives:

$$\mathcal{L}{total} = \sum{i=1}^{T} \lambda_i \mathcal{L}_i$$

Examples:

Object detection: Classification + bounding box regression + (maybe) mask prediction
Image generation: Reconstruction + adversarial + perceptual + regularization
Multi-task learning: Shared backbone with task-specific heads

Weighting Strategies

1. Fixed weights: Set $\lambda_i$ manually based on domain knowledge or preliminary experiments.

2. Uncertainty weighting (Kendall et al., 2018): $$\mathcal{L} = \sum_i \frac{1}{2\sigma_i^2}\mathcal{L}_i + \log\sigma_i$$

where $\sigma_i$ are learned. Tasks with higher uncertainty get lower weight.

3. Gradient normalization: Scale losses so gradients have similar magnitudes.

4. Dynamic weighting: Adjust weights during training based on task progress.

multi_task_loss.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import numpy as np
 
class UncertaintyWeightedLoss:
    """
    Learns task weights based on homoscedastic uncertainty.
    """
    def __init__(self, n_tasks):
        # log(sigma^2) for each task, initialized to 0
        self.log_vars = np.zeros(n_tasks)
    
    def __call__(self, task_losses):
        """
        Args:
            task_losses: List of loss values, one per task
        Returns:
            Combined loss and gradients w.r.t. log_vars
        """
        total = 0
        for i, loss in enumerate(task_losses):
            precision = np.exp(-self.log_vars[i])  # 1/sigma^2
            total += precision * loss + self.log_vars[i]
        return 0.5 * total
    
    def get_weights(self):
        """Current effective weights (inverse variance)."""
        return np.exp(-self.log_vars)
 
# Demo
losses = [0.5, 2.0, 0.1]  # Classification, bbox regression, mask
weighted = UncertaintyWeightedLoss(3)
print(f"Initial weights: {weighted.get_weights()}")
print(f"Combined loss: {weighted(losses):.4f}")

Loss Scale Matters

Losses with different scales can cause training instability. If L1 is typically 0.1 and L2 is typically 100, raw combination means L2 dominates completely. Always normalize or weight appropriately.

Perceptual and Feature-Matching Losses

The Problem with Pixel-Wise Loss

MSE on pixels encourages blurriness: when uncertain, predicting the mean minimizes squared error. For images, this means losing high-frequency details.

Perceptual Loss (Feature Loss)

Compare images in feature space of a pretrained network:

$$\mathcal{L}_{perceptual} = \sum_l \lambda_l |\phi_l(y) - \phi_l(\hat{y})|_2^2$$

where $\phi_l$ extracts features from layer $l$ of a pretrained network (typically VGG).

Why it works:

Early layers capture edges, textures (low-level)
Later layers capture objects, semantics (high-level)
Matching features encourages perceptually similar outputs

Style Loss (Gram Matrix)

For style transfer, match feature correlations:

$$\mathcal{L}_{style} = \sum_l |G_l(y) - G_l(\hat{y})|_F^2$$

where $G_l = \phi_l \phi_l^T$ is the Gram matrix (captures texture statistics).

perceptual_loss.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import numpy as np
 
def perceptual_loss(target_features, pred_features, weights=None):
    """
    Perceptual loss across multiple feature layers.
    
    Args:
        target_features: List of feature maps from target image
        pred_features: List of feature maps from predicted image
        weights: Per-layer weights (default: uniform)
    """
    if weights is None:
        weights = [1.0] * len(target_features)
    
    loss = 0
    for w, t, p in zip(weights, target_features, pred_features):
        loss += w * np.mean((t - p) ** 2)
    return loss
 
def gram_matrix(features):
    """Compute Gram matrix for style loss."""
    # features: (H, W, C) or (C, H, W)
    # Reshape to (C, H*W)
    if len(features.shape) == 3:
        c, h, w = features.shape
        f = features.reshape(c, -1)
    else:
        f = features.reshape(features.shape[0], -1)
    return f @ f.T / f.shape[1]
 
def style_loss(target_grams, pred_grams):
    """Style loss using Gram matrices."""
    loss = 0
    for t, p in zip(target_grams, pred_grams):
        loss += np.mean((t - p) ** 2)
    return loss

Common Perceptual Loss Configurations
Application	Network	Layers	Notes
Super-resolution	VGG19	relu5_4	High-level features
Style transfer	VGG19	relu1_1 to relu5_1	Multiple scales
Image synthesis	VGG16	relu2_2, relu3_3	Mid-level features
Face generation	VGGFace	Various	Face-specific features

Ranking and Contrastive Losses

Learning to Rank

When you care about ordering, not absolute values:

Pairwise Ranking Loss: $$\mathcal{L} = \max(0, \Delta + s_{neg} - s_{pos})$$

Positive items should score higher than negatives by margin $\Delta$.

ListNet/ListMLE: Probability distribution over rankings.

Triplet Loss

Learn embeddings where similar items are close:

$$\mathcal{L}_{triplet} = \max(0, |f(a) - f(p)|^2 - |f(a) - f(n)|^2 + \alpha)$$

where $a$ is anchor, $p$ is positive, $n$ is negative.

Contrastive Learning (InfoNCE)

The foundation of SimCLR, CLIP, etc.:

$$\mathcal{L}{NCE} = -\log \frac{\exp(s{pos}/\tau)}{\sum_j \exp(s_j/\tau)}$$

Push positive pairs together, negatives apart in embedding space.

contrastive_losses.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import numpy as np
 
def triplet_loss(anchor, positive, negative, margin=1.0):
    """Triplet loss for metric learning."""
    pos_dist = np.sum((anchor - positive) ** 2)
    neg_dist = np.sum((anchor - negative) ** 2)
    return np.maximum(0, margin + pos_dist - neg_dist)
 
def info_nce_loss(query, positive_key, negative_keys, temperature=0.07):
    """
    InfoNCE loss for contrastive learning.
    
    Args:
        query: Query embedding (d,)
        positive_key: Positive sample embedding (d,)
        negative_keys: Negative sample embeddings (N, d)
    """
    # Similarities
    pos_sim = np.dot(query, positive_key) / temperature
    neg_sims = negative_keys @ query / temperature
    
    # Log-softmax denominator
    all_sims = np.concatenate([[pos_sim], neg_sims])
    log_sum_exp = np.max(all_sims) + np.log(np.sum(np.exp(all_sims - np.max(all_sims))))
    
    return -pos_sim + log_sum_exp
 
# Demo
np.random.seed(42)
query = np.random.randn(128)
positive = query + 0.1 * np.random.randn(128)  # Similar
negatives = np.random.randn(100, 128)  # Random
 
print(f"InfoNCE loss: {info_nce_loss(query, positive, negatives):.4f}")

Hard Negative Mining

Contrastive losses benefit greatly from hard negatives—samples that are difficult to distinguish from positives. Random negatives become too easy as training progresses. Strategies: (1) in-batch hard mining, (2) memory banks, (3) momentum encoders (MoCo).

Constraint-Based and Physics-Informed Losses

Encoding Hard Constraints

Sometimes predictions must satisfy constraints:

Soft constraints (add penalty term): $$\mathcal{L} = \mathcal{L}{data} + \lambda \mathcal{L}{constraint}$$

Examples:

Predictions must sum to 1: $\mathcal{L}_c = (\sum_i \hat{y}_i - 1)^2$
Non-negativity: $\mathcal{L}_c = \sum_i \max(0, -\hat{y}_i)^2$
Monotonicity: $\mathcal{L}c = \sum_i \max(0, \hat{y}{i+1} - \hat{y}_i)^2$

Physics-Informed Neural Networks (PINNs)

Encode physical laws as loss terms:

$$\mathcal{L} = \mathcal{L}{data} + \lambda{physics} \mathcal{L}_{PDE}$$

where $\mathcal{L}_{PDE}$ penalizes violations of differential equations.

Example: Heat equation $\frac{\partial u}{\partial t} = \alpha abla^2 u$

$$\mathcal{L}_{PDE} = \left|\frac{\partial \hat{u}}{\partial t} - \alpha abla^2 \hat{u}\right|^2$$

evaluated at collocation points.

constraint_losses.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import numpy as np
 
def sum_to_one_constraint(predictions):
    """Penalize if predictions don't sum to 1."""
    return (np.sum(predictions, axis=-1) - 1) ** 2
 
def non_negative_constraint(predictions):
    """Penalize negative predictions."""
    return np.mean(np.maximum(0, -predictions) ** 2)
 
def monotonicity_constraint(predictions):
    """
    Penalize non-monotonic sequences.
    Assumes predictions should be increasing.
    """
    diffs = predictions[1:] - predictions[:-1]
    return np.mean(np.maximum(0, -diffs) ** 2)
 
def boundary_constraint(edge_values, boundary_conditions):
    """Penalize deviation from boundary conditions."""
    return np.mean((edge_values - boundary_conditions) ** 2)
 
# Physics-informed example: simple ODE constraint
def ode_residual_loss(t, y_pred, dy_dt_pred, f):
    """
    For ODE: dy/dt = f(t, y)
    Penalize: dy_dt_pred - f(t, y_pred)
    """
    expected_derivative = f(t, y_pred)
    return np.mean((dy_dt_pred - expected_derivative) ** 2)

Common Constraint Loss Patterns
Constraint Type	Loss Term	Example Application
Sum constraint	(Σyᵢ - target)²	Probability distributions
Bound constraint	max(0, y-upper)² + max(0, lower-y)²	Physical quantities
Symmetry	‖f(x) - f(flip(x))‖²	Image tasks
Smoothness	‖∇²y‖²	Denoising, interpolation
Conservation	(Σinput - Σoutput)²	Mass/energy balance

Adversarial and GAN Losses

The GAN Objective

Generative Adversarial Networks use a minimax game:

$$\min_G \max_D \mathbb{E}[\log D(x)] + \mathbb{E}[\log(1 - D(G(z)))]$$

Generator loss: $-\log(D(G(z)))$ (non-saturating version) Discriminator loss: Standard binary cross-entropy

GAN Loss Variants

Wasserstein GAN (WGAN):

$\mathcal{L}_D = \mathbb{E}[D(x)] - \mathbb{E}[D(G(z))]$
$\mathcal{L}_G = -\mathbb{E}[D(G(z))]$
More stable training, meaningful loss values

Hinge GAN:

$\mathcal{L}_D = \mathbb{E}[\max(0, 1-D(x))] + \mathbb{E}[\max(0, 1+D(G(z)))]$

Least Squares GAN (LSGAN):

$\mathcal{L}_D = \mathbb{E}[(D(x)-1)^2] + \mathbb{E}[D(G(z))^2]$
MSE instead of cross-entropy

Combining with Reconstruction

For conditional generation (pix2pix, image-to-image): $$\mathcal{L} = \lambda_{adv}\mathcal{L}{GAN} + \lambda{rec}\mathcal{L}_{L1}$$

Adversarial loss provides sharpness; reconstruction loss provides structure.

gan_losses.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import numpy as np
 
def vanilla_gan_d_loss(real_scores, fake_scores, epsilon=1e-15):
    """Discriminator loss for vanilla GAN."""
    real_loss = -np.mean(np.log(real_scores + epsilon))
    fake_loss = -np.mean(np.log(1 - fake_scores + epsilon))
    return real_loss + fake_loss
 
def vanilla_gan_g_loss(fake_scores, epsilon=1e-15):
    """Non-saturating generator loss."""
    return -np.mean(np.log(fake_scores + epsilon))
 
def wasserstein_d_loss(real_scores, fake_scores):
    """WGAN discriminator (critic) loss."""
    return np.mean(fake_scores) - np.mean(real_scores)
 
def wasserstein_g_loss(fake_scores):
    """WGAN generator loss."""
    return -np.mean(fake_scores)
 
def hinge_d_loss(real_scores, fake_scores):
    """Hinge loss for discriminator."""
    real_loss = np.mean(np.maximum(0, 1 - real_scores))
    fake_loss = np.mean(np.maximum(0, 1 + fake_scores))
    return real_loss + fake_loss
 
def hinge_g_loss(fake_scores):
    """Hinge loss for generator."""
    return -np.mean(fake_scores)

Debugging and Validation Strategies

Common Failure Modes

1. Gradient issues: NaN, inf, or vanishing gradients

Check: Print gradient norms during training
Fix: Add epsilon, use stable implementations, clip gradients

2. Loss scale mismatch: One component dominates

Check: Log individual loss components
Fix: Normalize losses, use adaptive weighting

3. Conflicting objectives: Losses fight each other

Check: Analyze gradient directions (cosine similarity)
Fix: Gradient surgery, careful weighting, staged training

4. Wrong minimum: Loss decreases but model fails

Check: Validate with metrics that matter for your task
Fix: Add regularization, constraints, or auxiliary losses

Debugging Checklist

•Verify gradient flow: Use gradient checking (numerical vs analytical)
•Test individual components: Train with each loss alone first
•Log loss components separately: Watch for divergence or collapse
•Visualize predictions: Loss curves can hide catastrophic failures
•Start simple: Add complexity incrementally, verify each addition
•Use validation metrics: Loss isn't always aligned with task performance

loss_debugging.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import numpy as np
 
def gradient_check(loss_fn, params, epsilon=1e-5):
    """Numerical gradient check for custom loss."""
    numerical_grad = np.zeros_like(params)
    for i in range(len(params)):
        params_plus = params.copy()
        params_plus[i] += epsilon
        params_minus = params.copy()
        params_minus[i] -= epsilon
        numerical_grad[i] = (loss_fn(params_plus) - loss_fn(params_minus)) / (2*epsilon)
    return numerical_grad
 
def check_gradient_health(gradients, name="grad"):
    """Check for common gradient issues."""
    issues = []
    if np.any(np.isnan(gradients)):
        issues.append(f"{name} contains NaN")
    if np.any(np.isinf(gradients)):
        issues.append(f"{name} contains Inf")
    if np.max(np.abs(gradients)) > 1000:
        issues.append(f"{name} may be exploding (max={np.max(np.abs(gradients)):.1f})")
    if np.max(np.abs(gradients)) < 1e-7:
        issues.append(f"{name} may be vanishing (max={np.max(np.abs(gradients)):.2e})")
    return issues if issues else ["OK"]

Summary: Custom Loss Mastery

Key Takeaways

•Multi-task learning: Combine losses with careful weighting—fixed, uncertainty-based, or gradient-normalized.
•Perceptual losses: Compare in feature space for perceptually meaningful similarity.
•Contrastive losses: InfoNCE, triplet loss for learning useful representations.
•Constraint losses: Encode domain knowledge and physical laws directly.
•GAN losses: Adversarial objectives for sharp, realistic generation.
•Debugging is critical: Log components, check gradients, validate with task metrics.

Module Complete

You've now mastered the complete landscape of loss functions: cross-entropy for classification, MSE for regression, hinge for margins, focal for imbalance, and the principles for designing custom objectives. This knowledge empowers you to match any learning problem with an appropriate objective function.

5 / 5

Loading learning content...

Machine LearningNeural Network Foundations

Loss Functions

LevelIntermediate

Duration90 mins

TopicNeural Network Foundations

5 / 5

Custom Losses: Encoding Domain Knowledge

Beyond Off-the-Shelf Losses

What You Will Master

Combining Multiple Losses

The Multi-Task Objective

Many applications require multiple objectives:

$$\mathcal{L}{total} = \sum{i=1}^{T} \lambda_i \mathcal{L}_i$$

Examples:

Object detection: Classification + bounding box regression + (maybe) mask prediction
Image generation: Reconstruction + adversarial + perceptual + regularization
Multi-task learning: Shared backbone with task-specific heads

Weighting Strategies

1. Fixed weights: Set $\lambda_i$ manually based on domain knowledge or preliminary experiments.

2. Uncertainty weighting (Kendall et al., 2018): $$\mathcal{L} = \sum_i \frac{1}{2\sigma_i^2}\mathcal{L}_i + \log\sigma_i$$

where $\sigma_i$ are learned. Tasks with higher uncertainty get lower weight.

3. Gradient normalization: Scale losses so gradients have similar magnitudes.

4. Dynamic weighting: Adjust weights during training based on task progress.

multi_task_loss.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import numpy as np
 
class UncertaintyWeightedLoss:
    """
    Learns task weights based on homoscedastic uncertainty.
    """
    def __init__(self, n_tasks):
        # log(sigma^2) for each task, initialized to 0
        self.log_vars = np.zeros(n_tasks)
    
    def __call__(self, task_losses):
        """
        Args:
            task_losses: List of loss values, one per task
        Returns:
            Combined loss and gradients w.r.t. log_vars
        """
        total = 0
        for i, loss in enumerate(task_losses):
            precision = np.exp(-self.log_vars[i])  # 1/sigma^2
            total += precision * loss + self.log_vars[i]
        return 0.5 * total
    
    def get_weights(self):
        """Current effective weights (inverse variance)."""
        return np.exp(-self.log_vars)
 
# Demo
losses = [0.5, 2.0, 0.1]  # Classification, bbox regression, mask
weighted = UncertaintyWeightedLoss(3)
print(f"Initial weights: {weighted.get_weights()}")
print(f"Combined loss: {weighted(losses):.4f}")

Loss Scale Matters

Losses with different scales can cause training instability. If L1 is typically 0.1 and L2 is typically 100, raw combination means L2 dominates completely. Always normalize or weight appropriately.

Perceptual and Feature-Matching Losses

The Problem with Pixel-Wise Loss

MSE on pixels encourages blurriness: when uncertain, predicting the mean minimizes squared error. For images, this means losing high-frequency details.

Perceptual Loss (Feature Loss)

Compare images in feature space of a pretrained network:

$$\mathcal{L}_{perceptual} = \sum_l \lambda_l |\phi_l(y) - \phi_l(\hat{y})|_2^2$$

where $\phi_l$ extracts features from layer $l$ of a pretrained network (typically VGG).

Why it works:

Early layers capture edges, textures (low-level)
Later layers capture objects, semantics (high-level)
Matching features encourages perceptually similar outputs

Style Loss (Gram Matrix)

For style transfer, match feature correlations:

$$\mathcal{L}_{style} = \sum_l |G_l(y) - G_l(\hat{y})|_F^2$$

where $G_l = \phi_l \phi_l^T$ is the Gram matrix (captures texture statistics).

perceptual_loss.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import numpy as np
 
def perceptual_loss(target_features, pred_features, weights=None):
    """
    Perceptual loss across multiple feature layers.
    
    Args:
        target_features: List of feature maps from target image
        pred_features: List of feature maps from predicted image
        weights: Per-layer weights (default: uniform)
    """
    if weights is None:
        weights = [1.0] * len(target_features)
    
    loss = 0
    for w, t, p in zip(weights, target_features, pred_features):
        loss += w * np.mean((t - p) ** 2)
    return loss
 
def gram_matrix(features):
    """Compute Gram matrix for style loss."""
    # features: (H, W, C) or (C, H, W)
    # Reshape to (C, H*W)
    if len(features.shape) == 3:
        c, h, w = features.shape
        f = features.reshape(c, -1)
    else:
        f = features.reshape(features.shape[0], -1)
    return f @ f.T / f.shape[1]
 
def style_loss(target_grams, pred_grams):
    """Style loss using Gram matrices."""
    loss = 0
    for t, p in zip(target_grams, pred_grams):
        loss += np.mean((t - p) ** 2)
    return loss

Common Perceptual Loss Configurations
Application	Network	Layers	Notes
Super-resolution	VGG19	relu5_4	High-level features
Style transfer	VGG19	relu1_1 to relu5_1	Multiple scales
Image synthesis	VGG16	relu2_2, relu3_3	Mid-level features
Face generation	VGGFace	Various	Face-specific features

Ranking and Contrastive Losses

Learning to Rank

When you care about ordering, not absolute values:

Pairwise Ranking Loss: $$\mathcal{L} = \max(0, \Delta + s_{neg} - s_{pos})$$

Positive items should score higher than negatives by margin $\Delta$.

ListNet/ListMLE: Probability distribution over rankings.

Triplet Loss

Learn embeddings where similar items are close:

$$\mathcal{L}_{triplet} = \max(0, |f(a) - f(p)|^2 - |f(a) - f(n)|^2 + \alpha)$$

where $a$ is anchor, $p$ is positive, $n$ is negative.

Contrastive Learning (InfoNCE)

The foundation of SimCLR, CLIP, etc.:

$$\mathcal{L}{NCE} = -\log \frac{\exp(s{pos}/\tau)}{\sum_j \exp(s_j/\tau)}$$

Push positive pairs together, negatives apart in embedding space.

contrastive_losses.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import numpy as np
 
def triplet_loss(anchor, positive, negative, margin=1.0):
    """Triplet loss for metric learning."""
    pos_dist = np.sum((anchor - positive) ** 2)
    neg_dist = np.sum((anchor - negative) ** 2)
    return np.maximum(0, margin + pos_dist - neg_dist)
 
def info_nce_loss(query, positive_key, negative_keys, temperature=0.07):
    """
    InfoNCE loss for contrastive learning.
    
    Args:
        query: Query embedding (d,)
        positive_key: Positive sample embedding (d,)
        negative_keys: Negative sample embeddings (N, d)
    """
    # Similarities
    pos_sim = np.dot(query, positive_key) / temperature
    neg_sims = negative_keys @ query / temperature
    
    # Log-softmax denominator
    all_sims = np.concatenate([[pos_sim], neg_sims])
    log_sum_exp = np.max(all_sims) + np.log(np.sum(np.exp(all_sims - np.max(all_sims))))
    
    return -pos_sim + log_sum_exp
 
# Demo
np.random.seed(42)
query = np.random.randn(128)
positive = query + 0.1 * np.random.randn(128)  # Similar
negatives = np.random.randn(100, 128)  # Random
 
print(f"InfoNCE loss: {info_nce_loss(query, positive, negatives):.4f}")

Hard Negative Mining

Constraint-Based and Physics-Informed Losses

Encoding Hard Constraints

Sometimes predictions must satisfy constraints:

Soft constraints (add penalty term): $$\mathcal{L} = \mathcal{L}{data} + \lambda \mathcal{L}{constraint}$$

Examples:

Predictions must sum to 1: $\mathcal{L}_c = (\sum_i \hat{y}_i - 1)^2$
Non-negativity: $\mathcal{L}_c = \sum_i \max(0, -\hat{y}_i)^2$
Monotonicity: $\mathcal{L}c = \sum_i \max(0, \hat{y}{i+1} - \hat{y}_i)^2$

Physics-Informed Neural Networks (PINNs)

Encode physical laws as loss terms:

$$\mathcal{L} = \mathcal{L}{data} + \lambda{physics} \mathcal{L}_{PDE}$$

where $\mathcal{L}_{PDE}$ penalizes violations of differential equations.

Example: Heat equation $\frac{\partial u}{\partial t} = \alpha abla^2 u$

$$\mathcal{L}_{PDE} = \left|\frac{\partial \hat{u}}{\partial t} - \alpha abla^2 \hat{u}\right|^2$$

evaluated at collocation points.

constraint_losses.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import numpy as np
 
def sum_to_one_constraint(predictions):
    """Penalize if predictions don't sum to 1."""
    return (np.sum(predictions, axis=-1) - 1) ** 2
 
def non_negative_constraint(predictions):
    """Penalize negative predictions."""
    return np.mean(np.maximum(0, -predictions) ** 2)
 
def monotonicity_constraint(predictions):
    """
    Penalize non-monotonic sequences.
    Assumes predictions should be increasing.
    """
    diffs = predictions[1:] - predictions[:-1]
    return np.mean(np.maximum(0, -diffs) ** 2)
 
def boundary_constraint(edge_values, boundary_conditions):
    """Penalize deviation from boundary conditions."""
    return np.mean((edge_values - boundary_conditions) ** 2)
 
# Physics-informed example: simple ODE constraint
def ode_residual_loss(t, y_pred, dy_dt_pred, f):
    """
    For ODE: dy/dt = f(t, y)
    Penalize: dy_dt_pred - f(t, y_pred)
    """
    expected_derivative = f(t, y_pred)
    return np.mean((dy_dt_pred - expected_derivative) ** 2)

Common Constraint Loss Patterns
Constraint Type	Loss Term	Example Application
Sum constraint	(Σyᵢ - target)²	Probability distributions
Bound constraint	max(0, y-upper)² + max(0, lower-y)²	Physical quantities
Symmetry	‖f(x) - f(flip(x))‖²	Image tasks
Smoothness	‖∇²y‖²	Denoising, interpolation
Conservation	(Σinput - Σoutput)²	Mass/energy balance

Adversarial and GAN Losses

The GAN Objective

Generative Adversarial Networks use a minimax game:

$$\min_G \max_D \mathbb{E}[\log D(x)] + \mathbb{E}[\log(1 - D(G(z)))]$$

Generator loss: $-\log(D(G(z)))$ (non-saturating version) Discriminator loss: Standard binary cross-entropy

GAN Loss Variants

Wasserstein GAN (WGAN):

$\mathcal{L}_D = \mathbb{E}[D(x)] - \mathbb{E}[D(G(z))]$
$\mathcal{L}_G = -\mathbb{E}[D(G(z))]$
More stable training, meaningful loss values

Hinge GAN:

$\mathcal{L}_D = \mathbb{E}[\max(0, 1-D(x))] + \mathbb{E}[\max(0, 1+D(G(z)))]$

Least Squares GAN (LSGAN):

$\mathcal{L}_D = \mathbb{E}[(D(x)-1)^2] + \mathbb{E}[D(G(z))^2]$
MSE instead of cross-entropy

Combining with Reconstruction

For conditional generation (pix2pix, image-to-image): $$\mathcal{L} = \lambda_{adv}\mathcal{L}{GAN} + \lambda{rec}\mathcal{L}_{L1}$$

Adversarial loss provides sharpness; reconstruction loss provides structure.

gan_losses.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import numpy as np
 
def vanilla_gan_d_loss(real_scores, fake_scores, epsilon=1e-15):
    """Discriminator loss for vanilla GAN."""
    real_loss = -np.mean(np.log(real_scores + epsilon))
    fake_loss = -np.mean(np.log(1 - fake_scores + epsilon))
    return real_loss + fake_loss
 
def vanilla_gan_g_loss(fake_scores, epsilon=1e-15):
    """Non-saturating generator loss."""
    return -np.mean(np.log(fake_scores + epsilon))
 
def wasserstein_d_loss(real_scores, fake_scores):
    """WGAN discriminator (critic) loss."""
    return np.mean(fake_scores) - np.mean(real_scores)
 
def wasserstein_g_loss(fake_scores):
    """WGAN generator loss."""
    return -np.mean(fake_scores)
 
def hinge_d_loss(real_scores, fake_scores):
    """Hinge loss for discriminator."""
    real_loss = np.mean(np.maximum(0, 1 - real_scores))
    fake_loss = np.mean(np.maximum(0, 1 + fake_scores))
    return real_loss + fake_loss
 
def hinge_g_loss(fake_scores):
    """Hinge loss for generator."""
    return -np.mean(fake_scores)

Debugging and Validation Strategies

Common Failure Modes

1. Gradient issues: NaN, inf, or vanishing gradients

Check: Print gradient norms during training
Fix: Add epsilon, use stable implementations, clip gradients

2. Loss scale mismatch: One component dominates

Check: Log individual loss components
Fix: Normalize losses, use adaptive weighting

3. Conflicting objectives: Losses fight each other

Check: Analyze gradient directions (cosine similarity)
Fix: Gradient surgery, careful weighting, staged training

4. Wrong minimum: Loss decreases but model fails

Check: Validate with metrics that matter for your task
Fix: Add regularization, constraints, or auxiliary losses

Debugging Checklist

•Verify gradient flow: Use gradient checking (numerical vs analytical)
•Test individual components: Train with each loss alone first
•Log loss components separately: Watch for divergence or collapse
•Visualize predictions: Loss curves can hide catastrophic failures
•Start simple: Add complexity incrementally, verify each addition
•Use validation metrics: Loss isn't always aligned with task performance

loss_debugging.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import numpy as np
 
def gradient_check(loss_fn, params, epsilon=1e-5):
    """Numerical gradient check for custom loss."""
    numerical_grad = np.zeros_like(params)
    for i in range(len(params)):
        params_plus = params.copy()
        params_plus[i] += epsilon
        params_minus = params.copy()
        params_minus[i] -= epsilon
        numerical_grad[i] = (loss_fn(params_plus) - loss_fn(params_minus)) / (2*epsilon)
    return numerical_grad
 
def check_gradient_health(gradients, name="grad"):
    """Check for common gradient issues."""
    issues = []
    if np.any(np.isnan(gradients)):
        issues.append(f"{name} contains NaN")
    if np.any(np.isinf(gradients)):
        issues.append(f"{name} contains Inf")
    if np.max(np.abs(gradients)) > 1000:
        issues.append(f"{name} may be exploding (max={np.max(np.abs(gradients)):.1f})")
    if np.max(np.abs(gradients)) < 1e-7:
        issues.append(f"{name} may be vanishing (max={np.max(np.abs(gradients)):.2e})")
    return issues if issues else ["OK"]

Summary: Custom Loss Mastery

Key Takeaways

•Multi-task learning: Combine losses with careful weighting—fixed, uncertainty-based, or gradient-normalized.
•Perceptual losses: Compare in feature space for perceptually meaningful similarity.
•Contrastive losses: InfoNCE, triplet loss for learning useful representations.
•Constraint losses: Encode domain knowledge and physical laws directly.
•GAN losses: Adversarial objectives for sharp, realistic generation.
•Debugging is critical: Log components, check gradients, validate with task metrics.

Module Complete

5 / 5