Information Theory - Learning Module

Loading content...

0/245

Information Theory in Machine Learning: A Unified View

The Information-Theoretic Lens

Throughout this module, we've developed the core concepts of information theory: entropy, cross-entropy, KL divergence, and mutual information. Now we step back to see the remarkable unity these concepts bring to machine learning.

Information theory provides a principled framework for understanding:

Why we use certain loss functions (cross-entropy, not MSE for classification)
How generative models work (VAEs minimize variational bounds, GANs estimate divergences)
What neural networks learn (information bottleneck, representational compression)
When learning is possible (generalization bounds, sample complexity)

This page synthesizes these connections, showing how information theory isn't just a collection of tools—it's a complete theoretical framework for understanding machine learning. We'll see how Shannon's insights from 1948 directly inform the design of modern deep learning systems in 2024.

What You Will Learn

By the end of this page, you will see how information theory unifies loss functions across ML, understand how generative models exploit information-theoretic objectives, appreciate the information bottleneck perspective on deep learning, and recognize information-theoretic bounds on what's learnable.

Loss Functions: An Information-Theoretic View

Every standard loss function in ML has an information-theoretic interpretation. This perspective explains why certain losses are natural choices and when alternatives might be appropriate.

Loss Functions and Information Theory
Loss Function	IT Interpretation	When to Use
Cross-Entropy	H(P, Q) = H(P) + D_KL(P\|\|Q); measures coding cost	Classification; maximizing likelihood
KL Divergence	D_KL(P\|\|Q); extra bits for using wrong distribution	VAE regularization; distribution matching
MSE (Gaussian)	-log p(y\|x) for Gaussian p(y\|x) = N(f(x), σ²)	Regression with Gaussian noise assumption
MAE (Laplace)	-log p(y\|x) for Laplacian p(y\|x)	Robust regression; heavy-tailed noise
InfoNCE	Lower bound on I(X; Y)	Contrastive learning; representation learning
ELBO	-H(P,Q) + H(Q) = E_q[log p] + H(q)	Variational inference; VAEs

Loss Functions as Negative Log-Likelihood
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy as np
import torch
import torch.nn.functional as F
from scipy import stats
 
# Core insight: Loss = -log p(y|x) for some probabilistic model
 
# 1. Cross-Entropy = -log p(y|x) for Categorical distribution
def cross_entropy_loss(logits, targets):
    """
    Cross-entropy for classification.
    Assumes p(y|x) = Categorical(softmax(f(x)))
    Loss = -log p(y=target|x)
    """
    log_probs = F.log_softmax(logits, dim=-1)
    return -log_probs.gather(1, targets.unsqueeze(1)).mean()
 
# 2. MSE = -log p(y|x) for Gaussian distribution (up to constant)
def mse_as_nll(predictions, targets, sigma=1.0):
    """
    MSE for regression.
    Assumes p(y|x) = N(f(x), σ²)
    -log p(y|x) = (y - f(x))² / (2σ²) + log(σ√(2π))
    MSE ignores constant terms.
    """
    mse = ((predictions - targets) ** 2).mean()
    # Full NLL (for proper probabilistic interpretation)
    nll = mse / (2 * sigma**2) + np.log(sigma * np.sqrt(2 * np.pi))
    return mse, nll
 
# 3. MAE = -log p(y|x) for Laplace distribution (up to constant)
def mae_as_nll(predictions, targets, b=1.0):
    """
    MAE (L1 loss) for regression.
    Assumes p(y|x) = Laplace(f(x), b)
    -log p(y|x) = |y - f(x)| / b + log(2b)
    """
    mae = torch.abs(predictions - targets).mean()
    nll = mae / b + np.log(2 * b)
    return mae, nll
 
# Example
print("Loss Functions as Negative Log-Likelihood")
print("=" * 60)
 
# Generate data
torch.manual_seed(42)
preds = torch.tensor([0.5, 1.2, 0.8, 1.5])
targets = torch.tensor([0.6, 1.0, 1.0, 1.3])
 
mse, mse_nll = mse_as_nll(preds, targets)
mae, mae_nll = mae_as_nll(preds, targets)
 
print(f"MSE: {mse:.4f}")
print(f"MSE as Gaussian NLL (σ=1): {mse_nll:.4f}")
print()
print(f"MAE: {mae.item():.4f}")
print(f"MAE as Laplace NLL (b=1): {mae_nll:.4f}")
print()
print("Key insight: Choosing a loss = choosing a noise distribution")
print("MSE → Gaussian noise | MAE → Laplacian noise | Huber → hybrid")

The MLE Connection

All these losses are negative log-likelihoods under specific distributional assumptions. Training with any loss is implicitly assuming a particular noise distribution. This explains why: • MSE is sensitive to outliers (Gaussians assign very low probability to outliers) • MAE is robust (Laplace has heavier tails) • Cross-entropy works for classification (categorical likelihood)

Generative Models: The Information-Theoretic Foundation

The major generative modeling frameworks—VAEs, GANs, flow-based models, and diffusion models—all have deep information-theoretic foundations.

Information Theory in Generative Models

•VAEs: Maximize ELBO = E_q[log p(x|z)] - D_KL(q(z|x) || p(z)). Balance reconstruction (likelihood) with regularization (KL to prior).
•GANs: The discriminator estimates f-divergences between real and generated distributions. Optimal D yields D_f(P_real || P_gen).
•Flow-based Models: Explicitly compute log p(x) = log p(z) + log|det(∂f/∂z)|. Direct maximum likelihood via change of variables.
•Diffusion Models: Approximate log p(x) via variational bound. Score matching estimates ∇log p(x), enabling sampling.
•Energy-Based Models: p(x) ∝ exp(-E(x)). Training minimizes contrastive divergence or noise-contrastive estimation (NCE).

Generative Models - IT Perspective
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
 
class VAE_InformationTheoretic(nn.Module):
    """
    VAE viewed through information-theoretic lens.
    
    Objective: Maximize I(X; Z) while keeping I(Z; X) close to prior entropy.
    
    ELBO = E_q(z|x)[log p(x|z)] - D_KL(q(z|x) || p(z))
         = Reconstruction          - Rate
         
    Rate-Distortion interpretation:
    - Reconstruction = How well can we decode X from Z?
    - KL term = How many bits does Z use beyond the prior?
    """
    
    def __init__(self, input_dim, latent_dim):
        super().__init__()
        self.latent_dim = latent_dim
        
        # Encoder: q(z|x)
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.ReLU()
        )
        self.fc_mu = nn.Linear(128, latent_dim)
        self.fc_logvar = nn.Linear(128, latent_dim)
        
        # Decoder: p(x|z)
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 256),
            nn.ReLU(),
            nn.Linear(256, input_dim),
            nn.Sigmoid()
        )
    
    def encode(self, x):
        h = self.encoder(x)
        return self.fc_mu(h), self.fc_logvar(h)
    
    def reparameterize(self, mu, logvar):
        """Reparameterization trick: z = μ + σ * ε, ε ~ N(0,1)"""
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + eps * std
    
    def decode(self, z):
        return self.decoder(z)
    
    def forward(self, x):
        mu, logvar = self.encode(x)
        z = self.reparameterize(mu, logvar)
        x_recon = self.decode(z)
        return x_recon, mu, logvar
    
    def loss_function(self, x, x_recon, mu, logvar, beta=1.0):
        """
        ELBO loss with β-VAE weighting.
        
        Returns components for analysis:
        - Reconstruction: -E_q[log p(x|z)] ≈ BCE(x, x_recon)
        - Rate: D_KL(q(z|x) || p(z)) = D_KL(N(μ, σ²) || N(0, 1))
        """
        # Reconstruction loss (negative log-likelihood)
        # For Bernoulli decoder: BCE
        recon = F.binary_cross_entropy(x_recon, x, reduction='sum')
        
        # KL divergence: D_KL(N(μ, σ²) || N(0, 1))
        # = 0.5 * Σ(μ² + σ² - 1 - log(σ²))
        kl = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
        
        # Total loss (negative ELBO)
        total = recon + beta * kl
        
        return total, recon, kl
 
# Information quantities
print("VAE Information-Theoretic Analysis")
print("=" * 60)
print()
print("ELBO = E_q[log p(x|z)] - D_KL(q(z|x) || p(z))")
print()
print("Decomposition:")
print("1. E_q[log p(x|z)]: Reconstruction quality")
print("   - How much information about X is preserved in Z?")
print("   - Maximizing this → high I(X; Z)")
print()
print("2. D_KL(q(z|x) || p(z)): Rate / Compression")
print("   - How many 'extra bits' does z use?")
print("   - Minimizing this → low I(X; Z)")
print()
print("β-VAE: Total = Reconstruction + β × KL")
print("  β > 1: More compression, better disentanglement")
print("  β < 1: Better reconstruction, less regularization")

GANs and Divergence Estimation:

The GAN training objective can be understood as estimating divergences between distributions:

Original GAN: Approximates Jensen-Shannon divergence D_JS(P_data || P_gen)
f-GAN: Generalizes to arbitrary f-divergences, including KL, reverse KL, χ², etc.
WGAN: Uses Wasserstein distance (not an f-divergence, but geometrically meaningful)

The discriminator D serves as a density ratio estimator: D(x) ≈ p_data(x) / (p_data(x) + p_gen(x)), enabling divergence computation without explicit density estimation.

Mode Collapse Through IT Lens

GAN mode collapse can be understood information-theoretically: the generator learns to minimize D(P_gen || P_data) (forward KL), which is mode-seeking. This encourages covering some modes perfectly while ignoring others. Alternative divergences (reverse KL, Wasserstein) have different mode-covering properties.

The Information Bottleneck Principle

The Information Bottleneck (IB) principle, introduced by Tishby et al. in 1999, provides a profound framework for understanding learning as optimal compression.

The IB Objective:

Given input X and target Y, find representation T that:

minimize: I(X; T) − β · I(T; Y)

This trades off:

Compression: I(X; T) — how much information T retains about input
Prediction: I(T; Y) — how much information T has about target

The parameter β controls the tradeoff. As β varies from 0 to ∞, we trace an information curve from maximal compression to maximal prediction.

Information Bottleneck Concept
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
# Information Bottleneck Framework
# ================================
 
# The IB Objective:
# min I(X; T) - β × I(T; Y)
#
# Equivalently:
# max I(T; Y) - β⁻¹ × I(X; T)
 
# Key insights:
# 1. Optimal T discards information in X not relevant to Y
# 2. β controls compression-prediction tradeoff
# 3. Neural network layers approximately solve IB
 
# The IB Lagrangian:
L = I(X; T) - β × I(T; Y)
 
# Can be rewritten using Markov chain X → T → Y:
# I(T; Y) ≤ I(X; Y)  (data processing inequality)
# I(X; T) ≥ I(T; Y)  (must preserve enough for prediction)
 
# In neural networks:
# - Early layers: High I(X; T), moderate I(T; Y)
# - Later layers: Lower I(X; T), higher I(T; Y)
# - This is "compression" as information flows through network
 
import numpy as np
import matplotlib.pyplot as plt
 
def plot_information_plane():
    """
    Conceptual illustration of the information plane.
    Each point (I(X;T), I(T;Y)) represents a layer's information content.
    """
    # Simulated layer-wise information (conceptual)
    layers = ['Input', 'Hidden1', 'Hidden2', 'Hidden3', 'Output']
    
    # During training, networks compress (reduce I(X;T)) while maintaining I(T;Y)
    I_XT_init = [5.0, 4.8, 4.5, 4.0, 3.5]  # Initially, preserve most input info
    I_TY_init = [0.5, 0.8, 1.0, 1.2, 1.5]  # Gradually improve prediction
    
    I_XT_final = [5.0, 3.5, 2.5, 2.0, 1.8]  # Compress unnecessary info
    I_TY_final = [0.5, 1.2, 1.5, 1.8, 2.0]  # Improve prediction
    
    print("Information Plane Analysis")
    print("=" * 60)
    print(f"{'Layer':<15} {'I(X;T) init':<15} {'I(X;T) final':<15} {'Compression':<15}")
    print("-" * 60)
    for i, layer in enumerate(layers):
        comp = I_XT_init[i] - I_XT_final[i]
        print(f"{layer:<15} {I_XT_init[i]:<15.2f} {I_XT_final[i]:<15.2f} {comp:+.2f}")
    print()
    print("Key observation: Networks compress I(X;T) while improving I(T;Y)")
    print("This is the 'information bottleneck' at work!")
 
plot_information_plane()
 
# Variational Information Bottleneck (VIB)
print()
print("Variational Information Bottleneck (VIB)")
print("-" * 60)
print("Tractable approximation to IB using variational bounds:")
print()
print("max I(T; Y) - β × I(X; T)")
print("≈ max E[log p(Y|T)] - β × D_KL(p(T|X) || r(T))")
print()
print("Where:")
print("  p(T|X): Encoder (stochastic neural network)")
print("  r(T): Variational prior (e.g., N(0, I))")
print("  p(Y|T): Predictor")
print()
print("This is a stochastic neural network with KL regularization!")

Deep Learning as Implicit IB:

Tishby and colleagues proposed that deep neural networks implicitly solve the information bottleneck:

Fitting phase: Early training increases I(T; Y) — the network learns to predict
Compression phase: Later training decreases I(X; T) — the network discards irrelevant input information

This compression hypothesis is debated but influential. What's clear is that:

Deeper layers have more task-relevant, less input-specific representations
Regularization (dropout, weight decay) encourages compression
Noise during training (SGD mini-batches) may aid compression

Ongoing Debate

The compression-phase hypothesis is controversial. Some argue that: • Compression depends on activation functions (tanh shows it, ReLU may not) • Discrete MI estimates are unreliable for continuous networks • Generalization may not require compression

Regardless, IB provides a useful conceptual framework for thinking about representations.

Information-Theoretic Representation Learning

Modern representation learning—especially self-supervised learning—is deeply grounded in information theory. The key objectives can all be expressed in terms of mutual information:

IT Objectives in Representation Learning
Method	IT Objective	Intuition
Contrastive (SimCLR)	max I(view₁; view₂)	Augmented views of same image should share info
Masked Prediction (BERT)	max I(context; masked)	Context should predict masked tokens
InfoMax (DIM)	max I(x; f(x))	Representation should preserve input info
VIB	max I(z;y) - β·I(x;z)	Task-relevant, input-compressed representations
Predictive Coding	max I(past; future\|z)	Learn representations that aid temporal prediction
InfoGAN	max I(c; G(z,c))	Latent codes should be recoverable from generated samples

Contrastive Learning as MI Maximization
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import torch
import torch.nn.functional as F
 
class ContrastiveLearner:
    """
    Contrastive learning maximizes I(view_1; view_2).
    
    The InfoNCE bound:
    I(X; Y) >= log(K) - L_NCE
    
    where K is the number of negatives + 1 and L_NCE is the contrastive loss.
    """
    
    def __init__(self, temperature=0.07):
        self.temperature = temperature
    
    def info_nce_loss(self, z_i, z_j, negatives=None):
        """
        Compute InfoNCE loss.
        
        Args:
            z_i: First view representations (batch_size, dim)
            z_j: Second view representations (batch_size, dim)  
            negatives: Optional explicit negatives
        
        If negatives=None, use other samples in batch as negatives.
        """
        batch_size = z_i.size(0)
        
        # Normalize embeddings
        z_i = F.normalize(z_i, dim=1)
        z_j = F.normalize(z_j, dim=1)
        
        # Compute similarity matrix
        representations = torch.cat([z_i, z_j], dim=0)  # (2*batch, dim)
        similarity = representations @ representations.T / self.temperature
        
        # Mask self-similarity (diagonal)
        mask = torch.eye(2 * batch_size, device=z_i.device).bool()
        similarity = similarity.masked_fill(mask, float('-inf'))
        
        # For each z_i, positive is the corresponding z_j (and vice versa)
        labels = torch.cat([
            torch.arange(batch_size, 2 * batch_size),
            torch.arange(batch_size)
        ], dim=0).to(z_i.device)
        
        # Cross-entropy loss
        loss = F.cross_entropy(similarity, labels)
        
        return loss
    
    def estimate_mi_bound(self, loss, num_negatives):
        """
        Estimate MI lower bound from InfoNCE loss.
        
        I(X; Y) >= log(K) - L_NCE
        """
        K = num_negatives + 1  # +1 for the positive
        return torch.log(torch.tensor(K)) - loss
 
# Example
learner = ContrastiveLearner(temperature=0.07)
 
# Simulated representations from two views
batch_size = 256
dim = 128
 
z1 = torch.randn(batch_size, dim)
z2 = z1 + torch.randn(batch_size, dim) * 0.3  # Views are similar
 
loss = learner.info_nce_loss(z1, z2)
mi_bound = learner.estimate_mi_bound(loss, num_negatives=2*batch_size - 2)
 
print("Contrastive Learning Analysis")
print("=" * 60)
print(f"InfoNCE Loss: {loss.item():.4f}")
print(f"Number of negatives: {2*batch_size - 2}")
print(f"MI Lower Bound: {mi_bound.item():.4f} nats")
print()
print("Key insight: Lower loss → higher MI bound → better representations")

Why Contrastive Learning Works

By maximizing I(view₁; view₂): • Forces representation to capture shared semantics (the "content") • Discards view-specific noise (augmentation artifacts) • Results in representations invariant to irrelevant transformations • Naturally learns hierarchical features when combined with sufficient data diversity

Information-Theoretic Generalization Bounds

Information theory provides theoretical guarantees about when and how well machine learning algorithms can generalize. These bounds complement classical VC-dimension and Rademacher complexity bounds with an information-theoretic perspective.

The Key Insight:

If a learning algorithm extracts less information from the training data, it's less likely to overfit. This leads to bounds of the form:

Generalization Gap ≤ O(√(I(Algorithm Output; Training Data) / n))

The less the algorithm "remembers" about specific training points, the better it generalizes.

Information-Theoretic Generalization
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# Information-Theoretic Generalization Bounds
# =============================================
 
# Classic bound (simplified form):
# E[|L_test - L_train|] ≤ √(2 * I(W; S) / n)
#
# Where:
# - W: Learned parameters (output of algorithm)
# - S: Training set
# - n: Number of training samples
# - I(W; S): Mutual information between weights and training data
 
# Key implications:
# 1. Less information extracted → better generalization
# 2. SGD noise limits I(W; S), aiding generalization
# 3. Regularization reduces I(W; S)
# 4. Data augmentation increases effective n
 
# Practical insights:
print("Information-Theoretic View of Regularization")
print("=" * 60)
print()
 
regularizers = {
    "Weight Decay (L2)": "Constrains weights to low-norm region, limiting information capacity",
    "Dropout": "Prevents co-adaptation, reducing I(W; S) by adding noise",
    "Batch Normalization": "Stabilizes representations, limiting individual sample influence",
    "Label Smoothing": "Reduces I(Y_pred; Y_true|X), preventing overfitting to hard labels",
    "Data Augmentation": "Increases effective n without changing I(W; S)",
    "Early Stopping": "Stops before full training data memorization, limiting I(W; S)",
}
 
for reg, explanation in regularizers.items():
    print(f"{reg}:")
    print(f"  {explanation}")
    print()
 
# PAC-Bayes bounds (information-theoretic flavor)
print("PAC-Bayes Bounds")
print("-" * 60)
print()
print("Generalization Error ≤ Train Error + √(D_KL(posterior || prior) / n)")
print()
print("This connects:")
print("- KL divergence between learned distribution and prior")
print("- Sample size n")
print("- Generalization gap")
print()
print("Insight: Learning 'close to prior' generalizes better")

Compression Implies Generalization:

A profound result connects compression to generalization:

If a learning algorithm outputs a hypothesis that can be described in k bits
Then the generalization gap is at most O(√(k/n))

This partially explains why:

Quantizing neural networks (reducing bits) often doesn't hurt accuracy much
Pruning (removing parameters) can improve generalization
Knowledge distillation works (simpler student can match complex teacher)

The description length is an information-theoretic measure of complexity.

Double Descent and IT

The "double descent" phenomenon—where test error drops again after the interpolation threshold—may have an information-theoretic explanation. Over-parameterized networks find minimum-norm solutions, which have low "effective" information content despite high parameter counts. The implicit bias of gradient descent toward simple solutions limits I(W; S) even when parameters exceed samples.

Practical Applications and Tools

Let's connect information theory to practical ML engineering with concrete examples and tools:

IT-Informed ML Practice

•Loss Function Selection: Use cross-entropy for classification (IT-principled), consider robust losses (Laplace/Huber) for noisy regression.
•Feature Selection: Mutual information scores (sklearn.feature_selection.mutual_info_*) identify nonlinear dependencies correlation misses.
•Model Complexity Control: Minimize I(model; training_data) via regularization, early stopping, or smaller architectures.
•Representation Quality: Measure I(representation; label) to assess if representations retain task-relevant information.
•VAE Tuning: Adjust β in β-VAE to control rate-distortion tradeoff; monitor KL term separately from reconstruction.
•Contrastive Learning: More negatives → tighter MI bound → better representations (until compute limits).

IT-Informed ML Toolkit
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import numpy as np
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
from sklearn.metrics import log_loss
from scipy.stats import entropy
from scipy.special import rel_entr
 
class InformationTheoreticML:
    """
    Toolkit for IT-informed machine learning practice.
    """
    
    @staticmethod
    def feature_importance_mi(X, y, task='classification'):
        """
        Compute MI-based feature importance.
        Superior to correlation for nonlinear relationships.
        """
        if task == 'classification':
            mi_scores = mutual_info_classif(X, y, random_state=42)
        else:
            mi_scores = mutual_info_regression(X, y, random_state=42)
        
        return mi_scores
    
    @staticmethod
    def prediction_entropy(probs):
        """
        Compute entropy of predicted probabilities.
        High entropy = uncertain model.
        """
        return entropy(probs, axis=1) / np.log(probs.shape[1])  # Normalized
    
    @staticmethod
    def calibration_analysis(y_true, y_pred_proba):
        """
        Information-theoretic calibration analysis.
        
        Well-calibrated: predicted entropy ≈ actual uncertainty
        """
        # Cross-entropy (bits needed using model's predictions)
        ce = log_loss(y_true, y_pred_proba)
        
        # True entropy (if we knew perfect probabilities)
        # Approximate from empirical frequencies
        class_freq = np.bincount(y_true) / len(y_true)
        true_entropy = entropy(class_freq)
        
        return {
            'cross_entropy': ce,
            'true_entropy': true_entropy,
            'calibration_gap': ce - true_entropy,  # Ideally small
        }
    
    @staticmethod
    def kl_divergence(p, q):
        """
        Compute D_KL(P || Q) for discrete distributions.
        """
        return np.sum(rel_entr(p, q))
    
    @staticmethod
    def vae_kl_monitor(mu, log_var):
        """
        Monitor KL term during VAE training.
        Useful for detecting posterior collapse.
        """
        kl = -0.5 * np.sum(1 + log_var - mu**2 - np.exp(log_var), axis=1)
        return {
            'mean_kl': np.mean(kl),
            'std_kl': np.std(kl),
            'kl_per_dim': np.mean(-0.5 * (1 + log_var - mu**2 - np.exp(log_var)), axis=0),
            'collapsed_dims': np.sum(np.mean(-0.5 * (1 + log_var - mu**2 - np.exp(log_var)), axis=0) < 0.01),
        }
 
# Example usage
toolkit = InformationTheoreticML()
 
# Feature importance
np.random.seed(42)
X = np.random.randn(1000, 10)
y = (X[:, 0]**2 + X[:, 1] > 0.5).astype(int)  # Nonlinear in X0, linear in X1
 
mi_scores = toolkit.feature_importance_mi(X, y)
print("MI-based Feature Importance (nonlinear data)")
print("=" * 50)
for i, score in enumerate(mi_scores):
    importance = "HIGH" if i < 2 else "low"
    print(f"Feature {i}: MI = {score:.4f} ({importance})")
 
print()
 
# Prediction uncertainty
probs = np.array([
    [0.1, 0.1, 0.8],  # Confident
    [0.33, 0.33, 0.34],  # Uncertain
    [0.05, 0.05, 0.9],  # Very confident
])
entropies = toolkit.prediction_entropy(probs)
print("Prediction Entropy Analysis")
print("-" * 50)
for i, ent in enumerate(entropies):
    print(f"Sample {i+1}: Entropy = {ent:.4f} (normalized)")

When to Use IT Perspective

Apply information-theoretic thinking when: • Choosing between loss functions (find the right likelihood) • Feature selection for complex models (MI beats correlation) • Debugging generative models (monitor KL, reconstruction separately) • Understanding model uncertainty (entropy of predictions) • Designing self-supervised objectives (maximize information preservation)

Frontiers: Information Theory and Modern ML

Information theory continues to drive ML research. Here are active frontiers:

Research Frontiers

•Foundation Models: How do LLMs compress world knowledge? What's their effective information capacity? Can we measure what they "know" in bits?
•Mechanistic Interpretability: Tracing information flow through networks to understand how features are computed and combined.
•Neural Scaling Laws: Understanding why loss scales as power law with compute/data through IT lens.
•Efficient Fine-tuning (LoRA): Low-rank adapters work because task-specific information is low-dimensional.
•Multimodal Learning: How to optimally fuse information from text, images, audio? MI maximization between modalities.
•AI Safety: Measuring and controlling what information models encode about training data (privacy, memorization).
•Quantum ML: Quantum information theory meets classical ML; quantum entropy, entanglement in learning.

The Scaling Laws Connection:

Recent work suggests that neural scaling laws (loss ∝ compute^{−α}) may have information-theoretic explanations:

Data contains ~N bits of relevant information for the task
Each parameter can store ~log(training_steps) effective bits
Loss measures remaining uncertainty after extracting available information

This connects compute budgets directly to information extraction rates.

What LLMs Know (In Bits):

An intriguing research direction: quantify LLM knowledge in bits. If a model achieves perplexity P on a test set:

Cross-entropy ≈ log₂(P) bits per token
Total "knowledge" ≈ (H(natural_text) - log₂(P)) × corpus_size bits

This gives a rough measure of how much the model has learned about language.

Shannon's Legacy

Claude Shannon's 1948 paper remains foundational 75+ years later. The concepts of entropy, channel capacity, and optimal coding directly inform how we: • Design loss functions (cross-entropy) • Train generative models (ELBO, divergences) • Understand representations (mutual information) • Bound generalization (IT generalization bounds)

Information theory provides the mathematical language for reasoning about learning itself.

Summary: The Information-Theoretic Foundation of ML

We've completed our journey through information theory and its applications in machine learning. Let's consolidate the unified view:

Key Takeaways from This Module

•Entropy H(X): Measures uncertainty/information content. Maximum for uniform distributions. Foundation for all else.
•Cross-Entropy H(P,Q): Bits needed to encode P-data with Q-code. Basis of classification losses. Equivalently: H(P) + D_KL(P||Q).
•KL Divergence D_KL(P||Q): "Extra bits" penalty for wrong distribution. Asymmetric; forward vs reverse KL have different behaviors. Central to VAEs and VI.
•Mutual Information I(X;Y): Shared information between variables. Captures all dependencies. Foundation of contrastive learning and feature selection.
•Information Bottleneck: Optimal representations compress input while preserving task-relevant information. Conceptual framework for deep learning.
•Generalization: Limiting I(algorithm; data) bounds overfitting. Regularization, noise, and compression all help.
•Unified View: Loss functions, generative models, representation learning, and generalization bounds all emerge from IT principles.

The Big Picture:

Machine learning is, fundamentally, about extracting and using information:

Supervised learning: Extract I(X; Y) from data to predict Y from X
Unsupervised learning: Compress X while preserving salient I(X; representation)
Generative modeling: Learn to sample from the data distribution (minimize divergence)
Reinforcement learning: Extract I(state_history; optimal_action) from experience

Information theory provides the mathematical framework for reasoning precisely about these goals. Whether you're debugging a loss function, designing a representation learning objective, or understanding why regularization helps, information-theoretic thinking offers clarity and rigor.

Module Complete

Congratulations! You've mastered the fundamentals of information theory for machine learning. You understand entropy as the measure of uncertainty, cross-entropy as the natural classification loss, KL divergence as the measure of distributional difference, mutual information as the measure of relevance, and how these concepts unify our understanding of learning itself. This foundation will serve you throughout your ML journey.