Loading content...
Throughout this module, we've developed the core concepts of information theory: entropy, cross-entropy, KL divergence, and mutual information. Now we step back to see the remarkable unity these concepts bring to machine learning.
Information theory provides a principled framework for understanding:
This page synthesizes these connections, showing how information theory isn't just a collection of tools—it's a complete theoretical framework for understanding machine learning. We'll see how Shannon's insights from 1948 directly inform the design of modern deep learning systems in 2024.
By the end of this page, you will see how information theory unifies loss functions across ML, understand how generative models exploit information-theoretic objectives, appreciate the information bottleneck perspective on deep learning, and recognize information-theoretic bounds on what's learnable.
Every standard loss function in ML has an information-theoretic interpretation. This perspective explains why certain losses are natural choices and when alternatives might be appropriate.
| Loss Function | IT Interpretation | When to Use |
|---|---|---|
| Cross-Entropy | H(P, Q) = H(P) + D_KL(P||Q); measures coding cost | Classification; maximizing likelihood |
| KL Divergence | D_KL(P||Q); extra bits for using wrong distribution | VAE regularization; distribution matching |
| MSE (Gaussian) | -log p(y|x) for Gaussian p(y|x) = N(f(x), σ²) | Regression with Gaussian noise assumption |
| MAE (Laplace) | -log p(y|x) for Laplacian p(y|x) | Robust regression; heavy-tailed noise |
| InfoNCE | Lower bound on I(X; Y) | Contrastive learning; representation learning |
| ELBO | -H(P,Q) + H(Q) = E_q[log p] + H(q) | Variational inference; VAEs |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
import numpy as npimport torchimport torch.nn.functional as Ffrom scipy import stats # Core insight: Loss = -log p(y|x) for some probabilistic model # 1. Cross-Entropy = -log p(y|x) for Categorical distributiondef cross_entropy_loss(logits, targets): """ Cross-entropy for classification. Assumes p(y|x) = Categorical(softmax(f(x))) Loss = -log p(y=target|x) """ log_probs = F.log_softmax(logits, dim=-1) return -log_probs.gather(1, targets.unsqueeze(1)).mean() # 2. MSE = -log p(y|x) for Gaussian distribution (up to constant)def mse_as_nll(predictions, targets, sigma=1.0): """ MSE for regression. Assumes p(y|x) = N(f(x), σ²) -log p(y|x) = (y - f(x))² / (2σ²) + log(σ√(2π)) MSE ignores constant terms. """ mse = ((predictions - targets) ** 2).mean() # Full NLL (for proper probabilistic interpretation) nll = mse / (2 * sigma**2) + np.log(sigma * np.sqrt(2 * np.pi)) return mse, nll # 3. MAE = -log p(y|x) for Laplace distribution (up to constant)def mae_as_nll(predictions, targets, b=1.0): """ MAE (L1 loss) for regression. Assumes p(y|x) = Laplace(f(x), b) -log p(y|x) = |y - f(x)| / b + log(2b) """ mae = torch.abs(predictions - targets).mean() nll = mae / b + np.log(2 * b) return mae, nll # Exampleprint("Loss Functions as Negative Log-Likelihood")print("=" * 60) # Generate datatorch.manual_seed(42)preds = torch.tensor([0.5, 1.2, 0.8, 1.5])targets = torch.tensor([0.6, 1.0, 1.0, 1.3]) mse, mse_nll = mse_as_nll(preds, targets)mae, mae_nll = mae_as_nll(preds, targets) print(f"MSE: {mse:.4f}")print(f"MSE as Gaussian NLL (σ=1): {mse_nll:.4f}")print()print(f"MAE: {mae.item():.4f}")print(f"MAE as Laplace NLL (b=1): {mae_nll:.4f}")print()print("Key insight: Choosing a loss = choosing a noise distribution")print("MSE → Gaussian noise | MAE → Laplacian noise | Huber → hybrid")All these losses are negative log-likelihoods under specific distributional assumptions. Training with any loss is implicitly assuming a particular noise distribution. This explains why: • MSE is sensitive to outliers (Gaussians assign very low probability to outliers) • MAE is robust (Laplace has heavier tails) • Cross-entropy works for classification (categorical likelihood)
The major generative modeling frameworks—VAEs, GANs, flow-based models, and diffusion models—all have deep information-theoretic foundations.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101
import torchimport torch.nn as nnimport torch.nn.functional as Fimport numpy as np class VAE_InformationTheoretic(nn.Module): """ VAE viewed through information-theoretic lens. Objective: Maximize I(X; Z) while keeping I(Z; X) close to prior entropy. ELBO = E_q(z|x)[log p(x|z)] - D_KL(q(z|x) || p(z)) = Reconstruction - Rate Rate-Distortion interpretation: - Reconstruction = How well can we decode X from Z? - KL term = How many bits does Z use beyond the prior? """ def __init__(self, input_dim, latent_dim): super().__init__() self.latent_dim = latent_dim # Encoder: q(z|x) self.encoder = nn.Sequential( nn.Linear(input_dim, 256), nn.ReLU(), nn.Linear(256, 128), nn.ReLU() ) self.fc_mu = nn.Linear(128, latent_dim) self.fc_logvar = nn.Linear(128, latent_dim) # Decoder: p(x|z) self.decoder = nn.Sequential( nn.Linear(latent_dim, 128), nn.ReLU(), nn.Linear(128, 256), nn.ReLU(), nn.Linear(256, input_dim), nn.Sigmoid() ) def encode(self, x): h = self.encoder(x) return self.fc_mu(h), self.fc_logvar(h) def reparameterize(self, mu, logvar): """Reparameterization trick: z = μ + σ * ε, ε ~ N(0,1)""" std = torch.exp(0.5 * logvar) eps = torch.randn_like(std) return mu + eps * std def decode(self, z): return self.decoder(z) def forward(self, x): mu, logvar = self.encode(x) z = self.reparameterize(mu, logvar) x_recon = self.decode(z) return x_recon, mu, logvar def loss_function(self, x, x_recon, mu, logvar, beta=1.0): """ ELBO loss with β-VAE weighting. Returns components for analysis: - Reconstruction: -E_q[log p(x|z)] ≈ BCE(x, x_recon) - Rate: D_KL(q(z|x) || p(z)) = D_KL(N(μ, σ²) || N(0, 1)) """ # Reconstruction loss (negative log-likelihood) # For Bernoulli decoder: BCE recon = F.binary_cross_entropy(x_recon, x, reduction='sum') # KL divergence: D_KL(N(μ, σ²) || N(0, 1)) # = 0.5 * Σ(μ² + σ² - 1 - log(σ²)) kl = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp()) # Total loss (negative ELBO) total = recon + beta * kl return total, recon, kl # Information quantitiesprint("VAE Information-Theoretic Analysis")print("=" * 60)print()print("ELBO = E_q[log p(x|z)] - D_KL(q(z|x) || p(z))")print()print("Decomposition:")print("1. E_q[log p(x|z)]: Reconstruction quality")print(" - How much information about X is preserved in Z?")print(" - Maximizing this → high I(X; Z)")print()print("2. D_KL(q(z|x) || p(z)): Rate / Compression")print(" - How many 'extra bits' does z use?")print(" - Minimizing this → low I(X; Z)")print()print("β-VAE: Total = Reconstruction + β × KL")print(" β > 1: More compression, better disentanglement")print(" β < 1: Better reconstruction, less regularization")GANs and Divergence Estimation:
The GAN training objective can be understood as estimating divergences between distributions:
The discriminator D serves as a density ratio estimator: D(x) ≈ p_data(x) / (p_data(x) + p_gen(x)), enabling divergence computation without explicit density estimation.
GAN mode collapse can be understood information-theoretically: the generator learns to minimize D(P_gen || P_data) (forward KL), which is mode-seeking. This encourages covering some modes perfectly while ignoring others. Alternative divergences (reverse KL, Wasserstein) have different mode-covering properties.
The Information Bottleneck (IB) principle, introduced by Tishby et al. in 1999, provides a profound framework for understanding learning as optimal compression.
The IB Objective:
Given input X and target Y, find representation T that:
minimize: I(X; T) − β · I(T; Y)
This trades off:
The parameter β controls the tradeoff. As β varies from 0 to ∞, we trace an information curve from maximal compression to maximal prediction.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
# Information Bottleneck Framework# ================================ # The IB Objective:# min I(X; T) - β × I(T; Y)## Equivalently:# max I(T; Y) - β⁻¹ × I(X; T) # Key insights:# 1. Optimal T discards information in X not relevant to Y# 2. β controls compression-prediction tradeoff# 3. Neural network layers approximately solve IB # The IB Lagrangian:L = I(X; T) - β × I(T; Y) # Can be rewritten using Markov chain X → T → Y:# I(T; Y) ≤ I(X; Y) (data processing inequality)# I(X; T) ≥ I(T; Y) (must preserve enough for prediction) # In neural networks:# - Early layers: High I(X; T), moderate I(T; Y)# - Later layers: Lower I(X; T), higher I(T; Y)# - This is "compression" as information flows through network import numpy as npimport matplotlib.pyplot as plt def plot_information_plane(): """ Conceptual illustration of the information plane. Each point (I(X;T), I(T;Y)) represents a layer's information content. """ # Simulated layer-wise information (conceptual) layers = ['Input', 'Hidden1', 'Hidden2', 'Hidden3', 'Output'] # During training, networks compress (reduce I(X;T)) while maintaining I(T;Y) I_XT_init = [5.0, 4.8, 4.5, 4.0, 3.5] # Initially, preserve most input info I_TY_init = [0.5, 0.8, 1.0, 1.2, 1.5] # Gradually improve prediction I_XT_final = [5.0, 3.5, 2.5, 2.0, 1.8] # Compress unnecessary info I_TY_final = [0.5, 1.2, 1.5, 1.8, 2.0] # Improve prediction print("Information Plane Analysis") print("=" * 60) print(f"{'Layer':<15} {'I(X;T) init':<15} {'I(X;T) final':<15} {'Compression':<15}") print("-" * 60) for i, layer in enumerate(layers): comp = I_XT_init[i] - I_XT_final[i] print(f"{layer:<15} {I_XT_init[i]:<15.2f} {I_XT_final[i]:<15.2f} {comp:+.2f}") print() print("Key observation: Networks compress I(X;T) while improving I(T;Y)") print("This is the 'information bottleneck' at work!") plot_information_plane() # Variational Information Bottleneck (VIB)print()print("Variational Information Bottleneck (VIB)")print("-" * 60)print("Tractable approximation to IB using variational bounds:")print()print("max I(T; Y) - β × I(X; T)")print("≈ max E[log p(Y|T)] - β × D_KL(p(T|X) || r(T))")print()print("Where:")print(" p(T|X): Encoder (stochastic neural network)")print(" r(T): Variational prior (e.g., N(0, I))")print(" p(Y|T): Predictor")print()print("This is a stochastic neural network with KL regularization!")Deep Learning as Implicit IB:
Tishby and colleagues proposed that deep neural networks implicitly solve the information bottleneck:
This compression hypothesis is debated but influential. What's clear is that:
The compression-phase hypothesis is controversial. Some argue that: • Compression depends on activation functions (tanh shows it, ReLU may not) • Discrete MI estimates are unreliable for continuous networks • Generalization may not require compression
Regardless, IB provides a useful conceptual framework for thinking about representations.
Modern representation learning—especially self-supervised learning—is deeply grounded in information theory. The key objectives can all be expressed in terms of mutual information:
| Method | IT Objective | Intuition |
|---|---|---|
| Contrastive (SimCLR) | max I(view₁; view₂) | Augmented views of same image should share info |
| Masked Prediction (BERT) | max I(context; masked) | Context should predict masked tokens |
| InfoMax (DIM) | max I(x; f(x)) | Representation should preserve input info |
| VIB | max I(z;y) - β·I(x;z) | Task-relevant, input-compressed representations |
| Predictive Coding | max I(past; future|z) | Learn representations that aid temporal prediction |
| InfoGAN | max I(c; G(z,c)) | Latent codes should be recoverable from generated samples |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
import torchimport torch.nn.functional as F class ContrastiveLearner: """ Contrastive learning maximizes I(view_1; view_2). The InfoNCE bound: I(X; Y) >= log(K) - L_NCE where K is the number of negatives + 1 and L_NCE is the contrastive loss. """ def __init__(self, temperature=0.07): self.temperature = temperature def info_nce_loss(self, z_i, z_j, negatives=None): """ Compute InfoNCE loss. Args: z_i: First view representations (batch_size, dim) z_j: Second view representations (batch_size, dim) negatives: Optional explicit negatives If negatives=None, use other samples in batch as negatives. """ batch_size = z_i.size(0) # Normalize embeddings z_i = F.normalize(z_i, dim=1) z_j = F.normalize(z_j, dim=1) # Compute similarity matrix representations = torch.cat([z_i, z_j], dim=0) # (2*batch, dim) similarity = representations @ representations.T / self.temperature # Mask self-similarity (diagonal) mask = torch.eye(2 * batch_size, device=z_i.device).bool() similarity = similarity.masked_fill(mask, float('-inf')) # For each z_i, positive is the corresponding z_j (and vice versa) labels = torch.cat([ torch.arange(batch_size, 2 * batch_size), torch.arange(batch_size) ], dim=0).to(z_i.device) # Cross-entropy loss loss = F.cross_entropy(similarity, labels) return loss def estimate_mi_bound(self, loss, num_negatives): """ Estimate MI lower bound from InfoNCE loss. I(X; Y) >= log(K) - L_NCE """ K = num_negatives + 1 # +1 for the positive return torch.log(torch.tensor(K)) - loss # Examplelearner = ContrastiveLearner(temperature=0.07) # Simulated representations from two viewsbatch_size = 256dim = 128 z1 = torch.randn(batch_size, dim)z2 = z1 + torch.randn(batch_size, dim) * 0.3 # Views are similar loss = learner.info_nce_loss(z1, z2)mi_bound = learner.estimate_mi_bound(loss, num_negatives=2*batch_size - 2) print("Contrastive Learning Analysis")print("=" * 60)print(f"InfoNCE Loss: {loss.item():.4f}")print(f"Number of negatives: {2*batch_size - 2}")print(f"MI Lower Bound: {mi_bound.item():.4f} nats")print()print("Key insight: Lower loss → higher MI bound → better representations")By maximizing I(view₁; view₂): • Forces representation to capture shared semantics (the "content") • Discards view-specific noise (augmentation artifacts) • Results in representations invariant to irrelevant transformations • Naturally learns hierarchical features when combined with sufficient data diversity
Information theory provides theoretical guarantees about when and how well machine learning algorithms can generalize. These bounds complement classical VC-dimension and Rademacher complexity bounds with an information-theoretic perspective.
The Key Insight:
If a learning algorithm extracts less information from the training data, it's less likely to overfit. This leads to bounds of the form:
Generalization Gap ≤ O(√(I(Algorithm Output; Training Data) / n))
The less the algorithm "remembers" about specific training points, the better it generalizes.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
# Information-Theoretic Generalization Bounds# ============================================= # Classic bound (simplified form):# E[|L_test - L_train|] ≤ √(2 * I(W; S) / n)## Where:# - W: Learned parameters (output of algorithm)# - S: Training set# - n: Number of training samples# - I(W; S): Mutual information between weights and training data # Key implications:# 1. Less information extracted → better generalization# 2. SGD noise limits I(W; S), aiding generalization# 3. Regularization reduces I(W; S)# 4. Data augmentation increases effective n # Practical insights:print("Information-Theoretic View of Regularization")print("=" * 60)print() regularizers = { "Weight Decay (L2)": "Constrains weights to low-norm region, limiting information capacity", "Dropout": "Prevents co-adaptation, reducing I(W; S) by adding noise", "Batch Normalization": "Stabilizes representations, limiting individual sample influence", "Label Smoothing": "Reduces I(Y_pred; Y_true|X), preventing overfitting to hard labels", "Data Augmentation": "Increases effective n without changing I(W; S)", "Early Stopping": "Stops before full training data memorization, limiting I(W; S)",} for reg, explanation in regularizers.items(): print(f"{reg}:") print(f" {explanation}") print() # PAC-Bayes bounds (information-theoretic flavor)print("PAC-Bayes Bounds")print("-" * 60)print()print("Generalization Error ≤ Train Error + √(D_KL(posterior || prior) / n)")print()print("This connects:")print("- KL divergence between learned distribution and prior")print("- Sample size n")print("- Generalization gap")print()print("Insight: Learning 'close to prior' generalizes better")Compression Implies Generalization:
A profound result connects compression to generalization:
This partially explains why:
The description length is an information-theoretic measure of complexity.
The "double descent" phenomenon—where test error drops again after the interpolation threshold—may have an information-theoretic explanation. Over-parameterized networks find minimum-norm solutions, which have low "effective" information content despite high parameter counts. The implicit bias of gradient descent toward simple solutions limits I(W; S) even when parameters exceed samples.
Let's connect information theory to practical ML engineering with concrete examples and tools:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102
import numpy as npfrom sklearn.feature_selection import mutual_info_classif, mutual_info_regressionfrom sklearn.metrics import log_lossfrom scipy.stats import entropyfrom scipy.special import rel_entr class InformationTheoreticML: """ Toolkit for IT-informed machine learning practice. """ @staticmethod def feature_importance_mi(X, y, task='classification'): """ Compute MI-based feature importance. Superior to correlation for nonlinear relationships. """ if task == 'classification': mi_scores = mutual_info_classif(X, y, random_state=42) else: mi_scores = mutual_info_regression(X, y, random_state=42) return mi_scores @staticmethod def prediction_entropy(probs): """ Compute entropy of predicted probabilities. High entropy = uncertain model. """ return entropy(probs, axis=1) / np.log(probs.shape[1]) # Normalized @staticmethod def calibration_analysis(y_true, y_pred_proba): """ Information-theoretic calibration analysis. Well-calibrated: predicted entropy ≈ actual uncertainty """ # Cross-entropy (bits needed using model's predictions) ce = log_loss(y_true, y_pred_proba) # True entropy (if we knew perfect probabilities) # Approximate from empirical frequencies class_freq = np.bincount(y_true) / len(y_true) true_entropy = entropy(class_freq) return { 'cross_entropy': ce, 'true_entropy': true_entropy, 'calibration_gap': ce - true_entropy, # Ideally small } @staticmethod def kl_divergence(p, q): """ Compute D_KL(P || Q) for discrete distributions. """ return np.sum(rel_entr(p, q)) @staticmethod def vae_kl_monitor(mu, log_var): """ Monitor KL term during VAE training. Useful for detecting posterior collapse. """ kl = -0.5 * np.sum(1 + log_var - mu**2 - np.exp(log_var), axis=1) return { 'mean_kl': np.mean(kl), 'std_kl': np.std(kl), 'kl_per_dim': np.mean(-0.5 * (1 + log_var - mu**2 - np.exp(log_var)), axis=0), 'collapsed_dims': np.sum(np.mean(-0.5 * (1 + log_var - mu**2 - np.exp(log_var)), axis=0) < 0.01), } # Example usagetoolkit = InformationTheoreticML() # Feature importancenp.random.seed(42)X = np.random.randn(1000, 10)y = (X[:, 0]**2 + X[:, 1] > 0.5).astype(int) # Nonlinear in X0, linear in X1 mi_scores = toolkit.feature_importance_mi(X, y)print("MI-based Feature Importance (nonlinear data)")print("=" * 50)for i, score in enumerate(mi_scores): importance = "HIGH" if i < 2 else "low" print(f"Feature {i}: MI = {score:.4f} ({importance})") print() # Prediction uncertaintyprobs = np.array([ [0.1, 0.1, 0.8], # Confident [0.33, 0.33, 0.34], # Uncertain [0.05, 0.05, 0.9], # Very confident])entropies = toolkit.prediction_entropy(probs)print("Prediction Entropy Analysis")print("-" * 50)for i, ent in enumerate(entropies): print(f"Sample {i+1}: Entropy = {ent:.4f} (normalized)")Apply information-theoretic thinking when: • Choosing between loss functions (find the right likelihood) • Feature selection for complex models (MI beats correlation) • Debugging generative models (monitor KL, reconstruction separately) • Understanding model uncertainty (entropy of predictions) • Designing self-supervised objectives (maximize information preservation)
Information theory continues to drive ML research. Here are active frontiers:
The Scaling Laws Connection:
Recent work suggests that neural scaling laws (loss ∝ compute^{−α}) may have information-theoretic explanations:
This connects compute budgets directly to information extraction rates.
What LLMs Know (In Bits):
An intriguing research direction: quantify LLM knowledge in bits. If a model achieves perplexity P on a test set:
This gives a rough measure of how much the model has learned about language.
Claude Shannon's 1948 paper remains foundational 75+ years later. The concepts of entropy, channel capacity, and optimal coding directly inform how we: • Design loss functions (cross-entropy) • Train generative models (ELBO, divergences) • Understand representations (mutual information) • Bound generalization (IT generalization bounds)
Information theory provides the mathematical language for reasoning about learning itself.
We've completed our journey through information theory and its applications in machine learning. Let's consolidate the unified view:
The Big Picture:
Machine learning is, fundamentally, about extracting and using information:
Information theory provides the mathematical framework for reasoning precisely about these goals. Whether you're debugging a loss function, designing a representation learning objective, or understanding why regularization helps, information-theoretic thinking offers clarity and rigor.
Congratulations! You've mastered the fundamentals of information theory for machine learning. You understand entropy as the measure of uncertainty, cross-entropy as the natural classification loss, KL divergence as the measure of distributional difference, mutual information as the measure of relevance, and how these concepts unify our understanding of learning itself. This foundation will serve you throughout your ML journey.