Loading learning content...
When building machine learning models, we constantly grapple with questions of relevance: Which features are most informative for predicting the target? How much does knowing X help us predict Y? What information do representations capture about the input?
These questions have a precise answer in information theory: mutual information. Denoted I(X; Y), mutual information quantifies the amount of information that knowing one variable provides about another. Unlike correlation, which only captures linear relationships, mutual information captures all statistical dependencies.
Mutual information is:
This elegant measure connects directly to entropy and KL divergence:
I(X; Y) = H(X) − H(X|Y) = H(Y) − H(Y|X) = D_KL(P(X,Y) || P(X)P(Y))
The last form is particularly revealing: mutual information is the KL divergence between the joint distribution and the product of marginals—the "statistical distance" from independence.
By the end of this page, you will understand mutual information's definition and multiple equivalent forms, appreciate its relationship to entropy, conditional entropy, and KL divergence, apply mutual information to feature selection and representation analysis, and tackle the challenge of estimating MI from samples.
Mutual information can be defined in several equivalent ways, each offering different intuition:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
# Mutual Information: Equivalent Definitions# =========================================== # Definition 1: Reduction in uncertainty# "How much does knowing Y reduce uncertainty about X?"I(X; Y) = H(X) - H(X|Y) # Definition 2: Symmetric form# "How much does knowing either reduce uncertainty about the other?"I(X; Y) = H(X) + H(Y) - H(X, Y) # Definition 3: KL divergence from independence# "How far is the joint from the product of marginals?"I(X; Y) = D_KL(P(X,Y) || P(X)P(Y)) = ∑_x ∑_y P(x,y) log[P(x,y) / (P(x)P(y))] # Definition 4: Expected pointwise mutual information# "Average surprise about joint vs independent co-occurrence"I(X; Y) = E_{P(X,Y)}[log(P(X,Y) / (P(X)P(Y)))] # All definitions are mathematically equivalent! # Python Implementation:import numpy as np def entropy(probs): """Compute entropy of a distribution.""" probs = np.array(probs).flatten() probs = probs[probs > 0] return -np.sum(probs * np.log2(probs)) def mutual_information(joint_probs): """ Compute mutual information from joint probability matrix. Args: joint_probs: 2D array where joint_probs[i,j] = P(X=i, Y=j) Returns: Mutual information I(X; Y) in bits """ joint = np.array(joint_probs) assert np.abs(joint.sum() - 1.0) < 1e-6, "Joint probs must sum to 1" # Marginal distributions p_x = joint.sum(axis=1) # Sum over Y p_y = joint.sum(axis=0) # Sum over X # Method 1: I(X;Y) = H(X) + H(Y) - H(X,Y) h_x = entropy(p_x) h_y = entropy(p_y) h_xy = entropy(joint) mi_method1 = h_x + h_y - h_xy # Method 2: I(X;Y) = Σ P(x,y) log[P(x,y)/(P(x)P(y))] mi_method2 = 0 for i in range(joint.shape[0]): for j in range(joint.shape[1]): if joint[i,j] > 0: mi_method2 += joint[i,j] * np.log2(joint[i,j] / (p_x[i] * p_y[j])) return mi_method1, mi_method2 # Example: Weather and umbrella carryingjoint = np.array([ [0.63, 0.07], # Sunny: [no umbrella, umbrella] [0.03, 0.27] # Rainy: [no umbrella, umbrella]]) mi1, mi2 = mutual_information(joint)print(f"Mutual Information (method 1): {mi1:.4f} bits")print(f"Mutual Information (method 2): {mi2:.4f} bits") # Verify: marginalsp_weather = joint.sum(axis=1) # [0.7, 0.3]p_umbrella = joint.sum(axis=0) # [0.66, 0.34]print(f"\nP(Weather): {p_weather}")print(f"P(Umbrella): {p_umbrella}")Understanding the Venn diagram:
The relationships between entropy, conditional entropy, joint entropy, and mutual information can be visualized as overlapping circles:
┌─────────────────────────────────────┐
│ H(X, Y) │
│ ┌─────────────┬─────────────┐ │
│ │ H(X|Y) │ H(Y|X) │ │
│ │ │ │ │
│ │ ┌──────┴──────┐ │ │
│ │ │ I(X; Y) │ │ │
│ │ └──────┬──────┘ │ │
│ │ │ │ │
│ └─────────────┴─────────────┘ │
│ H(X) H(Y) │
└─────────────────────────────────────┘
From the Venn diagram structure: • I(X; Y) = H(X) − H(X|Y) = "uncertainty reduction" • I(X; Y) = H(X) + H(Y) − H(X, Y) = "redundancy" • H(X, Y) = H(X) + H(Y|X) = H(Y) + H(X|Y) = "chain rule" • 0 ≤ I(X; Y) ≤ min(H(X), H(Y)) = "bounded by marginals"
Mutual information has several elegant properties that make it the canonical measure of statistical dependence:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
import numpy as np def mi_from_joint(joint): """Compute MI from joint probability matrix.""" joint = np.array(joint) p_x = joint.sum(axis=1) p_y = joint.sum(axis=0) mi = 0 for i in range(joint.shape[0]): for j in range(joint.shape[1]): if joint[i,j] > 0: mi += joint[i,j] * np.log2(joint[i,j] / (p_x[i] * p_y[j] + 1e-15)) return mi def entropy(probs): probs = np.array(probs) probs = probs[probs > 0] return -np.sum(probs * np.log2(probs)) # Property 1: Symmetryprint("Property: Symmetry")joint = np.array([[0.4, 0.1], [0.1, 0.4]])joint_transposed = joint.Tprint(f"I(X; Y) = {mi_from_joint(joint):.4f}")print(f"I(Y; X) = {mi_from_joint(joint_transposed):.4f}")print() # Property 2: Non-negativityprint("Property: Non-negativity (random tests)")for _ in range(3): # Random joint distribution joint = np.random.dirichlet(np.ones(9)).reshape(3, 3) mi = mi_from_joint(joint) print(f" I(X; Y) = {mi:.4f} >= 0: {mi >= -1e-10}")print() # Property 3: Zero for independenceprint("Property: Zero for independent variables")p_x = np.array([0.3, 0.4, 0.3])p_y = np.array([0.5, 0.5])joint_independent = np.outer(p_x, p_y) # Product of marginalsprint(f"Independent joint: I(X; Y) = {mi_from_joint(joint_independent):.6f} ≈ 0")print() # Property 4: Self-informationprint("Property: I(X; X) = H(X)")p = np.array([0.25, 0.25, 0.25, 0.25])# Diagonal joint: P(X=x, X=x) = P(X=x), off-diagonal = 0joint_self = np.diag(p)i_xx = mi_from_joint(joint_self)h_x = entropy(p)print(f"I(X; X) = {i_xx:.4f}")print(f"H(X) = {h_x:.4f}")print() # Property 5: Upper boundprint("Property: Upper bound")joint = np.array([[0.25, 0.25], [0.25, 0.25]]) # Very dependentp_x = joint.sum(axis=1)p_y = joint.sum(axis=0)mi = mi_from_joint(joint)print(f"I(X; Y) = {mi:.4f}")print(f"min(H(X), H(Y)) = min({entropy(p_x):.4f}, {entropy(p_y):.4f}) = {min(entropy(p_x), entropy(p_y)):.4f}")For any Markov chain X → Y → Z (meaning Z is conditionally independent of X given Y):
I(X; Z) ≤ I(X; Y)
This profound result says that processing (transforming) data can only destroy information, never create it. Every layer in a neural network can only lose information about the input (though it may make the remaining information more useful for the task).
A common question: "Why use mutual information instead of correlation?" The answer reveals a fundamental difference in what each measure captures.
Correlation (Pearson's r) measures linear dependence:
Mutual information measures any dependence:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
import numpy as npfrom sklearn.feature_selection import mutual_info_regression np.random.seed(42)n = 5000 print("Comparing Correlation vs Mutual Information")print("=" * 60)print() # Case 1: Linear relationship (both work)x1 = np.random.normal(0, 1, n)y1 = 2 * x1 + np.random.normal(0, 0.5, n) corr1 = np.corrcoef(x1, y1)[0, 1]mi1 = mutual_info_regression(x1.reshape(-1, 1), y1)[0] print("Case 1: Linear relationship (Y = 2X + noise)")print(f" Correlation: {corr1:.4f}")print(f" Mutual Information: {mi1:.4f} nats")print() # Case 2: Quadratic relationship (correlation fails!)x2 = np.random.uniform(-3, 3, n)y2 = x2**2 + np.random.normal(0, 0.5, n) corr2 = np.corrcoef(x2, y2)[0, 1]mi2 = mutual_info_regression(x2.reshape(-1, 1), y2)[0] print("Case 2: Quadratic relationship (Y = X² + noise)")print(f" Correlation: {corr2:.4f} (near zero!)")print(f" Mutual Information: {mi2:.4f} nats (high!)")print() # Case 3: Sinusoidal relationshipx3 = np.random.uniform(0, 4*np.pi, n)y3 = np.sin(x3) + np.random.normal(0, 0.2, n) corr3 = np.corrcoef(x3, y3)[0, 1]mi3 = mutual_info_regression(x3.reshape(-1, 1), y3)[0] print("Case 3: Sinusoidal relationship (Y = sin(X) + noise)")print(f" Correlation: {corr3:.4f} (near zero!)")print(f" Mutual Information: {mi3:.4f} nats (high!)")print() # Case 4: XOR-like relationship (binary)x4 = np.random.randint(0, 2, n)y4 = np.random.randint(0, 2, n)z4 = (x4 + y4) % 2 # XOR corr4 = np.corrcoef(x4, z4)[0, 1]# For discrete, use discrete MIfrom sklearn.metrics import mutual_info_scoremi4 = mutual_info_score(x4, z4) print("Case 4: XOR relationship (Z = X XOR Y)")print(f" Correlation(X, Z): {corr4:.4f}")print(f" Mutual Information: {mi4:.4f} nats")print() # Case 5: Independence (both should be zero)x5 = np.random.normal(0, 1, n)y5 = np.random.normal(0, 1, n) corr5 = np.corrcoef(x5, y5)[0, 1]mi5 = mutual_info_regression(x5.reshape(-1, 1), y5)[0] print("Case 5: Independence (X and Y unrelated)")print(f" Correlation: {corr5:.4f}")print(f" Mutual Information: {mi5:.4f} nats (≈0)") print()print("Key insight: Correlation can be zero even with strong dependence!")print("MI captures ALL dependencies, linear and nonlinear.")| Aspect | Correlation (r) | Mutual Information (I) |
|---|---|---|
| Relationships detected | Linear only | Any statistical dependence |
| Range | [-1, 1] | [0, min(H(X), H(Y))] |
| Zero means | No linear relationship | Independence |
| For Y = X² | ≈ 0 | High |
| Symmetry | Symmetric | Symmetric |
| Units | Unitless | Bits or nats |
| Computation | O(n) | O(n log n) or harder |
Use correlation when: • You expect linear relationships • Computational efficiency matters • You need a standardized effect size
Use mutual information when: • Nonlinear relationships are expected • You need true independence testing • Feature selection for complex models
One of the most practical applications of mutual information is feature selection: ranking and selecting features by how much information they provide about the target variable.
The intuition:
This is principled—we're directly measuring how much uncertainty about the target is reduced by knowing the feature.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
import numpy as npfrom sklearn.datasets import make_classificationfrom sklearn.feature_selection import mutual_info_classif, SelectKBestfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import cross_val_score # Create dataset with known structurenp.random.seed(42)n_samples = 1000n_informative = 5n_redundant = 3n_useless = 12 # Generate dataX, y = make_classification( n_samples=n_samples, n_features=n_informative + n_redundant + n_useless, n_informative=n_informative, n_redundant=n_redundant, n_clusters_per_class=2, flip_y=0.05, random_state=42) feature_names = [f"F{i}" for i in range(X.shape[1])] # Compute MI for each featuremi_scores = mutual_info_classif(X, y, random_state=42) # Rank featuresranking = np.argsort(mi_scores)[::-1] print("Feature Ranking by Mutual Information")print("=" * 50)print(f"{'Rank':<6} {'Feature':<10} {'MI Score':<12} {'Type'}")print("-" * 50) for i, idx in enumerate(ranking): if idx < n_informative: ftype = "Informative" elif idx < n_informative + n_redundant: ftype = "Redundant" else: ftype = "Useless" print(f"{i+1:<6} {feature_names[idx]:<10} {mi_scores[idx]:<12.4f} {ftype}") print() # Select top features and compare performanceprint("Model Performance Comparison")print("-" * 50) for k in [5, 10, 20]: selector = SelectKBest(mutual_info_classif, k=k) X_selected = selector.fit_transform(X, y) clf = RandomForestClassifier(n_estimators=50, random_state=42) scores = cross_val_score(clf, X_selected, y, cv=5) print(f"Top {k:2d} features: Accuracy = {scores.mean():.4f} ± {scores.std():.4f}") # Full featuresclf = RandomForestClassifier(n_estimators=50, random_state=42)scores_full = cross_val_score(clf, X, y, cv=5)print(f"All {X.shape[1]:2d} features: Accuracy = {scores_full.mean():.4f} ± {scores_full.std():.4f}")Handling redundancy:
Simple MI-based selection ranks features independently, which can select redundant features (multiple features carrying the same information). Advanced methods account for this:
mRMR (Minimum Redundancy Maximum Relevance): Maximize I(feature; target) while minimizing I(feature; already_selected_features)
JMI (Joint Mutual Information): Consider I(selected_features, new_feature; target)
CMIM (Conditional Mutual Information Maximization): Select features that provide additional information given already selected ones
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
import numpy as npfrom sklearn.feature_selection import mutual_info_classiffrom sklearn.metrics import mutual_info_score def mrmr_feature_selection(X, y, n_features=5): """ Minimum Redundancy Maximum Relevance feature selection. Score = I(f; y) - (1/|S|) * Σ I(f; s) for s in S Where S is the set of already selected features. """ n_total = X.shape[1] # Compute relevance: I(feature; target) relevance = mutual_info_classif(X, y, random_state=42) # For redundancy, we'll discretize continuous features def discretize(x, n_bins=10): return np.digitize(x, np.percentile(x, np.linspace(0, 100, n_bins))) X_discrete = np.apply_along_axis(discretize, 0, X) selected = [] remaining = list(range(n_total)) for i in range(n_features): if i == 0: # First feature: pure relevance scores = relevance[remaining] else: # Subsequent: relevance - avg(redundancy with selected) scores = [] for f in remaining: rel = relevance[f] # Average MI with already selected features redundancy = np.mean([ mutual_info_score(X_discrete[:, f], X_discrete[:, s]) for s in selected ]) scores.append(rel - redundancy) scores = np.array(scores) # Select feature with best score best_idx = np.argmax(scores) best_feature = remaining[best_idx] selected.append(best_feature) remaining.remove(best_feature) print(f"Step {i+1}: Selected F{best_feature} " f"(relevance={relevance[best_feature]:.4f})") return selected # Apply mRMRprint("mRMR Feature Selection")print("=" * 50)selected_features = mrmr_feature_selection(X, y, n_features=5)print(f"\nSelected features: {['F'+str(i) for i in selected_features]}")MI estimation from finite samples can be noisy, especially with continuous variables or many categories. For reliable results: • Use enough samples (thousands, not dozens) • Consider binning strategies for continuous variables • Use cross-validation to validate selected features • Be wary of overfitting on feature selection itself
In practice, we rarely have access to true probability distributions—we have samples. Estimating mutual information from finite samples is surprisingly challenging, especially for continuous variables.
The challenge:
Several approaches have been developed:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
import numpy as npfrom sklearn.feature_selection import mutual_info_regressionfrom scipy.stats import entropy def mi_histogram(x, y, bins=10): """ Histogram-based MI estimation. Simple but sensitive to binning. """ # Create 2D histogram hist_2d, x_edges, y_edges = np.histogram2d(x, y, bins=bins) # Normalize to probabilities pxy = hist_2d / hist_2d.sum() px = pxy.sum(axis=1) py = pxy.sum(axis=0) # Compute MI mi = 0 for i in range(len(px)): for j in range(len(py)): if pxy[i, j] > 0 and px[i] > 0 and py[j] > 0: mi += pxy[i, j] * np.log(pxy[i, j] / (px[i] * py[j])) return mi # in nats def mi_ksg(x, y, k=3): """ KSG estimator (using sklearn implementation). More robust than histogram methods. """ return mutual_info_regression(x.reshape(-1, 1), y, n_neighbors=k)[0] # Generate test data with known relationshipnp.random.seed(42)n = 1000 # Linear relationshipx_linear = np.random.normal(0, 1, n)y_linear = x_linear + np.random.normal(0, 0.5, n) # Theoretical MI for jointly Gaussian: I(X;Y) = -0.5 * log(1 - ρ²)rho = np.corrcoef(x_linear, y_linear)[0, 1]mi_theoretical = -0.5 * np.log(1 - rho**2) print("MI Estimation for Linear Gaussian (Y = X + noise)")print("=" * 60)print(f"Theoretical MI (Gaussian formula): {mi_theoretical:.4f} nats")print(f"Correlation: {rho:.4f}")print() # Compare estimatorsprint(f"{'Method':<30} {'Estimate':<12} {'Error':<12}")print("-" * 60) # Histogram with different bin sizesfor bins in [5, 10, 20, 50]: est = mi_histogram(x_linear, y_linear, bins=bins) err = est - mi_theoretical print(f"Histogram (bins={bins}){'':<15} {est:<12.4f} {err:+.4f}") # KSG with different kfor k in [3, 5, 10]: est = mi_ksg(x_linear, y_linear, k=k) err = est - mi_theoretical print(f"KSG (k={k}){'':<20} {est:<12.4f} {err:+.4f}") print()print("Note: KSG is generally more accurate and robust than histogram methods.")Estimating MI in high dimensions is notoriously difficult due to the curse of dimensionality. Methods like MINE and InfoNCE provide lower bounds rather than exact estimates. These bounds are sufficient for optimization (training) but may not accurately reflect the true MI value. For analysis purposes, dimensionality reduction before MI estimation is often necessary.
Mutual information has become central to understanding and training deep neural networks. Here are key applications:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
import torchimport torch.nn.functional as F def info_nce_loss(query, positive_key, negative_keys, temperature=0.07): """ InfoNCE loss for contrastive learning. This is a lower bound on I(query; key): I(Q; K) >= log(N) - L_NCE Args: query: Query representations (batch_size, dim) positive_key: Positive key for each query (batch_size, dim) negative_keys: Negative keys (num_negatives, dim) or (batch_size, num_neg, dim) temperature: Softmax temperature (lower = sharper) Returns: InfoNCE loss (lower is better for training) """ batch_size = query.size(0) # Normalize representations query = F.normalize(query, dim=1) positive_key = F.normalize(positive_key, dim=1) # Positive logits: q · k+ positive_logits = torch.sum(query * positive_key, dim=1, keepdim=True) positive_logits = positive_logits / temperature # Negative logits if negative_keys.dim() == 2: # Shared negatives across batch negative_keys = F.normalize(negative_keys, dim=1) negative_logits = query @ negative_keys.T / temperature else: # Per-sample negatives negative_keys = F.normalize(negative_keys, dim=2) negative_logits = torch.bmm(query.unsqueeze(1), negative_keys.transpose(1, 2)).squeeze(1) negative_logits = negative_logits / temperature # Concatenate: [positive, negatives] logits = torch.cat([positive_logits, negative_logits], dim=1) # Labels: positive is always at index 0 labels = torch.zeros(batch_size, dtype=torch.long, device=query.device) # Cross-entropy loss loss = F.cross_entropy(logits, labels) return loss # Example usagebatch_size = 32dim = 128num_negatives = 256 # Simulated representationsquery = torch.randn(batch_size, dim)positive_key = query + torch.randn(batch_size, dim) * 0.1 # Similarnegative_keys = torch.randn(num_negatives, dim) # Random loss = info_nce_loss(query, positive_key, negative_keys)print(f"InfoNCE Loss: {loss.item():.4f}") # Lower bound on MImi_lower_bound = np.log(num_negatives + 1) - loss.item()print(f"MI Lower Bound: {mi_lower_bound:.4f} nats") # With more negativesfor n_neg in [16, 64, 256, 1024]: neg_keys = torch.randn(n_neg, dim) loss = info_nce_loss(query, positive_key, neg_keys) mi_bound = np.log(n_neg + 1) - loss.item() print(f"N={n_neg:4d}: Loss={loss.item():.4f}, MI bound={mi_bound:.4f}")The Information Bottleneck View:
A deep neural network can be viewed through the lens of information theory:
The network must preserve information relevant to Y while discarding irrelevant details. This is precisely the Information Bottleneck objective:
minimize: I(X; T) − β · I(T; Y)
Find the representation T that maximally compresses X while retaining information about Y.
SimCLR and similar methods work by maximizing MI between different augmentations of the same image while minimizing MI between different images. The InfoNCE loss achieves this: numerator pulls positive pairs together (high MI), denominator pushes negatives apart (low MI). More negatives give tighter bounds but require more compute.
Sometimes we want to know the shared information between X and Y after accounting for a third variable Z. This is conditional mutual information:
I(X; Y | Z) = H(X | Z) − H(X | Y, Z)
This measures how much information Y provides about X beyond what Z already provides.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
# Conditional Mutual Information# ================================ # Definition:I(X; Y | Z) = H(X | Z) - H(X | Y, Z) = H(X, Z) + H(Y, Z) - H(Z) - H(X, Y, Z) = E_Z[I(X; Y | Z=z)] # Properties:# 1. Chain rule: I(X; Y, Z) = I(X; Z) + I(X; Y | Z)# 2. Non-negative: I(X; Y | Z) >= 0# 3. Reduces to MI when Z is constant: I(X; Y | ∅) = I(X; Y) # Key insight: I(X; Y | Z) can be LESS than I(X; Y)# This happens when Z "explains" some of the dependence between X and Y # Example: Confounding# X = Ice cream sales# Y = Drowning deaths# Z = Temperature# I(X; Y) > 0 (correlated!)# I(X; Y | Z) ≈ 0 (independent given temperature) import numpy as np def conditional_mi_discrete(x, y, z): """ Compute I(X; Y | Z) from discrete samples. """ from collections import defaultdict n = len(x) # Count joint occurrences xyz_counts = defaultdict(int) xz_counts = defaultdict(int) yz_counts = defaultdict(int) z_counts = defaultdict(int) for i in range(n): xyz_counts[(x[i], y[i], z[i])] += 1 xz_counts[(x[i], z[i])] += 1 yz_counts[(y[i], z[i])] += 1 z_counts[z[i]] += 1 # Convert to probabilities and compute CMI cmi = 0 for (xi, yi, zi), count in xyz_counts.items(): p_xyz = count / n p_xz = xz_counts[(xi, zi)] / n p_yz = yz_counts[(yi, zi)] / n p_z = z_counts[zi] / n if p_xyz > 0 and p_xz > 0 and p_yz > 0 and p_z > 0: # I(X;Y|Z) = Σ p(x,y,z) log[p(x,y,z)p(z) / (p(x,z)p(y,z))] cmi += p_xyz * np.log2((p_xyz * p_z) / (p_xz * p_yz)) return cmi # Example: Confoundingnp.random.seed(42)n = 5000 # Z causes both X and Yz = np.random.randint(0, 3, n) # Low, Medium, High temperature # X depends on Z (ice cream sales)x = z + np.random.binomial(1, 0.3, n) # Y depends on Z (drowning - more swimming in hot weather) y = z + np.random.binomial(1, 0.2, n) # Compute MIsfrom sklearn.metrics import mutual_info_score mi_xy = mutual_info_score(x, y)cmi_xy_z = conditional_mi_discrete(x, y, z) print("Confounding Example: Ice Cream Sales (X) vs Drownings (Y)")print("=" * 60)print(f"I(X; Y) = {mi_xy:.4f} nats <- Appears correlated!")print(f"I(X; Y | Z=Temperature) = {cmi_xy_z:.4f} nats <- Much less after conditioning!")print()print("Conditioning on temperature 'explains away' the spurious correlation.")Conditional MI is central to causal inference. If X and Y are conditionally independent given Z (I(X;Y|Z) = 0), then Z "screens off" the dependence. This is the basis for identifying confounders and understanding causal structure. Conditional independence testing uses CMI as the test statistic.
Mutual information is the definitive measure of statistical dependence with deep connections throughout ML. Let's consolidate:
What's next:
We've covered the core concepts of information theory: entropy, cross-entropy, KL divergence, and mutual information. The final page synthesizes these concepts, showing how information theory provides a unified lens for understanding machine learning—from loss functions to generative models to neural network analysis.
You now understand mutual information as the canonical measure of statistical dependence, can apply it to feature selection, appreciate its role in modern deep learning through contrastive methods, and understand the challenges of estimation from finite samples.