Probability TheoryInformation Theory

Information Theory for Machine Learning

LevelIntermediate

Duration90 mins

TopicInformation Theory

4 / 5

Mutual Information: Quantifying Shared Information

The Essence of Relevance

When building machine learning models, we constantly grapple with questions of relevance: Which features are most informative for predicting the target? How much does knowing X help us predict Y? What information do representations capture about the input?

These questions have a precise answer in information theory: mutual information. Denoted I(X; Y), mutual information quantifies the amount of information that knowing one variable provides about another. Unlike correlation, which only captures linear relationships, mutual information captures all statistical dependencies.

Mutual information is:

Symmetric: I(X; Y) = I(Y; X) — information flows both ways
Non-negative: I(X; Y) ≥ 0 — knowing more never hurts
Zero iff independent: I(X; Y) = 0 ⟺ X and Y are independent
Measured in bits: The same units as entropy

This elegant measure connects directly to entropy and KL divergence:

I(X; Y) = H(X) − H(X|Y) = H(Y) − H(Y|X) = D_KL(P(X,Y) || P(X)P(Y))

The last form is particularly revealing: mutual information is the KL divergence between the joint distribution and the product of marginals—the "statistical distance" from independence.

What You Will Learn

By the end of this page, you will understand mutual information's definition and multiple equivalent forms, appreciate its relationship to entropy, conditional entropy, and KL divergence, apply mutual information to feature selection and representation analysis, and tackle the challenge of estimating MI from samples.

Definition and Equivalent Forms

Mutual information can be defined in several equivalent ways, each offering different intuition:

Mutual Information Definitions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# Mutual Information: Equivalent Definitions
# ===========================================
 
# Definition 1: Reduction in uncertainty
# "How much does knowing Y reduce uncertainty about X?"
I(X; Y) = H(X) - H(X|Y)
 
# Definition 2: Symmetric form
# "How much does knowing either reduce uncertainty about the other?"
I(X; Y) = H(X) + H(Y) - H(X, Y)
 
# Definition 3: KL divergence from independence
# "How far is the joint from the product of marginals?"
I(X; Y) = D_KL(P(X,Y) || P(X)P(Y))
        = ∑_x ∑_y P(x,y) log[P(x,y) / (P(x)P(y))]
 
# Definition 4: Expected pointwise mutual information
# "Average surprise about joint vs independent co-occurrence"
I(X; Y) = E_{P(X,Y)}[log(P(X,Y) / (P(X)P(Y)))]
 
# All definitions are mathematically equivalent!
 
# Python Implementation:
import numpy as np
 
def entropy(probs):
    """Compute entropy of a distribution."""
    probs = np.array(probs).flatten()
    probs = probs[probs > 0]
    return -np.sum(probs * np.log2(probs))
 
def mutual_information(joint_probs):
    """
    Compute mutual information from joint probability matrix.
    
    Args:
        joint_probs: 2D array where joint_probs[i,j] = P(X=i, Y=j)
    
    Returns:
        Mutual information I(X; Y) in bits
    """
    joint = np.array(joint_probs)
    assert np.abs(joint.sum() - 1.0) < 1e-6, "Joint probs must sum to 1"
    
    # Marginal distributions
    p_x = joint.sum(axis=1)  # Sum over Y
    p_y = joint.sum(axis=0)  # Sum over X
    
    # Method 1: I(X;Y) = H(X) + H(Y) - H(X,Y)
    h_x = entropy(p_x)
    h_y = entropy(p_y)
    h_xy = entropy(joint)
    mi_method1 = h_x + h_y - h_xy
    
    # Method 2: I(X;Y) = Σ P(x,y) log[P(x,y)/(P(x)P(y))]
    mi_method2 = 0
    for i in range(joint.shape[0]):
        for j in range(joint.shape[1]):
            if joint[i,j] > 0:
                mi_method2 += joint[i,j] * np.log2(joint[i,j] / (p_x[i] * p_y[j]))
    
    return mi_method1, mi_method2
 
# Example: Weather and umbrella carrying
joint = np.array([
    [0.63, 0.07],  # Sunny: [no umbrella, umbrella]
    [0.03, 0.27]   # Rainy: [no umbrella, umbrella]
])
 
mi1, mi2 = mutual_information(joint)
print(f"Mutual Information (method 1): {mi1:.4f} bits")
print(f"Mutual Information (method 2): {mi2:.4f} bits")
 
# Verify: marginals
p_weather = joint.sum(axis=1)  # [0.7, 0.3]
p_umbrella = joint.sum(axis=0)  # [0.66, 0.34]
print(f"\nP(Weather): {p_weather}")
print(f"P(Umbrella): {p_umbrella}")

Understanding the Venn diagram:

The relationships between entropy, conditional entropy, joint entropy, and mutual information can be visualized as overlapping circles:

       ┌─────────────────────────────────────┐
       │           H(X, Y)                   │
       │  ┌─────────────┬─────────────┐      │
       │  │    H(X|Y)   │    H(Y|X)   │      │
       │  │             │             │      │
       │  │      ┌──────┴──────┐      │      │
       │  │      │   I(X; Y)   │      │      │
       │  │      └──────┬──────┘      │      │
       │  │             │             │      │
       │  └─────────────┴─────────────┘      │
       │        H(X)          H(Y)           │
       └─────────────────────────────────────┘

H(X, Y) = Total uncertainty in both
H(X) = Uncertainty in X alone
H(Y) = Uncertainty in Y alone
I(X; Y) = Overlap (shared information)
H(X|Y) = H(X) − I(X; Y) = Unique uncertainty in X
H(Y|X) = H(Y) − I(X; Y) = Unique uncertainty in Y

Key Relationships

From the Venn diagram structure: • I(X; Y) = H(X) − H(X|Y) = "uncertainty reduction" • I(X; Y) = H(X) + H(Y) − H(X, Y) = "redundancy" • H(X, Y) = H(X) + H(Y|X) = H(Y) + H(X|Y) = "chain rule" • 0 ≤ I(X; Y) ≤ min(H(X), H(Y)) = "bounded by marginals"

Fundamental Properties

Mutual information has several elegant properties that make it the canonical measure of statistical dependence:

Key Properties of Mutual Information

•Symmetry: I(X; Y) = I(Y; X). Information is bi-directional—knowing X tells you as much about Y as knowing Y tells you about X.
•Non-negativity: I(X; Y) ≥ 0. Knowledge never increases uncertainty. Equality holds iff X and Y are independent.
•Self-information: I(X; X) = H(X). A variable shares all its information with itself.
•Upper bound: I(X; Y) ≤ min(H(X), H(Y)). Can't share more information than either variable contains.
•Data processing inequality: For Markov chain X → Y → Z, I(X; Z) ≤ I(X; Y). Processing can only lose information.
•Chain rule: I(X; Y, Z) = I(X; Y) + I(X; Z|Y). Multi-variable MI decomposes.
•Invariance to invertible transforms: I(f(X); g(Y)) = I(X; Y) if f and g are invertible.

Demonstrating MI Properties
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import numpy as np
 
def mi_from_joint(joint):
    """Compute MI from joint probability matrix."""
    joint = np.array(joint)
    p_x = joint.sum(axis=1)
    p_y = joint.sum(axis=0)
    
    mi = 0
    for i in range(joint.shape[0]):
        for j in range(joint.shape[1]):
            if joint[i,j] > 0:
                mi += joint[i,j] * np.log2(joint[i,j] / (p_x[i] * p_y[j] + 1e-15))
    return mi
 
def entropy(probs):
    probs = np.array(probs)
    probs = probs[probs > 0]
    return -np.sum(probs * np.log2(probs))
 
# Property 1: Symmetry
print("Property: Symmetry")
joint = np.array([[0.4, 0.1], [0.1, 0.4]])
joint_transposed = joint.T
print(f"I(X; Y) = {mi_from_joint(joint):.4f}")
print(f"I(Y; X) = {mi_from_joint(joint_transposed):.4f}")
print()
 
# Property 2: Non-negativity
print("Property: Non-negativity (random tests)")
for _ in range(3):
    # Random joint distribution
    joint = np.random.dirichlet(np.ones(9)).reshape(3, 3)
    mi = mi_from_joint(joint)
    print(f"  I(X; Y) = {mi:.4f} >= 0: {mi >= -1e-10}")
print()
 
# Property 3: Zero for independence
print("Property: Zero for independent variables")
p_x = np.array([0.3, 0.4, 0.3])
p_y = np.array([0.5, 0.5])
joint_independent = np.outer(p_x, p_y)  # Product of marginals
print(f"Independent joint: I(X; Y) = {mi_from_joint(joint_independent):.6f} ≈ 0")
print()
 
# Property 4: Self-information
print("Property: I(X; X) = H(X)")
p = np.array([0.25, 0.25, 0.25, 0.25])
# Diagonal joint: P(X=x, X=x) = P(X=x), off-diagonal = 0
joint_self = np.diag(p)
i_xx = mi_from_joint(joint_self)
h_x = entropy(p)
print(f"I(X; X) = {i_xx:.4f}")
print(f"H(X) = {h_x:.4f}")
print()
 
# Property 5: Upper bound
print("Property: Upper bound")
joint = np.array([[0.25, 0.25], [0.25, 0.25]])  # Very dependent
p_x = joint.sum(axis=1)
p_y = joint.sum(axis=0)
mi = mi_from_joint(joint)
print(f"I(X; Y) = {mi:.4f}")
print(f"min(H(X), H(Y)) = min({entropy(p_x):.4f}, {entropy(p_y):.4f}) = {min(entropy(p_x), entropy(p_y)):.4f}")

Data Processing Inequality

For any Markov chain X → Y → Z (meaning Z is conditionally independent of X given Y):

I(X; Z) ≤ I(X; Y)

This profound result says that processing (transforming) data can only destroy information, never create it. Every layer in a neural network can only lose information about the input (though it may make the remaining information more useful for the task).

Mutual Information vs Correlation

A common question: "Why use mutual information instead of correlation?" The answer reveals a fundamental difference in what each measure captures.

Correlation (Pearson's r) measures linear dependence:

r = 0 means no linear relationship
r = ±1 means perfect linear relationship
But r can be 0 even when variables are perfectly dependent!

Mutual information measures any dependence:

I(X; Y) = 0 means true independence
I(X; Y) > 0 means some dependency, linear or not
Captures nonlinear, non-monotonic relationships

MI vs Correlation: Key Differences
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
from sklearn.feature_selection import mutual_info_regression
 
np.random.seed(42)
n = 5000
 
print("Comparing Correlation vs Mutual Information")
print("=" * 60)
print()
 
# Case 1: Linear relationship (both work)
x1 = np.random.normal(0, 1, n)
y1 = 2 * x1 + np.random.normal(0, 0.5, n)
 
corr1 = np.corrcoef(x1, y1)[0, 1]
mi1 = mutual_info_regression(x1.reshape(-1, 1), y1)[0]
 
print("Case 1: Linear relationship (Y = 2X + noise)")
print(f"  Correlation: {corr1:.4f}")
print(f"  Mutual Information: {mi1:.4f} nats")
print()
 
# Case 2: Quadratic relationship (correlation fails!)
x2 = np.random.uniform(-3, 3, n)
y2 = x2**2 + np.random.normal(0, 0.5, n)
 
corr2 = np.corrcoef(x2, y2)[0, 1]
mi2 = mutual_info_regression(x2.reshape(-1, 1), y2)[0]
 
print("Case 2: Quadratic relationship (Y = X² + noise)")
print(f"  Correlation: {corr2:.4f} (near zero!)")
print(f"  Mutual Information: {mi2:.4f} nats (high!)")
print()
 
# Case 3: Sinusoidal relationship
x3 = np.random.uniform(0, 4*np.pi, n)
y3 = np.sin(x3) + np.random.normal(0, 0.2, n)
 
corr3 = np.corrcoef(x3, y3)[0, 1]
mi3 = mutual_info_regression(x3.reshape(-1, 1), y3)[0]
 
print("Case 3: Sinusoidal relationship (Y = sin(X) + noise)")
print(f"  Correlation: {corr3:.4f} (near zero!)")
print(f"  Mutual Information: {mi3:.4f} nats (high!)")
print()
 
# Case 4: XOR-like relationship (binary)
x4 = np.random.randint(0, 2, n)
y4 = np.random.randint(0, 2, n)
z4 = (x4 + y4) % 2  # XOR
 
corr4 = np.corrcoef(x4, z4)[0, 1]
# For discrete, use discrete MI
from sklearn.metrics import mutual_info_score
mi4 = mutual_info_score(x4, z4)
 
print("Case 4: XOR relationship (Z = X XOR Y)")
print(f"  Correlation(X, Z): {corr4:.4f}")
print(f"  Mutual Information: {mi4:.4f} nats")
print()
 
# Case 5: Independence (both should be zero)
x5 = np.random.normal(0, 1, n)
y5 = np.random.normal(0, 1, n)
 
corr5 = np.corrcoef(x5, y5)[0, 1]
mi5 = mutual_info_regression(x5.reshape(-1, 1), y5)[0]
 
print("Case 5: Independence (X and Y unrelated)")
print(f"  Correlation: {corr5:.4f}")
print(f"  Mutual Information: {mi5:.4f} nats (≈0)")
 
print()
print("Key insight: Correlation can be zero even with strong dependence!")
print("MI captures ALL dependencies, linear and nonlinear.")

Correlation vs Mutual Information
Aspect	Correlation (r)	Mutual Information (I)
Relationships detected	Linear only	Any statistical dependence
Range	[-1, 1]	[0, min(H(X), H(Y))]
Zero means	No linear relationship	Independence
For Y = X²	≈ 0	High
Symmetry	Symmetric	Symmetric
Units	Unitless	Bits or nats
Computation	O(n)	O(n log n) or harder

When to Use Which

Use correlation when: • You expect linear relationships • Computational efficiency matters • You need a standardized effect size

Use mutual information when: • Nonlinear relationships are expected • You need true independence testing • Feature selection for complex models

Feature Selection with Mutual Information

One of the most practical applications of mutual information is feature selection: ranking and selecting features by how much information they provide about the target variable.

The intuition:

If I(feature; target) is high, the feature is informative
If I(feature; target) ≈ 0, the feature is irrelevant (independent of target)

This is principled—we're directly measuring how much uncertainty about the target is reduced by knowing the feature.

MI-Based Feature Selection
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import numpy as np
from sklearn.datasets import make_classification
from sklearn.feature_selection import mutual_info_classif, SelectKBest
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
 
# Create dataset with known structure
np.random.seed(42)
n_samples = 1000
n_informative = 5
n_redundant = 3
n_useless = 12
 
# Generate data
X, y = make_classification(
    n_samples=n_samples,
    n_features=n_informative + n_redundant + n_useless,
    n_informative=n_informative,
    n_redundant=n_redundant,
    n_clusters_per_class=2,
    flip_y=0.05,
    random_state=42
)
 
feature_names = [f"F{i}" for i in range(X.shape[1])]
 
# Compute MI for each feature
mi_scores = mutual_info_classif(X, y, random_state=42)
 
# Rank features
ranking = np.argsort(mi_scores)[::-1]
 
print("Feature Ranking by Mutual Information")
print("=" * 50)
print(f"{'Rank':<6} {'Feature':<10} {'MI Score':<12} {'Type'}")
print("-" * 50)
 
for i, idx in enumerate(ranking):
    if idx < n_informative:
        ftype = "Informative"
    elif idx < n_informative + n_redundant:
        ftype = "Redundant"
    else:
        ftype = "Useless"
    print(f"{i+1:<6} {feature_names[idx]:<10} {mi_scores[idx]:<12.4f} {ftype}")
 
print()
 
# Select top features and compare performance
print("Model Performance Comparison")
print("-" * 50)
 
for k in [5, 10, 20]:
    selector = SelectKBest(mutual_info_classif, k=k)
    X_selected = selector.fit_transform(X, y)
    
    clf = RandomForestClassifier(n_estimators=50, random_state=42)
    scores = cross_val_score(clf, X_selected, y, cv=5)
    
    print(f"Top {k:2d} features: Accuracy = {scores.mean():.4f} ± {scores.std():.4f}")
 
# Full features
clf = RandomForestClassifier(n_estimators=50, random_state=42)
scores_full = cross_val_score(clf, X, y, cv=5)
print(f"All {X.shape[1]:2d} features: Accuracy = {scores_full.mean():.4f} ± {scores_full.std():.4f}")

Handling redundancy:

Simple MI-based selection ranks features independently, which can select redundant features (multiple features carrying the same information). Advanced methods account for this:

mRMR (Minimum Redundancy Maximum Relevance): Maximize I(feature; target) while minimizing I(feature; already_selected_features)
JMI (Joint Mutual Information): Consider I(selected_features, new_feature; target)
CMIM (Conditional Mutual Information Maximization): Select features that provide additional information given already selected ones

mRMR Feature Selection
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import numpy as np
from sklearn.feature_selection import mutual_info_classif
from sklearn.metrics import mutual_info_score
 
def mrmr_feature_selection(X, y, n_features=5):
    """
    Minimum Redundancy Maximum Relevance feature selection.
    
    Score = I(f; y) - (1/|S|) * Σ I(f; s) for s in S
    
    Where S is the set of already selected features.
    """
    n_total = X.shape[1]
    
    # Compute relevance: I(feature; target)
    relevance = mutual_info_classif(X, y, random_state=42)
    
    # For redundancy, we'll discretize continuous features
    def discretize(x, n_bins=10):
        return np.digitize(x, np.percentile(x, np.linspace(0, 100, n_bins)))
    
    X_discrete = np.apply_along_axis(discretize, 0, X)
    
    selected = []
    remaining = list(range(n_total))
    
    for i in range(n_features):
        if i == 0:
            # First feature: pure relevance
            scores = relevance[remaining]
        else:
            # Subsequent: relevance - avg(redundancy with selected)
            scores = []
            for f in remaining:
                rel = relevance[f]
                # Average MI with already selected features
                redundancy = np.mean([
                    mutual_info_score(X_discrete[:, f], X_discrete[:, s])
                    for s in selected
                ])
                scores.append(rel - redundancy)
            scores = np.array(scores)
        
        # Select feature with best score
        best_idx = np.argmax(scores)
        best_feature = remaining[best_idx]
        
        selected.append(best_feature)
        remaining.remove(best_feature)
        
        print(f"Step {i+1}: Selected F{best_feature} "
              f"(relevance={relevance[best_feature]:.4f})")
    
    return selected
 
# Apply mRMR
print("mRMR Feature Selection")
print("=" * 50)
selected_features = mrmr_feature_selection(X, y, n_features=5)
print(f"\nSelected features: {['F'+str(i) for i in selected_features]}")

Practical Considerations

MI estimation from finite samples can be noisy, especially with continuous variables or many categories. For reliable results: • Use enough samples (thousands, not dozens) • Consider binning strategies for continuous variables • Use cross-validation to validate selected features • Be wary of overfitting on feature selection itself

Estimating Mutual Information from Samples

In practice, we rarely have access to true probability distributions—we have samples. Estimating mutual information from finite samples is surprisingly challenging, especially for continuous variables.

The challenge:

Naive histogram-based estimates are biased and sensitive to binning
Continuous MI requires density estimation, which is hard in high dimensions
Small sample sizes cause severe underestimation

Several approaches have been developed:

MI Estimation Methods

•Histogram method: Bin the data and compute discrete MI. Simple but requires careful binning; biased.
•KSG estimator: k-Nearest Neighbors based. Uses local distances to estimate densities without binning. Standard choice for continuous variables.
•Kernel Density Estimation (KDE): Smooth density estimates with kernels. Better than histograms but computationally expensive.
•MINE (Mutual Information Neural Estimation): Train a neural network to approximate the KL formulation. Scales to high dimensions; used in deep learning.
•InfoNCE/Contrastive: Lower bound on MI using contrastive objectives. The basis for contrastive representation learning.

MI Estimation Methods Comparison
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
from sklearn.feature_selection import mutual_info_regression
from scipy.stats import entropy
 
def mi_histogram(x, y, bins=10):
    """
    Histogram-based MI estimation.
    Simple but sensitive to binning.
    """
    # Create 2D histogram
    hist_2d, x_edges, y_edges = np.histogram2d(x, y, bins=bins)
    
    # Normalize to probabilities
    pxy = hist_2d / hist_2d.sum()
    px = pxy.sum(axis=1)
    py = pxy.sum(axis=0)
    
    # Compute MI
    mi = 0
    for i in range(len(px)):
        for j in range(len(py)):
            if pxy[i, j] > 0 and px[i] > 0 and py[j] > 0:
                mi += pxy[i, j] * np.log(pxy[i, j] / (px[i] * py[j]))
    
    return mi  # in nats
 
def mi_ksg(x, y, k=3):
    """
    KSG estimator (using sklearn implementation).
    More robust than histogram methods.
    """
    return mutual_info_regression(x.reshape(-1, 1), y, n_neighbors=k)[0]
 
# Generate test data with known relationship
np.random.seed(42)
n = 1000
 
# Linear relationship
x_linear = np.random.normal(0, 1, n)
y_linear = x_linear + np.random.normal(0, 0.5, n)
 
# Theoretical MI for jointly Gaussian: I(X;Y) = -0.5 * log(1 - ρ²)
rho = np.corrcoef(x_linear, y_linear)[0, 1]
mi_theoretical = -0.5 * np.log(1 - rho**2)
 
print("MI Estimation for Linear Gaussian (Y = X + noise)")
print("=" * 60)
print(f"Theoretical MI (Gaussian formula): {mi_theoretical:.4f} nats")
print(f"Correlation: {rho:.4f}")
print()
 
# Compare estimators
print(f"{'Method':<30} {'Estimate':<12} {'Error':<12}")
print("-" * 60)
 
# Histogram with different bin sizes
for bins in [5, 10, 20, 50]:
    est = mi_histogram(x_linear, y_linear, bins=bins)
    err = est - mi_theoretical
    print(f"Histogram (bins={bins}){'':<15} {est:<12.4f} {err:+.4f}")
 
# KSG with different k
for k in [3, 5, 10]:
    est = mi_ksg(x_linear, y_linear, k=k)
    err = est - mi_theoretical
    print(f"KSG (k={k}){'':<20} {est:<12.4f} {err:+.4f}")
 
print()
print("Note: KSG is generally more accurate and robust than histogram methods.")

High-Dimensional MI Estimation

Estimating MI in high dimensions is notoriously difficult due to the curse of dimensionality. Methods like MINE and InfoNCE provide lower bounds rather than exact estimates. These bounds are sufficient for optimization (training) but may not accurately reflect the true MI value. For analysis purposes, dimensionality reduction before MI estimation is often necessary.

Mutual Information in Deep Learning

Mutual information has become central to understanding and training deep neural networks. Here are key applications:

MI Applications in Deep Learning

•Information Bottleneck Theory: Networks compress I(X; T) while preserving I(T; Y). Layers form an "information plane" showing compression-prediction tradeoff.
•Contrastive Learning (SimCLR, MoCo): Maximize I(view₁; view₂) between augmented views. Same image views should share information.
•InfoMax Principle: Maximize I(input; representation) for unsupervised learning. Learn representations that preserve input information.
•Variational Information Maximization (VIM): Use variational bounds to maximize MI when exact computation is intractable.
•MINE for GANs: Estimate I(X; Z) to improve latent space structure in generative models.

InfoNCE Loss (Contrastive Learning)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import torch
import torch.nn.functional as F
 
def info_nce_loss(query, positive_key, negative_keys, temperature=0.07):
    """
    InfoNCE loss for contrastive learning.
    
    This is a lower bound on I(query; key):
    I(Q; K) >= log(N) - L_NCE
    
    Args:
        query: Query representations (batch_size, dim)
        positive_key: Positive key for each query (batch_size, dim)
        negative_keys: Negative keys (num_negatives, dim) or (batch_size, num_neg, dim)
        temperature: Softmax temperature (lower = sharper)
    
    Returns:
        InfoNCE loss (lower is better for training)
    """
    batch_size = query.size(0)
    
    # Normalize representations
    query = F.normalize(query, dim=1)
    positive_key = F.normalize(positive_key, dim=1)
    
    # Positive logits: q · k+ 
    positive_logits = torch.sum(query * positive_key, dim=1, keepdim=True)
    positive_logits = positive_logits / temperature
    
    # Negative logits
    if negative_keys.dim() == 2:
        # Shared negatives across batch
        negative_keys = F.normalize(negative_keys, dim=1)
        negative_logits = query @ negative_keys.T / temperature
    else:
        # Per-sample negatives
        negative_keys = F.normalize(negative_keys, dim=2)
        negative_logits = torch.bmm(query.unsqueeze(1), 
                                     negative_keys.transpose(1, 2)).squeeze(1)
        negative_logits = negative_logits / temperature
    
    # Concatenate: [positive, negatives]
    logits = torch.cat([positive_logits, negative_logits], dim=1)
    
    # Labels: positive is always at index 0
    labels = torch.zeros(batch_size, dtype=torch.long, device=query.device)
    
    # Cross-entropy loss
    loss = F.cross_entropy(logits, labels)
    
    return loss
 
# Example usage
batch_size = 32
dim = 128
num_negatives = 256
 
# Simulated representations
query = torch.randn(batch_size, dim)
positive_key = query + torch.randn(batch_size, dim) * 0.1  # Similar
negative_keys = torch.randn(num_negatives, dim)  # Random
 
loss = info_nce_loss(query, positive_key, negative_keys)
print(f"InfoNCE Loss: {loss.item():.4f}")
 
# Lower bound on MI
mi_lower_bound = np.log(num_negatives + 1) - loss.item()
print(f"MI Lower Bound: {mi_lower_bound:.4f} nats")
 
# With more negatives
for n_neg in [16, 64, 256, 1024]:
    neg_keys = torch.randn(n_neg, dim)
    loss = info_nce_loss(query, positive_key, neg_keys)
    mi_bound = np.log(n_neg + 1) - loss.item()
    print(f"N={n_neg:4d}: Loss={loss.item():.4f}, MI bound={mi_bound:.4f}")

The Information Bottleneck View:

A deep neural network can be viewed through the lens of information theory:

Each layer T is a stochastic mapping X → T → Y
The Data Processing Inequality ensures: I(X; T) ≥ I(X; Y)
But also: I(T; Y) ≤ I(X; Y)

The network must preserve information relevant to Y while discarding irrelevant details. This is precisely the Information Bottleneck objective:

minimize: I(X; T) − β · I(T; Y)

Find the representation T that maximally compresses X while retaining information about Y.

Contrastive Learning Intuition

SimCLR and similar methods work by maximizing MI between different augmentations of the same image while minimizing MI between different images. The InfoNCE loss achieves this: numerator pulls positive pairs together (high MI), denominator pushes negatives apart (low MI). More negatives give tighter bounds but require more compute.

Conditional Mutual Information

Sometimes we want to know the shared information between X and Y after accounting for a third variable Z. This is conditional mutual information:

I(X; Y | Z) = H(X | Z) − H(X | Y, Z)

This measures how much information Y provides about X beyond what Z already provides.

Conditional Mutual Information
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
# Conditional Mutual Information
# ================================
 
# Definition:
I(X; Y | Z) = H(X | Z) - H(X | Y, Z)
            = H(X, Z) + H(Y, Z) - H(Z) - H(X, Y, Z)
            = E_Z[I(X; Y | Z=z)]
 
# Properties:
# 1. Chain rule: I(X; Y, Z) = I(X; Z) + I(X; Y | Z)
# 2. Non-negative: I(X; Y | Z) >= 0
# 3. Reduces to MI when Z is constant: I(X; Y | ∅) = I(X; Y)
 
# Key insight: I(X; Y | Z) can be LESS than I(X; Y)
# This happens when Z "explains" some of the dependence between X and Y
 
# Example: Confounding
# X = Ice cream sales
# Y = Drowning deaths
# Z = Temperature
# I(X; Y) > 0 (correlated!)
# I(X; Y | Z) ≈ 0 (independent given temperature)
 
import numpy as np
 
def conditional_mi_discrete(x, y, z):
    """
    Compute I(X; Y | Z) from discrete samples.
    """
    from collections import defaultdict
    
    n = len(x)
    
    # Count joint occurrences
    xyz_counts = defaultdict(int)
    xz_counts = defaultdict(int)
    yz_counts = defaultdict(int)
    z_counts = defaultdict(int)
    
    for i in range(n):
        xyz_counts[(x[i], y[i], z[i])] += 1
        xz_counts[(x[i], z[i])] += 1
        yz_counts[(y[i], z[i])] += 1
        z_counts[z[i]] += 1
    
    # Convert to probabilities and compute CMI
    cmi = 0
    for (xi, yi, zi), count in xyz_counts.items():
        p_xyz = count / n
        p_xz = xz_counts[(xi, zi)] / n
        p_yz = yz_counts[(yi, zi)] / n
        p_z = z_counts[zi] / n
        
        if p_xyz > 0 and p_xz > 0 and p_yz > 0 and p_z > 0:
            # I(X;Y|Z) = Σ p(x,y,z) log[p(x,y,z)p(z) / (p(x,z)p(y,z))]
            cmi += p_xyz * np.log2((p_xyz * p_z) / (p_xz * p_yz))
    
    return cmi
 
# Example: Confounding
np.random.seed(42)
n = 5000
 
# Z causes both X and Y
z = np.random.randint(0, 3, n)  # Low, Medium, High temperature
 
# X depends on Z (ice cream sales)
x = z + np.random.binomial(1, 0.3, n)
 
# Y depends on Z (drowning - more swimming in hot weather)  
y = z + np.random.binomial(1, 0.2, n)
 
# Compute MIs
from sklearn.metrics import mutual_info_score
 
mi_xy = mutual_info_score(x, y)
cmi_xy_z = conditional_mi_discrete(x, y, z)
 
print("Confounding Example: Ice Cream Sales (X) vs Drownings (Y)")
print("=" * 60)
print(f"I(X; Y) = {mi_xy:.4f} nats  <- Appears correlated!")
print(f"I(X; Y | Z=Temperature) = {cmi_xy_z:.4f} nats  <- Much less after conditioning!")
print()
print("Conditioning on temperature 'explains away' the spurious correlation.")

Causal Inference Connection

Conditional MI is central to causal inference. If X and Y are conditionally independent given Z (I(X;Y|Z) = 0), then Z "screens off" the dependence. This is the basis for identifying confounders and understanding causal structure. Conditional independence testing uses CMI as the test statistic.

Summary: Mutual Information Mastery

Mutual information is the definitive measure of statistical dependence with deep connections throughout ML. Let's consolidate:

Key Takeaways

•Definition: I(X;Y) = H(X) - H(X|Y) = D_KL(P(X,Y) || P(X)P(Y)). Measures shared information.
•Properties: Symmetric, non-negative, zero iff independent, bounded by marginal entropies.
•vs Correlation: MI captures ALL dependencies (linear and nonlinear); correlation catches only linear.
•Feature Selection: Rank features by I(feature; target). Use mRMR to avoid redundancy.
•Estimation: KSG for continuous variables; MINE/InfoNCE for deep learning; all are challenging in high dimensions.
•Deep Learning: Contrastive learning maximizes MI between views; information bottleneck explains compression.
•Conditional MI: I(X;Y|Z) measures dependence after accounting for Z; crucial for causal reasoning.

What's next:

We've covered the core concepts of information theory: entropy, cross-entropy, KL divergence, and mutual information. The final page synthesizes these concepts, showing how information theory provides a unified lens for understanding machine learning—from loss functions to generative models to neural network analysis.

Page Complete

You now understand mutual information as the canonical measure of statistical dependence, can apply it to feature selection, appreciate its role in modern deep learning through contrastive methods, and understand the challenges of estimation from finite samples.

4 / 5

Loading learning content...

Probability TheoryInformation Theory

Information Theory for Machine Learning

LevelIntermediate

Duration90 mins

TopicInformation Theory

4 / 5

Mutual Information: Quantifying Shared Information

The Essence of Relevance

Mutual information is:

Symmetric: I(X; Y) = I(Y; X) — information flows both ways
Non-negative: I(X; Y) ≥ 0 — knowing more never hurts
Zero iff independent: I(X; Y) = 0 ⟺ X and Y are independent
Measured in bits: The same units as entropy

This elegant measure connects directly to entropy and KL divergence:

I(X; Y) = H(X) − H(X|Y) = H(Y) − H(Y|X) = D_KL(P(X,Y) || P(X)P(Y))

The last form is particularly revealing: mutual information is the KL divergence between the joint distribution and the product of marginals—the "statistical distance" from independence.

What You Will Learn

Definition and Equivalent Forms

Mutual information can be defined in several equivalent ways, each offering different intuition:

Mutual Information Definitions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# Mutual Information: Equivalent Definitions
# ===========================================
 
# Definition 1: Reduction in uncertainty
# "How much does knowing Y reduce uncertainty about X?"
I(X; Y) = H(X) - H(X|Y)
 
# Definition 2: Symmetric form
# "How much does knowing either reduce uncertainty about the other?"
I(X; Y) = H(X) + H(Y) - H(X, Y)
 
# Definition 3: KL divergence from independence
# "How far is the joint from the product of marginals?"
I(X; Y) = D_KL(P(X,Y) || P(X)P(Y))
        = ∑_x ∑_y P(x,y) log[P(x,y) / (P(x)P(y))]
 
# Definition 4: Expected pointwise mutual information
# "Average surprise about joint vs independent co-occurrence"
I(X; Y) = E_{P(X,Y)}[log(P(X,Y) / (P(X)P(Y)))]
 
# All definitions are mathematically equivalent!
 
# Python Implementation:
import numpy as np
 
def entropy(probs):
    """Compute entropy of a distribution."""
    probs = np.array(probs).flatten()
    probs = probs[probs > 0]
    return -np.sum(probs * np.log2(probs))
 
def mutual_information(joint_probs):
    """
    Compute mutual information from joint probability matrix.
    
    Args:
        joint_probs: 2D array where joint_probs[i,j] = P(X=i, Y=j)
    
    Returns:
        Mutual information I(X; Y) in bits
    """
    joint = np.array(joint_probs)
    assert np.abs(joint.sum() - 1.0) < 1e-6, "Joint probs must sum to 1"
    
    # Marginal distributions
    p_x = joint.sum(axis=1)  # Sum over Y
    p_y = joint.sum(axis=0)  # Sum over X
    
    # Method 1: I(X;Y) = H(X) + H(Y) - H(X,Y)
    h_x = entropy(p_x)
    h_y = entropy(p_y)
    h_xy = entropy(joint)
    mi_method1 = h_x + h_y - h_xy
    
    # Method 2: I(X;Y) = Σ P(x,y) log[P(x,y)/(P(x)P(y))]
    mi_method2 = 0
    for i in range(joint.shape[0]):
        for j in range(joint.shape[1]):
            if joint[i,j] > 0:
                mi_method2 += joint[i,j] * np.log2(joint[i,j] / (p_x[i] * p_y[j]))
    
    return mi_method1, mi_method2
 
# Example: Weather and umbrella carrying
joint = np.array([
    [0.63, 0.07],  # Sunny: [no umbrella, umbrella]
    [0.03, 0.27]   # Rainy: [no umbrella, umbrella]
])
 
mi1, mi2 = mutual_information(joint)
print(f"Mutual Information (method 1): {mi1:.4f} bits")
print(f"Mutual Information (method 2): {mi2:.4f} bits")
 
# Verify: marginals
p_weather = joint.sum(axis=1)  # [0.7, 0.3]
p_umbrella = joint.sum(axis=0)  # [0.66, 0.34]
print(f"\nP(Weather): {p_weather}")
print(f"P(Umbrella): {p_umbrella}")

Understanding the Venn diagram:

The relationships between entropy, conditional entropy, joint entropy, and mutual information can be visualized as overlapping circles:

       ┌─────────────────────────────────────┐
       │           H(X, Y)                   │
       │  ┌─────────────┬─────────────┐      │
       │  │    H(X|Y)   │    H(Y|X)   │      │
       │  │             │             │      │
       │  │      ┌──────┴──────┐      │      │
       │  │      │   I(X; Y)   │      │      │
       │  │      └──────┬──────┘      │      │
       │  │             │             │      │
       │  └─────────────┴─────────────┘      │
       │        H(X)          H(Y)           │
       └─────────────────────────────────────┘

H(X, Y) = Total uncertainty in both
H(X) = Uncertainty in X alone
H(Y) = Uncertainty in Y alone
I(X; Y) = Overlap (shared information)
H(X|Y) = H(X) − I(X; Y) = Unique uncertainty in X
H(Y|X) = H(Y) − I(X; Y) = Unique uncertainty in Y

Key Relationships

Fundamental Properties

Mutual information has several elegant properties that make it the canonical measure of statistical dependence:

Key Properties of Mutual Information

•Symmetry: I(X; Y) = I(Y; X). Information is bi-directional—knowing X tells you as much about Y as knowing Y tells you about X.
•Non-negativity: I(X; Y) ≥ 0. Knowledge never increases uncertainty. Equality holds iff X and Y are independent.
•Self-information: I(X; X) = H(X). A variable shares all its information with itself.
•Upper bound: I(X; Y) ≤ min(H(X), H(Y)). Can't share more information than either variable contains.
•Data processing inequality: For Markov chain X → Y → Z, I(X; Z) ≤ I(X; Y). Processing can only lose information.
•Chain rule: I(X; Y, Z) = I(X; Y) + I(X; Z|Y). Multi-variable MI decomposes.
•Invariance to invertible transforms: I(f(X); g(Y)) = I(X; Y) if f and g are invertible.

Demonstrating MI Properties
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import numpy as np
 
def mi_from_joint(joint):
    """Compute MI from joint probability matrix."""
    joint = np.array(joint)
    p_x = joint.sum(axis=1)
    p_y = joint.sum(axis=0)
    
    mi = 0
    for i in range(joint.shape[0]):
        for j in range(joint.shape[1]):
            if joint[i,j] > 0:
                mi += joint[i,j] * np.log2(joint[i,j] / (p_x[i] * p_y[j] + 1e-15))
    return mi
 
def entropy(probs):
    probs = np.array(probs)
    probs = probs[probs > 0]
    return -np.sum(probs * np.log2(probs))
 
# Property 1: Symmetry
print("Property: Symmetry")
joint = np.array([[0.4, 0.1], [0.1, 0.4]])
joint_transposed = joint.T
print(f"I(X; Y) = {mi_from_joint(joint):.4f}")
print(f"I(Y; X) = {mi_from_joint(joint_transposed):.4f}")
print()
 
# Property 2: Non-negativity
print("Property: Non-negativity (random tests)")
for _ in range(3):
    # Random joint distribution
    joint = np.random.dirichlet(np.ones(9)).reshape(3, 3)
    mi = mi_from_joint(joint)
    print(f"  I(X; Y) = {mi:.4f} >= 0: {mi >= -1e-10}")
print()
 
# Property 3: Zero for independence
print("Property: Zero for independent variables")
p_x = np.array([0.3, 0.4, 0.3])
p_y = np.array([0.5, 0.5])
joint_independent = np.outer(p_x, p_y)  # Product of marginals
print(f"Independent joint: I(X; Y) = {mi_from_joint(joint_independent):.6f} ≈ 0")
print()
 
# Property 4: Self-information
print("Property: I(X; X) = H(X)")
p = np.array([0.25, 0.25, 0.25, 0.25])
# Diagonal joint: P(X=x, X=x) = P(X=x), off-diagonal = 0
joint_self = np.diag(p)
i_xx = mi_from_joint(joint_self)
h_x = entropy(p)
print(f"I(X; X) = {i_xx:.4f}")
print(f"H(X) = {h_x:.4f}")
print()
 
# Property 5: Upper bound
print("Property: Upper bound")
joint = np.array([[0.25, 0.25], [0.25, 0.25]])  # Very dependent
p_x = joint.sum(axis=1)
p_y = joint.sum(axis=0)
mi = mi_from_joint(joint)
print(f"I(X; Y) = {mi:.4f}")
print(f"min(H(X), H(Y)) = min({entropy(p_x):.4f}, {entropy(p_y):.4f}) = {min(entropy(p_x), entropy(p_y)):.4f}")

Data Processing Inequality

For any Markov chain X → Y → Z (meaning Z is conditionally independent of X given Y):

I(X; Z) ≤ I(X; Y)

Mutual Information vs Correlation

A common question: "Why use mutual information instead of correlation?" The answer reveals a fundamental difference in what each measure captures.

Correlation (Pearson's r) measures linear dependence:

r = 0 means no linear relationship
r = ±1 means perfect linear relationship
But r can be 0 even when variables are perfectly dependent!

Mutual information measures any dependence:

I(X; Y) = 0 means true independence
I(X; Y) > 0 means some dependency, linear or not
Captures nonlinear, non-monotonic relationships

MI vs Correlation: Key Differences
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
from sklearn.feature_selection import mutual_info_regression
 
np.random.seed(42)
n = 5000
 
print("Comparing Correlation vs Mutual Information")
print("=" * 60)
print()
 
# Case 1: Linear relationship (both work)
x1 = np.random.normal(0, 1, n)
y1 = 2 * x1 + np.random.normal(0, 0.5, n)
 
corr1 = np.corrcoef(x1, y1)[0, 1]
mi1 = mutual_info_regression(x1.reshape(-1, 1), y1)[0]
 
print("Case 1: Linear relationship (Y = 2X + noise)")
print(f"  Correlation: {corr1:.4f}")
print(f"  Mutual Information: {mi1:.4f} nats")
print()
 
# Case 2: Quadratic relationship (correlation fails!)
x2 = np.random.uniform(-3, 3, n)
y2 = x2**2 + np.random.normal(0, 0.5, n)
 
corr2 = np.corrcoef(x2, y2)[0, 1]
mi2 = mutual_info_regression(x2.reshape(-1, 1), y2)[0]
 
print("Case 2: Quadratic relationship (Y = X² + noise)")
print(f"  Correlation: {corr2:.4f} (near zero!)")
print(f"  Mutual Information: {mi2:.4f} nats (high!)")
print()
 
# Case 3: Sinusoidal relationship
x3 = np.random.uniform(0, 4*np.pi, n)
y3 = np.sin(x3) + np.random.normal(0, 0.2, n)
 
corr3 = np.corrcoef(x3, y3)[0, 1]
mi3 = mutual_info_regression(x3.reshape(-1, 1), y3)[0]
 
print("Case 3: Sinusoidal relationship (Y = sin(X) + noise)")
print(f"  Correlation: {corr3:.4f} (near zero!)")
print(f"  Mutual Information: {mi3:.4f} nats (high!)")
print()
 
# Case 4: XOR-like relationship (binary)
x4 = np.random.randint(0, 2, n)
y4 = np.random.randint(0, 2, n)
z4 = (x4 + y4) % 2  # XOR
 
corr4 = np.corrcoef(x4, z4)[0, 1]
# For discrete, use discrete MI
from sklearn.metrics import mutual_info_score
mi4 = mutual_info_score(x4, z4)
 
print("Case 4: XOR relationship (Z = X XOR Y)")
print(f"  Correlation(X, Z): {corr4:.4f}")
print(f"  Mutual Information: {mi4:.4f} nats")
print()
 
# Case 5: Independence (both should be zero)
x5 = np.random.normal(0, 1, n)
y5 = np.random.normal(0, 1, n)
 
corr5 = np.corrcoef(x5, y5)[0, 1]
mi5 = mutual_info_regression(x5.reshape(-1, 1), y5)[0]
 
print("Case 5: Independence (X and Y unrelated)")
print(f"  Correlation: {corr5:.4f}")
print(f"  Mutual Information: {mi5:.4f} nats (≈0)")
 
print()
print("Key insight: Correlation can be zero even with strong dependence!")
print("MI captures ALL dependencies, linear and nonlinear.")

Correlation vs Mutual Information
Aspect	Correlation (r)	Mutual Information (I)
Relationships detected	Linear only	Any statistical dependence
Range	[-1, 1]	[0, min(H(X), H(Y))]
Zero means	No linear relationship	Independence
For Y = X²	≈ 0	High
Symmetry	Symmetric	Symmetric
Units	Unitless	Bits or nats
Computation	O(n)	O(n log n) or harder

When to Use Which

Use correlation when: • You expect linear relationships • Computational efficiency matters • You need a standardized effect size

Use mutual information when: • Nonlinear relationships are expected • You need true independence testing • Feature selection for complex models

Feature Selection with Mutual Information

One of the most practical applications of mutual information is feature selection: ranking and selecting features by how much information they provide about the target variable.

The intuition:

If I(feature; target) is high, the feature is informative
If I(feature; target) ≈ 0, the feature is irrelevant (independent of target)

This is principled—we're directly measuring how much uncertainty about the target is reduced by knowing the feature.

MI-Based Feature Selection
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import numpy as np
from sklearn.datasets import make_classification
from sklearn.feature_selection import mutual_info_classif, SelectKBest
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
 
# Create dataset with known structure
np.random.seed(42)
n_samples = 1000
n_informative = 5
n_redundant = 3
n_useless = 12
 
# Generate data
X, y = make_classification(
    n_samples=n_samples,
    n_features=n_informative + n_redundant + n_useless,
    n_informative=n_informative,
    n_redundant=n_redundant,
    n_clusters_per_class=2,
    flip_y=0.05,
    random_state=42
)
 
feature_names = [f"F{i}" for i in range(X.shape[1])]
 
# Compute MI for each feature
mi_scores = mutual_info_classif(X, y, random_state=42)
 
# Rank features
ranking = np.argsort(mi_scores)[::-1]
 
print("Feature Ranking by Mutual Information")
print("=" * 50)
print(f"{'Rank':<6} {'Feature':<10} {'MI Score':<12} {'Type'}")
print("-" * 50)
 
for i, idx in enumerate(ranking):
    if idx < n_informative:
        ftype = "Informative"
    elif idx < n_informative + n_redundant:
        ftype = "Redundant"
    else:
        ftype = "Useless"
    print(f"{i+1:<6} {feature_names[idx]:<10} {mi_scores[idx]:<12.4f} {ftype}")
 
print()
 
# Select top features and compare performance
print("Model Performance Comparison")
print("-" * 50)
 
for k in [5, 10, 20]:
    selector = SelectKBest(mutual_info_classif, k=k)
    X_selected = selector.fit_transform(X, y)
    
    clf = RandomForestClassifier(n_estimators=50, random_state=42)
    scores = cross_val_score(clf, X_selected, y, cv=5)
    
    print(f"Top {k:2d} features: Accuracy = {scores.mean():.4f} ± {scores.std():.4f}")
 
# Full features
clf = RandomForestClassifier(n_estimators=50, random_state=42)
scores_full = cross_val_score(clf, X, y, cv=5)
print(f"All {X.shape[1]:2d} features: Accuracy = {scores_full.mean():.4f} ± {scores_full.std():.4f}")

Handling redundancy:

Simple MI-based selection ranks features independently, which can select redundant features (multiple features carrying the same information). Advanced methods account for this:

mRMR (Minimum Redundancy Maximum Relevance): Maximize I(feature; target) while minimizing I(feature; already_selected_features)
JMI (Joint Mutual Information): Consider I(selected_features, new_feature; target)
CMIM (Conditional Mutual Information Maximization): Select features that provide additional information given already selected ones

mRMR Feature Selection
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import numpy as np
from sklearn.feature_selection import mutual_info_classif
from sklearn.metrics import mutual_info_score
 
def mrmr_feature_selection(X, y, n_features=5):
    """
    Minimum Redundancy Maximum Relevance feature selection.
    
    Score = I(f; y) - (1/|S|) * Σ I(f; s) for s in S
    
    Where S is the set of already selected features.
    """
    n_total = X.shape[1]
    
    # Compute relevance: I(feature; target)
    relevance = mutual_info_classif(X, y, random_state=42)
    
    # For redundancy, we'll discretize continuous features
    def discretize(x, n_bins=10):
        return np.digitize(x, np.percentile(x, np.linspace(0, 100, n_bins)))
    
    X_discrete = np.apply_along_axis(discretize, 0, X)
    
    selected = []
    remaining = list(range(n_total))
    
    for i in range(n_features):
        if i == 0:
            # First feature: pure relevance
            scores = relevance[remaining]
        else:
            # Subsequent: relevance - avg(redundancy with selected)
            scores = []
            for f in remaining:
                rel = relevance[f]
                # Average MI with already selected features
                redundancy = np.mean([
                    mutual_info_score(X_discrete[:, f], X_discrete[:, s])
                    for s in selected
                ])
                scores.append(rel - redundancy)
            scores = np.array(scores)
        
        # Select feature with best score
        best_idx = np.argmax(scores)
        best_feature = remaining[best_idx]
        
        selected.append(best_feature)
        remaining.remove(best_feature)
        
        print(f"Step {i+1}: Selected F{best_feature} "
              f"(relevance={relevance[best_feature]:.4f})")
    
    return selected
 
# Apply mRMR
print("mRMR Feature Selection")
print("=" * 50)
selected_features = mrmr_feature_selection(X, y, n_features=5)
print(f"\nSelected features: {['F'+str(i) for i in selected_features]}")

Practical Considerations

Estimating Mutual Information from Samples

The challenge:

Naive histogram-based estimates are biased and sensitive to binning
Continuous MI requires density estimation, which is hard in high dimensions
Small sample sizes cause severe underestimation

Several approaches have been developed:

MI Estimation Methods

•Histogram method: Bin the data and compute discrete MI. Simple but requires careful binning; biased.
•KSG estimator: k-Nearest Neighbors based. Uses local distances to estimate densities without binning. Standard choice for continuous variables.
•Kernel Density Estimation (KDE): Smooth density estimates with kernels. Better than histograms but computationally expensive.
•MINE (Mutual Information Neural Estimation): Train a neural network to approximate the KL formulation. Scales to high dimensions; used in deep learning.
•InfoNCE/Contrastive: Lower bound on MI using contrastive objectives. The basis for contrastive representation learning.

MI Estimation Methods Comparison
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
from sklearn.feature_selection import mutual_info_regression
from scipy.stats import entropy
 
def mi_histogram(x, y, bins=10):
    """
    Histogram-based MI estimation.
    Simple but sensitive to binning.
    """
    # Create 2D histogram
    hist_2d, x_edges, y_edges = np.histogram2d(x, y, bins=bins)
    
    # Normalize to probabilities
    pxy = hist_2d / hist_2d.sum()
    px = pxy.sum(axis=1)
    py = pxy.sum(axis=0)
    
    # Compute MI
    mi = 0
    for i in range(len(px)):
        for j in range(len(py)):
            if pxy[i, j] > 0 and px[i] > 0 and py[j] > 0:
                mi += pxy[i, j] * np.log(pxy[i, j] / (px[i] * py[j]))
    
    return mi  # in nats
 
def mi_ksg(x, y, k=3):
    """
    KSG estimator (using sklearn implementation).
    More robust than histogram methods.
    """
    return mutual_info_regression(x.reshape(-1, 1), y, n_neighbors=k)[0]
 
# Generate test data with known relationship
np.random.seed(42)
n = 1000
 
# Linear relationship
x_linear = np.random.normal(0, 1, n)
y_linear = x_linear + np.random.normal(0, 0.5, n)
 
# Theoretical MI for jointly Gaussian: I(X;Y) = -0.5 * log(1 - ρ²)
rho = np.corrcoef(x_linear, y_linear)[0, 1]
mi_theoretical = -0.5 * np.log(1 - rho**2)
 
print("MI Estimation for Linear Gaussian (Y = X + noise)")
print("=" * 60)
print(f"Theoretical MI (Gaussian formula): {mi_theoretical:.4f} nats")
print(f"Correlation: {rho:.4f}")
print()
 
# Compare estimators
print(f"{'Method':<30} {'Estimate':<12} {'Error':<12}")
print("-" * 60)
 
# Histogram with different bin sizes
for bins in [5, 10, 20, 50]:
    est = mi_histogram(x_linear, y_linear, bins=bins)
    err = est - mi_theoretical
    print(f"Histogram (bins={bins}){'':<15} {est:<12.4f} {err:+.4f}")
 
# KSG with different k
for k in [3, 5, 10]:
    est = mi_ksg(x_linear, y_linear, k=k)
    err = est - mi_theoretical
    print(f"KSG (k={k}){'':<20} {est:<12.4f} {err:+.4f}")
 
print()
print("Note: KSG is generally more accurate and robust than histogram methods.")

High-Dimensional MI Estimation

Mutual Information in Deep Learning

Mutual information has become central to understanding and training deep neural networks. Here are key applications:

MI Applications in Deep Learning

•Information Bottleneck Theory: Networks compress I(X; T) while preserving I(T; Y). Layers form an "information plane" showing compression-prediction tradeoff.
•Contrastive Learning (SimCLR, MoCo): Maximize I(view₁; view₂) between augmented views. Same image views should share information.
•InfoMax Principle: Maximize I(input; representation) for unsupervised learning. Learn representations that preserve input information.
•Variational Information Maximization (VIM): Use variational bounds to maximize MI when exact computation is intractable.
•MINE for GANs: Estimate I(X; Z) to improve latent space structure in generative models.

InfoNCE Loss (Contrastive Learning)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import torch
import torch.nn.functional as F
 
def info_nce_loss(query, positive_key, negative_keys, temperature=0.07):
    """
    InfoNCE loss for contrastive learning.
    
    This is a lower bound on I(query; key):
    I(Q; K) >= log(N) - L_NCE
    
    Args:
        query: Query representations (batch_size, dim)
        positive_key: Positive key for each query (batch_size, dim)
        negative_keys: Negative keys (num_negatives, dim) or (batch_size, num_neg, dim)
        temperature: Softmax temperature (lower = sharper)
    
    Returns:
        InfoNCE loss (lower is better for training)
    """
    batch_size = query.size(0)
    
    # Normalize representations
    query = F.normalize(query, dim=1)
    positive_key = F.normalize(positive_key, dim=1)
    
    # Positive logits: q · k+ 
    positive_logits = torch.sum(query * positive_key, dim=1, keepdim=True)
    positive_logits = positive_logits / temperature
    
    # Negative logits
    if negative_keys.dim() == 2:
        # Shared negatives across batch
        negative_keys = F.normalize(negative_keys, dim=1)
        negative_logits = query @ negative_keys.T / temperature
    else:
        # Per-sample negatives
        negative_keys = F.normalize(negative_keys, dim=2)
        negative_logits = torch.bmm(query.unsqueeze(1), 
                                     negative_keys.transpose(1, 2)).squeeze(1)
        negative_logits = negative_logits / temperature
    
    # Concatenate: [positive, negatives]
    logits = torch.cat([positive_logits, negative_logits], dim=1)
    
    # Labels: positive is always at index 0
    labels = torch.zeros(batch_size, dtype=torch.long, device=query.device)
    
    # Cross-entropy loss
    loss = F.cross_entropy(logits, labels)
    
    return loss
 
# Example usage
batch_size = 32
dim = 128
num_negatives = 256
 
# Simulated representations
query = torch.randn(batch_size, dim)
positive_key = query + torch.randn(batch_size, dim) * 0.1  # Similar
negative_keys = torch.randn(num_negatives, dim)  # Random
 
loss = info_nce_loss(query, positive_key, negative_keys)
print(f"InfoNCE Loss: {loss.item():.4f}")
 
# Lower bound on MI
mi_lower_bound = np.log(num_negatives + 1) - loss.item()
print(f"MI Lower Bound: {mi_lower_bound:.4f} nats")
 
# With more negatives
for n_neg in [16, 64, 256, 1024]:
    neg_keys = torch.randn(n_neg, dim)
    loss = info_nce_loss(query, positive_key, neg_keys)
    mi_bound = np.log(n_neg + 1) - loss.item()
    print(f"N={n_neg:4d}: Loss={loss.item():.4f}, MI bound={mi_bound:.4f}")

The Information Bottleneck View:

A deep neural network can be viewed through the lens of information theory:

Each layer T is a stochastic mapping X → T → Y
The Data Processing Inequality ensures: I(X; T) ≥ I(X; Y)
But also: I(T; Y) ≤ I(X; Y)

The network must preserve information relevant to Y while discarding irrelevant details. This is precisely the Information Bottleneck objective:

minimize: I(X; T) − β · I(T; Y)

Find the representation T that maximally compresses X while retaining information about Y.

Contrastive Learning Intuition

Conditional Mutual Information

Sometimes we want to know the shared information between X and Y after accounting for a third variable Z. This is conditional mutual information:

I(X; Y | Z) = H(X | Z) − H(X | Y, Z)

This measures how much information Y provides about X beyond what Z already provides.

Conditional Mutual Information
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
# Conditional Mutual Information
# ================================
 
# Definition:
I(X; Y | Z) = H(X | Z) - H(X | Y, Z)
            = H(X, Z) + H(Y, Z) - H(Z) - H(X, Y, Z)
            = E_Z[I(X; Y | Z=z)]
 
# Properties:
# 1. Chain rule: I(X; Y, Z) = I(X; Z) + I(X; Y | Z)
# 2. Non-negative: I(X; Y | Z) >= 0
# 3. Reduces to MI when Z is constant: I(X; Y | ∅) = I(X; Y)
 
# Key insight: I(X; Y | Z) can be LESS than I(X; Y)
# This happens when Z "explains" some of the dependence between X and Y
 
# Example: Confounding
# X = Ice cream sales
# Y = Drowning deaths
# Z = Temperature
# I(X; Y) > 0 (correlated!)
# I(X; Y | Z) ≈ 0 (independent given temperature)
 
import numpy as np
 
def conditional_mi_discrete(x, y, z):
    """
    Compute I(X; Y | Z) from discrete samples.
    """
    from collections import defaultdict
    
    n = len(x)
    
    # Count joint occurrences
    xyz_counts = defaultdict(int)
    xz_counts = defaultdict(int)
    yz_counts = defaultdict(int)
    z_counts = defaultdict(int)
    
    for i in range(n):
        xyz_counts[(x[i], y[i], z[i])] += 1
        xz_counts[(x[i], z[i])] += 1
        yz_counts[(y[i], z[i])] += 1
        z_counts[z[i]] += 1
    
    # Convert to probabilities and compute CMI
    cmi = 0
    for (xi, yi, zi), count in xyz_counts.items():
        p_xyz = count / n
        p_xz = xz_counts[(xi, zi)] / n
        p_yz = yz_counts[(yi, zi)] / n
        p_z = z_counts[zi] / n
        
        if p_xyz > 0 and p_xz > 0 and p_yz > 0 and p_z > 0:
            # I(X;Y|Z) = Σ p(x,y,z) log[p(x,y,z)p(z) / (p(x,z)p(y,z))]
            cmi += p_xyz * np.log2((p_xyz * p_z) / (p_xz * p_yz))
    
    return cmi
 
# Example: Confounding
np.random.seed(42)
n = 5000
 
# Z causes both X and Y
z = np.random.randint(0, 3, n)  # Low, Medium, High temperature
 
# X depends on Z (ice cream sales)
x = z + np.random.binomial(1, 0.3, n)
 
# Y depends on Z (drowning - more swimming in hot weather)  
y = z + np.random.binomial(1, 0.2, n)
 
# Compute MIs
from sklearn.metrics import mutual_info_score
 
mi_xy = mutual_info_score(x, y)
cmi_xy_z = conditional_mi_discrete(x, y, z)
 
print("Confounding Example: Ice Cream Sales (X) vs Drownings (Y)")
print("=" * 60)
print(f"I(X; Y) = {mi_xy:.4f} nats  <- Appears correlated!")
print(f"I(X; Y | Z=Temperature) = {cmi_xy_z:.4f} nats  <- Much less after conditioning!")
print()
print("Conditioning on temperature 'explains away' the spurious correlation.")

Causal Inference Connection

Summary: Mutual Information Mastery

Mutual information is the definitive measure of statistical dependence with deep connections throughout ML. Let's consolidate:

Key Takeaways

•Definition: I(X;Y) = H(X) - H(X|Y) = D_KL(P(X,Y) || P(X)P(Y)). Measures shared information.
•Properties: Symmetric, non-negative, zero iff independent, bounded by marginal entropies.
•vs Correlation: MI captures ALL dependencies (linear and nonlinear); correlation catches only linear.
•Feature Selection: Rank features by I(feature; target). Use mRMR to avoid redundancy.
•Estimation: KSG for continuous variables; MINE/InfoNCE for deep learning; all are challenging in high dimensions.
•Deep Learning: Contrastive learning maximizes MI between views; information bottleneck explains compression.
•Conditional MI: I(X;Y|Z) measures dependence after accounting for Z; crucial for causal reasoning.

What's next:

Page Complete

4 / 5