Probability TheoryInformation Theory

Information Theory for Machine Learning

LevelIntermediate

Duration90 mins

TopicInformation Theory

1 / 5

Entropy: Quantifying Uncertainty

The Mathematics of Surprise

In 1948, Claude Shannon published "A Mathematical Theory of Communication," a paper that would fundamentally transform how we understand information itself. Shannon's insight was revolutionary: information could be precisely quantified, measured in bits, and optimized just like physical quantities such as energy or mass.

At the heart of Shannon's framework lies a deceptively simple concept: entropy. While the name was borrowed from thermodynamics (at the suggestion of John von Neumann, who quipped that "no one really knows what entropy is, so in any argument you will always have the advantage"), Shannon's entropy is a fundamentally new idea—a measure of uncertainty, surprise, and information content inherent in a probability distribution.

For machine learning practitioners, entropy is not merely a theoretical curiosity. It appears everywhere:

The cross-entropy loss function that trains virtually every neural network
The decision tree splitting criteria (information gain) popularized by ID3 and C4.5
The KL divergence that measures how distributions differ
The variational lower bound that enables modern generative models
The mutual information that quantifies feature relevance

Understanding entropy deeply—not just as a formula, but as a conceptual tool—unlocks intuition about why certain ML methods work and when they might fail.

What You Will Learn

By the end of this page, you will understand entropy as a rigorous measure of uncertainty, derive its mathematical properties from first principles, connect it to optimal coding theory, and appreciate its central role in machine learning. You'll be equipped to reason about information content in any probabilistic setting.

The Intuition Behind Entropy

Before diving into mathematics, let's build intuition. Consider two scenarios:

Scenario A: A Biased Coin You have a coin that lands heads 99% of the time. Someone flips it and asks you to guess the result. You confidently say "heads"—and you're almost always right. When the result is revealed, you're rarely surprised. The outcome was predictable.

Scenario B: A Fair Coin Now consider a perfectly fair coin (50% heads, 50% tails). Each flip is genuinely uncertain. When the result is revealed, you experience maximum surprise for a binary outcome—there's no way to predict it better than random chance.

The key insight: Entropy quantifies the average surprise you experience when observing outcomes from a probability distribution. High entropy means high average surprise (unpredictability). Low entropy means low average surprise (predictability).

But what exactly do we mean by "surprise"?

Surprise as a Formal Concept

Shannon defined self-information (or "surprisal") of an event with probability p as: I(p) = -log₂(p). A rare event (small p) has high self-information—it's very surprising. A certain event (p = 1) has zero self-information—no surprise at all. Entropy is simply the expected value of self-information across all possible outcomes.

Why logarithms?

Shannon's choice of the logarithm wasn't arbitrary. It emerges naturally from three fundamental requirements that any measure of information should satisfy:

Continuity: Small changes in probabilities should produce small changes in information
Monotonicity: Less probable events should carry more information
Additivity: The information from independent events should add up

It can be proven mathematically that the only function satisfying all three properties is the logarithm. This isn't a design choice—it's a mathematical necessity.

The connection to bits:

When we use log base 2, we measure information in bits. A bit is the amount of information needed to specify one of two equally likely alternatives. The fair coin flip contains exactly 1 bit of information. The biased 99/1 coin contains only about 0.08 bits—far less information, because the outcome is mostly predetermined.

Shannon Entropy: The Formal Definition

With intuition established, we now present the formal definition. For a discrete random variable X with possible outcomes {x₁, x₂, ..., xₙ} and probability mass function P(X = xᵢ) = pᵢ, the Shannon entropy H(X) is defined as:

Shannon Entropy Definition
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Shannon Entropy for Discrete Random Variables
# ================================================
 
H(X) = -∑ᵢ P(X = xᵢ) · log₂(P(X = xᵢ))
     = -∑ᵢ pᵢ · log₂(pᵢ)
     = E[-log₂(P(X))]
 
# In Python (using natural log, then converting):
import numpy as np
 
def entropy(probs):
    """
    Compute Shannon entropy in bits.
    
    Args:
        probs: Array of probabilities (must sum to 1)
    
    Returns:
        Entropy H(X) in bits
    """
    # Filter out zero probabilities (0 * log(0) is defined as 0)
    probs = np.array(probs)
    probs = probs[probs > 0]
    
    # H(X) = -Σ p_i * log_2(p_i)
    return -np.sum(probs * np.log2(probs))
 
# Examples
fair_coin = [0.5, 0.5]
biased_coin = [0.99, 0.01]
fair_die = [1/6] * 6
 
print(f"Fair coin entropy: {entropy(fair_coin):.4f} bits")     # 1.0000 bits
print(f"Biased coin entropy: {entropy(biased_coin):.4f} bits") # 0.0808 bits
print(f"Fair die entropy: {entropy(fair_die):.4f} bits")       # 2.5850 bits

Understanding the components:

• −log₂(pᵢ): This is the self-information or surprisal of outcome xᵢ. When pᵢ is small (rare event), −log₂(pᵢ) is large. When pᵢ = 1 (certain event), −log₂(1) = 0.

• pᵢ · (−log₂(pᵢ)): The surprisal weighted by probability. We weight each outcome's surprise by how often it occurs.

• Σᵢ: Summing over all outcomes gives the expected surprisal—the average surprise across the entire distribution.

The negative sign: We use −log₂(p) because log₂(p) is negative for 0 < p < 1. The negative sign makes entropy a positive quantity.

Edge case (p = 0): Mathematically, 0 · log(0) is undefined, but by L'Hôpital's rule, lim_{p→0⁺} p · log(p) = 0. We define 0 · log(0) = 0, meaning impossible events contribute zero to entropy.

Units of Entropy

The unit of entropy depends on the logarithm base: • Base 2 → bits (most common in information theory and ML) • Base e → nats (natural units, common in physics) • Base 10 → hartleys (rarely used)

Conversion: H_nats = H_bits × ln(2) ≈ 0.693 × H_bits. In machine learning, cross-entropy loss typically uses natural logarithms (nats), while information-theoretic analyses often use bits.

Fundamental Properties of Entropy

Entropy satisfies several elegant properties that make it the canonical measure of uncertainty. Understanding these properties provides deep insight into why entropy appears throughout machine learning.

Key Properties of Entropy

•Non-negativity: H(X) ≥ 0. Entropy is always non-negative. You cannot have "negative uncertainty."
•Zero iff Deterministic: H(X) = 0 if and only if one outcome has probability 1. Only completely deterministic distributions have zero entropy.
•Maximum for Uniform Distribution: For n outcomes, H(X) ≤ log₂(n), with equality when all outcomes are equally likely (uniform distribution). Maximum uncertainty occurs when we have no preference among outcomes.
•Symmetry: Entropy depends only on the probability values, not the labels of outcomes. H([0.3, 0.7]) = H([0.7, 0.3]).
•Additivity for Independent Variables: If X and Y are independent, H(X, Y) = H(X) + H(Y). Uncertainties of independent events add.
•Concavity: Entropy is a concave function of the probability distribution. Mixing distributions increases entropy (on average).

Entropy Properties - Demonstrations
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import numpy as np
import matplotlib.pyplot as plt
 
def entropy(probs):
    probs = np.array(probs)
    probs = probs[probs > 0]
    return -np.sum(probs * np.log2(probs))
 
# Property 1 & 2: Non-negativity and Zero for Deterministic
print("Property: Non-negativity and Zero for Deterministic")
print(f"H([1.0, 0.0]) = {entropy([1.0, 0.0]):.4f}")  # 0.0000
print(f"H([0.9, 0.1]) = {entropy([0.9, 0.1]):.4f}")  # 0.4690
print()
 
# Property 3: Maximum for Uniform Distribution
print("Property: Maximum for Uniform Distribution")
for n in [2, 4, 8, 16]:
    uniform = [1/n] * n
    max_theoretical = np.log2(n)
    actual = entropy(uniform)
    print(f"n={n:2d}: H(uniform) = {actual:.4f}, log₂(n) = {max_theoretical:.4f}")
print()
 
# Property 5: Additivity for Independent Variables
print("Property: Additivity for Independent Variables")
px = [0.3, 0.7]
py = [0.4, 0.6]
# Joint distribution of independent X, Y
joint = [px[i] * py[j] for i in range(2) for j in range(2)]
print(f"H(X) = {entropy(px):.4f}")
print(f"H(Y) = {entropy(py):.4f}")
print(f"H(X) + H(Y) = {entropy(px) + entropy(py):.4f}")
print(f"H(X, Y) computed from joint = {entropy(joint):.4f}")
print()
 
# Visualization: Binary Entropy Function
p_values = np.linspace(0.001, 0.999, 500)
h_binary = [-p * np.log2(p) - (1-p) * np.log2(1-p) for p in p_values]
 
# The binary entropy H(p) = -p*log(p) - (1-p)*log(1-p)
# Maximum at p = 0.5, value = 1 bit
print("Binary Entropy Function H(p):")
print(f"H(0.5) = {entropy([0.5, 0.5]):.4f} bits (maximum)")
print(f"H(0.9) = {entropy([0.9, 0.1]):.4f} bits")
print(f"H(0.99) = {entropy([0.99, 0.01]):.4f} bits")

The Binary Entropy Function:

For a Bernoulli random variable with P(X = 1) = p and P(X = 0) = 1 - p, the entropy has a special name: the binary entropy function, denoted H(p) or H_b(p):

H(p) = −p · log₂(p) − (1−p) · log₂(1−p)

This function is:

Symmetric around p = 0.5
Maximized at p = 0.5 (fair coin), where H(0.5) = 1 bit
Approaches 0 as p → 0 or p → 1 (deterministic outcomes)
Concave (bowl-shaped upward)

The binary entropy function appears constantly in machine learning, particularly in binary classification problems.

Common Misconception

Entropy measures uncertainty, not complexity or value. A password "aaaaaaaa" and "x7#mK9pL" might both come from distributions with identical entropy, but one is far less secure. Similarly, uniformly random data and highly structured data with the same marginal distribution have the same entropy. Entropy is about unpredictability, not usefulness.

Entropy and Optimal Coding

Shannon's most profound insight was connecting entropy to data compression. The entropy H(X) has a beautiful operational interpretation: it represents the minimum average number of bits needed to encode samples from distribution X.

Shannon's Source Coding Theorem:

For a discrete source X with entropy H(X):

It is impossible to encode messages using fewer than H(X) bits per symbol on average (without loss)
It is possible to encode messages using arbitrarily close to H(X) bits per symbol (with arbitrarily long block codes)

This is remarkable. Entropy isn't just an abstract measure—it's a fundamental limit on compression.

Optimal Coding ExampleConsider a source that emits four symbols with probabilities: A (1/2), B (1/4), C (1/8), D (1/8)

Input

Output

Why this matters for ML:

The connection between entropy and coding explains why cross-entropy loss is the natural choice for classification:

When training a classifier, we're learning a probability distribution Q(Y|X) over labels
The true data is generated from the real distribution P(Y|X)
If we use Q to encode the labels (design a code based on our model's predictions), the average code length is the cross-entropy H(P, Q)
The extra bits we waste (compared to using the optimal code based on P) is the KL divergence D_KL(P || Q)

Training to minimize cross-entropy is equivalent to making our model Q as close as possible to the true distribution P in terms of coding efficiency.

Optimal Coding Demonstration
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import numpy as np
from collections import Counter
 
def entropy(probs):
    probs = np.array(probs)
    probs = probs[probs > 0]
    return -np.sum(probs * np.log2(probs))
 
# Example: Huffman-like coding for a biased source
symbols = ['A', 'B', 'C', 'D']
probs = [0.5, 0.25, 0.125, 0.125]
 
# Calculate theoretical entropy
H = entropy(probs)
print(f"Entropy H(X) = {H:.4f} bits")
 
# Fixed-length code (2 bits each)
fixed_length_avg = 2.0
print(f"Fixed-length code: {fixed_length_avg:.4f} bits/symbol")
 
# Huffman code lengths: A=1, B=2, C=3, D=3
huffman_lengths = [1, 2, 3, 3]
huffman_avg = sum(p * l for p, l in zip(probs, huffman_lengths))
print(f"Huffman code: {huffman_avg:.4f} bits/symbol")
 
# Efficiency
print(f"\nCoding efficiency:")
print(f"  Fixed-length: {H/fixed_length_avg * 100:.1f}%")
print(f"  Huffman: {H/huffman_avg * 100:.1f}%")
 
# Simulate encoding a message
message = ''.join(np.random.choice(symbols, size=10000, p=probs))
actual_freq = Counter(message)
print(f"\nSimulated 10,000 symbols:")
for s, p in zip(symbols, probs):
    empirical = actual_freq[s] / 10000
    print(f"  {s}: expected {p:.3f}, observed {empirical:.3f}")
 
# With Huffman, total bits needed
total_fixed = 10000 * 2
total_huffman = sum(actual_freq[s] * l for s, l in zip(symbols, huffman_lengths))
print(f"\nTotal bits for 10,000 symbols:")
print(f"  Fixed-length: {total_fixed:,} bits")
print(f"  Huffman: {total_huffman:,} bits")
print(f"  Savings: {(1 - total_huffman/total_fixed) * 100:.1f}%")

Differential Entropy for Continuous Distributions

So far, we've discussed entropy for discrete random variables. What about continuous distributions like Gaussians? The natural generalization is differential entropy, denoted h(X) (lowercase h to distinguish from discrete entropy H(X)):

Differential Entropy Definition
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Differential Entropy for Continuous Random Variables
# =====================================================
 
h(X) = -∫ p(x) · log p(x) dx
 
# where p(x) is the probability density function (PDF)
 
# Key Examples:
 
# 1. Uniform Distribution on [a, b]
h(Uniform[a,b]) = log(b - a)
 
# 2. Gaussian Distribution N(μ, σ²)
h(Gaussian) = (1/2) · log(2πeσ²) = (1/2) + (1/2)·log(2πσ²)
            ≈ 0.5 + 0.5·log(2π) + log(σ)
            ≈ 1.42 + log(σ)  [in nats]
 
# 3. Exponential Distribution with rate λ
h(Exponential) = 1 - log(λ)
 
# 4. Multivariate Gaussian N(μ, Σ) in d dimensions
h(MVN) = (d/2)·log(2πe) + (1/2)·log|Σ|
       = (d/2)·(1 + log(2π)) + (1/2)·log|Σ|

Key Differences from Discrete Entropy

Differential entropy differs from discrete entropy in important ways:

Can be negative: h(X) < 0 is possible (e.g., Uniform on [0, 0.1] has h = log(0.1) < 0)
Not scale-invariant: Scaling X by constant c adds log(c) to h(X)
No direct coding interpretation: You can't losslessly encode continuous values with finite bits
Infinite for some distributions: Distributions with heavy tails may have undefined differential entropy

Despite these differences, differential entropy retains many useful properties and is essential for analyzing continuous ML models.

Why the Gaussian has maximum entropy:

Among all continuous distributions with fixed mean μ and variance σ², the Gaussian distribution uniquely maximizes differential entropy. This is a profound result with deep implications:

Maximum entropy principle: If we only know the first two moments, the least-presumptuous distribution to assume is Gaussian
Central Limit Theorem intuition: Sums of random variables concentrate toward the maximum-entropy distribution
Regularization in ML: Gaussian assumptions in Bayesian ML reflect maximum uncertainty given known moments

This maximum entropy property makes Gaussians special. When we use Gaussian assumptions in ML, we're not being lazy—we're being maximally non-committal given our partial knowledge.

Differential Entropy Calculations
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import numpy as np
from scipy import stats
 
def differential_entropy_gaussian(sigma):
    """Differential entropy of N(0, σ²) in nats."""
    return 0.5 * np.log(2 * np.pi * np.e * sigma**2)
 
def differential_entropy_uniform(a, b):
    """Differential entropy of Uniform[a, b] in nats."""
    return np.log(b - a)
 
def differential_entropy_exponential(lam):
    """Differential entropy of Exp(λ) in nats."""
    return 1 - np.log(lam)
 
# Examples
print("Differential Entropy Examples (in nats):")
print("=" * 50)
 
# Gaussian distributions with different variances
for sigma in [0.5, 1.0, 2.0, 5.0]:
    h = differential_entropy_gaussian(sigma)
    print(f"N(0, {sigma}²): h = {h:.4f} nats")
print()
 
# Uniform distributions
for a, b in [(0, 1), (0, 0.1), (0, 10)]:
    h = differential_entropy_uniform(a, b)
    print(f"Uniform[{a}, {b}]: h = {h:.4f} nats")
print()
 
# Note: Uniform[0, 0.1] has NEGATIVE entropy!
print("Note: Uniform[0, 0.1] has negative entropy because")
print("the distribution is 'concentrated' in a small interval.")
print()
 
# Verify using scipy
print("Verification with scipy.stats:")
print(f"N(0, 1) entropy: {stats.norm(0, 1).entropy():.4f} nats")
print(f"Uniform[0, 1] entropy: {stats.uniform(0, 1).entropy():.4f} nats")
print(f"Exp(1) entropy: {stats.expon(scale=1).entropy():.4f} nats")

Conditional and Joint Entropy

When dealing with multiple random variables, we need to understand how uncertainty combines and how knowing one variable affects uncertainty about another.

Joint Entropy H(X, Y) measures the total uncertainty in the pair (X, Y):

H(X, Y) = −ΣᵢΣⱼ P(X = xᵢ, Y = yⱼ) · log P(X = xᵢ, Y = yⱼ)

Conditional Entropy H(Y|X) measures the remaining uncertainty in Y after observing X:

H(Y|X) = Σᵢ P(X = xᵢ) · H(Y|X = xᵢ) = −ΣᵢΣⱼ P(X = xᵢ, Y = yⱼ) · log P(Y = yⱼ|X = xᵢ)

These quantities satisfy the elegant chain rule for entropy:

Chain Rule for Entropy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Chain Rule for Entropy
# ======================
 
H(X, Y) = H(X) + H(Y|X)        # Joint = Marginal + Conditional
        = H(Y) + H(X|Y)        # Symmetric form
 
# Interpretation:
# Total uncertainty in (X, Y) = Uncertainty in X + Remaining uncertainty in Y given X
 
# For multiple variables:
H(X₁, X₂, ..., Xₙ) = H(X₁) + H(X₂|X₁) + H(X₃|X₁,X₂) + ... + H(Xₙ|X₁,...,Xₙ₋₁)
 
# Key Properties:
# 1. Conditioning reduces entropy: H(Y|X) ≤ H(Y)
#    "Knowing X can only reduce (or maintain) uncertainty about Y"
#
# 2. Equality when independent: H(Y|X) = H(Y) iff X and Y are independent
#
# 3. H(X, Y) ≤ H(X) + H(Y), with equality iff independent

Conditioning reduces entropy (on average):

H(Y|X) ≤ H(Y)

This is the information processing inequality in its simplest form. Learning X can only reduce (never increase) our uncertainty about Y on average. Equality holds if and only if X and Y are independent—learning X tells us nothing about Y.

This has profound implications for machine learning:

Features reduce label uncertainty only if they're correlated with labels
The best features are those that maximally reduce conditional entropy
Feature selection can be framed as finding X that minimizes H(Y|X)

Joint and Conditional Entropy CalculationConsider two binary variables: Weather (W) and Umbrella (U)

Input

Output

Computing Joint and Conditional Entropy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
 
def entropy(probs):
    probs = np.array(probs).flatten()
    probs = probs[probs > 0]
    return -np.sum(probs * np.log2(probs))
 
# Joint distribution P(Weather, Umbrella)
# Rows: Weather (Sunny, Rainy)
# Cols: Umbrella (No, Yes)
joint = np.array([
    [0.63, 0.07],  # Sunny
    [0.03, 0.27]   # Rainy
])
 
# Marginal distributions
p_weather = joint.sum(axis=1)  # Sum over umbrella
p_umbrella = joint.sum(axis=0)  # Sum over weather
 
print("Marginal Distributions:")
print(f"  P(Weather) = {p_weather}")  # [0.7, 0.3]
print(f"  P(Umbrella) = {p_umbrella}")  # [0.66, 0.34]
print()
 
# Individual entropies
H_W = entropy(p_weather)
H_U = entropy(p_umbrella)
H_joint = entropy(joint)
 
print("Entropies:")
print(f"  H(Weather) = {H_W:.4f} bits")
print(f"  H(Umbrella) = {H_U:.4f} bits")
print(f"  H(Weather, Umbrella) = {H_joint:.4f} bits")
print()
 
# Conditional entropies (using chain rule)
H_U_given_W = H_joint - H_W
H_W_given_U = H_joint - H_U
 
print("Conditional Entropies:")
print(f"  H(Umbrella | Weather) = {H_U_given_W:.4f} bits")
print(f"  H(Weather | Umbrella) = {H_W_given_U:.4f} bits")
print()
 
# Verify chain rule
print("Chain Rule Verification:")
print(f"  H(W) + H(U|W) = {H_W + H_U_given_W:.4f}")
print(f"  H(U) + H(W|U) = {H_U + H_W_given_U:.4f}")
print(f"  H(W, U) = {H_joint:.4f}")
print()
 
# Information gained about U by knowing W
info_gain = H_U - H_U_given_W
print(f"Information gain about Umbrella from Weather: {info_gain:.4f} bits")
print(f"This is the Mutual Information I(W; U)")

Entropy in Machine Learning Applications

Entropy permeates machine learning. Here we examine its most important applications, building intuition for why this measure of uncertainty is so central to learning algorithms.

Key Applications of Entropy in ML

•Decision Tree Splitting (Information Gain): Trees split on features that maximize reduction in entropy. If H(Y) is label entropy and H(Y|X) is entropy after splitting on feature X, then Information Gain = H(Y) - H(Y|X). The best feature maximizes this gain.
•Neural Network Loss Functions: Cross-entropy loss for classification is directly derived from entropy. It measures the average bits needed to encode true labels using the model's predicted distribution.
•Maximum Entropy Models: Logistic regression and MaxEnt classifiers find the distribution with maximum entropy subject to feature constraints. This yields the least-biased model consistent with observations.
•Variational Inference: The ELBO (Evidence Lower BOund) in VAEs includes an entropy term that encourages diverse latent representations. Entropy regularization prevents collapse to point estimates.
•Reinforcement Learning: Entropy regularization (SAC, A3C) encourages exploration by rewarding high-entropy action distributions. This prevents premature convergence to suboptimal policies.
•Information Bottleneck: Trade-off between compression and prediction—minimize I(X; T) while maximizing I(T; Y). Finding optimal representations that retain task-relevant information.

Decision Tree Information Gain
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import numpy as np
 
def entropy(labels):
    """Compute entropy of label array."""
    _, counts = np.unique(labels, 
                          return_counts=True)
    probs = counts / len(labels)
    probs = probs[probs > 0]
    return -np.sum(probs * np.log2(probs))
 
def information_gain(y, splits):
    """
    Compute information gain for a split.
    
    Args:
        y: Original labels
        splits: List of label arrays 
                after split
    """
    H_before = entropy(y)
    n = len(y)
    
    # Weighted average entropy after split
    H_after = sum(
        len(s)/n * entropy(s) 
        for s in splits
    )
    
    return H_before - H_after
 
# Example: Binary classification
y = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
print(f"Initial H(Y): {entropy(y):.4f}")
 
# Split A: Perfect separation
split_a = [[0,0,0,0], [1,1,1,1,1,1]]
ig_a = information_gain(y, split_a)
print(f"Perfect split IG: {ig_a:.4f}")
 
# Split B: Imperfect separation  
split_b = [[0,0,1], [0,0,1,1,1,1,1]]
ig_b = information_gain(y, split_b)
print(f"Imperfect split IG: {ig_b:.4f}")

Entropy Regularization in RL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import numpy as np
 
def softmax(logits, temperature=1.0):
    """Softmax with temperature."""
    exp_logits = np.exp(
        (logits - np.max(logits)) 
        / temperature
    )
    return exp_logits / exp_logits.sum()
 
def policy_entropy(probs):
    """Entropy of action distribution."""
    probs = probs[probs > 0]
    return -np.sum(probs * np.log(probs))
 
# RL objective with entropy regularization
# J(π) = E[Σ r_t] + β * H(π(a|s))
beta = 0.01  # Entropy coefficient
 
# Compare greedy vs entropy-regularized
logits = np.array([10.0, 1.0, 0.1, 0.1])
 
# Greedy policy (low entropy)
greedy = softmax(logits, temperature=0.1)
print(f"Greedy policy: {greedy.round(3)}")
print(f"  Entropy: {policy_entropy(greedy):.4f}")
 
# High-temp policy (high entropy)
exploratory = softmax(logits, temperature=2.0)
print(f"\nExploratory: {exploratory.round(3)}")
print(f"  Entropy: {policy_entropy(exploratory):.4f}")
 
# Maximum entropy (uniform)
uniform = np.ones(4) / 4
print(f"\nUniform: {uniform}")
print(f"  Entropy: {policy_entropy(uniform):.4f}")

Entropy as a Regularizer

In many ML contexts, entropy acts as a regularizer. High entropy in policy distributions prevents overconfident decisions. High entropy in generative models encourages diverse outputs. The entropy term in variational free energy prevents overfitting to narrow modes. This pattern—using entropy to keep distributions "spread out"—recurs throughout probabilistic ML.

Summary: Entropy as Foundation

We've established entropy as the foundational concept of information theory and a cornerstone of machine learning. Let's consolidate the key insights:

Key Takeaways

•Entropy measures uncertainty: H(X) = −Σ pᵢ log pᵢ quantifies the average "surprise" or unpredictability in a distribution.
•Fundamental properties: Non-negativity, maximized by uniform distribution, additive for independent variables, and concave in the distribution.
•Operational meaning: Entropy is the minimum average bits needed for lossless compression (Shannon's source coding theorem).
•Differential entropy: Extension to continuous distributions, with notable differences (can be negative, not scale-invariant).
•Chain rule: H(X, Y) = H(X) + H(Y|X) connects joint, marginal, and conditional entropies.
•ML applications: Decision trees (information gain), loss functions (cross-entropy), regularization (entropy bonus), and variational methods.

What's next:

With entropy established, we're ready to explore cross-entropy—the measure that compares two distributions and forms the basis of nearly every classification loss function in machine learning. We'll see how cross-entropy naturally emerges when using one distribution to encode data from another, and why minimizing cross-entropy loss is equivalent to maximum likelihood estimation.

Page Complete

You now understand Shannon entropy as a rigorous measure of uncertainty with deep connections to optimal coding and machine learning. This foundation prepares you for cross-entropy, KL divergence, and the information-theoretic principles that underpin modern ML systems.

1 / 5

Loading learning content...

Probability TheoryInformation Theory

Information Theory for Machine Learning

LevelIntermediate

Duration90 mins

TopicInformation Theory

1 / 5

Entropy: Quantifying Uncertainty

The Mathematics of Surprise

For machine learning practitioners, entropy is not merely a theoretical curiosity. It appears everywhere:

The cross-entropy loss function that trains virtually every neural network
The decision tree splitting criteria (information gain) popularized by ID3 and C4.5
The KL divergence that measures how distributions differ
The variational lower bound that enables modern generative models
The mutual information that quantifies feature relevance

Understanding entropy deeply—not just as a formula, but as a conceptual tool—unlocks intuition about why certain ML methods work and when they might fail.

What You Will Learn

The Intuition Behind Entropy

Before diving into mathematics, let's build intuition. Consider two scenarios:

But what exactly do we mean by "surprise"?

Surprise as a Formal Concept

Why logarithms?

Shannon's choice of the logarithm wasn't arbitrary. It emerges naturally from three fundamental requirements that any measure of information should satisfy:

Continuity: Small changes in probabilities should produce small changes in information
Monotonicity: Less probable events should carry more information
Additivity: The information from independent events should add up

It can be proven mathematically that the only function satisfying all three properties is the logarithm. This isn't a design choice—it's a mathematical necessity.

The connection to bits:

Shannon Entropy: The Formal Definition

Shannon Entropy Definition
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Shannon Entropy for Discrete Random Variables
# ================================================
 
H(X) = -∑ᵢ P(X = xᵢ) · log₂(P(X = xᵢ))
     = -∑ᵢ pᵢ · log₂(pᵢ)
     = E[-log₂(P(X))]
 
# In Python (using natural log, then converting):
import numpy as np
 
def entropy(probs):
    """
    Compute Shannon entropy in bits.
    
    Args:
        probs: Array of probabilities (must sum to 1)
    
    Returns:
        Entropy H(X) in bits
    """
    # Filter out zero probabilities (0 * log(0) is defined as 0)
    probs = np.array(probs)
    probs = probs[probs > 0]
    
    # H(X) = -Σ p_i * log_2(p_i)
    return -np.sum(probs * np.log2(probs))
 
# Examples
fair_coin = [0.5, 0.5]
biased_coin = [0.99, 0.01]
fair_die = [1/6] * 6
 
print(f"Fair coin entropy: {entropy(fair_coin):.4f} bits")     # 1.0000 bits
print(f"Biased coin entropy: {entropy(biased_coin):.4f} bits") # 0.0808 bits
print(f"Fair die entropy: {entropy(fair_die):.4f} bits")       # 2.5850 bits

Understanding the components:

• −log₂(pᵢ): This is the self-information or surprisal of outcome xᵢ. When pᵢ is small (rare event), −log₂(pᵢ) is large. When pᵢ = 1 (certain event), −log₂(1) = 0.

• pᵢ · (−log₂(pᵢ)): The surprisal weighted by probability. We weight each outcome's surprise by how often it occurs.

• Σᵢ: Summing over all outcomes gives the expected surprisal—the average surprise across the entire distribution.

The negative sign: We use −log₂(p) because log₂(p) is negative for 0 < p < 1. The negative sign makes entropy a positive quantity.

Units of Entropy

Conversion: H_nats = H_bits × ln(2) ≈ 0.693 × H_bits. In machine learning, cross-entropy loss typically uses natural logarithms (nats), while information-theoretic analyses often use bits.

Fundamental Properties of Entropy

Key Properties of Entropy

•Non-negativity: H(X) ≥ 0. Entropy is always non-negative. You cannot have "negative uncertainty."
•Zero iff Deterministic: H(X) = 0 if and only if one outcome has probability 1. Only completely deterministic distributions have zero entropy.
•Maximum for Uniform Distribution: For n outcomes, H(X) ≤ log₂(n), with equality when all outcomes are equally likely (uniform distribution). Maximum uncertainty occurs when we have no preference among outcomes.
•Symmetry: Entropy depends only on the probability values, not the labels of outcomes. H([0.3, 0.7]) = H([0.7, 0.3]).
•Additivity for Independent Variables: If X and Y are independent, H(X, Y) = H(X) + H(Y). Uncertainties of independent events add.
•Concavity: Entropy is a concave function of the probability distribution. Mixing distributions increases entropy (on average).

Entropy Properties - Demonstrations
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import numpy as np
import matplotlib.pyplot as plt
 
def entropy(probs):
    probs = np.array(probs)
    probs = probs[probs > 0]
    return -np.sum(probs * np.log2(probs))
 
# Property 1 & 2: Non-negativity and Zero for Deterministic
print("Property: Non-negativity and Zero for Deterministic")
print(f"H([1.0, 0.0]) = {entropy([1.0, 0.0]):.4f}")  # 0.0000
print(f"H([0.9, 0.1]) = {entropy([0.9, 0.1]):.4f}")  # 0.4690
print()
 
# Property 3: Maximum for Uniform Distribution
print("Property: Maximum for Uniform Distribution")
for n in [2, 4, 8, 16]:
    uniform = [1/n] * n
    max_theoretical = np.log2(n)
    actual = entropy(uniform)
    print(f"n={n:2d}: H(uniform) = {actual:.4f}, log₂(n) = {max_theoretical:.4f}")
print()
 
# Property 5: Additivity for Independent Variables
print("Property: Additivity for Independent Variables")
px = [0.3, 0.7]
py = [0.4, 0.6]
# Joint distribution of independent X, Y
joint = [px[i] * py[j] for i in range(2) for j in range(2)]
print(f"H(X) = {entropy(px):.4f}")
print(f"H(Y) = {entropy(py):.4f}")
print(f"H(X) + H(Y) = {entropy(px) + entropy(py):.4f}")
print(f"H(X, Y) computed from joint = {entropy(joint):.4f}")
print()
 
# Visualization: Binary Entropy Function
p_values = np.linspace(0.001, 0.999, 500)
h_binary = [-p * np.log2(p) - (1-p) * np.log2(1-p) for p in p_values]
 
# The binary entropy H(p) = -p*log(p) - (1-p)*log(1-p)
# Maximum at p = 0.5, value = 1 bit
print("Binary Entropy Function H(p):")
print(f"H(0.5) = {entropy([0.5, 0.5]):.4f} bits (maximum)")
print(f"H(0.9) = {entropy([0.9, 0.1]):.4f} bits")
print(f"H(0.99) = {entropy([0.99, 0.01]):.4f} bits")

The Binary Entropy Function:

For a Bernoulli random variable with P(X = 1) = p and P(X = 0) = 1 - p, the entropy has a special name: the binary entropy function, denoted H(p) or H_b(p):

H(p) = −p · log₂(p) − (1−p) · log₂(1−p)

This function is:

Symmetric around p = 0.5
Maximized at p = 0.5 (fair coin), where H(0.5) = 1 bit
Approaches 0 as p → 0 or p → 1 (deterministic outcomes)
Concave (bowl-shaped upward)

The binary entropy function appears constantly in machine learning, particularly in binary classification problems.

Common Misconception

Entropy and Optimal Coding

Shannon's Source Coding Theorem:

For a discrete source X with entropy H(X):

It is impossible to encode messages using fewer than H(X) bits per symbol on average (without loss)
It is possible to encode messages using arbitrarily close to H(X) bits per symbol (with arbitrarily long block codes)

This is remarkable. Entropy isn't just an abstract measure—it's a fundamental limit on compression.

Optimal Coding ExampleConsider a source that emits four symbols with probabilities: A (1/2), B (1/4), C (1/8), D (1/8)

Input

Output

Why this matters for ML:

The connection between entropy and coding explains why cross-entropy loss is the natural choice for classification:

When training a classifier, we're learning a probability distribution Q(Y|X) over labels
The true data is generated from the real distribution P(Y|X)
If we use Q to encode the labels (design a code based on our model's predictions), the average code length is the cross-entropy H(P, Q)
The extra bits we waste (compared to using the optimal code based on P) is the KL divergence D_KL(P || Q)

Training to minimize cross-entropy is equivalent to making our model Q as close as possible to the true distribution P in terms of coding efficiency.

Optimal Coding Demonstration
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import numpy as np
from collections import Counter
 
def entropy(probs):
    probs = np.array(probs)
    probs = probs[probs > 0]
    return -np.sum(probs * np.log2(probs))
 
# Example: Huffman-like coding for a biased source
symbols = ['A', 'B', 'C', 'D']
probs = [0.5, 0.25, 0.125, 0.125]
 
# Calculate theoretical entropy
H = entropy(probs)
print(f"Entropy H(X) = {H:.4f} bits")
 
# Fixed-length code (2 bits each)
fixed_length_avg = 2.0
print(f"Fixed-length code: {fixed_length_avg:.4f} bits/symbol")
 
# Huffman code lengths: A=1, B=2, C=3, D=3
huffman_lengths = [1, 2, 3, 3]
huffman_avg = sum(p * l for p, l in zip(probs, huffman_lengths))
print(f"Huffman code: {huffman_avg:.4f} bits/symbol")
 
# Efficiency
print(f"\nCoding efficiency:")
print(f"  Fixed-length: {H/fixed_length_avg * 100:.1f}%")
print(f"  Huffman: {H/huffman_avg * 100:.1f}%")
 
# Simulate encoding a message
message = ''.join(np.random.choice(symbols, size=10000, p=probs))
actual_freq = Counter(message)
print(f"\nSimulated 10,000 symbols:")
for s, p in zip(symbols, probs):
    empirical = actual_freq[s] / 10000
    print(f"  {s}: expected {p:.3f}, observed {empirical:.3f}")
 
# With Huffman, total bits needed
total_fixed = 10000 * 2
total_huffman = sum(actual_freq[s] * l for s, l in zip(symbols, huffman_lengths))
print(f"\nTotal bits for 10,000 symbols:")
print(f"  Fixed-length: {total_fixed:,} bits")
print(f"  Huffman: {total_huffman:,} bits")
print(f"  Savings: {(1 - total_huffman/total_fixed) * 100:.1f}%")

Differential Entropy for Continuous Distributions

Differential Entropy Definition
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Differential Entropy for Continuous Random Variables
# =====================================================
 
h(X) = -∫ p(x) · log p(x) dx
 
# where p(x) is the probability density function (PDF)
 
# Key Examples:
 
# 1. Uniform Distribution on [a, b]
h(Uniform[a,b]) = log(b - a)
 
# 2. Gaussian Distribution N(μ, σ²)
h(Gaussian) = (1/2) · log(2πeσ²) = (1/2) + (1/2)·log(2πσ²)
            ≈ 0.5 + 0.5·log(2π) + log(σ)
            ≈ 1.42 + log(σ)  [in nats]
 
# 3. Exponential Distribution with rate λ
h(Exponential) = 1 - log(λ)
 
# 4. Multivariate Gaussian N(μ, Σ) in d dimensions
h(MVN) = (d/2)·log(2πe) + (1/2)·log|Σ|
       = (d/2)·(1 + log(2π)) + (1/2)·log|Σ|

Key Differences from Discrete Entropy

Differential entropy differs from discrete entropy in important ways:

Can be negative: h(X) < 0 is possible (e.g., Uniform on [0, 0.1] has h = log(0.1) < 0)
Not scale-invariant: Scaling X by constant c adds log(c) to h(X)
No direct coding interpretation: You can't losslessly encode continuous values with finite bits
Infinite for some distributions: Distributions with heavy tails may have undefined differential entropy

Despite these differences, differential entropy retains many useful properties and is essential for analyzing continuous ML models.

Why the Gaussian has maximum entropy:

Among all continuous distributions with fixed mean μ and variance σ², the Gaussian distribution uniquely maximizes differential entropy. This is a profound result with deep implications:

Maximum entropy principle: If we only know the first two moments, the least-presumptuous distribution to assume is Gaussian
Central Limit Theorem intuition: Sums of random variables concentrate toward the maximum-entropy distribution
Regularization in ML: Gaussian assumptions in Bayesian ML reflect maximum uncertainty given known moments

This maximum entropy property makes Gaussians special. When we use Gaussian assumptions in ML, we're not being lazy—we're being maximally non-committal given our partial knowledge.

Differential Entropy Calculations
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import numpy as np
from scipy import stats
 
def differential_entropy_gaussian(sigma):
    """Differential entropy of N(0, σ²) in nats."""
    return 0.5 * np.log(2 * np.pi * np.e * sigma**2)
 
def differential_entropy_uniform(a, b):
    """Differential entropy of Uniform[a, b] in nats."""
    return np.log(b - a)
 
def differential_entropy_exponential(lam):
    """Differential entropy of Exp(λ) in nats."""
    return 1 - np.log(lam)
 
# Examples
print("Differential Entropy Examples (in nats):")
print("=" * 50)
 
# Gaussian distributions with different variances
for sigma in [0.5, 1.0, 2.0, 5.0]:
    h = differential_entropy_gaussian(sigma)
    print(f"N(0, {sigma}²): h = {h:.4f} nats")
print()
 
# Uniform distributions
for a, b in [(0, 1), (0, 0.1), (0, 10)]:
    h = differential_entropy_uniform(a, b)
    print(f"Uniform[{a}, {b}]: h = {h:.4f} nats")
print()
 
# Note: Uniform[0, 0.1] has NEGATIVE entropy!
print("Note: Uniform[0, 0.1] has negative entropy because")
print("the distribution is 'concentrated' in a small interval.")
print()
 
# Verify using scipy
print("Verification with scipy.stats:")
print(f"N(0, 1) entropy: {stats.norm(0, 1).entropy():.4f} nats")
print(f"Uniform[0, 1] entropy: {stats.uniform(0, 1).entropy():.4f} nats")
print(f"Exp(1) entropy: {stats.expon(scale=1).entropy():.4f} nats")

Conditional and Joint Entropy

When dealing with multiple random variables, we need to understand how uncertainty combines and how knowing one variable affects uncertainty about another.

Joint Entropy H(X, Y) measures the total uncertainty in the pair (X, Y):

H(X, Y) = −ΣᵢΣⱼ P(X = xᵢ, Y = yⱼ) · log P(X = xᵢ, Y = yⱼ)

Conditional Entropy H(Y|X) measures the remaining uncertainty in Y after observing X:

H(Y|X) = Σᵢ P(X = xᵢ) · H(Y|X = xᵢ) = −ΣᵢΣⱼ P(X = xᵢ, Y = yⱼ) · log P(Y = yⱼ|X = xᵢ)

These quantities satisfy the elegant chain rule for entropy:

Chain Rule for Entropy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Chain Rule for Entropy
# ======================
 
H(X, Y) = H(X) + H(Y|X)        # Joint = Marginal + Conditional
        = H(Y) + H(X|Y)        # Symmetric form
 
# Interpretation:
# Total uncertainty in (X, Y) = Uncertainty in X + Remaining uncertainty in Y given X
 
# For multiple variables:
H(X₁, X₂, ..., Xₙ) = H(X₁) + H(X₂|X₁) + H(X₃|X₁,X₂) + ... + H(Xₙ|X₁,...,Xₙ₋₁)
 
# Key Properties:
# 1. Conditioning reduces entropy: H(Y|X) ≤ H(Y)
#    "Knowing X can only reduce (or maintain) uncertainty about Y"
#
# 2. Equality when independent: H(Y|X) = H(Y) iff X and Y are independent
#
# 3. H(X, Y) ≤ H(X) + H(Y), with equality iff independent

Conditioning reduces entropy (on average):

H(Y|X) ≤ H(Y)

This has profound implications for machine learning:

Features reduce label uncertainty only if they're correlated with labels
The best features are those that maximally reduce conditional entropy
Feature selection can be framed as finding X that minimizes H(Y|X)

Joint and Conditional Entropy CalculationConsider two binary variables: Weather (W) and Umbrella (U)

Input

Output

Computing Joint and Conditional Entropy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
 
def entropy(probs):
    probs = np.array(probs).flatten()
    probs = probs[probs > 0]
    return -np.sum(probs * np.log2(probs))
 
# Joint distribution P(Weather, Umbrella)
# Rows: Weather (Sunny, Rainy)
# Cols: Umbrella (No, Yes)
joint = np.array([
    [0.63, 0.07],  # Sunny
    [0.03, 0.27]   # Rainy
])
 
# Marginal distributions
p_weather = joint.sum(axis=1)  # Sum over umbrella
p_umbrella = joint.sum(axis=0)  # Sum over weather
 
print("Marginal Distributions:")
print(f"  P(Weather) = {p_weather}")  # [0.7, 0.3]
print(f"  P(Umbrella) = {p_umbrella}")  # [0.66, 0.34]
print()
 
# Individual entropies
H_W = entropy(p_weather)
H_U = entropy(p_umbrella)
H_joint = entropy(joint)
 
print("Entropies:")
print(f"  H(Weather) = {H_W:.4f} bits")
print(f"  H(Umbrella) = {H_U:.4f} bits")
print(f"  H(Weather, Umbrella) = {H_joint:.4f} bits")
print()
 
# Conditional entropies (using chain rule)
H_U_given_W = H_joint - H_W
H_W_given_U = H_joint - H_U
 
print("Conditional Entropies:")
print(f"  H(Umbrella | Weather) = {H_U_given_W:.4f} bits")
print(f"  H(Weather | Umbrella) = {H_W_given_U:.4f} bits")
print()
 
# Verify chain rule
print("Chain Rule Verification:")
print(f"  H(W) + H(U|W) = {H_W + H_U_given_W:.4f}")
print(f"  H(U) + H(W|U) = {H_U + H_W_given_U:.4f}")
print(f"  H(W, U) = {H_joint:.4f}")
print()
 
# Information gained about U by knowing W
info_gain = H_U - H_U_given_W
print(f"Information gain about Umbrella from Weather: {info_gain:.4f} bits")
print(f"This is the Mutual Information I(W; U)")

Entropy in Machine Learning Applications

Entropy permeates machine learning. Here we examine its most important applications, building intuition for why this measure of uncertainty is so central to learning algorithms.

Key Applications of Entropy in ML

•Decision Tree Splitting (Information Gain): Trees split on features that maximize reduction in entropy. If H(Y) is label entropy and H(Y|X) is entropy after splitting on feature X, then Information Gain = H(Y) - H(Y|X). The best feature maximizes this gain.
•Neural Network Loss Functions: Cross-entropy loss for classification is directly derived from entropy. It measures the average bits needed to encode true labels using the model's predicted distribution.
•Maximum Entropy Models: Logistic regression and MaxEnt classifiers find the distribution with maximum entropy subject to feature constraints. This yields the least-biased model consistent with observations.
•Variational Inference: The ELBO (Evidence Lower BOund) in VAEs includes an entropy term that encourages diverse latent representations. Entropy regularization prevents collapse to point estimates.
•Reinforcement Learning: Entropy regularization (SAC, A3C) encourages exploration by rewarding high-entropy action distributions. This prevents premature convergence to suboptimal policies.
•Information Bottleneck: Trade-off between compression and prediction—minimize I(X; T) while maximizing I(T; Y). Finding optimal representations that retain task-relevant information.

Decision Tree Information Gain
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import numpy as np
 
def entropy(labels):
    """Compute entropy of label array."""
    _, counts = np.unique(labels, 
                          return_counts=True)
    probs = counts / len(labels)
    probs = probs[probs > 0]
    return -np.sum(probs * np.log2(probs))
 
def information_gain(y, splits):
    """
    Compute information gain for a split.
    
    Args:
        y: Original labels
        splits: List of label arrays 
                after split
    """
    H_before = entropy(y)
    n = len(y)
    
    # Weighted average entropy after split
    H_after = sum(
        len(s)/n * entropy(s) 
        for s in splits
    )
    
    return H_before - H_after
 
# Example: Binary classification
y = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
print(f"Initial H(Y): {entropy(y):.4f}")
 
# Split A: Perfect separation
split_a = [[0,0,0,0], [1,1,1,1,1,1]]
ig_a = information_gain(y, split_a)
print(f"Perfect split IG: {ig_a:.4f}")
 
# Split B: Imperfect separation  
split_b = [[0,0,1], [0,0,1,1,1,1,1]]
ig_b = information_gain(y, split_b)
print(f"Imperfect split IG: {ig_b:.4f}")

Entropy Regularization in RL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import numpy as np
 
def softmax(logits, temperature=1.0):
    """Softmax with temperature."""
    exp_logits = np.exp(
        (logits - np.max(logits)) 
        / temperature
    )
    return exp_logits / exp_logits.sum()
 
def policy_entropy(probs):
    """Entropy of action distribution."""
    probs = probs[probs > 0]
    return -np.sum(probs * np.log(probs))
 
# RL objective with entropy regularization
# J(π) = E[Σ r_t] + β * H(π(a|s))
beta = 0.01  # Entropy coefficient
 
# Compare greedy vs entropy-regularized
logits = np.array([10.0, 1.0, 0.1, 0.1])
 
# Greedy policy (low entropy)
greedy = softmax(logits, temperature=0.1)
print(f"Greedy policy: {greedy.round(3)}")
print(f"  Entropy: {policy_entropy(greedy):.4f}")
 
# High-temp policy (high entropy)
exploratory = softmax(logits, temperature=2.0)
print(f"\nExploratory: {exploratory.round(3)}")
print(f"  Entropy: {policy_entropy(exploratory):.4f}")
 
# Maximum entropy (uniform)
uniform = np.ones(4) / 4
print(f"\nUniform: {uniform}")
print(f"  Entropy: {policy_entropy(uniform):.4f}")

Entropy as a Regularizer

Summary: Entropy as Foundation

We've established entropy as the foundational concept of information theory and a cornerstone of machine learning. Let's consolidate the key insights:

Key Takeaways

•Entropy measures uncertainty: H(X) = −Σ pᵢ log pᵢ quantifies the average "surprise" or unpredictability in a distribution.
•Fundamental properties: Non-negativity, maximized by uniform distribution, additive for independent variables, and concave in the distribution.
•Operational meaning: Entropy is the minimum average bits needed for lossless compression (Shannon's source coding theorem).
•Differential entropy: Extension to continuous distributions, with notable differences (can be negative, not scale-invariant).
•Chain rule: H(X, Y) = H(X) + H(Y|X) connects joint, marginal, and conditional entropies.
•ML applications: Decision trees (information gain), loss functions (cross-entropy), regularization (entropy bonus), and variational methods.

What's next:

Page Complete

1 / 5