Loading learning content...
In 1948, Claude Shannon published "A Mathematical Theory of Communication," a paper that would fundamentally transform how we understand information itself. Shannon's insight was revolutionary: information could be precisely quantified, measured in bits, and optimized just like physical quantities such as energy or mass.
At the heart of Shannon's framework lies a deceptively simple concept: entropy. While the name was borrowed from thermodynamics (at the suggestion of John von Neumann, who quipped that "no one really knows what entropy is, so in any argument you will always have the advantage"), Shannon's entropy is a fundamentally new idea—a measure of uncertainty, surprise, and information content inherent in a probability distribution.
For machine learning practitioners, entropy is not merely a theoretical curiosity. It appears everywhere:
Understanding entropy deeply—not just as a formula, but as a conceptual tool—unlocks intuition about why certain ML methods work and when they might fail.
By the end of this page, you will understand entropy as a rigorous measure of uncertainty, derive its mathematical properties from first principles, connect it to optimal coding theory, and appreciate its central role in machine learning. You'll be equipped to reason about information content in any probabilistic setting.
Before diving into mathematics, let's build intuition. Consider two scenarios:
Scenario A: A Biased Coin You have a coin that lands heads 99% of the time. Someone flips it and asks you to guess the result. You confidently say "heads"—and you're almost always right. When the result is revealed, you're rarely surprised. The outcome was predictable.
Scenario B: A Fair Coin Now consider a perfectly fair coin (50% heads, 50% tails). Each flip is genuinely uncertain. When the result is revealed, you experience maximum surprise for a binary outcome—there's no way to predict it better than random chance.
The key insight: Entropy quantifies the average surprise you experience when observing outcomes from a probability distribution. High entropy means high average surprise (unpredictability). Low entropy means low average surprise (predictability).
But what exactly do we mean by "surprise"?
Shannon defined self-information (or "surprisal") of an event with probability p as: I(p) = -log₂(p). A rare event (small p) has high self-information—it's very surprising. A certain event (p = 1) has zero self-information—no surprise at all. Entropy is simply the expected value of self-information across all possible outcomes.
Why logarithms?
Shannon's choice of the logarithm wasn't arbitrary. It emerges naturally from three fundamental requirements that any measure of information should satisfy:
It can be proven mathematically that the only function satisfying all three properties is the logarithm. This isn't a design choice—it's a mathematical necessity.
The connection to bits:
When we use log base 2, we measure information in bits. A bit is the amount of information needed to specify one of two equally likely alternatives. The fair coin flip contains exactly 1 bit of information. The biased 99/1 coin contains only about 0.08 bits—far less information, because the outcome is mostly predetermined.
With intuition established, we now present the formal definition. For a discrete random variable X with possible outcomes {x₁, x₂, ..., xₙ} and probability mass function P(X = xᵢ) = pᵢ, the Shannon entropy H(X) is defined as:
1234567891011121314151617181920212223242526272829303132333435
# Shannon Entropy for Discrete Random Variables# ================================================ H(X) = -∑ᵢ P(X = xᵢ) · log₂(P(X = xᵢ)) = -∑ᵢ pᵢ · log₂(pᵢ) = E[-log₂(P(X))] # In Python (using natural log, then converting):import numpy as np def entropy(probs): """ Compute Shannon entropy in bits. Args: probs: Array of probabilities (must sum to 1) Returns: Entropy H(X) in bits """ # Filter out zero probabilities (0 * log(0) is defined as 0) probs = np.array(probs) probs = probs[probs > 0] # H(X) = -Σ p_i * log_2(p_i) return -np.sum(probs * np.log2(probs)) # Examplesfair_coin = [0.5, 0.5]biased_coin = [0.99, 0.01]fair_die = [1/6] * 6 print(f"Fair coin entropy: {entropy(fair_coin):.4f} bits") # 1.0000 bitsprint(f"Biased coin entropy: {entropy(biased_coin):.4f} bits") # 0.0808 bitsprint(f"Fair die entropy: {entropy(fair_die):.4f} bits") # 2.5850 bitsUnderstanding the components:
• −log₂(pᵢ): This is the self-information or surprisal of outcome xᵢ. When pᵢ is small (rare event), −log₂(pᵢ) is large. When pᵢ = 1 (certain event), −log₂(1) = 0.
• pᵢ · (−log₂(pᵢ)): The surprisal weighted by probability. We weight each outcome's surprise by how often it occurs.
• Σᵢ: Summing over all outcomes gives the expected surprisal—the average surprise across the entire distribution.
The negative sign: We use −log₂(p) because log₂(p) is negative for 0 < p < 1. The negative sign makes entropy a positive quantity.
Edge case (p = 0): Mathematically, 0 · log(0) is undefined, but by L'Hôpital's rule, lim_{p→0⁺} p · log(p) = 0. We define 0 · log(0) = 0, meaning impossible events contribute zero to entropy.
The unit of entropy depends on the logarithm base: • Base 2 → bits (most common in information theory and ML) • Base e → nats (natural units, common in physics) • Base 10 → hartleys (rarely used)
Conversion: H_nats = H_bits × ln(2) ≈ 0.693 × H_bits. In machine learning, cross-entropy loss typically uses natural logarithms (nats), while information-theoretic analyses often use bits.
Entropy satisfies several elegant properties that make it the canonical measure of uncertainty. Understanding these properties provides deep insight into why entropy appears throughout machine learning.
123456789101112131415161718192021222324252627282930313233343536373839404142434445
import numpy as npimport matplotlib.pyplot as plt def entropy(probs): probs = np.array(probs) probs = probs[probs > 0] return -np.sum(probs * np.log2(probs)) # Property 1 & 2: Non-negativity and Zero for Deterministicprint("Property: Non-negativity and Zero for Deterministic")print(f"H([1.0, 0.0]) = {entropy([1.0, 0.0]):.4f}") # 0.0000print(f"H([0.9, 0.1]) = {entropy([0.9, 0.1]):.4f}") # 0.4690print() # Property 3: Maximum for Uniform Distributionprint("Property: Maximum for Uniform Distribution")for n in [2, 4, 8, 16]: uniform = [1/n] * n max_theoretical = np.log2(n) actual = entropy(uniform) print(f"n={n:2d}: H(uniform) = {actual:.4f}, log₂(n) = {max_theoretical:.4f}")print() # Property 5: Additivity for Independent Variablesprint("Property: Additivity for Independent Variables")px = [0.3, 0.7]py = [0.4, 0.6]# Joint distribution of independent X, Yjoint = [px[i] * py[j] for i in range(2) for j in range(2)]print(f"H(X) = {entropy(px):.4f}")print(f"H(Y) = {entropy(py):.4f}")print(f"H(X) + H(Y) = {entropy(px) + entropy(py):.4f}")print(f"H(X, Y) computed from joint = {entropy(joint):.4f}")print() # Visualization: Binary Entropy Functionp_values = np.linspace(0.001, 0.999, 500)h_binary = [-p * np.log2(p) - (1-p) * np.log2(1-p) for p in p_values] # The binary entropy H(p) = -p*log(p) - (1-p)*log(1-p)# Maximum at p = 0.5, value = 1 bitprint("Binary Entropy Function H(p):")print(f"H(0.5) = {entropy([0.5, 0.5]):.4f} bits (maximum)")print(f"H(0.9) = {entropy([0.9, 0.1]):.4f} bits")print(f"H(0.99) = {entropy([0.99, 0.01]):.4f} bits")The Binary Entropy Function:
For a Bernoulli random variable with P(X = 1) = p and P(X = 0) = 1 - p, the entropy has a special name: the binary entropy function, denoted H(p) or H_b(p):
H(p) = −p · log₂(p) − (1−p) · log₂(1−p)
This function is:
The binary entropy function appears constantly in machine learning, particularly in binary classification problems.
Entropy measures uncertainty, not complexity or value. A password "aaaaaaaa" and "x7#mK9pL" might both come from distributions with identical entropy, but one is far less secure. Similarly, uniformly random data and highly structured data with the same marginal distribution have the same entropy. Entropy is about unpredictability, not usefulness.
Shannon's most profound insight was connecting entropy to data compression. The entropy H(X) has a beautiful operational interpretation: it represents the minimum average number of bits needed to encode samples from distribution X.
Shannon's Source Coding Theorem:
For a discrete source X with entropy H(X):
This is remarkable. Entropy isn't just an abstract measure—it's a fundamental limit on compression.
Why this matters for ML:
The connection between entropy and coding explains why cross-entropy loss is the natural choice for classification:
Training to minimize cross-entropy is equivalent to making our model Q as close as possible to the true distribution P in terms of coding efficiency.
123456789101112131415161718192021222324252627282930313233343536373839404142434445
import numpy as npfrom collections import Counter def entropy(probs): probs = np.array(probs) probs = probs[probs > 0] return -np.sum(probs * np.log2(probs)) # Example: Huffman-like coding for a biased sourcesymbols = ['A', 'B', 'C', 'D']probs = [0.5, 0.25, 0.125, 0.125] # Calculate theoretical entropyH = entropy(probs)print(f"Entropy H(X) = {H:.4f} bits") # Fixed-length code (2 bits each)fixed_length_avg = 2.0print(f"Fixed-length code: {fixed_length_avg:.4f} bits/symbol") # Huffman code lengths: A=1, B=2, C=3, D=3huffman_lengths = [1, 2, 3, 3]huffman_avg = sum(p * l for p, l in zip(probs, huffman_lengths))print(f"Huffman code: {huffman_avg:.4f} bits/symbol") # Efficiencyprint(f"\nCoding efficiency:")print(f" Fixed-length: {H/fixed_length_avg * 100:.1f}%")print(f" Huffman: {H/huffman_avg * 100:.1f}%") # Simulate encoding a messagemessage = ''.join(np.random.choice(symbols, size=10000, p=probs))actual_freq = Counter(message)print(f"\nSimulated 10,000 symbols:")for s, p in zip(symbols, probs): empirical = actual_freq[s] / 10000 print(f" {s}: expected {p:.3f}, observed {empirical:.3f}") # With Huffman, total bits neededtotal_fixed = 10000 * 2total_huffman = sum(actual_freq[s] * l for s, l in zip(symbols, huffman_lengths))print(f"\nTotal bits for 10,000 symbols:")print(f" Fixed-length: {total_fixed:,} bits")print(f" Huffman: {total_huffman:,} bits")print(f" Savings: {(1 - total_huffman/total_fixed) * 100:.1f}%")So far, we've discussed entropy for discrete random variables. What about continuous distributions like Gaussians? The natural generalization is differential entropy, denoted h(X) (lowercase h to distinguish from discrete entropy H(X)):
1234567891011121314151617181920212223
# Differential Entropy for Continuous Random Variables# ===================================================== h(X) = -∫ p(x) · log p(x) dx # where p(x) is the probability density function (PDF) # Key Examples: # 1. Uniform Distribution on [a, b]h(Uniform[a,b]) = log(b - a) # 2. Gaussian Distribution N(μ, σ²)h(Gaussian) = (1/2) · log(2πeσ²) = (1/2) + (1/2)·log(2πσ²) ≈ 0.5 + 0.5·log(2π) + log(σ) ≈ 1.42 + log(σ) [in nats] # 3. Exponential Distribution with rate λh(Exponential) = 1 - log(λ) # 4. Multivariate Gaussian N(μ, Σ) in d dimensionsh(MVN) = (d/2)·log(2πe) + (1/2)·log|Σ| = (d/2)·(1 + log(2π)) + (1/2)·log|Σ|Differential entropy differs from discrete entropy in important ways:
Despite these differences, differential entropy retains many useful properties and is essential for analyzing continuous ML models.
Why the Gaussian has maximum entropy:
Among all continuous distributions with fixed mean μ and variance σ², the Gaussian distribution uniquely maximizes differential entropy. This is a profound result with deep implications:
This maximum entropy property makes Gaussians special. When we use Gaussian assumptions in ML, we're not being lazy—we're being maximally non-committal given our partial knowledge.
1234567891011121314151617181920212223242526272829303132333435363738394041
import numpy as npfrom scipy import stats def differential_entropy_gaussian(sigma): """Differential entropy of N(0, σ²) in nats.""" return 0.5 * np.log(2 * np.pi * np.e * sigma**2) def differential_entropy_uniform(a, b): """Differential entropy of Uniform[a, b] in nats.""" return np.log(b - a) def differential_entropy_exponential(lam): """Differential entropy of Exp(λ) in nats.""" return 1 - np.log(lam) # Examplesprint("Differential Entropy Examples (in nats):")print("=" * 50) # Gaussian distributions with different variancesfor sigma in [0.5, 1.0, 2.0, 5.0]: h = differential_entropy_gaussian(sigma) print(f"N(0, {sigma}²): h = {h:.4f} nats")print() # Uniform distributionsfor a, b in [(0, 1), (0, 0.1), (0, 10)]: h = differential_entropy_uniform(a, b) print(f"Uniform[{a}, {b}]: h = {h:.4f} nats")print() # Note: Uniform[0, 0.1] has NEGATIVE entropy!print("Note: Uniform[0, 0.1] has negative entropy because")print("the distribution is 'concentrated' in a small interval.")print() # Verify using scipyprint("Verification with scipy.stats:")print(f"N(0, 1) entropy: {stats.norm(0, 1).entropy():.4f} nats")print(f"Uniform[0, 1] entropy: {stats.uniform(0, 1).entropy():.4f} nats")print(f"Exp(1) entropy: {stats.expon(scale=1).entropy():.4f} nats")When dealing with multiple random variables, we need to understand how uncertainty combines and how knowing one variable affects uncertainty about another.
Joint Entropy H(X, Y) measures the total uncertainty in the pair (X, Y):
H(X, Y) = −ΣᵢΣⱼ P(X = xᵢ, Y = yⱼ) · log P(X = xᵢ, Y = yⱼ)
Conditional Entropy H(Y|X) measures the remaining uncertainty in Y after observing X:
H(Y|X) = Σᵢ P(X = xᵢ) · H(Y|X = xᵢ) = −ΣᵢΣⱼ P(X = xᵢ, Y = yⱼ) · log P(Y = yⱼ|X = xᵢ)
These quantities satisfy the elegant chain rule for entropy:
12345678910111213141516171819
# Chain Rule for Entropy# ====================== H(X, Y) = H(X) + H(Y|X) # Joint = Marginal + Conditional = H(Y) + H(X|Y) # Symmetric form # Interpretation:# Total uncertainty in (X, Y) = Uncertainty in X + Remaining uncertainty in Y given X # For multiple variables:H(X₁, X₂, ..., Xₙ) = H(X₁) + H(X₂|X₁) + H(X₃|X₁,X₂) + ... + H(Xₙ|X₁,...,Xₙ₋₁) # Key Properties:# 1. Conditioning reduces entropy: H(Y|X) ≤ H(Y)# "Knowing X can only reduce (or maintain) uncertainty about Y"## 2. Equality when independent: H(Y|X) = H(Y) iff X and Y are independent## 3. H(X, Y) ≤ H(X) + H(Y), with equality iff independentConditioning reduces entropy (on average):
H(Y|X) ≤ H(Y)
This is the information processing inequality in its simplest form. Learning X can only reduce (never increase) our uncertainty about Y on average. Equality holds if and only if X and Y are independent—learning X tells us nothing about Y.
This has profound implications for machine learning:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
import numpy as np def entropy(probs): probs = np.array(probs).flatten() probs = probs[probs > 0] return -np.sum(probs * np.log2(probs)) # Joint distribution P(Weather, Umbrella)# Rows: Weather (Sunny, Rainy)# Cols: Umbrella (No, Yes)joint = np.array([ [0.63, 0.07], # Sunny [0.03, 0.27] # Rainy]) # Marginal distributionsp_weather = joint.sum(axis=1) # Sum over umbrellap_umbrella = joint.sum(axis=0) # Sum over weather print("Marginal Distributions:")print(f" P(Weather) = {p_weather}") # [0.7, 0.3]print(f" P(Umbrella) = {p_umbrella}") # [0.66, 0.34]print() # Individual entropiesH_W = entropy(p_weather)H_U = entropy(p_umbrella)H_joint = entropy(joint) print("Entropies:")print(f" H(Weather) = {H_W:.4f} bits")print(f" H(Umbrella) = {H_U:.4f} bits")print(f" H(Weather, Umbrella) = {H_joint:.4f} bits")print() # Conditional entropies (using chain rule)H_U_given_W = H_joint - H_WH_W_given_U = H_joint - H_U print("Conditional Entropies:")print(f" H(Umbrella | Weather) = {H_U_given_W:.4f} bits")print(f" H(Weather | Umbrella) = {H_W_given_U:.4f} bits")print() # Verify chain ruleprint("Chain Rule Verification:")print(f" H(W) + H(U|W) = {H_W + H_U_given_W:.4f}")print(f" H(U) + H(W|U) = {H_U + H_W_given_U:.4f}")print(f" H(W, U) = {H_joint:.4f}")print() # Information gained about U by knowing Winfo_gain = H_U - H_U_given_Wprint(f"Information gain about Umbrella from Weather: {info_gain:.4f} bits")print(f"This is the Mutual Information I(W; U)")Entropy permeates machine learning. Here we examine its most important applications, building intuition for why this measure of uncertainty is so central to learning algorithms.
12345678910111213141516171819202122232425262728293031323334353637383940414243
import numpy as np def entropy(labels): """Compute entropy of label array.""" _, counts = np.unique(labels, return_counts=True) probs = counts / len(labels) probs = probs[probs > 0] return -np.sum(probs * np.log2(probs)) def information_gain(y, splits): """ Compute information gain for a split. Args: y: Original labels splits: List of label arrays after split """ H_before = entropy(y) n = len(y) # Weighted average entropy after split H_after = sum( len(s)/n * entropy(s) for s in splits ) return H_before - H_after # Example: Binary classificationy = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]print(f"Initial H(Y): {entropy(y):.4f}") # Split A: Perfect separationsplit_a = [[0,0,0,0], [1,1,1,1,1,1]]ig_a = information_gain(y, split_a)print(f"Perfect split IG: {ig_a:.4f}") # Split B: Imperfect separation split_b = [[0,0,1], [0,0,1,1,1,1,1]]ig_b = information_gain(y, split_b)print(f"Imperfect split IG: {ig_b:.4f}")123456789101112131415161718192021222324252627282930313233343536
import numpy as np def softmax(logits, temperature=1.0): """Softmax with temperature.""" exp_logits = np.exp( (logits - np.max(logits)) / temperature ) return exp_logits / exp_logits.sum() def policy_entropy(probs): """Entropy of action distribution.""" probs = probs[probs > 0] return -np.sum(probs * np.log(probs)) # RL objective with entropy regularization# J(π) = E[Σ r_t] + β * H(π(a|s))beta = 0.01 # Entropy coefficient # Compare greedy vs entropy-regularizedlogits = np.array([10.0, 1.0, 0.1, 0.1]) # Greedy policy (low entropy)greedy = softmax(logits, temperature=0.1)print(f"Greedy policy: {greedy.round(3)}")print(f" Entropy: {policy_entropy(greedy):.4f}") # High-temp policy (high entropy)exploratory = softmax(logits, temperature=2.0)print(f"\nExploratory: {exploratory.round(3)}")print(f" Entropy: {policy_entropy(exploratory):.4f}") # Maximum entropy (uniform)uniform = np.ones(4) / 4print(f"\nUniform: {uniform}")print(f" Entropy: {policy_entropy(uniform):.4f}")In many ML contexts, entropy acts as a regularizer. High entropy in policy distributions prevents overconfident decisions. High entropy in generative models encourages diverse outputs. The entropy term in variational free energy prevents overfitting to narrow modes. This pattern—using entropy to keep distributions "spread out"—recurs throughout probabilistic ML.
We've established entropy as the foundational concept of information theory and a cornerstone of machine learning. Let's consolidate the key insights:
What's next:
With entropy established, we're ready to explore cross-entropy—the measure that compares two distributions and forms the basis of nearly every classification loss function in machine learning. We'll see how cross-entropy naturally emerges when using one distribution to encode data from another, and why minimizing cross-entropy loss is equivalent to maximum likelihood estimation.
You now understand Shannon entropy as a rigorous measure of uncertainty with deep connections to optimal coding and machine learning. This foundation prepares you for cross-entropy, KL divergence, and the information-theoretic principles that underpin modern ML systems.