Loading content...
We've seen that Huffman's algorithm repeatedly combines the two nodes with minimum frequency. The algorithm is simple and elegant—but simplicity doesn't guarantee correctness. Many plausible greedy strategies fail to produce optimal solutions.
So why does this particular greedy choice work? Why is it safe to commit to combining the two smallest frequencies without considering the consequences? How do we know there isn't some other tree structure that would give a better result?
This page answers these questions rigorously. We'll establish the greedy choice property for Huffman coding and prove that the algorithm's local decisions always lead to a globally optimal solution.
By the end of this page, you will understand why combining the two minimum-frequency nodes is always a safe choice, be able to articulate the greedy choice property in the context of Huffman coding, follow the exchange argument that proves optimality, and recognize the structural properties of optimal prefix codes.
Recall from our study of greedy algorithms that the greedy choice property states:
A locally optimal choice can always be incorporated into a globally optimal solution.
For Huffman coding, the precise statement is:
Huffman's Greedy Choice Property:
Let x and y be the two symbols with the lowest frequencies in alphabet Σ. There exists an optimal prefix code for Σ in which x and y are siblings at the maximum depth.
In other words: it is always safe to make x and y siblings with the longest codes. We don't need to consider alternatives—some optimal solution necessarily has this structure.
A greedy choice is 'safe' when making it doesn't eliminate the possibility of finding an optimal solution. If we can prove that some optimal solution includes our greedy choice, then making that choice cannot lead us astray.
Intuition: Why Maximum Depth for Minimum Frequency?
The average code length is: $$\text{Average} = \sum_i f_i \cdot l_i$$
Where $f_i$ is frequency and $l_i$ is code length (depth in tree).
To minimize this sum, we want:
The two lowest frequency symbols should be at maximum depth because increasing their depth increases the weighted sum by the smallest amount.
But intuition isn't proof. Let's make this rigorous.
Before proving the greedy choice property, we need a key structural observation about optimal trees.
Lemma (Sibling Existence at Maximum Depth):
In any optimal prefix code tree for an alphabet with at least two symbols, there exist two symbols that are siblings at the maximum depth.
Why This Must Be True:
A full binary tree (required for optimal codes) has the property that every internal node has exactly two children. Consider the deepest level of the tree:
This lemma tells us: we can always find a pair of sibling symbols at the bottom of any optimal tree.
# In any full binary tree, leaves at maximum depth have siblings## Example: Maximum depth = 3## ( )# / \# ( ) [ ] ← depth 1 leaf# / \# ( ) ( ) ← depth 2 internal nodes# / \ / \# [ ] [ ] [ ] [ ] ← depth 3 leaves (all have siblings!)## If a leaf at depth d had no sibling (parent has only one child),# we could remove the parent and move the leaf up—a shorter code!# This contradicts optimality, so siblings must exist.An optimal code tree must be a full binary tree (every internal node has exactly 2 children). If some internal node had only one child, we could remove that node and shorten all codes in its subtree, reducing average length—contradicting optimality.
We now prove the greedy choice property using an exchange argument—a classic proof technique for greedy algorithms.
Theorem (Greedy Choice Property):
Let x and y be the two symbols with the lowest frequencies. There exists an optimal prefix code tree T where x and y are siblings at maximum depth.
Proof:
Step 1: Consider any optimal tree T*. By the structural lemma, T* has some pair of sibling symbols (a, b) at maximum depth.
Step 2: If {a, b} = {x, y}, we're done—T* already has the desired structure.
Step 3: If not, we'll show we can exchange symbols to create another optimal tree T' where x and y are at maximum depth.
Without loss of generality, assume a ≠ x. We'll exchange a with x.
# Exchange Argument Visualization## Given: x, y are the two lowest-frequency symbols# Have: Optimal tree T* with siblings a, b at max depth# Want: Optimal tree T' with x, y at max depth## --- Before Exchange (Tree T*) ---## (root)# / \# ... ...# / \# [x] ( ) ← a,b are siblings at max depth# depth d_x / \# [a] [b]# depth d_max (deepest)## --- After Exchanging x and a (Tree T') ---## (root)# / \# ... ...# / \# [a] ( ) ← now x,b are siblings at max depth# depth d_x / \# [x] [b]# depth d_max## The structure is identical; only leaf labels changed.Step 4: Analyze the cost change
Let the depth of x in T* be $d_x$ and the depth of a (at max depth) be $d_{max}$.
Since a is at maximum depth: $d_x \leq d_{max}$.
Cost of T* (for symbols x and a): $$\text{Cost}{T^*} = f_x \cdot d_x + f_a \cdot d{max} + (\text{other symbols})$$
Cost of T' (after swapping x and a): $$\text{Cost}{T'} = f_x \cdot d{max} + f_a \cdot d_x + (\text{other symbols})$$
The difference: $$\text{Cost}{T'} - \text{Cost}{T^*} = f_x \cdot d_{max} + f_a \cdot d_x - f_x \cdot d_x - f_a \cdot d_{max}$$ $$= f_x(d_{max} - d_x) - f_a(d_{max} - d_x)$$ $$= (f_x - f_a)(d_{max} - d_x)$$
Since x has minimum frequency: f_x ≤ f_a, so (f_x - f_a) ≤ 0. Since d_max is maximum depth: d_x ≤ d_max, so (d_max - d_x) ≥ 0. Therefore: (f_x - f_a)(d_max - d_x) ≤ 0. This means Cost(T') ≤ Cost(T*). Since T* was optimal, T' must also be optimal!
Step 5: Similarly, if y is not at maximum depth (i.e., y ≠ b), we can exchange y with b using the same argument, getting another optimal tree T'' where both x and y are at maximum depth.
Step 6: Since x and y are now both at maximum depth, and they displaced the original siblings a and b, x and y must themselves be siblings (they share the same parent that a and b had).
Conclusion: There exists an optimal tree where x and y (the two minimum-frequency symbols) are siblings at maximum depth. ∎
The exchange argument proves the greedy choice property: combining the two smallest-frequency symbols is safe. But we also need optimal substructure to complete the correctness proof.
Optimal Substructure Property:
If T is an optimal tree for alphabet Σ, and we replace the subtree containing x and y (minimum-frequency siblings) with a single node z of frequency f_x + f_y, then the resulting tree T' is optimal for the reduced alphabet Σ' = Σ - {x, y} ∪ {z}.
Why This Holds:
The cost of T can be written as: $$\text{Cost}(T) = \text{Cost}(T') + f_x + f_y$$
This is because x and y are one level deeper than z in T versus T'. Their contribution is:
Difference: $(f_x + f_y)(d_z + 1) - (f_x + f_y) \cdot d_z = f_x + f_y$
# Optimal Substructure Visualization## Original tree T (optimal for {x, y, ...}):## ...# |# (z) ← internal node# / \# [x] [y] at depth d## Cost contribution: f_x * d + f_y * d## Reduced tree T' (for {...} with z replacing x,y):# # ...# |# [z] ← leaf node with frequency f_x + f_y# at depth d-1## Cost contribution: (f_x + f_y) * (d-1)## Difference:# [f_x * d + f_y * d] - [(f_x + f_y) * (d-1)]# = f_x * d + f_y * d - f_x * d + f_x - f_y * d + f_y# = f_x + f_y## So: Cost(T) = Cost(T') + f_x + f_y## If T' were suboptimal, we could find a better T'' and build a # better tree for the original problem—contradicting T's optimality.The Complete Correctness Argument:
Base Case: For a 1-symbol alphabet, the optimal tree is trivially a single leaf.
Inductive Step: For an n-symbol alphabet:
Conclusion: Huffman's algorithm produces an optimal prefix code for any alphabet.
While Huffman's algorithm builds the tree bottom-up (from leaves to root), the correctness proof proceeds top-down by induction on alphabet size. This reversed perspective is common in greedy algorithm proofs.
Let's trace through the proof with a concrete example. Consider alphabet {A, B, C, D} with frequencies {5, 3, 2, 1}.
Claim: Huffman's algorithm produces an optimal tree.
Step 1: Identify minimum-frequency symbols
The two minimum-frequency symbols are C (freq=2) and D (freq=1).
Step 2: Apply greedy choice property
By the greedy choice property, there exists an optimal tree where C and D are siblings at maximum depth. Huffman's algorithm makes them siblings: it combines them into a node with frequency 2+1=3.
Step 3: Reduce the problem
The new alphabet is {A, B, (CD)} with frequencies {5, 3, 3}.
Step 4: Recurse
Minimum frequencies: B (3) and (CD) (3). By the greedy choice property, there exists an optimal tree for this reduced alphabet where B and (CD) are siblings. Combine them: frequency 3+3=6.
Step 5: Continue recursion
Alphabet: {A, (B,CD)} with frequencies {5, 6}. Minimum: A (5) and (B,CD) (6). Combine them: root with frequency 11.
# Huffman's algorithm trace with proof verification# Frequencies: A=5, B=3, C=2, D=1 # Step 1: Combine D(1) and C(2) → (CD:3)# # Greedy choice property guarantees:# ∃ optimal tree with C,D as deepest siblings ✓## (CD:3)# / \# [D:1] [C:2] # Step 2: Combine B(3) and (CD:3) → (BCD:6)## Greedy choice property for reduced alphabet:# ∃ optimal tree with B,(CD) as siblings ✓## (BCD:6)# / \# [B:3] (CD:3)# / \# [D:1] [C:2] # Step 3: Combine A(5) and (BCD:6) → Root(11)## (11)# / \# [A:5] (BCD:6)# / \# [B:3] (CD:3)# / \# [D:1] [C:2] # Final codes:# A: 0 (freq 5, depth 1) → contribution: 5# B: 10 (freq 3, depth 2) → contribution: 6# C: 111 (freq 2, depth 3) → contribution: 6# D: 110 (freq 1, depth 3) → contribution: 3## Total weighted path length: 5 + 6 + 6 + 3 = 20# Average code length: 20 / 11 ≈ 1.82 bits/symbolVerifying optimality:
Let's calculate the entropy (theoretical minimum): $$H = -\left(\frac{5}{11}\log_2\frac{5}{11} + \frac{3}{11}\log_2\frac{3}{11} + \frac{2}{11}\log_2\frac{2}{11} + \frac{1}{11}\log_2\frac{1}{11}\right)$$ $$H \approx 1.76 \text{ bits/symbol}$$
Our Huffman code achieves 1.82 bits/symbol, which is within the theoretical bound of: entropy ≤ average ≤ entropy + 1.
No prefix code can average below entropy, and Huffman gives the best possible among prefix codes!
At each step, the exchange argument guarantees that making the two smallest frequencies siblings is part of some optimal solution. By optimal substructure, solving the reduced problem optimally and then expanding gives an optimal solution to the original problem.
The fact that Huffman's specific greedy choice works doesn't mean any greedy approach would. Let's examine some alternative strategies that seem reasonable but fail to produce optimal codes.
Counterexample for Maximum-First Strategy:
Consider frequencies: A=10, B=5, C=3, D=2.
Maximum-first approach:
# Maximum-First Strategy (WRONG)## (20)# / \# (18) [D:2] ← rare symbol gets 1-bit code!# / \# (15) [C:3] ← rare symbol gets 2-bit code!# / \# [A:10] [B:5] ← COMMON symbols get 3-bit codes!## Codes: A=00, B=01, C=10, D=11 (all 2 bits? No wait...)# Actually: D=1, C=01, A=000, B=001## Weighted path length:# A: 10 × 3 = 30# B: 5 × 3 = 15# C: 3 × 2 = 6# D: 2 × 1 = 2# Total: 53## Huffman's approach (minimum-first):## (20)# / \# [A:10] (10)# / \# [B:5] (5)# / \# [C:3] [D:2]## Codes: A=0, B=10, C=110, D=111# Weighted path length:# A: 10 × 1 = 10# B: 5 × 2 = 10# C: 3 × 3 = 9# D: 2 × 3 = 6# Total: 35## Huffman: 35 vs Maximum-first: 53# The wrong strategy is 51% worse!The greedy approach only works because of the specific choice to combine minimum frequencies. The mathematical structure that makes this work (the exchange argument) does not apply to other selection criteria. Never assume a greedy strategy is correct—prove it!
When multiple symbols have the same frequency, Huffman's algorithm must choose which pair to combine. Does this choice affect optimality?
Key insight: All valid Huffman trees for the same frequencies have the same average code length—they're all optimal! The specific tree structure may differ, but the weighted path length is identical.
Why Multiple Optima Exist:
When frequencies are equal, the exchange argument shows we can swap siblings without changing the cost. Consider frequencies {A:2, B:2, C:1, D:1}:
# Two different but equally optimal Huffman trees# Frequencies: A=2, B=2, C=1, D=1 # Tree 1: Combine CD first, then AB, then combine results## ( )# / \# (AB) (CD)# / \ / \# [A] [B] [C] [D]## Codes: A=00, B=01, C=10, D=11 (all 2 bits)# Cost: 2×2 + 2×2 + 1×2 + 1×2 = 12 # Tree 2: Different combination order## ( )# / \# ( ) [A]# / \# ( ) [B]# / \# [C] [D]## Codes: A=1, B=01, C=000, D=001# Cost: 2×1 + 2×2 + 1×3 + 1×3 = 2+4+3+3 = 12 # Same cost! Both are optimal.# The structure differs, but weighted path length is identical.Practical Implications:
Reproducibility: If you need identical codes across implementations or runs, you must specify deterministic tie-breaking (e.g., alphabetical order, earliest insertion).
Canonical Huffman Codes: Some applications (like DEFLATE) require a specific 'canonical' form where codes of the same length are assigned in symbol order. This doesn't change the average length but standardizes the bit patterns.
Tree Shape vs. Code Lengths: What matters for compression is the depths (code lengths) of symbols, not the specific tree shape. Multiple shapes can yield the same depth distribution.
In practice, many compression formats use 'canonical' Huffman codes where codes are standardized: for each code length, symbols are assigned codes in alphabetical order. This means you only need to store the code lengths, not the actual bit patterns—significant space savings for the code table itself.
Stepping back, let's understand why the minimum-frequency greedy choice enjoys the optimal substructure property that makes it work.
The Matroid Structure (Advanced):
The mathematical structure underlying Huffman optimality is related to matroids—abstract structures that generalize the notion of linear independence. While a full matroid treatment is beyond our scope, the key idea is:
The set of 'valid choices' at each step forms a structure where local optimality implies global optimality.
For Huffman coding, this manifests as: the contribution of a symbol to total cost depends only on its frequency and depth, not on what other symbols do. This independence is why we can greedily optimize depth assignments.
The Cost Function is Decomposable:
$$\text{Cost}(T) = \sum_{i=1}^{n} f_i \cdot d_i = \sum_{\text{leaves } i} (\text{frequency} \times \text{depth})$$
Each symbol contributes independently to the total cost. When you move a symbol to a different depth, only that symbol's contribution changes. This decomposability is essential for the exchange argument to work.
Contrast with Non-Greedy-Able Problems:
Consider the Traveling Salesman Problem (TSP). The cost of visiting city A before city B depends on which city you came from—the contribution is not independent. This interdependence prevents greedy approaches from working.
In Huffman coding, symbol A's code cost is f_A × depth_A, regardless of what codes other symbols have. This independence is the deep reason the greedy approach succeeds.
When analyzing whether a greedy approach might work, look for decomposable cost functions where each element's contribution is independent of others. This is necessary (though not sufficient) for greedy optimality.
We've rigorously established why Huffman's greedy strategy produces optimal codes. Let's consolidate:
What's Next:
We've proven that Huffman's algorithm is correct. But how good is the result? In the final page, we'll examine the optimality guarantees quantitatively—how close Huffman codes come to the theoretical entropy limit, when they achieve it exactly, and the formal theorem that establishes Huffman coding as optimal among all prefix codes.
You now understand the theoretical foundation of Huffman's algorithm: the greedy choice property, the exchange argument proof technique, and why this specific greedy strategy—unlike many alternatives—always yields optimal prefix codes.