Loading learning content...
In the previous page, we established that our goal is to find a prefix-free binary code with minimum average length. We framed this as finding a binary tree that minimizes weighted path length. But with the number of possible tree structures growing exponentially, how can we efficiently find the optimal one?
The answer lies in one of the most beautiful greedy algorithms in computer science: Huffman's algorithm. Rather than searching through all possible trees, Huffman discovered that building the tree bottom-up by repeatedly combining the two lowest-frequency nodes always produces an optimal result.
This page walks through the complete construction process, from intuition to implementation, ensuring you understand not just what the algorithm does but why each step is necessary.
By the end of this page, you will be able to construct a Huffman tree by hand for any frequency distribution, implement the algorithm in code using a priority queue, trace through the algorithm step-by-step, and extract the optimal codes from the completed tree.
Huffman's key insight can be stated simply:
The two symbols with the lowest frequencies should be siblings in the optimal tree, as deep as possible.
Why? Because in a full binary tree, the deepest leaves have the longest codes. By placing the least frequent symbols at the bottom, we ensure that the longest codes are assigned to symbols that appear least often—exactly what we want to minimize average code length.
The Greedy Strategy
Huffman's algorithm works by repeatedly finding the two lowest-frequency nodes and combining them into a new internal node whose frequency is the sum of its children. This new node then participates in future combinations.
The process continues until only one node remains—the root of the complete Huffman tree.
Most tree algorithms work top-down: start at the root and descend. Huffman works bottom-up: start with leaves (individual symbols) and build upward by combining. This reversal is what makes the greedy approach work.
The Algorithm at a Glance
Let's trace through Huffman's algorithm with a concrete example. Consider a message where characters appear with the following frequencies:
| Character | Frequency |
|---|---|
| A | 45 |
| B | 13 |
| C | 12 |
| D | 16 |
| E | 9 |
| F | 5 |
Initial State: Six leaf nodes in our priority queue, ordered by frequency:
# Step 0: Initial priority queue (min-heap by frequency)## Queue: [F:5, E:9, C:12, B:13, D:16, A:45]## Each entry is a leaf node for the corresponding symbol## Forest of 6 individual leaf nodes:## [F:5] [E:9] [C:12] [B:13] [D:16] [A:45]Step 1: Extract the two minimum nodes (F:5 and E:9). Combine them into a new internal node with frequency 5+9=14. Insert the new node back into the queue.
# Step 1: Combine F and E## Extract: F:5, E:9# Create internal node: (14) with children F, E# Insert (14) back into queue## Queue: [C:12, B:13, (14), D:16, A:45]## Current forest:## (14)# / \# [F:5] [E:9]## [C:12] [B:13] [D:16] [A:45]Step 2: Extract C:12 and B:13. Combine into node with frequency 25.
# Step 2: Combine C and B## Extract: C:12, B:13# Create internal node: (25) with children C, B# Insert (25) back into queue## Queue: [(14), D:16, (25), A:45]## Current forest:## (14) (25)# / \ / \# [F:5] [E:9] [C:12] [B:13]## [D:16] [A:45]Step 3: Extract (14) and D:16. Combine into node with frequency 30.
# Step 3: Combine (14) and D## Extract: (14), D:16# Create internal node: (30) with children (14), D# Insert (30) back into queue## Queue: [(25), (30), A:45]## Current forest:## (30)# / \# (14) [D:16]# / \# [F:5] [E:9]## (25)# / \# [C:12] [B:13]## [A:45]Step 4: Extract (25) and (30). Combine into node with frequency 55.
# Step 4: Combine (25) and (30)## Extract: (25), (30)# Create internal node: (55) with children (25), (30)# Insert (55) back into queue## Queue: [A:45, (55)]## Current forest:## (55)# / \# (25) (30)# / \ / \# [C:12][B:13](14)[D:16]# / \# [F:5] [E:9]## [A:45]Step 5 (Final): Extract A:45 and (55). Combine into root node with frequency 100.
# Step 5: Combine A and (55) - FINAL STEP## Extract: A:45, (55)# Create root node: (100) with children A, (55)# Queue now has only one node - we're done!## COMPLETE HUFFMAN TREE:## (100)# / \# [A:45] (55)# / \# (25) (30)# / \ / \# [C:12][B:13](14)[D:16]# / \# [F:5] [E:9]## Reading codes (path from root, left=0, right=1):# A: 0 (just left)# C: 100 (right, left, left)# B: 101 (right, left, right)# F: 1100 (right, right, left, left)# E: 1101 (right, right, left, right)# D: 111 (right, right, right)Notice how the most frequent symbol (A:45) ends up closest to the root with a 1-bit code, while the least frequent symbols (F:5, E:9) end up deepest with 4-bit codes. This is precisely what we want for minimal average code length!
Once the Huffman tree is built, extracting the codes is straightforward. We perform a traversal from the root to each leaf, recording '0' for each left edge and '1' for each right edge.
The Resulting Codes:
| Character | Frequency | Code | Code Length |
|---|---|---|---|
| A | 45 | 0 | 1 |
| B | 13 | 101 | 3 |
| C | 12 | 100 | 3 |
| D | 16 | 111 | 3 |
| E | 9 | 1101 | 4 |
| F | 5 | 1100 | 4 |
Verifying Prefix-Free Property:
Let's confirm no code is a prefix of another:
✓ Valid prefix-free code!
Calculating Average Code Length:
Total characters = 45 + 13 + 12 + 16 + 9 + 5 = 100
Average = (45×1 + 13×3 + 12×3 + 16×3 + 9×4 + 5×4) / 100 = (45 + 39 + 36 + 48 + 36 + 20) / 100 = 224 / 100 = 2.24 bits per symbol
Compare to fixed-length: ⌈log₂(6)⌉ = 3 bits per symbol
Compression achieved: (3 - 2.24) / 3 = 25.3% savings!
1234567891011121314151617181920212223242526272829303132
def extract_codes(root, current_code="", codes=None): """ Extract Huffman codes by traversing the tree. Args: root: The root of the Huffman tree current_code: The code built so far (used in recursion) codes: Dictionary to store symbol -> code mappings Returns: Dictionary mapping each symbol to its Huffman code """ if codes is None: codes = {} if root is None: return codes # If this is a leaf node (has a symbol), record the code if root.symbol is not None: # Edge case: single-symbol alphabet gets code "0" codes[root.symbol] = current_code if current_code else "0" return codes # Recurse: left child gets '0', right child gets '1' extract_codes(root.left, current_code + "0", codes) extract_codes(root.right, current_code + "1", codes) return codes # After building the Huffman tree, call:# codes = extract_codes(root)The choice of '0' for left and '1' for right is arbitrary convention. Some implementations use the opposite. What matters is consistency: whichever convention you choose for encoding must be used for decoding.
Now let's implement the complete Huffman coding algorithm. The key data structures are:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174
import heapqfrom collections import Counterfrom typing import Dict, Optional class HuffmanNode: """ Node in a Huffman tree. Leaf nodes have a symbol; internal nodes have symbol=None. All nodes have a frequency (weight). """ def __init__( self, frequency: int, symbol: Optional[str] = None, left: Optional['HuffmanNode'] = None, right: Optional['HuffmanNode'] = None ): self.frequency = frequency self.symbol = symbol self.left = left self.right = right # Define comparison for heap ordering # We compare by frequency; ties broken arbitrarily def __lt__(self, other: 'HuffmanNode') -> bool: return self.frequency < other.frequency def is_leaf(self) -> bool: return self.symbol is not None def build_huffman_tree(frequencies: Dict[str, int]) -> HuffmanNode: """ Build a Huffman tree from symbol frequencies. Args: frequencies: Dict mapping symbols to their frequencies Returns: The root node of the Huffman tree Time Complexity: O(n log n) where n = number of symbols Space Complexity: O(n) for the heap and tree nodes """ if not frequencies: raise ValueError("Cannot build Huffman tree for empty alphabet") # Handle single-symbol edge case if len(frequencies) == 1: symbol, freq = next(iter(frequencies.items())) # Create a tree with just one leaf return HuffmanNode(freq, symbol) # Step 1: Create leaf nodes and add to min-heap heap = [] for symbol, freq in frequencies.items(): node = HuffmanNode(freq, symbol) heapq.heappush(heap, node) # Step 2: Repeatedly combine two smallest nodes while len(heap) > 1: # Extract two minimum nodes left = heapq.heappop(heap) right = heapq.heappop(heap) # Create parent with combined frequency parent = HuffmanNode( frequency=left.frequency + right.frequency, symbol=None, # Internal node, no symbol left=left, right=right ) # Insert parent back into heap heapq.heappush(heap, parent) # Step 3: Return the root (last remaining node) return heap[0] def extract_codes(node: HuffmanNode, prefix: str = "") -> Dict[str, str]: """ Extract Huffman codes by traversing the tree. Args: node: Current node in traversal prefix: Code built so far on path from root Returns: Dict mapping each symbol to its Huffman code """ codes = {} if node.is_leaf(): # Leaf node: record the code (use "0" for single-symbol case) codes[node.symbol] = prefix if prefix else "0" else: # Internal node: recurse on children if node.left: codes.update(extract_codes(node.left, prefix + "0")) if node.right: codes.update(extract_codes(node.right, prefix + "1")) return codes def huffman_encode(text: str) -> tuple[str, Dict[str, str]]: """ Encode text using Huffman coding. Args: text: The string to encode Returns: Tuple of (encoded_bits, code_table) """ # Count frequencies frequencies = Counter(text) # Build tree and extract codes root = build_huffman_tree(dict(frequencies)) codes = extract_codes(root) # Encode the text encoded = ''.join(codes[char] for char in text) return encoded, codes def huffman_decode(encoded: str, codes: Dict[str, str]) -> str: """ Decode Huffman-encoded bits using the code table. Args: encoded: String of '0' and '1' representing encoded data codes: Dict mapping symbols to their codes Returns: Decoded string """ # Build reverse lookup: code -> symbol reverse_codes = {code: symbol for symbol, code in codes.items()} result = [] current = "" for bit in encoded: current += bit if current in reverse_codes: result.append(reverse_codes[current]) current = "" if current: # Leftover bits indicate invalid encoding raise ValueError(f"Invalid encoding: leftover bits '{current}'") return ''.join(result) # Example usageif __name__ == "__main__": # Example 1: Simple message text = "ABRACADABRA" encoded, codes = huffman_encode(text) decoded = huffman_decode(encoded, codes) print("=== Huffman Coding Demo ===") print(f"Original: {text}") print(f"Codes: {codes}") print(f"Encoded: {encoded}") print(f"Decoded: {decoded}") print(f"Original bits (8-bit ASCII): {len(text) * 8}") print(f"Encoded bits: {len(encoded)}") print(f"Compression ratio: {len(encoded) / (len(text) * 8):.1%}")The priority queue (min-heap) is crucial for efficiency. Each extraction and insertion takes O(log n) time. With n-1 iterations, the total time is O(n log n). Without a heap, naive linear search for minimum would make the algorithm O(n²).
Let's analyze the time and space complexity of Huffman's algorithm in detail.
Let n = size of the alphabet (number of distinct symbols)
Time Complexity: O(n log n)
| Operation | Frequency | Cost per Operation | Total |
|---|---|---|---|
| Build initial heap | 1 | O(n) | O(n) |
| Extract minimum | 2(n-1) | O(log n) | O(n log n) |
| Insert new node | n-1 | O(log n) | O(n log n) |
| Extract codes (tree traversal) | 1 | O(n) | O(n) |
| Overall | O(n log n) |
Space Complexity: O(n)
Encoding a message of length m:
Decoding a message:
In practice, Huffman tables are often pre-computed and standardized (like in JPEG). Building the tree is a one-time cost; the ongoing cost is the O(m) encoding/decoding of data, where m is the message length.
A robust Huffman implementation must handle several edge cases correctly.
1234567891011121314151617181920
# Edge Case 1: Single symboltext = "AAAAA"encoded, codes = huffman_encode(text)print(f"Single symbol: {codes}") # {'A': '0'} # Edge Case 2: Two symbols with equal frequencytext = "ABABAB"encoded, codes = huffman_encode(text)print(f"Equal frequency: {codes}") # e.g., {'A': '0', 'B': '1'} # Edge Case 3: Highly skewed distributiontext = "A" * 99 + "B" # 99 A's and 1 Bencoded, codes = huffman_encode(text)print(f"Skewed: {codes}") # A gets shorter code # Edge Case 4: All characters once (uniform distribution)text = "ABCDEFGH"encoded, codes = huffman_encode(text)# With 8 symbols, all codes will be 3 bits (since log₂(8) = 3)print(f"Uniform: {codes}")When two nodes have equal frequency, the choice of which to extract first is arbitrary for optimality but matters for reproducibility. If you need identical codes across runs, implement deterministic tie-breaking (e.g., by symbol value).
While the priority queue approach is standard, there are other ways to implement Huffman's algorithm, each with different trade-offs.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
from collections import deque def build_huffman_tree_linear(sorted_symbols: list) -> HuffmanNode: """ Build Huffman tree in O(n) time when input is pre-sorted. Args: sorted_symbols: List of (symbol, frequency) sorted by frequency Key insight: Internal nodes are created in non-decreasing order, so they're automatically sorted as we create them. """ # Queue 1: Original leaf nodes (already sorted) leaves = deque( HuffmanNode(freq, symbol) for symbol, freq in sorted_symbols ) # Queue 2: Internal nodes (will be sorted by construction) internals = deque() def get_minimum(): """Get the minimum node from either queue.""" if not leaves: return internals.popleft() if not internals: return leaves.popleft() # Both non-empty: compare fronts if leaves[0].frequency <= internals[0].frequency: return leaves.popleft() return internals.popleft() # Combine until one node remains total_nodes = len(leaves) while len(leaves) + len(internals) > 1: left = get_minimum() right = get_minimum() parent = HuffmanNode( frequency=left.frequency + right.frequency, left=left, right=right ) internals.append(parent) return internals[0] if internals else leaves[0] # Usage:# sorted_input = [('F', 5), ('E', 9), ('C', 12), ('B', 13), ('D', 16), ('A', 45)]# root = build_huffman_tree_linear(sorted_input)The O(n) algorithm shines when frequencies are already sorted or when using fixed, pre-sorted alphabets (like ASCII). In general-purpose compression, the O(n log n) heap version is simpler and the difference is negligible for typical alphabet sizes (n ≤ 256).
Let's trace another example to solidify understanding. Consider building a Huffman tree for the message 'ABRACADABRA'.
Step 1: Count Frequencies
| Character | Count | Frequency |
|---|---|---|
| A | 5 | 45.5% |
| B | 2 | 18.2% |
| R | 2 | 18.2% |
| C | 1 | 9.1% |
| D | 1 | 9.1% |
# Building Huffman Tree for "ABRACADABRA" # Initial heap: [C:1, D:1, B:2, R:2, A:5] # Step 1: Combine C:1 + D:1 → (2)# Heap: [B:2, R:2, (2), A:5]## (2)# / \# [C] [D] # Step 2: Combine R:2 + B:2 → (4) [or B:2 + (2) depending on tie-break]# Let's say we combine R and B:# Heap: [(2), A:5, (4)]## (4)# / \# [R] [B] # Step 3: Combine (2) + (4) → (6)# Heap: [A:5, (6)]## (6)# / \# (2) (4)# / \ / \# [C][D][R] [B] # Step 4: Combine A:5 + (6) → (11) - ROOT# # (11)# / \# [A:5] (6)# / \# (2) (4)# / \ / \# [C][D][R] [B] # Resulting codes:# A: 0 (1 bit)# C: 100 (3 bits)# D: 101 (3 bits)# R: 110 (3 bits)# B: 111 (3 bits) # Encoded "ABRACADABRA":# A B R A C A D A B R A# 0 111 110 0 100 0 101 0 111 110 0# = 01111100100010101111100# = 23 bits # Fixed length (3 bits for 5 symbols) would be:# 11 characters × 3 bits = 33 bits # Compression: 23/33 = 69.7% of original (30.3% savings!)Notice how 'A', which appears 5 times, gets a 1-bit code, while rare characters like 'C' and 'D' get 3-bit codes. The total encoding is 23 bits instead of 33—a 30% reduction just from respecting character frequencies!
We've thoroughly explored the Huffman tree construction algorithm. Let's consolidate the key insights:
What's Next:
We've seen how to build the Huffman tree, but why does always combining the two smallest nodes lead to an optimal tree? In the next page, we'll examine the greedy choice property in detail—the theoretical foundation that guarantees Huffman's algorithm always finds the minimum average code length.
You now understand the complete Huffman tree construction algorithm: initializing leaf nodes, using a priority queue to repeatedly combine minimum nodes, and extracting codes from the final tree. You can implement this algorithm from scratch and trace through it step-by-step for any input.