Greedy AlgorithmsHuffman Coding — Optimal Prefix Codes

Huffman Coding — Optimal Prefix Codes

LevelIntermediate

Duration60 mins

TopicHuffman Coding — Optimal Prefix Codes

2 / 4

Building the Huffman Tree

The Elegant Construction Algorithm

In the previous page, we established that our goal is to find a prefix-free binary code with minimum average length. We framed this as finding a binary tree that minimizes weighted path length. But with the number of possible tree structures growing exponentially, how can we efficiently find the optimal one?

The answer lies in one of the most beautiful greedy algorithms in computer science: Huffman's algorithm. Rather than searching through all possible trees, Huffman discovered that building the tree bottom-up by repeatedly combining the two lowest-frequency nodes always produces an optimal result.

This page walks through the complete construction process, from intuition to implementation, ensuring you understand not just what the algorithm does but why each step is necessary.

What You Will Learn

By the end of this page, you will be able to construct a Huffman tree by hand for any frequency distribution, implement the algorithm in code using a priority queue, trace through the algorithm step-by-step, and extract the optimal codes from the completed tree.

The Core Insight: Bottom-Up Construction

Huffman's key insight can be stated simply:

The two symbols with the lowest frequencies should be siblings in the optimal tree, as deep as possible.

Why? Because in a full binary tree, the deepest leaves have the longest codes. By placing the least frequent symbols at the bottom, we ensure that the longest codes are assigned to symbols that appear least often—exactly what we want to minimize average code length.

The Greedy Strategy

Huffman's algorithm works by repeatedly finding the two lowest-frequency nodes and combining them into a new internal node whose frequency is the sum of its children. This new node then participates in future combinations.

The process continues until only one node remains—the root of the complete Huffman tree.

Think Bottom-Up, Not Top-Down

Most tree algorithms work top-down: start at the root and descend. Huffman works bottom-up: start with leaves (individual symbols) and build upward by combining. This reversal is what makes the greedy approach work.

The Algorithm at a Glance

Create a leaf node for each symbol, with the symbol's frequency as its weight
Insert all leaf nodes into a priority queue (min-heap) ordered by frequency
While the queue has more than one node:
- Extract the two nodes with minimum frequency
- Create a new internal node with these two as children
- Set the new node's frequency = sum of children's frequencies
- Insert the new node back into the queue
The remaining node is the root of the Huffman tree
Assign codes by traversing the tree (left = '0', right = '1')

Step-by-Step Construction Example

Let's trace through Huffman's algorithm with a concrete example. Consider a message where characters appear with the following frequencies:

Character	Frequency
A	45
B	13
C	12
D	16
E	9
F	5

Initial State: Six leaf nodes in our priority queue, ordered by frequency:

huffman_step_by_step.txt
# Step 0: Initial priority queue (min-heap by frequency)
#
# Queue: [F:5, E:9, C:12, B:13, D:16, A:45]
#
# Each entry is a leaf node for the corresponding symbol
#
# Forest of 6 individual leaf nodes:
#
#   [F:5]   [E:9]   [C:12]   [B:13]   [D:16]   [A:45]

Step 1: Extract the two minimum nodes (F:5 and E:9). Combine them into a new internal node with frequency 5+9=14. Insert the new node back into the queue.

huffman_step1.txt
# Step 1: Combine F and E
#
# Extract: F:5, E:9
# Create internal node: (14) with children F, E
# Insert (14) back into queue
#
# Queue: [C:12, B:13, (14), D:16, A:45]
#
# Current forest:
#
#       (14)
#      /    \
#   [F:5]  [E:9]
#
#   [C:12]   [B:13]   [D:16]   [A:45]

Step 2: Extract C:12 and B:13. Combine into node with frequency 25.

huffman_step2.txt
# Step 2: Combine C and B
#
# Extract: C:12, B:13
# Create internal node: (25) with children C, B
# Insert (25) back into queue
#
# Queue: [(14), D:16, (25), A:45]
#
# Current forest:
#
#       (14)          (25)
#      /    \        /    \
#   [F:5]  [E:9]  [C:12]  [B:13]
#
#   [D:16]   [A:45]

Step 3: Extract (14) and D:16. Combine into node with frequency 30.

huffman_step3.txt
# Step 3: Combine (14) and D
#
# Extract: (14), D:16
# Create internal node: (30) with children (14), D
# Insert (30) back into queue
#
# Queue: [(25), (30), A:45]
#
# Current forest:
#
#           (30)
#          /    \
#       (14)    [D:16]
#      /    \
#   [F:5]  [E:9]
#
#       (25)
#      /    \
#   [C:12]  [B:13]
#
#   [A:45]

Step 4: Extract (25) and (30). Combine into node with frequency 55.

huffman_step4.txt
# Step 4: Combine (25) and (30)
#
# Extract: (25), (30)
# Create internal node: (55) with children (25), (30)
# Insert (55) back into queue
#
# Queue: [A:45, (55)]
#
# Current forest:
#
#               (55)
#              /    \
#           (25)    (30)
#          /   \   /    \
#       [C:12][B:13](14)[D:16]
#                  /    \
#               [F:5]  [E:9]
#
#   [A:45]

Step 5 (Final): Extract A:45 and (55). Combine into root node with frequency 100.

huffman_step5.txt
# Step 5: Combine A and (55) - FINAL STEP
#
# Extract: A:45, (55)
# Create root node: (100) with children A, (55)
# Queue now has only one node - we're done!
#
# COMPLETE HUFFMAN TREE:
#
#                    (100)
#                   /      \
#                [A:45]    (55)
#                         /    \
#                      (25)    (30)
#                     /   \   /    \
#                 [C:12][B:13](14)[D:16]
#                             /    \
#                          [F:5]  [E:9]
#
# Reading codes (path from root, left=0, right=1):
#   A: 0           (just left)
#   C: 100         (right, left, left)
#   B: 101         (right, left, right)
#   F: 1100        (right, right, left, left)
#   E: 1101        (right, right, left, right)
#   D: 111         (right, right, right)

Observe the Pattern

Notice how the most frequent symbol (A:45) ends up closest to the root with a 1-bit code, while the least frequent symbols (F:5, E:9) end up deepest with 4-bit codes. This is precisely what we want for minimal average code length!

Extracting Codes from the Tree

Once the Huffman tree is built, extracting the codes is straightforward. We perform a traversal from the root to each leaf, recording '0' for each left edge and '1' for each right edge.

The Resulting Codes:

Final Huffman Codes from Our Example
Character	Frequency	Code	Code Length
A	45	0	1
B	13	101	3
C	12	100	3
D	16	111	3
E	9	1101	4
F	5	1100	4

Verifying Prefix-Free Property:

Let's confirm no code is a prefix of another:

A = 0: No other code starts with just '0' (all others start with '1')
B = 101: Not a prefix of 100, 111, 1101, or 1100
C = 100: Not a prefix of 101, 111, 1101, or 1100
D = 111: Not a prefix of shorter codes, and different from 1100, 1101
E = 1101: Only longer codes would matter, but none exist
F = 1100: Same as E

✓ Valid prefix-free code!

Calculating Average Code Length:

Total characters = 45 + 13 + 12 + 16 + 9 + 5 = 100

Average = (45×1 + 13×3 + 12×3 + 16×3 + 9×4 + 5×4) / 100 = (45 + 39 + 36 + 48 + 36 + 20) / 100 = 224 / 100 = 2.24 bits per symbol

Compare to fixed-length: ⌈log₂(6)⌉ = 3 bits per symbol

Compression achieved: (3 - 2.24) / 3 = 25.3% savings!

extract_codes.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
def extract_codes(root, current_code="", codes=None):
    """
    Extract Huffman codes by traversing the tree.
    
    Args:
        root: The root of the Huffman tree
        current_code: The code built so far (used in recursion)
        codes: Dictionary to store symbol -> code mappings
        
    Returns:
        Dictionary mapping each symbol to its Huffman code
    """
    if codes is None:
        codes = {}
    
    if root is None:
        return codes
    
    # If this is a leaf node (has a symbol), record the code
    if root.symbol is not None:
        # Edge case: single-symbol alphabet gets code "0"
        codes[root.symbol] = current_code if current_code else "0"
        return codes
    
    # Recurse: left child gets '0', right child gets '1'
    extract_codes(root.left, current_code + "0", codes)
    extract_codes(root.right, current_code + "1", codes)
    
    return codes
 
# After building the Huffman tree, call:
# codes = extract_codes(root)

Left vs Right Convention

The choice of '0' for left and '1' for right is arbitrary convention. Some implementations use the opposite. What matters is consistency: whichever convention you choose for encoding must be used for decoding.

Complete Implementation

Now let's implement the complete Huffman coding algorithm. The key data structures are:

Node class: Represents both leaf nodes (symbols) and internal nodes
Priority queue (min-heap): Efficiently extracts minimum-frequency nodes
Recursion: For traversing the tree to extract codes

huffman_complete.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
import heapq
from collections import Counter
from typing import Dict, Optional
 
class HuffmanNode:
    """
    Node in a Huffman tree.
    
    Leaf nodes have a symbol; internal nodes have symbol=None.
    All nodes have a frequency (weight).
    """
    def __init__(
        self, 
        frequency: int, 
        symbol: Optional[str] = None,
        left: Optional['HuffmanNode'] = None,
        right: Optional['HuffmanNode'] = None
    ):
        self.frequency = frequency
        self.symbol = symbol
        self.left = left
        self.right = right
    
    # Define comparison for heap ordering
    # We compare by frequency; ties broken arbitrarily
    def __lt__(self, other: 'HuffmanNode') -> bool:
        return self.frequency < other.frequency
    
    def is_leaf(self) -> bool:
        return self.symbol is not None
 
 
def build_huffman_tree(frequencies: Dict[str, int]) -> HuffmanNode:
    """
    Build a Huffman tree from symbol frequencies.
    
    Args:
        frequencies: Dict mapping symbols to their frequencies
        
    Returns:
        The root node of the Huffman tree
        
    Time Complexity: O(n log n) where n = number of symbols
    Space Complexity: O(n) for the heap and tree nodes
    """
    if not frequencies:
        raise ValueError("Cannot build Huffman tree for empty alphabet")
    
    # Handle single-symbol edge case
    if len(frequencies) == 1:
        symbol, freq = next(iter(frequencies.items()))
        # Create a tree with just one leaf
        return HuffmanNode(freq, symbol)
    
    # Step 1: Create leaf nodes and add to min-heap
    heap = []
    for symbol, freq in frequencies.items():
        node = HuffmanNode(freq, symbol)
        heapq.heappush(heap, node)
    
    # Step 2: Repeatedly combine two smallest nodes
    while len(heap) > 1:
        # Extract two minimum nodes
        left = heapq.heappop(heap)
        right = heapq.heappop(heap)
        
        # Create parent with combined frequency
        parent = HuffmanNode(
            frequency=left.frequency + right.frequency,
            symbol=None,  # Internal node, no symbol
            left=left,
            right=right
        )
        
        # Insert parent back into heap
        heapq.heappush(heap, parent)
    
    # Step 3: Return the root (last remaining node)
    return heap[0]
 
 
def extract_codes(node: HuffmanNode, prefix: str = "") -> Dict[str, str]:
    """
    Extract Huffman codes by traversing the tree.
    
    Args:
        node: Current node in traversal
        prefix: Code built so far on path from root
        
    Returns:
        Dict mapping each symbol to its Huffman code
    """
    codes = {}
    
    if node.is_leaf():
        # Leaf node: record the code (use "0" for single-symbol case)
        codes[node.symbol] = prefix if prefix else "0"
    else:
        # Internal node: recurse on children
        if node.left:
            codes.update(extract_codes(node.left, prefix + "0"))
        if node.right:
            codes.update(extract_codes(node.right, prefix + "1"))
    
    return codes
 
 
def huffman_encode(text: str) -> tuple[str, Dict[str, str]]:
    """
    Encode text using Huffman coding.
    
    Args:
        text: The string to encode
        
    Returns:
        Tuple of (encoded_bits, code_table)
    """
    # Count frequencies
    frequencies = Counter(text)
    
    # Build tree and extract codes
    root = build_huffman_tree(dict(frequencies))
    codes = extract_codes(root)
    
    # Encode the text
    encoded = ''.join(codes[char] for char in text)
    
    return encoded, codes
 
 
def huffman_decode(encoded: str, codes: Dict[str, str]) -> str:
    """
    Decode Huffman-encoded bits using the code table.
    
    Args:
        encoded: String of '0' and '1' representing encoded data
        codes: Dict mapping symbols to their codes
        
    Returns:
        Decoded string
    """
    # Build reverse lookup: code -> symbol
    reverse_codes = {code: symbol for symbol, code in codes.items()}
    
    result = []
    current = ""
    
    for bit in encoded:
        current += bit
        if current in reverse_codes:
            result.append(reverse_codes[current])
            current = ""
    
    if current:  # Leftover bits indicate invalid encoding
        raise ValueError(f"Invalid encoding: leftover bits '{current}'")
    
    return ''.join(result)
 
 
# Example usage
if __name__ == "__main__":
    # Example 1: Simple message
    text = "ABRACADABRA"
    encoded, codes = huffman_encode(text)
    decoded = huffman_decode(encoded, codes)
    
    print("=== Huffman Coding Demo ===")
    print(f"Original:  {text}")
    print(f"Codes:     {codes}")
    print(f"Encoded:   {encoded}")
    print(f"Decoded:   {decoded}")
    print(f"Original bits (8-bit ASCII): {len(text) * 8}")
    print(f"Encoded bits:                {len(encoded)}")
    print(f"Compression ratio: {len(encoded) / (len(text) * 8):.1%}")

Why Priority Queue?

The priority queue (min-heap) is crucial for efficiency. Each extraction and insertion takes O(log n) time. With n-1 iterations, the total time is O(n log n). Without a heap, naive linear search for minimum would make the algorithm O(n²).

Complexity Analysis

Let's analyze the time and space complexity of Huffman's algorithm in detail.

Let n = size of the alphabet (number of distinct symbols)

Time Complexity: O(n log n)

Time Complexity Breakdown
Operation	Frequency	Cost per Operation	Total
Build initial heap	1	O(n)	O(n)
Extract minimum	2(n-1)	O(log n)	O(n log n)
Insert new node	n-1	O(log n)	O(n log n)
Extract codes (tree traversal)	1	O(n)	O(n)
Overall			O(n log n)

Space Complexity: O(n)

Leaf nodes: n nodes, one per symbol
Internal nodes: n-1 nodes (a full binary tree with n leaves has n-1 internal nodes)
Total tree nodes: 2n-1 = O(n)
Priority queue: At most n nodes at any time
Code table: n entries, each code is at most n-1 bits

Encoding a message of length m:

Time: O(m) after building the tree (just table lookup)
Space: O(m) for the encoded output (in practice, often smaller)

Decoding a message:

Time: O(m × L_max) where L_max is the maximum code length
With a lookup tree/trie: O(encoded length)

Practical Note: Pre-computed Tables

In practice, Huffman tables are often pre-computed and standardized (like in JPEG). Building the tree is a one-time cost; the ongoing cost is the O(m) encoding/decoding of data, where m is the message length.

Important Edge Cases

A robust Huffman implementation must handle several edge cases correctly.

Edge Cases to Handle

•Single-symbol alphabet: If the input contains only one distinct symbol (e.g., 'AAAA'), the tree is just one leaf. Convention is to assign it code '0' or '1'.
•Empty input: An empty string has no symbols. Either reject as invalid or return an empty encoding.
•Very skewed frequencies: If one symbol dominates (e.g., 99% 'A'), the tree becomes highly unbalanced. This is actually correct—'A' gets a very short code.
•Equal frequencies: When multiple symbols have the same frequency, the algorithm works but the specific tree structure depends on tie-breaking order. All such trees are equally optimal.
•Very large alphabets: With n = 256 (bytes) or more, the tree has up to 511 nodes. Memory is usually not an issue, but code lengths can approach log₂(n).

edge_cases.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Edge Case 1: Single symbol
text = "AAAAA"
encoded, codes = huffman_encode(text)
print(f"Single symbol: {codes}")  # {'A': '0'}
 
# Edge Case 2: Two symbols with equal frequency
text = "ABABAB"
encoded, codes = huffman_encode(text)
print(f"Equal frequency: {codes}")  # e.g., {'A': '0', 'B': '1'}
 
# Edge Case 3: Highly skewed distribution
text = "A" * 99 + "B"  # 99 A's and 1 B
encoded, codes = huffman_encode(text)
print(f"Skewed: {codes}")  # A gets shorter code
 
# Edge Case 4: All characters once (uniform distribution)
text = "ABCDEFGH"
encoded, codes = huffman_encode(text)
# With 8 symbols, all codes will be 3 bits (since log₂(8) = 3)
print(f"Uniform: {codes}")

Tie-Breaking Matters for Consistency

When two nodes have equal frequency, the choice of which to extract first is arbitrary for optimality but matters for reproducibility. If you need identical codes across runs, implement deterministic tie-breaking (e.g., by symbol value).

Alternative Implementations

While the priority queue approach is standard, there are other ways to implement Huffman's algorithm, each with different trade-offs.

Sorted Array Approach

•Sort symbols by frequency once
•Use two queues: original leaves + new internal nodes
•New nodes are always >= existing nodes
•O(n log n) for sort, O(n) for merging
•Slightly more complex but avoids heap operations

Linear Time for Sorted Input

•If frequencies are pre-sorted: O(n) total!
•Use two FIFO queues
•Queue 1: original sorted leaves
•Queue 2: internal nodes (already sorted)
•Each step: peek front of both, take smaller

linear_huffman.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
from collections import deque
 
def build_huffman_tree_linear(sorted_symbols: list) -> HuffmanNode:
    """
    Build Huffman tree in O(n) time when input is pre-sorted.
    
    Args:
        sorted_symbols: List of (symbol, frequency) sorted by frequency
        
    Key insight: Internal nodes are created in non-decreasing order,
    so they're automatically sorted as we create them.
    """
    # Queue 1: Original leaf nodes (already sorted)
    leaves = deque(
        HuffmanNode(freq, symbol) 
        for symbol, freq in sorted_symbols
    )
    
    # Queue 2: Internal nodes (will be sorted by construction)
    internals = deque()
    
    def get_minimum():
        """Get the minimum node from either queue."""
        if not leaves:
            return internals.popleft()
        if not internals:
            return leaves.popleft()
        # Both non-empty: compare fronts
        if leaves[0].frequency <= internals[0].frequency:
            return leaves.popleft()
        return internals.popleft()
    
    # Combine until one node remains
    total_nodes = len(leaves)
    while len(leaves) + len(internals) > 1:
        left = get_minimum()
        right = get_minimum()
        
        parent = HuffmanNode(
            frequency=left.frequency + right.frequency,
            left=left,
            right=right
        )
        internals.append(parent)
    
    return internals[0] if internals else leaves[0]
 
# Usage:
# sorted_input = [('F', 5), ('E', 9), ('C', 12), ('B', 13), ('D', 16), ('A', 45)]
# root = build_huffman_tree_linear(sorted_input)

When to Use Linear Time Version

The O(n) algorithm shines when frequencies are already sorted or when using fixed, pre-sorted alphabets (like ASCII). In general-purpose compression, the O(n log n) heap version is simpler and the difference is negligible for typical alphabet sizes (n ≤ 256).

Understanding Through Visualization

Let's trace another example to solidify understanding. Consider building a Huffman tree for the message 'ABRACADABRA'.

Step 1: Count Frequencies

Character Frequencies in 'ABRACADABRA'
Character	Count	Frequency
A	5	45.5%
B	2	18.2%
R	2	18.2%
C	1	9.1%
D	1	9.1%

abracadabra_trace.txt
# Building Huffman Tree for "ABRACADABRA"
 
# Initial heap: [C:1, D:1, B:2, R:2, A:5]
 
# Step 1: Combine C:1 + D:1 → (2)
# Heap: [B:2, R:2, (2), A:5]
#
#       (2)
#      /   \
#    [C]   [D]
 
# Step 2: Combine R:2 + B:2 → (4)  [or B:2 + (2) depending on tie-break]
# Let's say we combine R and B:
# Heap: [(2), A:5, (4)]
#
#       (4)
#      /   \
#    [R]   [B]
 
# Step 3: Combine (2) + (4) → (6)
# Heap: [A:5, (6)]
#
#           (6)
#          /   \
#        (2)   (4)
#       / \   /  \
#     [C][D][R] [B]
 
# Step 4: Combine A:5 + (6) → (11) - ROOT
# 
#              (11)
#             /    \
#          [A:5]   (6)
#                 /   \
#               (2)   (4)
#              / \   /  \
#            [C][D][R] [B]
 
# Resulting codes:
#   A: 0       (1 bit)
#   C: 100     (3 bits)
#   D: 101     (3 bits)
#   R: 110     (3 bits)
#   B: 111     (3 bits)
 
# Encoded "ABRACADABRA":
# A  B   R   A  C   A  D   A  B   R   A
# 0  111 110 0  100 0  101 0  111 110 0
# = 01111100100010101111100
# = 23 bits
 
# Fixed length (3 bits for 5 symbols) would be:
# 11 characters × 3 bits = 33 bits
 
# Compression: 23/33 = 69.7% of original (30.3% savings!)

The Magic of Frequency-Based Encoding

Notice how 'A', which appears 5 times, gets a 1-bit code, while rare characters like 'C' and 'D' get 3-bit codes. The total encoding is 23 bits instead of 33—a 30% reduction just from respecting character frequencies!

Summary: Building the Huffman Tree

We've thoroughly explored the Huffman tree construction algorithm. Let's consolidate the key insights:

Key Takeaways

•Huffman's algorithm builds bottom-up — Start with leaf nodes and combine upward, unlike most tree algorithms.
•The greedy choice is to combine the two smallest — At each step, merge the two lowest-frequency nodes into a new parent.
•Use a priority queue for efficiency — A min-heap enables O(log n) extraction and insertion, giving O(n log n) total time.
•The tree structure encodes prefix-free codes — Left edges give '0', right edges give '1', and leaf paths give symbol codes.
•Frequent symbols end up near the root — This ensures short codes for common symbols and long codes for rare ones.
•The algorithm handles edge cases gracefully — Single-symbol alphabets, equal frequencies, and skewed distributions all work correctly.
•Alternative implementations exist — Pre-sorted input enables O(n) time construction using two queues.

What's Next:

We've seen how to build the Huffman tree, but why does always combining the two smallest nodes lead to an optimal tree? In the next page, we'll examine the greedy choice property in detail—the theoretical foundation that guarantees Huffman's algorithm always finds the minimum average code length.

Page Complete

You now understand the complete Huffman tree construction algorithm: initializing leaf nodes, using a priority queue to repeatedly combine minimum nodes, and extracting codes from the final tree. You can implement this algorithm from scratch and trace through it step-by-step for any input.

2 / 4

Loading learning content...

Greedy AlgorithmsHuffman Coding — Optimal Prefix Codes

Huffman Coding — Optimal Prefix Codes

LevelIntermediate

Duration60 mins

TopicHuffman Coding — Optimal Prefix Codes

2 / 4

Building the Huffman Tree

The Elegant Construction Algorithm

This page walks through the complete construction process, from intuition to implementation, ensuring you understand not just what the algorithm does but why each step is necessary.

What You Will Learn

The Core Insight: Bottom-Up Construction

Huffman's key insight can be stated simply:

The two symbols with the lowest frequencies should be siblings in the optimal tree, as deep as possible.

The Greedy Strategy

The process continues until only one node remains—the root of the complete Huffman tree.

Think Bottom-Up, Not Top-Down

The Algorithm at a Glance

Create a leaf node for each symbol, with the symbol's frequency as its weight
Insert all leaf nodes into a priority queue (min-heap) ordered by frequency
While the queue has more than one node:
- Extract the two nodes with minimum frequency
- Create a new internal node with these two as children
- Set the new node's frequency = sum of children's frequencies
- Insert the new node back into the queue
The remaining node is the root of the Huffman tree
Assign codes by traversing the tree (left = '0', right = '1')

Step-by-Step Construction Example

Let's trace through Huffman's algorithm with a concrete example. Consider a message where characters appear with the following frequencies:

Character	Frequency
A	45
B	13
C	12
D	16
E	9
F	5

Initial State: Six leaf nodes in our priority queue, ordered by frequency:

huffman_step_by_step.txt
# Step 0: Initial priority queue (min-heap by frequency)
#
# Queue: [F:5, E:9, C:12, B:13, D:16, A:45]
#
# Each entry is a leaf node for the corresponding symbol
#
# Forest of 6 individual leaf nodes:
#
#   [F:5]   [E:9]   [C:12]   [B:13]   [D:16]   [A:45]

Step 1: Extract the two minimum nodes (F:5 and E:9). Combine them into a new internal node with frequency 5+9=14. Insert the new node back into the queue.

huffman_step1.txt
# Step 1: Combine F and E
#
# Extract: F:5, E:9
# Create internal node: (14) with children F, E
# Insert (14) back into queue
#
# Queue: [C:12, B:13, (14), D:16, A:45]
#
# Current forest:
#
#       (14)
#      /    \
#   [F:5]  [E:9]
#
#   [C:12]   [B:13]   [D:16]   [A:45]

Step 2: Extract C:12 and B:13. Combine into node with frequency 25.

huffman_step2.txt
# Step 2: Combine C and B
#
# Extract: C:12, B:13
# Create internal node: (25) with children C, B
# Insert (25) back into queue
#
# Queue: [(14), D:16, (25), A:45]
#
# Current forest:
#
#       (14)          (25)
#      /    \        /    \
#   [F:5]  [E:9]  [C:12]  [B:13]
#
#   [D:16]   [A:45]

Step 3: Extract (14) and D:16. Combine into node with frequency 30.

huffman_step3.txt
# Step 3: Combine (14) and D
#
# Extract: (14), D:16
# Create internal node: (30) with children (14), D
# Insert (30) back into queue
#
# Queue: [(25), (30), A:45]
#
# Current forest:
#
#           (30)
#          /    \
#       (14)    [D:16]
#      /    \
#   [F:5]  [E:9]
#
#       (25)
#      /    \
#   [C:12]  [B:13]
#
#   [A:45]

Step 4: Extract (25) and (30). Combine into node with frequency 55.

huffman_step4.txt
# Step 4: Combine (25) and (30)
#
# Extract: (25), (30)
# Create internal node: (55) with children (25), (30)
# Insert (55) back into queue
#
# Queue: [A:45, (55)]
#
# Current forest:
#
#               (55)
#              /    \
#           (25)    (30)
#          /   \   /    \
#       [C:12][B:13](14)[D:16]
#                  /    \
#               [F:5]  [E:9]
#
#   [A:45]

Step 5 (Final): Extract A:45 and (55). Combine into root node with frequency 100.

huffman_step5.txt
# Step 5: Combine A and (55) - FINAL STEP
#
# Extract: A:45, (55)
# Create root node: (100) with children A, (55)
# Queue now has only one node - we're done!
#
# COMPLETE HUFFMAN TREE:
#
#                    (100)
#                   /      \
#                [A:45]    (55)
#                         /    \
#                      (25)    (30)
#                     /   \   /    \
#                 [C:12][B:13](14)[D:16]
#                             /    \
#                          [F:5]  [E:9]
#
# Reading codes (path from root, left=0, right=1):
#   A: 0           (just left)
#   C: 100         (right, left, left)
#   B: 101         (right, left, right)
#   F: 1100        (right, right, left, left)
#   E: 1101        (right, right, left, right)
#   D: 111         (right, right, right)

Observe the Pattern

Extracting Codes from the Tree

Once the Huffman tree is built, extracting the codes is straightforward. We perform a traversal from the root to each leaf, recording '0' for each left edge and '1' for each right edge.

The Resulting Codes:

Final Huffman Codes from Our Example
Character	Frequency	Code	Code Length
A	45	0	1
B	13	101	3
C	12	100	3
D	16	111	3
E	9	1101	4
F	5	1100	4

Verifying Prefix-Free Property:

Let's confirm no code is a prefix of another:

A = 0: No other code starts with just '0' (all others start with '1')
B = 101: Not a prefix of 100, 111, 1101, or 1100
C = 100: Not a prefix of 101, 111, 1101, or 1100
D = 111: Not a prefix of shorter codes, and different from 1100, 1101
E = 1101: Only longer codes would matter, but none exist
F = 1100: Same as E

✓ Valid prefix-free code!

Calculating Average Code Length:

Total characters = 45 + 13 + 12 + 16 + 9 + 5 = 100

Average = (45×1 + 13×3 + 12×3 + 16×3 + 9×4 + 5×4) / 100 = (45 + 39 + 36 + 48 + 36 + 20) / 100 = 224 / 100 = 2.24 bits per symbol

Compare to fixed-length: ⌈log₂(6)⌉ = 3 bits per symbol

Compression achieved: (3 - 2.24) / 3 = 25.3% savings!

extract_codes.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
def extract_codes(root, current_code="", codes=None):
    """
    Extract Huffman codes by traversing the tree.
    
    Args:
        root: The root of the Huffman tree
        current_code: The code built so far (used in recursion)
        codes: Dictionary to store symbol -> code mappings
        
    Returns:
        Dictionary mapping each symbol to its Huffman code
    """
    if codes is None:
        codes = {}
    
    if root is None:
        return codes
    
    # If this is a leaf node (has a symbol), record the code
    if root.symbol is not None:
        # Edge case: single-symbol alphabet gets code "0"
        codes[root.symbol] = current_code if current_code else "0"
        return codes
    
    # Recurse: left child gets '0', right child gets '1'
    extract_codes(root.left, current_code + "0", codes)
    extract_codes(root.right, current_code + "1", codes)
    
    return codes
 
# After building the Huffman tree, call:
# codes = extract_codes(root)

Left vs Right Convention

Complete Implementation

Now let's implement the complete Huffman coding algorithm. The key data structures are:

Node class: Represents both leaf nodes (symbols) and internal nodes
Priority queue (min-heap): Efficiently extracts minimum-frequency nodes
Recursion: For traversing the tree to extract codes

huffman_complete.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
import heapq
from collections import Counter
from typing import Dict, Optional
 
class HuffmanNode:
    """
    Node in a Huffman tree.
    
    Leaf nodes have a symbol; internal nodes have symbol=None.
    All nodes have a frequency (weight).
    """
    def __init__(
        self, 
        frequency: int, 
        symbol: Optional[str] = None,
        left: Optional['HuffmanNode'] = None,
        right: Optional['HuffmanNode'] = None
    ):
        self.frequency = frequency
        self.symbol = symbol
        self.left = left
        self.right = right
    
    # Define comparison for heap ordering
    # We compare by frequency; ties broken arbitrarily
    def __lt__(self, other: 'HuffmanNode') -> bool:
        return self.frequency < other.frequency
    
    def is_leaf(self) -> bool:
        return self.symbol is not None
 
 
def build_huffman_tree(frequencies: Dict[str, int]) -> HuffmanNode:
    """
    Build a Huffman tree from symbol frequencies.
    
    Args:
        frequencies: Dict mapping symbols to their frequencies
        
    Returns:
        The root node of the Huffman tree
        
    Time Complexity: O(n log n) where n = number of symbols
    Space Complexity: O(n) for the heap and tree nodes
    """
    if not frequencies:
        raise ValueError("Cannot build Huffman tree for empty alphabet")
    
    # Handle single-symbol edge case
    if len(frequencies) == 1:
        symbol, freq = next(iter(frequencies.items()))
        # Create a tree with just one leaf
        return HuffmanNode(freq, symbol)
    
    # Step 1: Create leaf nodes and add to min-heap
    heap = []
    for symbol, freq in frequencies.items():
        node = HuffmanNode(freq, symbol)
        heapq.heappush(heap, node)
    
    # Step 2: Repeatedly combine two smallest nodes
    while len(heap) > 1:
        # Extract two minimum nodes
        left = heapq.heappop(heap)
        right = heapq.heappop(heap)
        
        # Create parent with combined frequency
        parent = HuffmanNode(
            frequency=left.frequency + right.frequency,
            symbol=None,  # Internal node, no symbol
            left=left,
            right=right
        )
        
        # Insert parent back into heap
        heapq.heappush(heap, parent)
    
    # Step 3: Return the root (last remaining node)
    return heap[0]
 
 
def extract_codes(node: HuffmanNode, prefix: str = "") -> Dict[str, str]:
    """
    Extract Huffman codes by traversing the tree.
    
    Args:
        node: Current node in traversal
        prefix: Code built so far on path from root
        
    Returns:
        Dict mapping each symbol to its Huffman code
    """
    codes = {}
    
    if node.is_leaf():
        # Leaf node: record the code (use "0" for single-symbol case)
        codes[node.symbol] = prefix if prefix else "0"
    else:
        # Internal node: recurse on children
        if node.left:
            codes.update(extract_codes(node.left, prefix + "0"))
        if node.right:
            codes.update(extract_codes(node.right, prefix + "1"))
    
    return codes
 
 
def huffman_encode(text: str) -> tuple[str, Dict[str, str]]:
    """
    Encode text using Huffman coding.
    
    Args:
        text: The string to encode
        
    Returns:
        Tuple of (encoded_bits, code_table)
    """
    # Count frequencies
    frequencies = Counter(text)
    
    # Build tree and extract codes
    root = build_huffman_tree(dict(frequencies))
    codes = extract_codes(root)
    
    # Encode the text
    encoded = ''.join(codes[char] for char in text)
    
    return encoded, codes
 
 
def huffman_decode(encoded: str, codes: Dict[str, str]) -> str:
    """
    Decode Huffman-encoded bits using the code table.
    
    Args:
        encoded: String of '0' and '1' representing encoded data
        codes: Dict mapping symbols to their codes
        
    Returns:
        Decoded string
    """
    # Build reverse lookup: code -> symbol
    reverse_codes = {code: symbol for symbol, code in codes.items()}
    
    result = []
    current = ""
    
    for bit in encoded:
        current += bit
        if current in reverse_codes:
            result.append(reverse_codes[current])
            current = ""
    
    if current:  # Leftover bits indicate invalid encoding
        raise ValueError(f"Invalid encoding: leftover bits '{current}'")
    
    return ''.join(result)
 
 
# Example usage
if __name__ == "__main__":
    # Example 1: Simple message
    text = "ABRACADABRA"
    encoded, codes = huffman_encode(text)
    decoded = huffman_decode(encoded, codes)
    
    print("=== Huffman Coding Demo ===")
    print(f"Original:  {text}")
    print(f"Codes:     {codes}")
    print(f"Encoded:   {encoded}")
    print(f"Decoded:   {decoded}")
    print(f"Original bits (8-bit ASCII): {len(text) * 8}")
    print(f"Encoded bits:                {len(encoded)}")
    print(f"Compression ratio: {len(encoded) / (len(text) * 8):.1%}")

Why Priority Queue?

Complexity Analysis

Let's analyze the time and space complexity of Huffman's algorithm in detail.

Let n = size of the alphabet (number of distinct symbols)

Time Complexity: O(n log n)

Time Complexity Breakdown
Operation	Frequency	Cost per Operation	Total
Build initial heap	1	O(n)	O(n)
Extract minimum	2(n-1)	O(log n)	O(n log n)
Insert new node	n-1	O(log n)	O(n log n)
Extract codes (tree traversal)	1	O(n)	O(n)
Overall			O(n log n)

Space Complexity: O(n)

Leaf nodes: n nodes, one per symbol
Internal nodes: n-1 nodes (a full binary tree with n leaves has n-1 internal nodes)
Total tree nodes: 2n-1 = O(n)
Priority queue: At most n nodes at any time
Code table: n entries, each code is at most n-1 bits

Encoding a message of length m:

Time: O(m) after building the tree (just table lookup)
Space: O(m) for the encoded output (in practice, often smaller)

Decoding a message:

Time: O(m × L_max) where L_max is the maximum code length
With a lookup tree/trie: O(encoded length)

Practical Note: Pre-computed Tables

Important Edge Cases

A robust Huffman implementation must handle several edge cases correctly.

Edge Cases to Handle

•Single-symbol alphabet: If the input contains only one distinct symbol (e.g., 'AAAA'), the tree is just one leaf. Convention is to assign it code '0' or '1'.
•Empty input: An empty string has no symbols. Either reject as invalid or return an empty encoding.
•Very skewed frequencies: If one symbol dominates (e.g., 99% 'A'), the tree becomes highly unbalanced. This is actually correct—'A' gets a very short code.
•Equal frequencies: When multiple symbols have the same frequency, the algorithm works but the specific tree structure depends on tie-breaking order. All such trees are equally optimal.
•Very large alphabets: With n = 256 (bytes) or more, the tree has up to 511 nodes. Memory is usually not an issue, but code lengths can approach log₂(n).

edge_cases.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Edge Case 1: Single symbol
text = "AAAAA"
encoded, codes = huffman_encode(text)
print(f"Single symbol: {codes}")  # {'A': '0'}
 
# Edge Case 2: Two symbols with equal frequency
text = "ABABAB"
encoded, codes = huffman_encode(text)
print(f"Equal frequency: {codes}")  # e.g., {'A': '0', 'B': '1'}
 
# Edge Case 3: Highly skewed distribution
text = "A" * 99 + "B"  # 99 A's and 1 B
encoded, codes = huffman_encode(text)
print(f"Skewed: {codes}")  # A gets shorter code
 
# Edge Case 4: All characters once (uniform distribution)
text = "ABCDEFGH"
encoded, codes = huffman_encode(text)
# With 8 symbols, all codes will be 3 bits (since log₂(8) = 3)
print(f"Uniform: {codes}")

Tie-Breaking Matters for Consistency

Alternative Implementations

While the priority queue approach is standard, there are other ways to implement Huffman's algorithm, each with different trade-offs.

Sorted Array Approach

•Sort symbols by frequency once
•Use two queues: original leaves + new internal nodes
•New nodes are always >= existing nodes
•O(n log n) for sort, O(n) for merging
•Slightly more complex but avoids heap operations

Linear Time for Sorted Input

•If frequencies are pre-sorted: O(n) total!
•Use two FIFO queues
•Queue 1: original sorted leaves
•Queue 2: internal nodes (already sorted)
•Each step: peek front of both, take smaller

linear_huffman.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
from collections import deque
 
def build_huffman_tree_linear(sorted_symbols: list) -> HuffmanNode:
    """
    Build Huffman tree in O(n) time when input is pre-sorted.
    
    Args:
        sorted_symbols: List of (symbol, frequency) sorted by frequency
        
    Key insight: Internal nodes are created in non-decreasing order,
    so they're automatically sorted as we create them.
    """
    # Queue 1: Original leaf nodes (already sorted)
    leaves = deque(
        HuffmanNode(freq, symbol) 
        for symbol, freq in sorted_symbols
    )
    
    # Queue 2: Internal nodes (will be sorted by construction)
    internals = deque()
    
    def get_minimum():
        """Get the minimum node from either queue."""
        if not leaves:
            return internals.popleft()
        if not internals:
            return leaves.popleft()
        # Both non-empty: compare fronts
        if leaves[0].frequency <= internals[0].frequency:
            return leaves.popleft()
        return internals.popleft()
    
    # Combine until one node remains
    total_nodes = len(leaves)
    while len(leaves) + len(internals) > 1:
        left = get_minimum()
        right = get_minimum()
        
        parent = HuffmanNode(
            frequency=left.frequency + right.frequency,
            left=left,
            right=right
        )
        internals.append(parent)
    
    return internals[0] if internals else leaves[0]
 
# Usage:
# sorted_input = [('F', 5), ('E', 9), ('C', 12), ('B', 13), ('D', 16), ('A', 45)]
# root = build_huffman_tree_linear(sorted_input)

When to Use Linear Time Version

Understanding Through Visualization

Let's trace another example to solidify understanding. Consider building a Huffman tree for the message 'ABRACADABRA'.

Step 1: Count Frequencies

Character Frequencies in 'ABRACADABRA'
Character	Count	Frequency
A	5	45.5%
B	2	18.2%
R	2	18.2%
C	1	9.1%
D	1	9.1%

abracadabra_trace.txt
# Building Huffman Tree for "ABRACADABRA"
 
# Initial heap: [C:1, D:1, B:2, R:2, A:5]
 
# Step 1: Combine C:1 + D:1 → (2)
# Heap: [B:2, R:2, (2), A:5]
#
#       (2)
#      /   \
#    [C]   [D]
 
# Step 2: Combine R:2 + B:2 → (4)  [or B:2 + (2) depending on tie-break]
# Let's say we combine R and B:
# Heap: [(2), A:5, (4)]
#
#       (4)
#      /   \
#    [R]   [B]
 
# Step 3: Combine (2) + (4) → (6)
# Heap: [A:5, (6)]
#
#           (6)
#          /   \
#        (2)   (4)
#       / \   /  \
#     [C][D][R] [B]
 
# Step 4: Combine A:5 + (6) → (11) - ROOT
# 
#              (11)
#             /    \
#          [A:5]   (6)
#                 /   \
#               (2)   (4)
#              / \   /  \
#            [C][D][R] [B]
 
# Resulting codes:
#   A: 0       (1 bit)
#   C: 100     (3 bits)
#   D: 101     (3 bits)
#   R: 110     (3 bits)
#   B: 111     (3 bits)
 
# Encoded "ABRACADABRA":
# A  B   R   A  C   A  D   A  B   R   A
# 0  111 110 0  100 0  101 0  111 110 0
# = 01111100100010101111100
# = 23 bits
 
# Fixed length (3 bits for 5 symbols) would be:
# 11 characters × 3 bits = 33 bits
 
# Compression: 23/33 = 69.7% of original (30.3% savings!)

The Magic of Frequency-Based Encoding

Summary: Building the Huffman Tree

We've thoroughly explored the Huffman tree construction algorithm. Let's consolidate the key insights:

Key Takeaways

•Huffman's algorithm builds bottom-up — Start with leaf nodes and combine upward, unlike most tree algorithms.
•The greedy choice is to combine the two smallest — At each step, merge the two lowest-frequency nodes into a new parent.
•Use a priority queue for efficiency — A min-heap enables O(log n) extraction and insertion, giving O(n log n) total time.
•The tree structure encodes prefix-free codes — Left edges give '0', right edges give '1', and leaf paths give symbol codes.
•Frequent symbols end up near the root — This ensures short codes for common symbols and long codes for rare ones.
•The algorithm handles edge cases gracefully — Single-symbol alphabets, equal frequencies, and skewed distributions all work correctly.
•Alternative implementations exist — Pre-sorted input enables O(n) time construction using two queues.

What's Next:

Page Complete

2 / 4