Compressed Tries - Learning Module

Loading content...

0/276

What is Trie Compression

The Hidden Cost of Standard Tries

Tries are elegant data structures. They enable O(m) lookup, insertion, and prefix queries where m is the length of the key—operations that hash tables cannot match for prefix-based problems. But elegance comes at a price: standard tries can be enormously wasteful with memory.

Consider storing the words "internationalization", "internationalize", and "international" in a standard trie. These three words share the prefix "international" (13 characters), yet the paths through the tree diverge only at the very end. A standard trie creates separate nodes for every single character, including long unary chains where nodes have exactly one child.

This observation leads to a powerful question: What if we could collapse those single-child chains into single nodes?

The answer is trie compression—a family of techniques that transform space-hungry standard tries into compact, efficient structures while preserving all their operational benefits.

What You Will Learn

By the end of this page, you will understand: • Why standard tries waste space and where inefficiency arises • The fundamental principle of trie compression • How compressed tries maintain O(m) operational complexity • The theoretical framework for analyzing compression benefits • Key terminology: compressed trie, Patricia trie, radix tree

The Space Problem in Standard Tries

Before we can appreciate compression, we must thoroughly understand the problem it solves. Standard tries—also called prefix trees—store strings by creating one node per character along the path from root to leaf.

The Structure of a Standard Trie:

Each node in a standard trie contains:

A collection of child pointers (either an array of size |Σ| for alphabet Σ, or a hash map)
An end-of-word marker (boolean flag indicating whether a complete word ends at this node)
Optionally, the character leading to this node (though this can be implicit from the parent's pointer)

For a trie storing strings over a 26-letter alphabet using array-based children, each node requires:

26 pointers (8 bytes each on 64-bit systems) = 208 bytes
End-of-word flag = 1 byte (typically padded)
Additional overhead for alignment and object headers

Total per node: potentially 250+ bytes on modern systems.

The Memory Explosion

If you store n strings of average length m, a standard trie might create O(n × m) nodes. For a dictionary of 100,000 English words averaging 8 characters, that's potentially 800,000 nodes × 250+ bytes = 200+ MB just for the structure—far exceeding the actual string data!

Why This Happens: The Unary Chain Problem

The fundamental inefficiency arises from unary chains—sequences of nodes where each node has exactly one child. Consider storing these three words:

"apple"
"application"
"apply"

The paths look like this in a standard trie:

root
 └── a
     └── p
         └── p
             └── l
                 ├── e ✓ (end: "apple")
                 ├── i
                 │   └── c
                 │       └── a
                 │           └── t
                 │               └── i
                 │                   └── o
                 │                       └── n ✓ (end: "application")
                 └── y ✓ (end: "apply")

Notice the path from 'l' to 'i' to 'c' to 'a' to 't' to 'i' to 'o' to 'n' for "application". Each of those intermediate nodes (from 'i' down to 'o') has exactly one child. They perform no disambiguation—they simply extend a path. Yet each consumes the full memory footprint of a trie node.

Node Efficiency in Standard Trie Example
Node Character	Number of Children	Classification	Memory Justification
root	1	Unary	Could be compressed
a	1	Unary	Could be compressed
p (first)	1	Unary	Could be compressed
p (second)	1	Unary	Could be compressed
l	3	Branching	Must remain—disambiguates paths
e	0	Leaf	Must remain—marks word end
i	1	Unary	Could be compressed
c	1	Unary	Could be compressed
... through 'o'	1	Unary	Could be compressed
n	0	Leaf	Must remain—marks word end
y	0	Leaf	Must remain—marks word end

In this example, 10 out of 14 nodes (approximately 70%) are either unary (single child) or could benefit from compression. Only branching nodes and leaf nodes with end markers are strictly necessary to distinguish between stored strings.

The Pattern Recognition:

Unary chains are especially prevalent when:

The stored string set shares long common prefixes (e.g., URLs with "https://www.")
Strings have long, unique suffixes that don't share structure with other strings
The alphabet is large relative to the branching factor (e.g., Unicode strings)

In extreme cases—such as storing a single very long string—a standard trie degenerates into a linked list, with zero space benefit over simply storing the string directly.

The Compression Principle

The core insight of trie compression is deceptively simple: Replace every maximal unary chain with a single edge labeled with multiple characters.

Instead of creating nodes a→p→p→l where each arrow represents a single character, we create a single edge labeled "appl" from root to a node representing that entire prefix.

Before Compression (Standard Trie):

Each edge represents ONE character:
root ─a→ ● ─p→ ● ─p→ ● ─l→ ●

After Compression:

Edges represent STRINGS:
root ─"appl"→ ●

The compressed node retains all the semantics of the original path—it represents the complete prefix "appl"—while consuming the memory of only one node instead of four.

The Key Invariant

After compression, every internal node (non-leaf) must have at least two children. If an internal node had only one child, we would have compressed it further. This invariant is what makes compressed tries efficient—no node exists without purposeful branching.

Visualizing the Transformation:

Let's see our earlier example after compression:

Standard Trie (before):

root
 └── a
     └── p
         └── p
             └── l
                 ├── e ✓
                 ├── i
                 │   └── c
                 │       └── a
                 │           └── t
                 │               └── i
                 │                   └── o
                 │                       └── n ✓
                 └── y ✓

Compressed Trie (after):

root
 └──"appl"─→ ●
              ├──"e"─→ ● ✓ ("apple")
              ├──"ication"─→ ● ✓ ("application")
              └──"y"─→ ● ✓ ("apply")

Node count reduction: 14 nodes → 5 nodes (64% reduction)

The compressed trie preserves all functionality: we can still search for "apple", check if a prefix "app" exists, and enumerate all words starting with "appl". But we do so with fewer than half the nodes.

Standard Trie Properties

•One node per character
•Edges implicitly labeled (single char)
•Many unary chain nodes
•Internal nodes may have 1 child
•Simple implementation
•Higher memory usage

Compressed Trie Properties

•Nodes only at branch points
•Edges explicitly labeled (strings)
•No unary chains exist
•Internal nodes have ≥2 children
•More complex implementation
•Significantly lower memory

Formal Definition and Terminology

To reason precisely about compressed tries, we need formal definitions. This terminology appears throughout the literature and is essential for understanding research papers, library documentation, and technical discussions.

Definition: Compressed Trie (Patricia Trie)

Given a set S of strings over alphabet Σ, a compressed trie T for S is a rooted tree where:

Edge labels are non-empty strings over Σ (possibly multi-character)
No two edges from the same node share a common prefix (this ensures deterministic traversal)
Every internal node has at least two children (the no-unary-chain invariant)
The set of root-to-leaf paths, when concatenating edge labels, equals exactly the set S (correctness)

A Patricia trie (Practical Algorithm to Retrieve Information Coded in Alphanumeric) is technically a specific implementation variant of compressed tries, but the terms are often used interchangeably in practice.

Definition: Radix Tree

A radix tree (also called radix trie) is a compressed trie where edge labels are constrained further. In a radix-2 tree (binary radix tree), each edge label is a sequence of bits. More generally, a radix-k tree operates over an alphabet of k symbols.

The term "radix tree" emphasizes the underlying number system interpretation, which is particularly relevant for:

IP address routing (radix-2 on 32 or 128 bits)
Integer keys (radix-10 or radix-2^k)
Memory addressing schemes

Terminology Reference
Term	Definition	Common Usage
Standard Trie	One node per character; edges labeled with single characters	Basic implementations, educational contexts
Compressed Trie	Edges may be multi-character; no unary chains	General term for space-optimized tries
Patricia Trie	Specific compressed trie variant storing skip counts	Networking, older literature
Radix Tree	Compressed trie emphasizing radix/number base interpretation	Kernel code, IP routing, system software
Compact Trie	Synonym for compressed trie	Some textbooks and papers

Name Confusion in Practice

You'll encounter "Patricia trie," "radix tree," and "compressed trie" used inconsistently in different contexts. Linux kernel uses "radix tree." Academic papers often use "Patricia trie." Modern libraries may simply say "compact trie." They all refer to essentially the same structure with minor implementation variations. Focus on understanding the core concept rather than memorizing name distinctions.

Key Structural Properties:

Let T be a compressed trie storing n distinct strings. The following properties hold:

Maximum internal nodes: T has at most n-1 internal nodes.
- Proof sketch: Each internal node has ≥2 children. With n leaves (one per string), this is a tree with n leaves and no unary internal nodes, which has at most n-1 internal nodes.
Maximum total nodes: T has at most 2n-1 nodes.
- Direct consequence of property 1.
Node bound is tight: The bound 2n-1 is achieved when all strings differ at their very first character (maximum branching, no shared prefixes).
Space complexity: O(n) nodes regardless of string lengths.
- This is the critical win: Standard tries use O(total characters), but compressed tries use O(number of strings).

These properties guarantee that compressed tries scale with the number of strings, not with the total length of strings. For datasets with long strings sharing common prefixes, this difference can be enormous.

How Operations Work on Compressed Tries

The elegance of compressed tries is that they maintain the same O(m) time complexity for all operations where m is the key length—despite the structural transformation. Let's understand why.

Search Operation:

To search for key k in a compressed trie:

Start at the root with position p = 0 in k
At each node, find the outgoing edge whose label shares a prefix with k[p:]
If the edge label is a prefix of remaining k[p:], advance p by length of edge label and follow the edge
If remaining k[p:] is a prefix of edge label (k ends mid-edge), the key is present only if this is valid for your use case (prefix search)
If there's a mismatch, the key doesn't exist
Repeat until k is exhausted; check end-of-word marker

Time Complexity Analysis:

The key insight: we compare each character of k at most once. In standard tries, we compare one character per node visited. In compressed tries, we may compare multiple characters per node, but the total comparisons across all nodes equals exactly m. The path length (number of nodes) may be much shorter, but we spend more time at each node—the product is the same.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def search(root, key):
    """
    Search for 'key' in a compressed trie.
    Returns True if key exists as a complete word.
    Time: O(m) where m = len(key)
    """
    node = root
    i = 0  # Position in key
    
    while i < len(key):
        # Find matching outgoing edge
        found_edge = None
        for edge_label, child in node.children.items():
            if key[i:].startswith(edge_label):
                # Edge label is a prefix of remaining key
                found_edge = (edge_label, child)
                break
            if edge_label.startswith(key[i:]):
                # Key ends within this edge (prefix exists)
                return False  # Key itself doesn't end at node
        
        if found_edge is None:
            return False  # No matching edge
        
        edge_label, child = found_edge
        i += len(edge_label)  # Advance by entire edge label
        node = child
    
    # We've consumed entire key; check if it's marked as word
    return node.is_end_of_word

Insert Operation:

Insertion in compressed tries is more involved than in standard tries because we may need to split edges:

Search for the key until we find where it diverges
If divergence happens at a node, simply add a new edge for the remaining suffix
If divergence happens within an edge label (key and label share a prefix but then differ):
- Split the edge at the divergence point
- Create a new internal node at the split point
- The original edge becomes two: root-of-subtree to new node, new node to original destination
- Add the new key's remaining suffix as an additional edge from the new node

Example: Inserting "apply" into a trie containing "application"

Before:

root ──"application"──→ ● ✓

The key "apply" matches "appl" but diverges at 'y' vs 'i'. We split:

After:

root ──"appl"──→ ●
                 ├──"ication"──→ ● ✓
                 └──"y"──→ ● ✓

We created a new internal node at "appl", split the original edge, and added the new suffix "y".

Edge Splitting Complexity

While search and lookup remain O(m), the constant factors for insertion and deletion are higher in compressed tries due to edge splitting and merging. This tradeoff is usually worthwhile because the space savings outweigh the insertion overhead—especially when reads vastly outnumber writes.

Delete Operation:

Deletion requires the inverse of splitting—merging edges when a node becomes unary:

Find the node corresponding to the key
Unmark it as end-of-word
If the node has no children, remove it
If removal causes the parent to have only one child, merge the parent with that child (concatenate edge labels)
Propagate merging upward as needed

Prefix Search (StartsWith):

Prefix search follows the same logic as key search but succeeds whenever the prefix is fully matched, even if mid-edge:

If prefix ends exactly at a node: that node's subtree contains all strings with that prefix
If prefix ends mid-edge: follow that edge, and the subtree below contains all matching strings
If prefix mismatches: no strings with that prefix exist

Time complexity remains O(p) where p is prefix length.

Space Complexity Analysis

The primary motivation for compression is space efficiency. Let's quantify the improvement precisely.

Standard Trie Space:

For a set S of n strings with total length L (summing all character counts):

Number of nodes: O(L) in the worst case (no shared prefixes)
Per-node overhead: O(|Σ|) for children pointers + O(1) for metadata
Total space: O(L × |Σ|) for array-based children

With 26 lowercase letters and 8-byte pointers, that's 208 bytes per node.

Compressed Trie Space:

For the same set S:

Number of nodes: O(n) guaranteed (at most 2n-1 nodes)
Per-node overhead: O(|Σ|) or O(children count) + edge label storage
Edge label storage: The total length of all edge labels equals L (we've just redistributed characters from nodes to edges)
Total space: O(n × |Σ| + L) or O(n × average_branching + L) with hash map children

Space Comparison: Standard vs Compressed Trie
Metric	Standard Trie	Compressed Trie	Savings
Node count	O(L)	O(n)	O(L/n) = O(avg length)
Children storage per node	O(\|Σ\|)	O(\|Σ\|) or O(branching)	Same or better
Edge label storage	Implicit (0)	O(L) total	New overhead
Overall space	O(L × \|Σ\|)	O(n × \|Σ\| + L)	Significant when L >> n

When Is Compression Most Beneficial?

The space advantage of compressed tries is greatest when:

Strings share long common prefixes: URLs ("https://www.example.com/..."), domain names, file paths, etc.
Average string length is much greater than √n: When L >> n, the reduction from O(L) nodes to O(n) nodes is dramatic.
Alphabet is large: With Unicode or extended ASCII, each node's children array is expensive. Fewer nodes = less waste.
String set is sparse relative to possible strings: Storing 1000 words out of a possible ≈10^20 combinations of 15-character ASCII strings.

Numerical Example:

Storing 100,000 URLs averaging 60 characters:

Total characters L = 6,000,000
Standard trie: Up to 6,000,000 nodes × 250 bytes ≈ 1.5 GB
Compressed trie: ~200,000 nodes × 250 bytes + 6MB for edge labels ≈ 56 MB
Compression ratio: ~27x

Edge Label Storage Optimization

A clever implementation doesn't store edge labels as copied substrings. Instead, it stores (start_index, length) references into the original string. This way, edge label storage becomes O(n) integers rather than O(L) characters. The original strings must be retained, but they're likely kept anyway for retrieval.

Conceptual Understanding Through Examples

Theory is essential, but intuition comes from working through examples. Let's trace several scenarios to internalize how compressed tries behave.

Example 1: Progressive Insertions

Inserting words one at a time into an initially empty compressed trie:

Insert "test":

root ──"test"──→ ● ✓

Single word = single edge from root to a marked leaf.

Insert "testing":

root ──"test"──→ ●
                 ├──""──→ ● ✓ (end marker for "test" moves here)
                 └──"ing"──→ ● ✓

Wait—this isn't quite right. Let's reconsider.

Actually, "test" is a prefix of "testing". The trie becomes:

root ──"test"──→ ● ✓ ("test")
                 └──"ing"──→ ● ✓ ("testing")

The node for "test" is marked AND has a child. No need for epsilon edges.

Insert "team": "team" shares only "te" with "test"/"testing". We split:

root ──"te"──→ ●
               ├──"st"──→ ● ✓ ("test")
               │          └──"ing"──→ ● ✓ ("testing")
               └──"am"──→ ● ✓ ("team")

Insert "tea": "tea" shares "tea" with "team". We split:

root ──"te"──→ ●
               ├──"st"──→ ● ✓
               │          └──"ing"──→ ● ✓
               └──"a"──→ ● ✓ ("tea")
                         └──"m"──→ ● ✓ ("team")

Node count: 6 nodes for 4 words. A standard trie would need 15 nodes.

Example 2: Edge Cases

What happens with special inputs?

Single character strings: {"a", "b", "c"}

root
 ├──"a"──→ ● ✓
 ├──"b"──→ ● ✓
 └──"c"──→ ● ✓

No compression benefit—every string differs at first character. 4 nodes total.

Identical prefixes with single-character differences: {"aa", "ab", "ac"}

root ──"a"──→ ●
              ├──"a"──→ ● ✓
              ├──"b"──→ ● ✓
              └──"c"──→ ● ✓

5 nodes. Standard trie would have 7 nodes (root→a→{a,b,c}).

One very long string: {"supercalifragilisticexpialidocious"}

root ──"supercalifragilisticexpialidocious"──→ ● ✓

2 nodes total! Standard trie: 35 nodes.

Compression Patterns to Remember

•Full compression at extremes: Single string = 2 nodes; n strings differing at first char = n+1 nodes
•Splitting creates structure: Each split converts an edge into two edges plus a branching node
•Merging simplifies structure: Deleting can trigger edge concatenation upward
•Internal nodes always branch: If you see an internal node with one child, the trie isn't properly compressed
•End markers matter: Nodes must track whether they represent complete words, not just branching points

Why Compressed Tries Still Work

It's worth pausing to understand why trie operations still work correctly after compression. The transformation might seem to lose information, but it doesn't.

The Fundamental Preservation:

A compressed trie preserves exactly one thing: the set of root-to-leaf (or root-to-end-marker) paths. Every distinct path through the standard trie—concatenating character labels—becomes exactly one distinct path through the compressed trie—concatenating edge labels.

Why Search Still Works:

Searching for key k means finding whether a path spelling k exists:

Standard trie: Walk character by character, each step via a labeled edge
Compressed trie: Walk chunk by chunk, each step via a multi-character edge

The path exists if and only if the sequence of characters matches. Whether we match one character per step or multiple characters per step, the total match is identical.

Why Prefix Queries Still Work:

For a prefix p:

Standard trie: Navigate p character by character until exhausted; subtree below contains all extensions
Compressed trie: Navigate p chunk by chunk; may end mid-edge (partial match of some edge); subtree below that edge contains all extensions

The only complication is handling mid-edge termination, but this is straightforward to implement.

Information Theory Perspective

The compression is lossless. We haven't discarded any information about which strings are stored—we've merely changed the encoding. What was represented as "node existence" is now represented as "substring within edge label." The total information content is identical; only the representation is more compact.

Correctness Invariants:

A correctly implemented compressed trie maintains these invariants after every operation:

No unary internal nodes: Every internal node has ≥2 children
Distinct edge labels: No two edges from the same node share a prefix (this ensures deterministic traversal)
Path-string correspondence: Concatenating edge labels along any root-to-leaf path yields exactly one stored string
End marker correctness: A node is marked end-of-word iff some stored string ends exactly at that node

Violating any invariant leads to incorrect behavior:

Unary node: Wasted space; indicates incomplete compression
Shared prefix on edges: Ambiguous traversal; broken search
Wrong path: Lost or phantom strings
Wrong end markers: False positives or negatives on exact match

Summary: Understanding Trie Compression

We've established a comprehensive understanding of what trie compression is and why it matters. Let's consolidate the key insights.

Key Takeaways

•Standard tries waste space on unary chains—nodes with exactly one child that merely extend a path without disambiguating
•Compression collapses unary chains into multi-character edge labels, dramatically reducing node count
•Node count becomes O(n) for n strings, regardless of total string length—this is the critical space improvement
•Operations remain O(m) where m is key length; we simply process multiple characters per edge traversal
•Terminology varies: Patricia trie, radix tree, and compressed trie all refer to essentially the same concept
•Compression benefits scale with prefix sharing and string length—URL tries, IP routing, and file system paths benefit enormously

Standard Trie Limitations

•O(L) nodes for total length L
•Massive memory for long strings
•Poor cache locality
•Scales with characters, not strings

Compressed Trie Advantages

•O(n) nodes for n strings
•Memory proportional to string count
•Better cache performance
•Scales with dataset size, not content

What's Next:

Now that we understand the what and why of trie compression, the next page dives into the how: the detailed mechanics of compacting single-child chains. We'll walk through the compression algorithm itself, exploring edge splitting, label management, and the subtle considerations that make implementations robust.

Page Complete

You now understand the fundamental concept of trie compression—why standard tries waste space on unary chains and how compressed tries eliminate this inefficiency while preserving O(m) operations. Next, we'll explore the specific techniques for compacting single-child chains into efficient edge-labeled structures.