Loading content...
Tries are elegant data structures. They enable O(m) lookup, insertion, and prefix queries where m is the length of the key—operations that hash tables cannot match for prefix-based problems. But elegance comes at a price: standard tries can be enormously wasteful with memory.
Consider storing the words "internationalization", "internationalize", and "international" in a standard trie. These three words share the prefix "international" (13 characters), yet the paths through the tree diverge only at the very end. A standard trie creates separate nodes for every single character, including long unary chains where nodes have exactly one child.
This observation leads to a powerful question: What if we could collapse those single-child chains into single nodes?
The answer is trie compression—a family of techniques that transform space-hungry standard tries into compact, efficient structures while preserving all their operational benefits.
By the end of this page, you will understand: • Why standard tries waste space and where inefficiency arises • The fundamental principle of trie compression • How compressed tries maintain O(m) operational complexity • The theoretical framework for analyzing compression benefits • Key terminology: compressed trie, Patricia trie, radix tree
Before we can appreciate compression, we must thoroughly understand the problem it solves. Standard tries—also called prefix trees—store strings by creating one node per character along the path from root to leaf.
The Structure of a Standard Trie:
Each node in a standard trie contains:
For a trie storing strings over a 26-letter alphabet using array-based children, each node requires:
Total per node: potentially 250+ bytes on modern systems.
If you store n strings of average length m, a standard trie might create O(n × m) nodes. For a dictionary of 100,000 English words averaging 8 characters, that's potentially 800,000 nodes × 250+ bytes = 200+ MB just for the structure—far exceeding the actual string data!
Why This Happens: The Unary Chain Problem
The fundamental inefficiency arises from unary chains—sequences of nodes where each node has exactly one child. Consider storing these three words:
"apple"
"application"
"apply"
The paths look like this in a standard trie:
root
└── a
└── p
└── p
└── l
├── e ✓ (end: "apple")
├── i
│ └── c
│ └── a
│ └── t
│ └── i
│ └── o
│ └── n ✓ (end: "application")
└── y ✓ (end: "apply")
Notice the path from 'l' to 'i' to 'c' to 'a' to 't' to 'i' to 'o' to 'n' for "application". Each of those intermediate nodes (from 'i' down to 'o') has exactly one child. They perform no disambiguation—they simply extend a path. Yet each consumes the full memory footprint of a trie node.
| Node Character | Number of Children | Classification | Memory Justification |
|---|---|---|---|
| root | 1 | Unary | Could be compressed |
| a | 1 | Unary | Could be compressed |
| p (first) | 1 | Unary | Could be compressed |
| p (second) | 1 | Unary | Could be compressed |
| l | 3 | Branching | Must remain—disambiguates paths |
| e | 0 | Leaf | Must remain—marks word end |
| i | 1 | Unary | Could be compressed |
| c | 1 | Unary | Could be compressed |
| ... through 'o' | 1 | Unary | Could be compressed |
| n | 0 | Leaf | Must remain—marks word end |
| y | 0 | Leaf | Must remain—marks word end |
In this example, 10 out of 14 nodes (approximately 70%) are either unary (single child) or could benefit from compression. Only branching nodes and leaf nodes with end markers are strictly necessary to distinguish between stored strings.
The Pattern Recognition:
Unary chains are especially prevalent when:
In extreme cases—such as storing a single very long string—a standard trie degenerates into a linked list, with zero space benefit over simply storing the string directly.
The core insight of trie compression is deceptively simple: Replace every maximal unary chain with a single edge labeled with multiple characters.
Instead of creating nodes a→p→p→l where each arrow represents a single character, we create a single edge labeled "appl" from root to a node representing that entire prefix.
Before Compression (Standard Trie):
Each edge represents ONE character:
root ─a→ ● ─p→ ● ─p→ ● ─l→ ●
After Compression:
Edges represent STRINGS:
root ─"appl"→ ●
The compressed node retains all the semantics of the original path—it represents the complete prefix "appl"—while consuming the memory of only one node instead of four.
After compression, every internal node (non-leaf) must have at least two children. If an internal node had only one child, we would have compressed it further. This invariant is what makes compressed tries efficient—no node exists without purposeful branching.
Visualizing the Transformation:
Let's see our earlier example after compression:
Standard Trie (before):
root
└── a
└── p
└── p
└── l
├── e ✓
├── i
│ └── c
│ └── a
│ └── t
│ └── i
│ └── o
│ └── n ✓
└── y ✓
Compressed Trie (after):
root
└──"appl"─→ ●
├──"e"─→ ● ✓ ("apple")
├──"ication"─→ ● ✓ ("application")
└──"y"─→ ● ✓ ("apply")
Node count reduction: 14 nodes → 5 nodes (64% reduction)
The compressed trie preserves all functionality: we can still search for "apple", check if a prefix "app" exists, and enumerate all words starting with "appl". But we do so with fewer than half the nodes.
To reason precisely about compressed tries, we need formal definitions. This terminology appears throughout the literature and is essential for understanding research papers, library documentation, and technical discussions.
Definition: Compressed Trie (Patricia Trie)
Given a set S of strings over alphabet Σ, a compressed trie T for S is a rooted tree where:
A Patricia trie (Practical Algorithm to Retrieve Information Coded in Alphanumeric) is technically a specific implementation variant of compressed tries, but the terms are often used interchangeably in practice.
Definition: Radix Tree
A radix tree (also called radix trie) is a compressed trie where edge labels are constrained further. In a radix-2 tree (binary radix tree), each edge label is a sequence of bits. More generally, a radix-k tree operates over an alphabet of k symbols.
The term "radix tree" emphasizes the underlying number system interpretation, which is particularly relevant for:
| Term | Definition | Common Usage |
|---|---|---|
| Standard Trie | One node per character; edges labeled with single characters | Basic implementations, educational contexts |
| Compressed Trie | Edges may be multi-character; no unary chains | General term for space-optimized tries |
| Patricia Trie | Specific compressed trie variant storing skip counts | Networking, older literature |
| Radix Tree | Compressed trie emphasizing radix/number base interpretation | Kernel code, IP routing, system software |
| Compact Trie | Synonym for compressed trie | Some textbooks and papers |
You'll encounter "Patricia trie," "radix tree," and "compressed trie" used inconsistently in different contexts. Linux kernel uses "radix tree." Academic papers often use "Patricia trie." Modern libraries may simply say "compact trie." They all refer to essentially the same structure with minor implementation variations. Focus on understanding the core concept rather than memorizing name distinctions.
Key Structural Properties:
Let T be a compressed trie storing n distinct strings. The following properties hold:
Maximum internal nodes: T has at most n-1 internal nodes.
Maximum total nodes: T has at most 2n-1 nodes.
Node bound is tight: The bound 2n-1 is achieved when all strings differ at their very first character (maximum branching, no shared prefixes).
Space complexity: O(n) nodes regardless of string lengths.
These properties guarantee that compressed tries scale with the number of strings, not with the total length of strings. For datasets with long strings sharing common prefixes, this difference can be enormous.
The elegance of compressed tries is that they maintain the same O(m) time complexity for all operations where m is the key length—despite the structural transformation. Let's understand why.
Search Operation:
To search for key k in a compressed trie:
Time Complexity Analysis:
The key insight: we compare each character of k at most once. In standard tries, we compare one character per node visited. In compressed tries, we may compare multiple characters per node, but the total comparisons across all nodes equals exactly m. The path length (number of nodes) may be much shorter, but we spend more time at each node—the product is the same.
123456789101112131415161718192021222324252627282930
def search(root, key): """ Search for 'key' in a compressed trie. Returns True if key exists as a complete word. Time: O(m) where m = len(key) """ node = root i = 0 # Position in key while i < len(key): # Find matching outgoing edge found_edge = None for edge_label, child in node.children.items(): if key[i:].startswith(edge_label): # Edge label is a prefix of remaining key found_edge = (edge_label, child) break if edge_label.startswith(key[i:]): # Key ends within this edge (prefix exists) return False # Key itself doesn't end at node if found_edge is None: return False # No matching edge edge_label, child = found_edge i += len(edge_label) # Advance by entire edge label node = child # We've consumed entire key; check if it's marked as word return node.is_end_of_wordInsert Operation:
Insertion in compressed tries is more involved than in standard tries because we may need to split edges:
Example: Inserting "apply" into a trie containing "application"
Before:
root ──"application"──→ ● ✓
The key "apply" matches "appl" but diverges at 'y' vs 'i'. We split:
After:
root ──"appl"──→ ●
├──"ication"──→ ● ✓
└──"y"──→ ● ✓
We created a new internal node at "appl", split the original edge, and added the new suffix "y".
While search and lookup remain O(m), the constant factors for insertion and deletion are higher in compressed tries due to edge splitting and merging. This tradeoff is usually worthwhile because the space savings outweigh the insertion overhead—especially when reads vastly outnumber writes.
Delete Operation:
Deletion requires the inverse of splitting—merging edges when a node becomes unary:
Prefix Search (StartsWith):
Prefix search follows the same logic as key search but succeeds whenever the prefix is fully matched, even if mid-edge:
Time complexity remains O(p) where p is prefix length.
The primary motivation for compression is space efficiency. Let's quantify the improvement precisely.
Standard Trie Space:
For a set S of n strings with total length L (summing all character counts):
With 26 lowercase letters and 8-byte pointers, that's 208 bytes per node.
Compressed Trie Space:
For the same set S:
| Metric | Standard Trie | Compressed Trie | Savings |
|---|---|---|---|
| Node count | O(L) | O(n) | O(L/n) = O(avg length) |
| Children storage per node | O(|Σ|) | O(|Σ|) or O(branching) | Same or better |
| Edge label storage | Implicit (0) | O(L) total | New overhead |
| Overall space | O(L × |Σ|) | O(n × |Σ| + L) | Significant when L >> n |
When Is Compression Most Beneficial?
The space advantage of compressed tries is greatest when:
Strings share long common prefixes: URLs ("https://www.example.com/..."), domain names, file paths, etc.
Average string length is much greater than √n: When L >> n, the reduction from O(L) nodes to O(n) nodes is dramatic.
Alphabet is large: With Unicode or extended ASCII, each node's children array is expensive. Fewer nodes = less waste.
String set is sparse relative to possible strings: Storing 1000 words out of a possible ≈10^20 combinations of 15-character ASCII strings.
Numerical Example:
Storing 100,000 URLs averaging 60 characters:
A clever implementation doesn't store edge labels as copied substrings. Instead, it stores (start_index, length) references into the original string. This way, edge label storage becomes O(n) integers rather than O(L) characters. The original strings must be retained, but they're likely kept anyway for retrieval.
Theory is essential, but intuition comes from working through examples. Let's trace several scenarios to internalize how compressed tries behave.
Example 1: Progressive Insertions
Inserting words one at a time into an initially empty compressed trie:
Insert "test":
root ──"test"──→ ● ✓
Single word = single edge from root to a marked leaf.
Insert "testing":
root ──"test"──→ ●
├──""──→ ● ✓ (end marker for "test" moves here)
└──"ing"──→ ● ✓
Wait—this isn't quite right. Let's reconsider.
Actually, "test" is a prefix of "testing". The trie becomes:
root ──"test"──→ ● ✓ ("test")
└──"ing"──→ ● ✓ ("testing")
The node for "test" is marked AND has a child. No need for epsilon edges.
Insert "team": "team" shares only "te" with "test"/"testing". We split:
root ──"te"──→ ●
├──"st"──→ ● ✓ ("test")
│ └──"ing"──→ ● ✓ ("testing")
└──"am"──→ ● ✓ ("team")
Insert "tea": "tea" shares "tea" with "team". We split:
root ──"te"──→ ●
├──"st"──→ ● ✓
│ └──"ing"──→ ● ✓
└──"a"──→ ● ✓ ("tea")
└──"m"──→ ● ✓ ("team")
Node count: 6 nodes for 4 words. A standard trie would need 15 nodes.
Example 2: Edge Cases
What happens with special inputs?
Single character strings: {"a", "b", "c"}
root
├──"a"──→ ● ✓
├──"b"──→ ● ✓
└──"c"──→ ● ✓
No compression benefit—every string differs at first character. 4 nodes total.
Identical prefixes with single-character differences: {"aa", "ab", "ac"}
root ──"a"──→ ●
├──"a"──→ ● ✓
├──"b"──→ ● ✓
└──"c"──→ ● ✓
5 nodes. Standard trie would have 7 nodes (root→a→{a,b,c}).
One very long string: {"supercalifragilisticexpialidocious"}
root ──"supercalifragilisticexpialidocious"──→ ● ✓
2 nodes total! Standard trie: 35 nodes.
It's worth pausing to understand why trie operations still work correctly after compression. The transformation might seem to lose information, but it doesn't.
The Fundamental Preservation:
A compressed trie preserves exactly one thing: the set of root-to-leaf (or root-to-end-marker) paths. Every distinct path through the standard trie—concatenating character labels—becomes exactly one distinct path through the compressed trie—concatenating edge labels.
Why Search Still Works:
Searching for key k means finding whether a path spelling k exists:
The path exists if and only if the sequence of characters matches. Whether we match one character per step or multiple characters per step, the total match is identical.
Why Prefix Queries Still Work:
For a prefix p:
The only complication is handling mid-edge termination, but this is straightforward to implement.
The compression is lossless. We haven't discarded any information about which strings are stored—we've merely changed the encoding. What was represented as "node existence" is now represented as "substring within edge label." The total information content is identical; only the representation is more compact.
Correctness Invariants:
A correctly implemented compressed trie maintains these invariants after every operation:
Violating any invariant leads to incorrect behavior:
We've established a comprehensive understanding of what trie compression is and why it matters. Let's consolidate the key insights.
What's Next:
Now that we understand the what and why of trie compression, the next page dives into the how: the detailed mechanics of compacting single-child chains. We'll walk through the compression algorithm itself, exploring edge splitting, label management, and the subtle considerations that make implementations robust.
You now understand the fundamental concept of trie compression—why standard tries waste space on unary chains and how compressed tries eliminate this inefficiency while preserving O(m) operations. Next, we'll explore the specific techniques for compacting single-child chains into efficient edge-labeled structures.